Expert Coders | Privacy-Preserving Split Inference for Large Language Models

Featured Image for Privacy-Preserving Split Inference for Large Language Models

Privacy-Preserving Split Inference for Large Language Models

Overview

A research system — backed by a published paper on arXiv — that enables privacy-conscious LLM inference by splitting transformer models between a local trusted GPU and a remote cloud server. Only intermediate activations (abstract mathematical representations) travel across the network. Raw text never leaves the local machine.

The Challenge

Organizations in regulated industries (healthcare, legal, defense) want to use powerful language models but cannot send sensitive data to cloud providers. Running large models locally requires expensive hardware most businesses cannot justify. The challenge was to find a middle ground: use cloud compute power without exposing private data.

What I Built

Asymmetric model splitting — embedding and unembedding layers stay local, ensuring raw tokens never leave the trusted device
Lookahead speculative decoding — the first application of this technique to distributed inference, generating multiple candidate tokens per network round trip
Binary WebSocket protocol for efficient tensor transmission between local and cloud servers
KV-cache management across the split boundary for efficient autoregressive generation
Privacy attack evaluation framework measuring token recovery rates at different split points

Results

8.7–9.3 tokens/second on Mistral 7B over real-world WAN conditions (~80ms latency). 7.8–8.7 tok/s on Mistral NeMo 12B using only 4.9 GB local VRAM. At 8 layers local, attacker token recovery drops to ~35%.

Tech Stack