Expert Coders

Expert Coders

State-Of-The-Art Software Development

"The software you built has made mud logging less stressful, enjoyable and flat out easy!" — Customer

Mike Cunningham

Mike Cunningham

Owner

Privacy-Preserving Split Inference for Large Language Models

Overview

A research system — backed by a published paper on arXiv — that enables privacy-conscious LLM inference by splitting transformer models between a local trusted GPU and a remote cloud server. Only intermediate activations (abstract mathematical representations) travel across the network. Raw text never leaves the local machine.

The Challenge

Organizations in regulated industries (healthcare, legal, defense) want to use powerful language models but cannot send sensitive data to cloud providers. Running large models locally requires expensive hardware most businesses cannot justify. The challenge was to find a middle ground: use cloud compute power without exposing private data.

What I Built

  • Asymmetric model splitting — embedding and unembedding layers stay local, ensuring raw tokens never leave the trusted device
  • Lookahead speculative decoding — the first application of this technique to distributed inference, generating multiple candidate tokens per network round trip
  • Binary WebSocket protocol for efficient tensor transmission between local and cloud servers
  • KV-cache management across the split boundary for efficient autoregressive generation
  • Privacy attack evaluation framework measuring token recovery rates at different split points

Results

8.7–9.3 tokens/second on Mistral 7B over real-world WAN conditions (~80ms latency). 7.8–8.7 tok/s on Mistral NeMo 12B using only 4.9 GB local VRAM. At 8 layers local, attacker token recovery drops to ~35%.

Tech Stack

Python, PyTorch, Hugging Face Transformers, WebSocket, Flask, Mistral 7B/12B, CUDA