Expert Coders

Expert Coders

State-Of-The-Art Software Development

"The software you built has made mud logging less stressful, enjoyable and flat out easy!" — Customer

Mike Cunningham

Mike Cunningham

Owner

I Published a Research Paper on Privacy-Preserving AI — Here's What I Built and Why It Matters

Earlier this year, I published a research paper on arXiv titled "Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks." It is the culmination of months of building, testing, and benchmarking a system that addresses one of the most important problems in AI right now: how do you use powerful language models without handing all your data to someone else?

I want to walk through what I built, why I built it, and what the results look like — in plain language.

The Problem

Large language models like GPT-4, Claude, and Llama are incredibly useful. But to use them, you typically have to send your prompts — which might contain proprietary business data, legal documents, medical records, or trade secrets — to a cloud server you don't control. The provider processes your data on their hardware and sends back a response. You're trusting them completely.

For a lot of businesses, that's a dealbreaker. Attorneys can't send client communications to OpenAI. Healthcare organizations have HIPAA obligations. Defense contractors have ITAR restrictions. And even for companies without regulatory requirements, there's a reasonable reluctance to send competitive intelligence to a third-party server.

Running models locally is one option, but the hardware requirements are steep. A 70-billion parameter model needs multiple high-end GPUs that cost tens of thousands of dollars. Most small businesses aren't going to make that investment.

The Solution: Split the Model

My approach splits the model in two. The first few layers run on a local GPU — something as modest as a consumer-grade card with 5 GB of VRAM. The remaining layers run on a powerful cloud server. The critical insight is that only intermediate activations — abstract mathematical representations — travel across the network. Your actual text never leaves your machine.

The embedding layer (which converts your words into numbers) and the unembedding layer (which converts numbers back into words) both stay local. Raw tokens never touch the cloud server. An attacker who intercepts the network traffic sees floating-point tensors, not your documents.

Making It Fast: Speculative Decoding

The obvious problem with split inference over a wide-area network is latency. Every token the model generates requires a round trip to the cloud server and back. At 80 milliseconds per round trip (a typical WAN latency), that adds up fast.

I implemented a technique called lookahead speculative decoding, which is the first application of this approach to distributed inference that I'm aware of. Instead of generating one token per network round trip, the system speculatively generates multiple candidate tokens locally using n-gram patterns, then verifies them in a single batched call to the cloud. On code-heavy content, I measured peaks of 7 tokens accepted per decoding step.

The result: 8.7 to 9.3 tokens per second on a Mistral 7B model over real-world WAN conditions with approximately 80ms latency. On the larger Mistral NeMo 12B (40 layers), it achieved 7.8 to 8.7 tokens per second while using only 4.9 GB of local VRAM. At lower latencies around 20ms, projections show 15 to 19 tokens per second — well into the range of comfortable interactive use.

How Private Is It?

I ran formal attack evaluations to quantify the privacy guarantees. With just 2 layers kept local, an attacker can recover about 59% of original tokens from the intermediate activations. That's not great. But at 8 layers local, recovery drops to around 35%, and the throughput penalty is minimal. The system gives you a tunable knob: trade a small amount of performance for significantly better privacy.

I also formally proved that the lookahead decoding produces token-identical output to standard sequential decoding under greedy sampling. This is not an approximation — it is mathematically exact.

Why I Did This

I've spent 16 years running a business in industries where data sensitivity matters — oil and gas, healthcare, legal, defense. I've seen firsthand how many organizations want to use AI but can't because of legitimate privacy concerns. This is not a theoretical problem for me. It's a real constraint I've watched clients struggle with.

Building this system and publishing the research was my way of contributing a real solution. The code works. The benchmarks are reproducible. And the approach is practical enough to deploy on hardware that a small business can actually afford.

What's Next

I'm continuing to develop this technology and looking at ways to integrate it into the custom software I build for clients. If your organization needs to use AI but can't afford the privacy risk of sending data to the cloud, this is the kind of system I can build for you.

You can read the full paper on arXiv.