Proxy Server
M2M Proxy Server
The M2M Proxy is an OpenAI API-compatible reverse proxy that transparently applies M2M protocol compression to reduce bandwidth and costs for LLM API communication.
Design Philosophy
The proxy is designed to work with any OpenAI-compatible API endpoint, not just OpenAI. This includes:
| Provider Type | Examples | Token Counting |
|---|---|---|
| OpenAI | api.openai.com | Exact (tiktoken) |
| Self-hosted OSS | vLLM, Ollama, LocalAI, TGI | Exact (Llama BPE) |
| Cloud Providers | OpenRouter, Together.ai, Anyscale | Exact or Heuristic |
| Enterprise | Azure OpenAI, AWS Bedrock | Exact (tiktoken) |
Why OpenAI-Compatible?
The OpenAI chat completion API has become the de facto standard for LLM inference. Most inference servers (vLLM, Ollama, LocalAI, Text Generation Inference) expose an OpenAI-compatible endpoint because:
- Ecosystem compatibility - Tools, SDKs, and applications expect this interface
- Zero migration cost - Switch between providers by changing the base URL
- Standardized schema - Predictable request/response structure enables optimization
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│ Application │ │ M2M Proxy │ │ LLM Provider ││ │ │ │ │ ││ POST /v1/chat │────────>│ Compress req │────────>│ vLLM/Ollama/ ││ /completions │ │ Security scan │ │ OpenAI/etc. ││ │<────────│ Decompress res │<────────│ │└─────────────────┘ └─────────────────┘ └─────────────────┘Request Flow
- Receive - Application sends standard OpenAI-format request to proxy
- Security Scan - Check for prompt injection, jailbreaks (optional)
- Compress - Apply M2M compression to reduce payload size
- Forward - Send compressed request to upstream LLM provider
- Stream/Buffer - Handle streaming SSE responses
- Decompress - Expand response back to standard format
- Return - Send unmodified OpenAI response to application
Compression Benefits
| Scenario | Without M2M | With M2M | Savings |
|---|---|---|---|
| Chat request | 2.4 KB | 1.7 KB | ~30% |
| Long conversation | 48 KB | 31 KB | ~35% |
| Tool calls | 8.2 KB | 4.9 KB | ~40% |
Usage
Starting the Proxy
# Basic usage - forwards to local Ollamam2m proxy --port 8080 --upstream http://localhost:11434/v1
# Forward to vLLMm2m proxy --port 8080 --upstream http://vllm-server:8000/v1
# Forward to OpenAI with API keym2m proxy --port 8080 --upstream https://api.openai.com/v1 --api-key $OPENAI_API_KEY
# With security scanning enabledm2m proxy --port 8080 --upstream http://localhost:11434/v1 --security --threshold 0.8Transport Options
The proxy supports multiple transport protocols:
# TCP only (default)m2m proxy --port 8080 --upstream http://localhost:11434/v1 --transport tcp
# QUIC/HTTP3 only (requires TLS)m2m proxy --port 8443 --upstream http://localhost:11434/v1 --transport quic
# Both TCP and QUIC simultaneouslym2m proxy --port 8080 --upstream http://localhost:11434/v1 --transport both --quic-port 8443
# QUIC with custom certificates (production)m2m proxy --port 8443 --upstream http://localhost:11434/v1 --transport quic \ --cert /path/to/cert.pem --key /path/to/key.pemQUIC Benefits
| Feature | Benefit |
|---|---|
| 0-RTT Connection | Reduced latency for returning connections |
| No Head-of-Line Blocking | Multiplexed streams don’t block each other |
| Connection Migration | Survives network changes (WiFi → cellular) |
| Built-in TLS 1.3 | Secure by default, no separate TLS handshake |
| BBR Congestion Control | Better throughput on lossy networks |
Configuration
[proxy]listen = "127.0.0.1:8080"upstream = "http://localhost:11434/v1"api_key = "" # Optional: forwarded to upstream
[proxy.compression]requests = trueresponses = true
[proxy.security]enabled = trueblocking = truethreshold = 0.8Using with Applications
Point your application to the proxy instead of the upstream:
# Before: Direct to Ollamaclient = OpenAI(base_url="http://localhost:11434/v1")
# After: Through M2M proxyclient = OpenAI(base_url="http://localhost:8080/v1")# curl examplecurl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health | GET | Health check |
/stats | GET | Compression statistics |
/v1/chat/completions | POST | OpenAI-compatible chat endpoint |
/v1/compress | POST | Manual compression (debugging) |
/v1/decompress | POST | Manual decompression (debugging) |
Stats Response
{ "requests_total": 1542, "bytes_in": 3847291, "bytes_out": 2693104, "compression_ratio": 0.70, "avg_latency_ms": 0.8, "p99_latency_ms": 2.1}Streaming Support
The proxy fully supports Server-Sent Events (SSE) streaming:
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama3.2", "messages": [...], "stream": true}'SSE chunks are compressed individually using the streaming codec, which:
- Abbreviates JSON keys in each chunk
- Accumulates content for final statistics
- Maintains SSE format compatibility
Token Counting Accuracy
| Provider | Tokenizer | Accuracy |
|---|---|---|
| OpenAI GPT-4o/o1/o3 | tiktoken (o200k_base) | Exact |
| OpenAI GPT-4/3.5 | tiktoken (cl100k_base) | Exact |
| Llama 3.x | Llama BPE | Exact |
| Mistral/Mixtral | Llama BPE | Exact |
| DeepSeek | Heuristic | ~95% |
| Qwen | Heuristic | ~95% |
| Claude* | Heuristic | ~90% |
| Gemini* | Heuristic | ~90% |
*Closed tokenizers - compression still works, token stats are estimates.
Self-Hosted Model Example
With Ollama
# Start Ollama with Llama 3.2ollama run llama3.2
# Start M2M proxym2m proxy --port 8080 --upstream http://localhost:11434/v1
# Use normallycurl http://localhost:8080/v1/chat/completions \ -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'With vLLM
# Start vLLMpython -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --port 8000
# Start M2M proxym2m proxy --port 8080 --upstream http://localhost:8000/v1
# Use normallycurl http://localhost:8080/v1/chat/completions \ -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [...]}'Security Scanning
When enabled, the proxy scans all incoming requests for:
- Prompt injection - “ignore previous instructions”, role confusion
- Jailbreaks - DAN mode, developer mode, bypass attempts
- Malformed input - Null bytes, unicode exploits, excessive nesting
- Data exfiltration - Environment variable access, file path probing
# Enable blocking mode (rejects threats)m2m proxy --port 8080 --upstream ... --security --threshold 0.8
# Enable monitoring mode (logs but doesn't block)m2m proxy --port 8080 --upstream ... --security --threshold 0.0Performance
| Metric | Value |
|---|---|
| Added latency | < 2ms |
| Memory overhead | < 20MB |
| Max throughput | 10k+ req/s |
| Compression ratio | 25-40% savings |
The proxy adds minimal overhead while providing significant bandwidth savings, especially valuable for:
- High-volume inference workloads
- Edge deployments with limited bandwidth
- Cost optimization on metered connections