Overview — Compresh Docs

Compresh is context compression middleware for LLM APIs. It sits between your application and your LLM provider as a transparent proxy, compressing conversation context as it deepens — so you send fewer tokens without losing quality.

The problem

LLMs forget. Every API call resends the entire conversation history — system prompts, prior turns, accumulated context. By turn 10, you're paying for the same information over and over. Token costs scale linearly with conversation depth, but the actual new information per turn barely grows.

How Compresh works

Compresh runs as a proxy between your app and the LLM provider. You change one line — your base_url — and Compresh handles the rest:

Intercepts outgoing requests before they reach the provider
Analyzes conversation depth and identifies redundant context
Compresses repeated information into compact episodic summaries
Forwards the compressed payload to your LLM provider
Returns the response unchanged — your app never knows the difference

Episodic Memory Architecture

Most compression tools treat conversations as flat text and apply generic summarization. Compresh uses Episodic Memory Architecture (EMA) — a depth-aware approach that understands when information was introduced and how it relates across turns.

Depth-aware: Compression intensity scales with conversation depth. Early turns stay intact; deep turns get aggressively compressed.
Episodic, not flat: Related information is grouped into episodes, preserving semantic relationships that flat summarization destroys.
Lossless where it matters: Recent context, tool calls, and critical instructions are never compressed.

Key numbers

Metric	Value
Token savings at turn 10+	~80%
Quality retention	No measurable loss (dual-judge benchmark)
Integration effort	One line change
Latency overhead	<50ms per request

Two ways to integrate

Compresh works in two architectural modes — pick whichever fits your privacy and latency posture. The compression quality is comparable in both modes; the difference is where the work runs.

Mode A — Drop-in proxy

Compresh is OpenAI-compatible. Point your existing SDK at https://api.compre.sh/v1 and you're done. Your provider API key passes through Compresh and reaches your LLM; we return the response unchanged.

client = OpenAI(
    api_key="comp_your_key",
    base_url="https://api.compre.sh/v1"
)

Mode B — Local MCP + server enhancement

Run the open-source compresh-mcp Python package on your machine. Your provider key stays inside your client (Claude Code, Cursor, Cowork, OpenClaw, etc.) and never reaches Compresh.

Local: the bundled tulbase core (MIT, vendored) handles base compression — LexRank summarization, Protection Zone, modality elision — on your machine.
Server (paid tier): the transcript is sent to api.compre.sh/v1/tul2 for the TUL 2.0 enhancement layer — query-aware retrieval that selects the older turns most relevant to the current question and returns them in full (no lossy summarization), keeping the Protection Zone tail raw. Saving grows as the conversation deepens. Gated by your Compresh API key + plan.
Fallback: if the server is unreachable, the local result is used silently — compression never blocks.

pipx install compresh-mcp
export COMPRESH_API_KEY=sk-comp_...

Then add compresh-mcp as an MCP server in your client config (Claude Code, Cursor, Cowork) or enable the OpenClaw hook.

Tip

Ready to set up? Head to the Quick Start guide — you'll be running in under two minutes.