compresh

Every model forgets — indiscriminately.
Compresh forgets selectively.

Reconstruct context, don't resend it. Hand the model memory, not just history.

Numbers, not promises

~66% fewer input tokens. No measurable quality loss.

Measured on 360 real-world Q&A items (from public StackExchange threads), replayed as one long, growing session — not synthetic dialogue. 108 of them refer back to earlier turns, to test recall. We compared full history, Compresh compression, and Compresh with the memory layer.

The memory layer is effectively free: same quality, no meaningful token overhead. Deeper episodic recall is under active benchmarking — we publish only what we've measured.

Input tokens over a 360-turn session — lower is better

40.9M raw 13.9M −66.0% tulbase 14.5M −64.5% tulngin

Accuracy vs the human accepted answer — higher is better

25 50 75 100 90.0% raw 87.5% tulbase 89.4% tulngin

Full results

Metric raw tulbase tulngin
Input tokens (total) 40,881,263 13,880,456 14,507,553
Token saving vs raw 66.0% 64.5%
Output tokens (total) 378,927 372,899 369,320
Equivalence vs ground truth (accuracy) 90.0% 87.5% 89.4%
Cosine vs ground truth (0–1) 0.670 0.667 0.670

What the numbers mean

Accuracy — share of answers an independent judge model marked equivalent in meaning to the human-accepted StackExchange answer.

Cosine — semantic-embedding similarity between the answer and the accepted answer (0–1; higher is closer).

Token saving — reduction in input tokens versus sending the full, uncompressed history every turn.

Output tokens stay flat across all three — compression adds no response bloat.

Fidelity — tulbase's answers were judged equivalent to the raw (no-compression) answer 91.7% of the time; tulngin matched tulbase 96.9%. Compression rarely changes the answer.

Model under test: GPT-5-mini. Judge: every answer was graded by a separate model (Llama 3.3 70B, via Groq); cosine uses sentence-embedding similarity (all-MiniLM-L6-v2). Same questions and system prompt across all three runs.

Methodology + data on GitHub

Compresh is built on the distinction, formalized by Endel Tulving, between semantic memory (the facts you know) and episodic memory (the events you lived). Tulving's episodic–semantic distinction ↗

Recent research argues today's LLMs still lack genuine episodic memory — the gap Compresh is built to close. Dong et al., Trends in Cognitive Sciences (2025) ↗ · Huet et al., arXiv (2025) ↗

How Compresh fits among memory approaches

These approaches are strong at what they target. Compresh focuses on a different axis.

Approach Best at As the conversation deepens
Retrieval-based memory
Excellent at pulling relevant facts from large stores. Retrieves matching chunks by similarity.
Long-context windows
Excellent when the whole conversation fits and cost isn't the constraint. Holds the full history — cost grows every turn.
Summarization buffers
Good for keeping a running gist in simple continuity. Replaces detail with a rolling summary.
Compresh
Episodic reconstruction for deepening conversations — where token cost compounds turn after turn. Reconstructs what mattered — savings grow with depth.

Try Compresh free

One line to integrate — change your base URL, keep everything else. You'll see the savings on your own long conversations within hours.

First 100 builders get $30 in credit — no card required.

Where we are

Compresh has three primary audiences. We're at different stages with each.

Ready

For agent and chatbot builders

If you're building tools that hold long, multi-turn conversations with users — agents, copilots, customer-service bots — Compresh is production-ready. Swap your base_url, and your conversations get compressed automatically. You'll see meaningful token reductions on deeper conversations within hours.

What we want from you: real workloads. Compresh learns fastest from production traffic, not synthetic benchmarks.

Plug us in →
Exploring

For RAG developers

Episodic memory and retrieval-augmented generation share a common question: how do you select what's relevant? Compresh's tag-based approach complements RAG in some workflows, replaces parts of it in others. Early signals look strong.

If you're solving retrieval at scale, we'd like to test the overlap together.

Test the overlap →
Talk to us

For teams using AI internally

If your employees use ChatGPT, Claude, or any LLM API for daily work, Compresh fits in front. One base_url change per developer, one master account for IT. Compressed conversations, shared system-prompt savings, per-employee usage analytics — without changing how anyone works.

What we want from you: team size, primary use cases, compliance requirements.

Reach out for team pricing →

Running at platform scale with large, repeated system prompts? That's a conversation for later — reach out and we'll figure out the fit together.

Integrate

Compresh fits in two ways. Pick the one that matches your environment — both run the same compression engine, only the privacy posture differs.

Direct SDK / IDE

Drop-in proxy

Change your base_url to Compresh. Works when you control the client — OpenAI/Anthropic SDKs, raw HTTP, or IDEs that expose an API base URL setting.

  • → Anthropic / OpenAI Python or JS SDK
  • → Cursor, Aider, LangChain, Claude Code
  • → Provider key passes through Compresh
Read integration docs
Managed agent

Hook / MCP

Install a hook in your agent platform. Your provider key never leaves the machine — Compresh only sees the transcript fragment your hook reveals.

  • → OpenClaw hook (live) · Claude Code hook (next)
  • → Compresh-MCP runs locally
  • → Provider key stays with you
See hook docs

Works with your stack, not instead of it. RAG brings in your docs, memory layers track who the user is, caching cuts repeat costs — Compresh handles the conversation itself. Drop it in front; the rest keep working.

Open source

Protocol open, implementation differentiated.

Open

TCCP — Tag Cloud Context Protocol

The wire format and conventions for conversation identity and compression signaling. Anyone can implement a TCCP-compatible proxy or SDK.

github.com/compresh
Patent-pending

Compression engine

Episodic memory architecture — turn-linked classification and progressive, scored compression. The tagging is an internal process: the model never receives the tags, only the reconstructed context they produce. Patent application filed (TR).

That's how open standards work — the protocol is free, the best implementation competes. We earn the way our incentive points: only as you save.

Is Compresh for you?

It earns its keep as conversations deepen — so it's built for some workloads, and honestly not for others.

For you if
  • You run long, multi-turn AI agents
  • You operate support bots that never reset
  • You're building coding agents or copilots
  • You ship a large system prompt on every call
  • Your monthly LLM bill is real
Probably not (yet) if
  • Your conversations stay under ~20–30 messages
  • You're doing simple, one-shot RAG
  • Your token cost is already low
Pricing

Pay only on savings.

You only pay a share of the input tokens Compresh removes — no savings, no fee. Free to start, no card. And local or free models? No savings-share at all.

Bring your own provider key — you keep paying your own LLM bill; we only ever take a cut of what we save you.

See full pricing

Stay close

Sharing this in the open. Follow along, or reach out directly.