Every model forgets — indiscriminately.
Compresh forgets selectively.
Reconstruct context, don't resend it. Hand the model memory, not just history.
Numbers, not promises
~66% fewer input tokens. No measurable quality loss.
Measured on 360 real-world Q&A items (from public StackExchange threads), replayed as one long, growing session — not synthetic dialogue. 108 of them refer back to earlier turns, to test recall. We compared full history, Compresh compression, and Compresh with the memory layer.
The memory layer is effectively free: same quality, no meaningful token overhead. Deeper episodic recall is under active benchmarking — we publish only what we've measured.
Input tokens over a 360-turn session — lower is better
Accuracy vs the human accepted answer — higher is better
Full results
| Metric | raw | tulbase | tulngin |
|---|---|---|---|
| Input tokens (total) | 40,881,263 | 13,880,456 | 14,507,553 |
| Token saving vs raw | — | 66.0% | 64.5% |
| Output tokens (total) | 378,927 | 372,899 | 369,320 |
| Equivalence vs ground truth (accuracy) | 90.0% | 87.5% | 89.4% |
| Cosine vs ground truth (0–1) | 0.670 | 0.667 | 0.670 |
What the numbers mean
Accuracy — share of answers an independent judge model marked equivalent in meaning to the human-accepted StackExchange answer.
Cosine — semantic-embedding similarity between the answer and the accepted answer (0–1; higher is closer).
Token saving — reduction in input tokens versus sending the full, uncompressed history every turn.
Output tokens stay flat across all three — compression adds no response bloat.
Fidelity — tulbase's answers were judged equivalent to the raw (no-compression) answer 91.7% of the time; tulngin matched tulbase 96.9%. Compression rarely changes the answer.
Model under test: GPT-5-mini. Judge: every answer was graded by a separate model (Llama 3.3 70B, via Groq); cosine uses sentence-embedding similarity (all-MiniLM-L6-v2). Same questions and system prompt across all three runs.
Compresh is built on the distinction, formalized by Endel Tulving, between semantic memory (the facts you know) and episodic memory (the events you lived). Tulving's episodic–semantic distinction ↗
Recent research argues today's LLMs still lack genuine episodic memory — the gap Compresh is built to close. Dong et al., Trends in Cognitive Sciences (2025) ↗ · Huet et al., arXiv (2025) ↗
How Compresh fits among memory approaches
These approaches are strong at what they target. Compresh focuses on a different axis.
| Approach | Best at | As the conversation deepens |
|---|---|---|
| Retrieval-based memory | Excellent at pulling relevant facts from large stores. | Retrieves matching chunks by similarity. |
| Long-context windows | Excellent when the whole conversation fits and cost isn't the constraint. | Holds the full history — cost grows every turn. |
| Summarization buffers | Good for keeping a running gist in simple continuity. | Replaces detail with a rolling summary. |
| Compresh | Episodic reconstruction for deepening conversations — where token cost compounds turn after turn. | Reconstructs what mattered — savings grow with depth. |
Try Compresh free
One line to integrate — change your base URL, keep everything else. You'll see the savings on your own long conversations within hours.
Where we are
Compresh has three primary audiences. We're at different stages with each.
For agent and chatbot builders
If you're building tools that hold long, multi-turn conversations with users — agents, copilots, customer-service bots — Compresh is production-ready. Swap your base_url, and your conversations get compressed automatically. You'll see meaningful token reductions on deeper conversations within hours.
What we want from you: real workloads. Compresh learns fastest from production traffic, not synthetic benchmarks.
Plug us in →For RAG developers
Episodic memory and retrieval-augmented generation share a common question: how do you select what's relevant? Compresh's tag-based approach complements RAG in some workflows, replaces parts of it in others. Early signals look strong.
If you're solving retrieval at scale, we'd like to test the overlap together.
Test the overlap →For teams using AI internally
If your employees use ChatGPT, Claude, or any LLM API for daily work, Compresh fits in front. One base_url change per developer, one master account for IT. Compressed conversations, shared system-prompt savings, per-employee usage analytics — without changing how anyone works.
What we want from you: team size, primary use cases, compliance requirements.
Reach out for team pricing →Running at platform scale with large, repeated system prompts? That's a conversation for later — reach out and we'll figure out the fit together.
Integrate
Compresh fits in two ways. Pick the one that matches your environment — both run the same compression engine, only the privacy posture differs.
Drop-in proxy
Change your base_url to Compresh. Works when you control the client — OpenAI/Anthropic SDKs, raw HTTP, or IDEs that expose an API base URL setting.
- → Anthropic / OpenAI Python or JS SDK
- → Cursor, Aider, LangChain, Claude Code
- → Provider key passes through Compresh
Hook / MCP
Install a hook in your agent platform. Your provider key never leaves the machine — Compresh only sees the transcript fragment your hook reveals.
- → OpenClaw hook (live) · Claude Code hook (next)
- → Compresh-MCP runs locally
- → Provider key stays with you
Works with your stack, not instead of it. RAG brings in your docs, memory layers track who the user is, caching cuts repeat costs — Compresh handles the conversation itself. Drop it in front; the rest keep working.
Open source
Protocol open, implementation differentiated.
TCCP — Tag Cloud Context Protocol
The wire format and conventions for conversation identity and compression signaling. Anyone can implement a TCCP-compatible proxy or SDK.
github.com/compreshCompression engine
Episodic memory architecture — turn-linked classification and progressive, scored compression. The tagging is an internal process: the model never receives the tags, only the reconstructed context they produce. Patent application filed (TR).
That's how open standards work — the protocol is free, the best implementation competes. We earn the way our incentive points: only as you save.Is Compresh for you?
It earns its keep as conversations deepen — so it's built for some workloads, and honestly not for others.
- You run long, multi-turn AI agents
- You operate support bots that never reset
- You're building coding agents or copilots
- You ship a large system prompt on every call
- Your monthly LLM bill is real
- Your conversations stay under ~20–30 messages
- You're doing simple, one-shot RAG
- Your token cost is already low
Pay only on savings.
You only pay a share of the input tokens Compresh removes — no savings, no fee. Free to start, no card. And local or free models? No savings-share at all.
Bring your own provider key — you keep paying your own LLM bill; we only ever take a cut of what we save you.
See full pricing