AI Gateway feature
Token Compression
In progress
Keep your prompts effective while sending fewer tokens. Compression helps control cost and p95 latency on long-context and agent workloads.
We’re validating compression strategies on real workloads (RAG, multi-turn, agent traces).
How it works
- Your app sends a request to Edgee.
- If enabled by policy, Edgee compresses eligible parts of the prompt/context.
- Edgee forwards the resulting request to the selected model/provider.
- You see savings and request traces in observability.
Common use cases
- RAG prompts with large retrieved documents
- Multi-turn assistants with long conversation history
- Agents that accumulate tool traces and intermediate steps
- Apps with strict cost ceilings per user/session
Lower spend
Fewer input tokens for the same intent means lower model costs.
Better latency at scale
Less payload to process and transmit, especially for long contexts.
More predictable budgets
Reduce variance when prompts balloon due to RAG payloads or tool traces.
FAQ
Answers reflect current direction and may evolve as the platform ships.