AI Gateway feature

Token Compression

In progress

Keep your prompts effective while sending fewer tokens. Compression helps control cost and p95 latency on long-context and agent workloads.

We’re validating compression strategies on real workloads (RAG, multi-turn, agent traces).

How it works

  1. Your app sends a request to Edgee.
  2. If enabled by policy, Edgee compresses eligible parts of the prompt/context.
  3. Edgee forwards the resulting request to the selected model/provider.
  4. You see savings and request traces in observability.

Common use cases

  • RAG prompts with large retrieved documents
  • Multi-turn assistants with long conversation history
  • Agents that accumulate tool traces and intermediate steps
  • Apps with strict cost ceilings per user/session

Lower spend

Fewer input tokens for the same intent means lower model costs.

Better latency at scale

Less payload to process and transmit, especially for long contexts.

More predictable budgets

Reduce variance when prompts balloon due to RAG payloads or tool traces.

FAQ

Answers reflect current direction and may evolve as the platform ships.

Ship faster

Start with one key. Scale with policies.

Use Edgee’s unified access to get moving quickly, then add routing, budgets, and privacy controls as your AI usage grows.

Contact
Token Compression — Edgee AI Gateway