Rate limiting for LLMs
Control LLM costs with token-based rate limiting and request-based limits.
About
Rate limiting for LLM traffic helps you control costs and prevent runaway token consumption. Agentgateway supports both traditional request-based limits and LLM-specific token-based budgets. Token limits let you cap spending on expensive prompts and prevent unexpected bills from prompt injection or misconfigured applications.
Agentgateway offers two modes of rate limiting:
Local rate limiting: Runs in-process on each agentgateway proxy replica. Each replica maintains its own independent rate limit counters. This is simpler to configure and requires no external services, but limits are per-replica rather than shared across the entire gateway fleet.
Global rate limiting: Uses an external rate limit service to coordinate limits across multiple proxy replicas. All replicas share the same counters, ensuring consistent enforcement regardless of which replica receives the request. This requires deploying and configuring an external rate limit service.
Both local and global rate limiting support both request-based and token-based limits.
How token counting works
Agentgateway reads the usage field from every LLM response to accumulate token counts against the configured budget. Two behaviors are important to understand before applying limits:
Streaming responses: Token counts are only known after the full stream completes. The gateway cannot interrupt a response mid-stream. Token-based limits apply to future requests — the request that pushes you over the budget completes successfully, and the next request gets a 429.
Counting happens after the fact: This means token budgets are approximate. With a 1000-token-per-minute limit and a single request that returns 1200 tokens, that request succeeds, you’re 200 tokens over budget, and subsequent requests are blocked until the window resets.
Token budgets degrade gracefully: requests that exceed the budget fail fast with a 429 and are not forwarded to the backend. After the window resets, the token budget is restored and requests succeed again. No manual intervention is required.
This behavior is intentional and matches how real LLM providers implement soft quotas.
Response headers
When rate limiting is enabled, the following headers are added to responses. These headers help clients understand their current rate limit status and adapt their behavior accordingly.
Note: The x-envoy-ratelimited header is only present when using global rate limiting with an Envoy-compatible rate limit service. It is added by the rate limit service itself, not by agentgateway. As such, this header does not appear with local rate limiting.
| Header | Description | Added by | Example |
|---|---|---|---|
x-ratelimit-limit | The rate limit ceiling for the given request. For local rate limiting, this is the base limit plus burst. For global rate limiting with time windows, this might include window information. | Agentgateway | 6 (local), 10, 10;w=60 (global with 60-second window) |
x-ratelimit-remaining | The number of requests (or tokens for LLM rate limiting) remaining in the current time window. | Agentgateway | 5 |
x-ratelimit-reset | The time in seconds until the rate limit window resets. | Agentgateway | 30 |
x-envoy-ratelimited | Present when the request is rate limited. Only appears in 429 responses when using global rate limiting. | External rate limit service | (header present) |
Common use cases
Review the following table for example use cases and configuration guidance.
| What you want | How to configure it |
|---|---|
| Cap token spend on an LLM route | AgentgatewayPolicy targeting HTTPRoute, local[].tokens. |
| Limit requests independently of tokens | Add a second local[] entry with requests. |
| Streaming-safe token limits | No special config — token limits are always applied post-stream. |
| Hard token ceiling across the gateway | AgentgatewayPolicy targeting Gateway, local[].tokens. |
| Per-minute vs per-hour budget | Change unit — use Minutes for tighter windows, Hours for daily-style quotas. |
Also, check out the rate limiting guides for other use cases:
Before you begin
Local token rate limiting
Local token rate limiting runs in-process on each agentgateway proxy replica. The following steps show how to apply a per-route token budget and test it with streaming and non-streaming requests.
Apply a token budget to your LLM HTTPRoute. The following example limits total token consumption to 100 tokens per minute across all requests hitting this route.
Review the following table to understand this configuration.kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayPolicy metadata: name: llm-token-budget namespace: agentgateway-system spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: httpbun-llm traffic: rateLimit: local: - tokens: 100 unit: Minutes EOFField Required Description tokensYes (or requests)Number of tokens allowed per unit.unitYes Seconds,Minutes, orHours.Verify the policy attached.
kubectl get AgentgatewayPolicy llm-token-budget -n agentgateway-system \ -o jsonpath='{.status.ancestors[0].conditions}' | jq .Example output:
[ { "type": "Accepted", "status": "True", "message": "Policy accepted" }, { "type": "Attached", "status": "True", "message": "Attached to all targets" } ]Send repeated requests and watch the budget drain.
for i in $(seq 1 10); do RESPONSE=$(curl -s -w "\nHTTP_STATUS:%{http_code}" \ http://$INGRESS_GW_ADDRESS/openai \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "Say hello in exactly 10 words."}] }') STATUS=$(echo "$RESPONSE" | grep "HTTP_STATUS" | cut -d: -f2) TOKENS=$(echo "$RESPONSE" | jq -r '.usage.total_tokens // "blocked"' 2>/dev/null) echo "Request $i: HTTP $STATUS — tokens: $TOKENS" donefor i in $(seq 1 10); do RESPONSE=$(curl -s -w "\nHTTP_STATUS:%{http_code}" \ localhost:8080/openai \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "Say hello in exactly 10 words."}] }') STATUS=$(echo "$RESPONSE" | grep "HTTP_STATUS" | cut -d: -f2) TOKENS=$(echo "$RESPONSE" | jq -r '.usage.total_tokens // "blocked"' 2>/dev/null) echo "Request $i: HTTP $STATUS — tokens: $TOKENS" doneExample output:
Request 1: HTTP 200 — tokens: 39 Request 2: HTTP 200 — tokens: 39 Request 3: HTTP 200 — tokens: 39 Request 4: HTTP 429 — tokens: blocked ...After the token budget is exhausted, subsequent requests return a 429 HTTP response until the minute window resets.
Test with streaming to verify that token limits work the same way with streaming responses.
curl -N http://$INGRESS_GW_ADDRESS/openai \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "Count from 1 to 20."}], "stream": true }'curl -N localhost:8080/openai \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "Count from 1 to 20."}], "stream": true }'Notice the SSE chunks arrive in full:
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"1"},...}]} data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":", 2"},...}]} ... data: [DONE]After the stream ends, agentgateway reads the accumulated token count from the final chunk’s
usagefield and updates the budget.Repeat the request a couple times. The request is rejected if the budget is exhausted.
rate limit exceeded
Other configurations
The following examples show additional configuration patterns for LLM rate limiting.
Combine request and token limits
You can apply both request-based and token-based limits to the same route. Both limits are evaluated independently. A request must pass both checks to succeed.
kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: llm-combined-limit
namespace: agentgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: openai
traffic:
rateLimit:
local:
- requests: 10
unit: Minutes
burst: 5
- tokens: 1000
unit: Minutes
EOFThis configuration enforces:
- Maximum 10 requests per minute (with up to 5 burst)
- Maximum 1000 tokens per minute
Both limits apply to the same traffic. If either limit is exceeded, the request is rejected with a 429 response.
Gateway-level token limits
Apply a token budget across all LLM routes by targeting the Gateway resource.
kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: gateway-token-limit
namespace: agentgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: agentgateway-proxy
traffic:
rateLimit:
local:
- tokens: 10000
unit: Hours
EOFThis policy acts as a hard ceiling on total token consumption across the entire gateway, regardless of which route is hit.
Global rate limiting for LLMs
Local rate limiting runs independently on each proxy replica. For shared token budgets across multiple agentgateway replicas, use global rate limiting with an external rate limit service.
For detailed instructions on setting up global rate limiting with descriptors and an external rate limit service, see the Global rate limiting guide.
Cleanup
You can remove the resources that you created in this guide.kubectl delete AgentgatewayPolicy llm-token-budget -n agentgateway-system