Rate limiting for LLMs

Control LLM costs with token-based rate limiting and request-based limits.

About

Rate limiting for LLM traffic helps you control costs and prevent runaway token consumption. Agentgateway supports both traditional request-based limits and LLM-specific token-based budgets. Token limits let you cap spending on expensive prompts and prevent unexpected bills from prompt injection or misconfigured applications.

Agentgateway offers two modes of rate limiting:

  • Local rate limiting: Runs in-process on each agentgateway proxy replica. Each replica maintains its own independent rate limit counters. This is simpler to configure and requires no external services, but limits are per-replica rather than shared across the entire gateway fleet.

  • Global rate limiting: Uses an external rate limit service to coordinate limits across multiple proxy replicas. All replicas share the same counters, ensuring consistent enforcement regardless of which replica receives the request. This requires deploying and configuring an external rate limit service.

Both local and global rate limiting support both request-based and token-based limits.

How token counting works

Agentgateway reads the usage field from every LLM response to accumulate token counts against the configured budget. Two behaviors are important to understand before applying limits:

Streaming responses: Token counts are only known after the full stream completes. The gateway cannot interrupt a response mid-stream. Token-based limits apply to future requests — the request that pushes you over the budget completes successfully, and the next request gets a 429.

Counting happens after the fact: This means token budgets are approximate. With a 1000-token-per-minute limit and a single request that returns 1200 tokens, that request succeeds, you’re 200 tokens over budget, and subsequent requests are blocked until the window resets.

Token budgets degrade gracefully: requests that exceed the budget fail fast with a 429 and are not forwarded to the backend. After the window resets, the token budget is restored and requests succeed again. No manual intervention is required.

This behavior is intentional and matches how real LLM providers implement soft quotas.

Response headers

When rate limiting is enabled, the following headers are added to responses. These headers help clients understand their current rate limit status and adapt their behavior accordingly.

Note: The x-envoy-ratelimited header is only present when using global rate limiting with an Envoy-compatible rate limit service. It is added by the rate limit service itself, not by agentgateway. As such, this header does not appear with local rate limiting.

HeaderDescriptionAdded byExample
x-ratelimit-limitThe rate limit ceiling for the given request. For local rate limiting, this is the base limit plus burst. For global rate limiting with time windows, this might include window information.Agentgateway6 (local), 10, 10;w=60 (global with 60-second window)
x-ratelimit-remainingThe number of requests (or tokens for LLM rate limiting) remaining in the current time window.Agentgateway5
x-ratelimit-resetThe time in seconds until the rate limit window resets.Agentgateway30
x-envoy-ratelimitedPresent when the request is rate limited. Only appears in 429 responses when using global rate limiting.External rate limit service(header present)

Common use cases

Review the following table for example use cases and configuration guidance.

What you wantHow to configure it
Cap token spend on an LLM routeAgentgatewayPolicy targeting HTTPRoute, local[].tokens.
Limit requests independently of tokensAdd a second local[] entry with requests.
Streaming-safe token limitsNo special config — token limits are always applied post-stream.
Hard token ceiling across the gatewayAgentgatewayPolicy targeting Gateway, local[].tokens.
Per-minute vs per-hour budgetChange unit — use Minutes for tighter windows, Hours for daily-style quotas.

Also, check out the rate limiting guides for other use cases:

Before you begin

  1. Set up an agentgateway proxy.
  2. Set up access to the OpenAI LLM provider.

Local token rate limiting

Local token rate limiting runs in-process on each agentgateway proxy replica. The following steps show how to apply a per-route token budget and test it with streaming and non-streaming requests.

  1. Apply a token budget to your LLM HTTPRoute. The following example limits total token consumption to 100 tokens per minute across all requests hitting this route.

    kubectl apply -f- <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayPolicy
    metadata:
      name: llm-token-budget
      namespace: agentgateway-system
    spec:
      targetRefs:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: httpbun-llm
      traffic:
        rateLimit:
          local:
          - tokens: 100
            unit: Minutes
    EOF
    Review the following table to understand this configuration.
    FieldRequiredDescription
    tokensYes (or requests)Number of tokens allowed per unit.
    unitYesSeconds, Minutes, or Hours.
  2. Verify the policy attached.

    kubectl get AgentgatewayPolicy llm-token-budget -n agentgateway-system \
      -o jsonpath='{.status.ancestors[0].conditions}' | jq .

    Example output:

    [
      { "type": "Accepted", "status": "True", "message": "Policy accepted" },
      { "type": "Attached", "status": "True", "message": "Attached to all targets" }
    ]
  3. Send repeated requests and watch the budget drain.

    for i in $(seq 1 10); do
      RESPONSE=$(curl -s -w "\nHTTP_STATUS:%{http_code}" \
        http://$INGRESS_GW_ADDRESS/openai \
        -H "Content-Type: application/json" \
        -d '{
          "model": "gpt-4",
          "messages": [{"role": "user", "content": "Say hello in exactly 10 words."}]
        }')
      STATUS=$(echo "$RESPONSE" | grep "HTTP_STATUS" | cut -d: -f2)
      TOKENS=$(echo "$RESPONSE" | jq -r '.usage.total_tokens // "blocked"' 2>/dev/null)
      echo "Request $i: HTTP $STATUS — tokens: $TOKENS"
    done
    for i in $(seq 1 10); do
      RESPONSE=$(curl -s -w "\nHTTP_STATUS:%{http_code}" \
        localhost:8080/openai \
        -H "Content-Type: application/json" \
        -d '{
          "model": "gpt-4",
          "messages": [{"role": "user", "content": "Say hello in exactly 10 words."}]
        }')
      STATUS=$(echo "$RESPONSE" | grep "HTTP_STATUS" | cut -d: -f2)
      TOKENS=$(echo "$RESPONSE" | jq -r '.usage.total_tokens // "blocked"' 2>/dev/null)
      echo "Request $i: HTTP $STATUS — tokens: $TOKENS"
    done

    Example output:

    Request 1:  HTTP 200 — tokens: 39
    Request 2:  HTTP 200 — tokens: 39
    Request 3:  HTTP 200 — tokens: 39
    Request 4:  HTTP 429 — tokens: blocked
    ...

    After the token budget is exhausted, subsequent requests return a 429 HTTP response until the minute window resets.

  4. Test with streaming to verify that token limits work the same way with streaming responses.

    curl -N http://$INGRESS_GW_ADDRESS/openai \
      -H "Content-Type: application/json" \
      -d '{
        "model": "gpt-4",
        "messages": [{"role": "user", "content": "Count from 1 to 20."}],
        "stream": true
      }'
    curl -N localhost:8080/openai \
      -H "Content-Type: application/json" \
      -d '{
        "model": "gpt-4",
        "messages": [{"role": "user", "content": "Count from 1 to 20."}],
        "stream": true
      }'

    Notice the SSE chunks arrive in full:

    data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"1"},...}]}
    data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":", 2"},...}]}
    ...
    data: [DONE]

    After the stream ends, agentgateway reads the accumulated token count from the final chunk’s usage field and updates the budget.

    Repeat the request a couple times. The request is rejected if the budget is exhausted.

    rate limit exceeded

Other configurations

The following examples show additional configuration patterns for LLM rate limiting.

Combine request and token limits

You can apply both request-based and token-based limits to the same route. Both limits are evaluated independently. A request must pass both checks to succeed.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: llm-combined-limit
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  traffic:
    rateLimit:
      local:
      - requests: 10
        unit: Minutes
        burst: 5
      - tokens: 1000
        unit: Minutes
EOF

This configuration enforces:

  • Maximum 10 requests per minute (with up to 5 burst)
  • Maximum 1000 tokens per minute

Both limits apply to the same traffic. If either limit is exceeded, the request is rejected with a 429 response.

Gateway-level token limits

Apply a token budget across all LLM routes by targeting the Gateway resource.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: gateway-token-limit
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: agentgateway-proxy
  traffic:
    rateLimit:
      local:
      - tokens: 10000
        unit: Hours
EOF

This policy acts as a hard ceiling on total token consumption across the entire gateway, regardless of which route is hit.

Global rate limiting for LLMs

Local rate limiting runs independently on each proxy replica. For shared token budgets across multiple agentgateway replicas, use global rate limiting with an external rate limit service.

For detailed instructions on setting up global rate limiting with descriptors and an external rate limit service, see the Global rate limiting guide.

Cleanup

You can remove the resources that you created in this guide.
kubectl delete AgentgatewayPolicy llm-token-budget -n agentgateway-system
Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

Tip: one topic per conversation gives the best results. Use the + button in the chat header to start a new conversation.

Switching topics? Starting a new conversation improves accuracy.
↑↓ navigate select esc dismiss

What could be improved?

Your feedback helps us improve assistant answers and identify docs gaps we should fix.

Need more help? Join us on Discord: https://discord.gg/y9efgEmppm

Want to use your own agent? Add the Solo MCP server to query our docs directly. Get started here: https://search.solo.io/.