Local rate limiting

Apply local and global rate limits to HTTP traffic to protect your backend services from overload.

About

Rate limiting in agentgateway protects your services from being overwhelmed by excessive traffic. A runaway automation script, a misconfigured retry loop, or a deliberate flood can exhaust your upstream’s capacity in seconds. Rate limiting gives you precise control over how much traffic reaches any route or the entire gateway — without any changes to the backend.

Rate limiting in agentgateway is expressed through AgentgatewayPolicy resources. A policy attaches to a Gateway or HTTPRoute target, and defines limits in the spec.traffic.rateLimit field. Gateway-level policies act as a hard ceiling on total traffic, while route-level policies provide finer-grained control.

Additionally, you can set up local or global rate limiting, depending on whether you want limits shared across Gateway instances.

ModeWhere limits are enforcedUse case
LocalIn-process, per proxy replicaSimple per-route or gateway-wide limits
GlobalExternal rate limit serviceShared limits across multiple proxy replicas

For AI-specific use cases, see:

Gateway-level global DoS protection

Target your Gateway resource to apply a limit across all routes. This acts as a hard ceiling on total gateway throughput regardless of which route is hit.

Example gateway policy
kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: gateway-rate-limit
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: agentgateway-proxy
  traffic:
    rateLimit:
      local:
      - requests: 5000
        unit: Minutes
        burst: 1000
EOF

Route-level rate limit

Route-level policies take precedence over gateway-level ones for their specific traffic.

Example route policy
kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: httpbin-rate-limit
  namespace: httpbin
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: httpbin
  traffic:
    rateLimit:
      local:
      - requests: 3
        unit: Seconds
        burst: 3
EOF

Inheritance

Policies apply at the attachment point with a clear precedence order:

Gateway → Listener → Route → Route Rule → Backend

More specific policies win. A route-level limit overrides a gateway-level limit for traffic on that route.

With both policies in place, traffic to www.example.com is subject to the route limit (3 req/s), while all other routes are bounded only by the gateway limit (5000 req/min).

Response headers

When rate limiting is enabled, the following headers are added to responses. These headers help clients understand their current rate limit status and adapt their behavior accordingly.

Note: The x-envoy-ratelimited header is only present when using global rate limiting with an Envoy-compatible rate limit service. It is added by the rate limit service itself, not by agentgateway. As such, this header does not appear with local rate limiting.

HeaderDescriptionAdded byExample
x-ratelimit-limitThe rate limit ceiling for the given request. For local rate limiting, this is the base limit plus burst. For global rate limiting with time windows, this might include window information.Agentgateway6 (local), 10, 10;w=60 (global with 60-second window)
x-ratelimit-remainingThe number of requests (or tokens for LLM rate limiting) remaining in the current time window.Agentgateway5
x-ratelimit-resetThe time in seconds until the rate limit window resets.Agentgateway30
x-envoy-ratelimitedPresent when the request is rate limited. Only appears in 429 responses when using global rate limiting.External rate limit service(header present)

Before you begin

  1. Important: Install the experimental channel of the Kubernetes Gateway API to use this feature.

    kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.0/experimental-install.yaml --server-side
  2. Upgrade or install agentgateway with the KGW_ENABLE_GATEWAY_API_EXPERIMENTAL_FEATURES environment variable. This setting defaults to false and must be explicitly enabled to use Gateway API experimental features.

    Example command:

    helm upgrade -i agentgateway oci://cr.agentgateway.dev/charts/agentgateway  \
      --namespace agentgateway-system \
      --version v1.0.0-alpha.4 \
      --set controller.image.pullPolicy=Always \
      --set controller.extraEnv.KGW_ENABLE_GATEWAY_API_EXPERIMENTAL_FEATURES=true
  3. Set up an agentgateway proxy.

  4. Follow the Sample app guide to deploy the httpbin sample app

  5. Get the external address of the gateway and save it in an environment variable.

    export INGRESS_GW_ADDRESS=$(kubectl get svc -n agentgateway-system agentgateway-proxy -o jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
    echo $INGRESS_GW_ADDRESS  
    kubectl port-forward deployment/agentgateway-proxy -n agentgateway-system 8080:80

Local rate limiting

Local rate limiting runs entirely inside the agentgateway proxy — no external service needed. The following steps show how to apply request-based limits to your HTTP traffic.

  1. Apply a rate limit to the httpbin HTTPRoute.

    kubectl apply -f- <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayPolicy
    metadata:
      name: httpbin-rate-limit
      namespace: httpbin
    spec:
      targetRefs:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: httpbin
      traffic:
        rateLimit:
          local:
          - requests: 3
            unit: Seconds
            burst: 3
    EOF
    Review the following table to understand this configuration.
    FieldRequiredDescription
    requestsYesNumber of requests allowed per unit.
    unitYesSeconds, Minutes, or Hours.
    burstNoExtra requests allowed above the base rate in a short burst. The burst field implements a token bucket on top of the base rate. With requests: 3, burst: 3, you get up to 6 requests in one burst (3 base + 3 burst capacity), then the bucket refills at 3 per second. This absorbs short traffic spikes without rejecting requests. This setting only works with requests, not with token rate limits.
  2. Verify that the policy is attached.

    kubectl get AgentgatewayPolicy httpbin-rate-limit -n httpbin \
      -o jsonpath='{.status.ancestors[0].conditions}' | jq .

    A healthy policy reports both Accepted and Attached as True:

    [
      {
        "type": "Accepted",
        "status": "True",
        "message": "Policy accepted"
      },
      {
        "type": "Attached",
        "status": "True",
        "message": "Attached to all targets"
      }
    ]

    If Attached is False, the policy’s targetRef points to a resource that doesn’t exist. Check the message field for the exact resource name that’s missing.

  3. Fire 10 rapid requests to test the rate limit.

    for i in $(seq 1 10); do
      STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
        http://$INGRESS_GW_ADDRESS:80/headers -H "host: www.example.com")
      echo "Request $i: HTTP $STATUS"
    done
    for i in $(seq 1 10); do
      STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
        localhost:8080/headers -H "host: www.example.com")
      echo "Request $i: HTTP $STATUS"
    done

    Example output:

    Request 1: HTTP 200
    Request 2: HTTP 200
    Request 3: HTTP 200
    Request 4: HTTP 200
    Request 5: HTTP 200
    Request 6: HTTP 200
    Request 7: HTTP 429
    Request 8: HTTP 429
    Request 9: HTTP 429
    Request 10: HTTP 429

    The first 6 succeed (3 base + 3 burst), then requests are rejected until the bucket refills. Inspect a 429 response to see the rate limit headers:

    HTTP/1.1 429 Too Many Requests
    x-ratelimit-limit: 6
    x-ratelimit-remaining: 0
    x-ratelimit-reset: 0
    content-type: text/plain
    content-length: 19
    
    rate limit exceeded
  4. After 1 second the bucket refills and requests succeed again.

    sleep 1 && curl -o /dev/null -w "%{http_code}\n" \
      localhost:8080/headers -H "host: www.example.com"
    # 200

Global rate limiting

Local rate limiting runs independently on each proxy replica. If you run multiple agentgateway replicas and need a shared quota across the fleet, use global rate limiting backed by an external service such as Envoy’s rate limit service.

For detailed instructions on setting up global rate limiting with descriptors and an external rate limit service, see the Global rate limiting guide.

Cleanup

You can remove the resources that you created in this guide.
kubectl delete AgentgatewayPolicy httpbin-rate-limit -n httpbin
Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

Tip: one topic per conversation gives the best results. Use the + button in the chat header to start a new conversation.

Switching topics? Starting a new conversation improves accuracy.
↑↓ navigate select esc dismiss

What could be improved?

Your feedback helps us improve assistant answers and identify docs gaps we should fix.

Need more help? Join us on Discord: https://discord.gg/y9efgEmppm

Want to use your own agent? Add the Solo MCP server to query our docs directly. Get started here: https://search.solo.io/.