Rate limiting
Agentgateway rate limiting policies allow controlling the rate of requests incoming to a route.
Rate limit types
Agentgateway exposes two types of rate limits:
Local rate limits apply in memory, and counters are not shared between replicas of agentgateway, nor across restarts. These are very low overhead, but not appropriate for usage where exact global counts are required, or for limits with long time windows (like monthly limits).
Remote rate limits store counters in an pluggable external data store, which enables shared state across replicas of agentgateway. This is controlled via the Envoy Rate Limit gRPC service to enable re-use with existing rate limiting services built for Envoy; the Envoy project has an example rate limiter service that can be used.
Rate limit modes
In additional to simple request-based rate limits, agentgateway can limit requests based on tokens for LLM consumption.
Request-based rate limits
By default, agentgateway applies rate limits to requests. Therefore, each request consumes 1 unit of capacity.
To explicitly set request-based rate limits, set the rate limiting type to requests
as shown in the following example.
policies:
localRateLimit:
- maxTokens: 10
tokensPerFill: 1
fillInterval: 60s
type: requests
Token-based rate limits
For tokens, each token (prompt or completion) consumes 1 unit of capacity. Because the number of tokens that are used for the completion is not known at the time the request is sent, calculating the number of tokens can become tricky. To work around this issue, agentgateway checks token-based rate limits in two phases, at request time and at response time.
To enable token-based rate limiting, set the rate limiting type to token
as shown in the following example.
policies:
localRateLimit:
- maxTokens: 10
tokensPerFill: 1
fillInterval: 60s
type: tokens
At request time
- When
tokenize: true
is not set or is set tofalse
on the AI backend, the number of tokens that are used for the request cannot be calculated. Because of this, the request is always allowed, unless the rate limit is set to 0 tokens. The LLM typically returns the number of tokens that were used for the request when sending the response. Agentgateway verifies the number of tokens that were used in the request and the response to determine whether the rate limit was reached. By default,tokenize
is set to false. - When
tokenize: true
is set, agentgateway estimates the number of tokens at request time. Because of that, the request is only allowed if the estimated number of tokens does not exceed the set rate limit.
At response time
When the LLM returns a response, it typically provides the number of tokens that were used during the request and response. Agentgateway uses these numbers to determine if the rate limit was reached.
Note that this determination happens after the response is returned. Even, if the number of tokens that are used in the response exceeds the number of allowed tokens, the response is still returned to the user. Only subsequent requests are rate limited. If tokenize: true
is set on the AI backend and tokens were estimated during the request, agentgateway verifies the actual number of tokens that were used for the request when the LLM returns its response. In the case the initial estimation was off, agentgateway adjusts the number of used tokens to count these against the set rate limit.
Configuration
Local
Local rate limiting uses a Token bucket algorithm.
Field | Meaning |
---|---|
maxTokens |
Maximum, and initial, size of the bucket |
fillInterval |
How often to refill the bucket |
tokensPerFill |
How many tokens to replenish per fill |
type |
The type of rate limiting. Choose between requests for request-based rate limits, and tokens for token-based rate limits. |
Below shows an example rate limit configuration that allows 5,000 tokens per hour, and 60 requests per second.
localRateLimit:
- maxTokens: 5000
# Every hour, refill 5000 tokens
tokensPerFill: 5000
fillInterval: 1h
type: tokens
- maxTokens: 60
# Every second, refill 1 token
tokensPerFill: 1
fillInterval: 1s
type: requests
Note
The term “tokens” is used for two distinct meanings. In maxTokens
and tokensPerFill
, it indicates the “token” in the token bucket counter. Each token can allow either 1 LLM token, or 1 HTTP request, based on the type
.
Remote
Remote rate limits are not defined directly in agentgateway. Instead, agentgateway is configured to connect to an external rate limit server, and which “descriptors” to send to the server. The rate limit server is responsible for defining, and enforcing, the appropriate limits matching the descriptors.
remoteRateLimit:
# The address to access the rate limit server
host: localhost:9090
# Arbitrary 'domain' to match limits on the rate limit server
domain: example.com
descriptors:
# Rate limit requests based on a header, whether the user is authenticated, and a static value (used to match a specific rate limit rule on the rate limit server)
- entries:
- key: some-static-value
value: '"something"'
- key: organization
value: 'request.headers["x-organization"]'
- key: authenticated
value: 'has(jwt.sub)'
type: tokens # or 'requests'
Each descriptor value is a CEL expression.