For the complete documentation index, see llms.txt. Markdown versions of all docs pages are available by appending .md to any docs URL.
KServe
Use KServe with agentgateway.
KServe is a Kubernetes-native platform for serving machine learning models. With agentgateway in front of KServe, you can enforce traffic management policies, such as token-based rate limiting, for inference requests without modifying your inference services.
Before you begin
Follow the Get started guide to install agentgateway.
Follow the Sample app guide to create a gateway proxy with an HTTP listener and deploy the httpbin sample app.
Get the external address of the gateway and save it in an environment variable.
export INGRESS_GW_ADDRESS=$(kubectl get svc -n agentgateway-system http -o jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}") echo $INGRESS_GW_ADDRESS
Step 1: Install cert-manager
Install cert-manager, which KServe requires for webhook certificates.
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.20.2/cert-manager.yamlWait for cert-manager to be ready before you continue.
kubectl wait --for=condition=available deployment --all -n cert-manager --timeout=120s
Step 2: Create the KServe namespace and gateway
Create the
kservenamespace.kubectl create namespace kserveCreate a
Gatewayresource that agentgateway manages. KServe attachesHTTPRouteresources to this gateway automatically for eachInferenceServiceyou deploy.kubectl apply -f - <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: kserve-ingress-gateway namespace: kserve spec: gatewayClassName: agentgateway listeners: - name: http protocol: HTTP port: 80 allowedRoutes: namespaces: from: All infrastructure: labels: serving.kserve.io/gateway: kserve-ingress-gateway EOFVerify the gateway service is created.
kubectl get svc -n kserve kserve-ingress-gatewayExample output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kserve-ingress-gateway LoadBalancer 10.96.4.5 <pending> 80:32764/TCP,443:31766/TCP 11s
Step 3: Install KServe
Install the KServe CRDs.
helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.17.0Install KServe resources using Helm.
helm install kserve oci://ghcr.io/kserve/charts/kserve-resources \ --version v0.17.0 \ --namespace kserve \ --create-namespace \ --set kserve.controller.deploymentMode=Standard \ --set kserve.controller.gateway.ingressGateway.enableGatewayApi=true \ --set kserve.controller.gateway.ingressGateway.createGateway=false \ --set kserve.controller.gateway.ingressGateway.kserveGateway=kserve/kserve-ingress-gateway \ --set kserve.controller.gateway.ingressGateway.className=agentgateway \ --set kserve.controller.gateway.disableIstioVirtualHost=true \ --set kserve.controller.gateway.disableIngressCreation=false \ --set kserve.controller.knativeAddressableResolver.enabled=false \ --set kserve.controller.gateway.localGateway.gateway="" \ --set kserve.controller.gateway.localGateway.gatewayService=""
Step 4: Deploy a mocked LLM with llm-d-inference-sim
Instead of a real model, this guide uses llm-d-inference-sim to serve a mock OpenAI compatible endpoint. llm-d-inference-sim’s /v1/chat/completions path returns a properly structured OpenAI chat completion response, including usage.total_tokens in the response body, which agentgateway reads to enforce token-based rate limits.
Create the test namespace.
kubectl create namespace kserve-testDeploy an
InferenceServiceusing llm-d-inference-sim directly viaspec.predictor.containers. This approach bypasses KServe’s model runtime machinery entirely, noClusterServingRuntimeor model storage is needed.kubectl apply -f - <<EOF apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: mock-llm namespace: kserve-test spec: predictor: containers: - name: kserve-container image: ghcr.io/llm-d/llm-d-inference-sim:v0.9.0-rc3 args: - --model - mock-llm - --port - "8080" - --mode - echo ports: - containerPort: 8080 protocol: TCP resources: requests: cpu: "100m" memory: "128Mi" limits: cpu: "500m" memory: "256Mi" EOFWait for the
InferenceServiceto become ready.kubectl get inferenceservices mock-llm -n kserve-test --watch
Optional Step 4b: Apply a transformation policy to the KServe-generated HTTPRoute
Without a policy, agentgateway forwards requests and responses as-is. This step shows how a transformation policy can enrich responses with additional headers — without touching the inference service itself.
Verify that KServe created an HTTPRoute after the Gateway becomes
READY. The route attaches tokserve/kserve-ingress-gatewaywith hostnamemock-llm-kserve-test.example.com.kubectl get httproute mock-llm -n kserve-test -o yaml
Get the external address of the gateway and save it in an environment variable.
export INGRESS_GW_ADDRESS=$(kubectl get svc -n kserve agentgateway-proxy \ -o=jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}") echo $INGRESS_GW_ADDRESSConfirm that the response contains no custom headers.
curl -s -X POST http://$INGRESS_GW_ADDRESS/v1/chat/completions \ -H "Host: mock-llm-kserve-test.example.com" \ -H "Content-Type: application/json" \ -d '{ "model": "mock-llm", "messages": [{"role": "user", "content": "Hello"}] }' -v 2>&1 | grep "^<"Example Output:
< HTTP/1.1 200 OK < server: fasthttp < date: Mon, 18 May 2026 21:55:33 GMT < content-type: application/json < content-length: 353Apply a transformation policy that reads the model name from the request and response body and injects them as response headers.
kubectl apply -f - <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayPolicy metadata: name: model-echo-headers namespace: kserve-test spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: mock-llm traffic: transformation: response: set: - name: x-requested-model value: 'string(json(request.body).model)' - name: x-actual-model value: 'string(json(response.body).model)' EOFSend the same request again and check the headers.
curl -s -X POST http://$INGRESS_GW_ADDRESS/v1/chat/completions \ -H "Host: mock-llm-kserve-test.example.com" \ -H "Content-Type: application/json" \ -d '{ "model": "mock-llm", "messages": [{"role": "user", "content": "Hello"}] }' -v 2>&1 | grep "^<"Example output:
< HTTP/1.1 200 OK < server: fasthttp < date: Mon, 18 May 2026 21:56:12 GMT < content-type: application/json < content-length: 353 < x-requested-model: mock-llm < x-actual-model: mock-llm
Step 5: Create an AgentgatewayBackend
KServe generates the HTTPRoute with a plain Kubernetes Service as the backendRef. However, to apply a token-based rate limiting policy, agentgateway needs the backend to be an AgentgatewayBackend. This way, agentgateway knows that the backend is an LLM that has a response body with the usage.total_tokens field to count against the rate limit bucket. In the following steps, you create an AgentgatewayBackend and a second HTTPRoute to route to it as a workaround to the KServe-created, Service-based setup.
Create an
AgentgatewayBackendthat points at the llm-d-inference-sim service.kubectl apply -f - <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: mock-llm-backend namespace: kserve-test spec: ai: provider: openai: model: mock-llm host: mock-llm-predictor.kserve-test.svc.cluster.local port: 80 path: "/v1/chat/completions" EOFCreate a second
HTTPRoutethat routes to theAgentgatewayBackend. This route uses the same hostname as the KServe-generated route but matches only the/v1/chat/completionspath, so the gateway prefers it for LLM traffic.kubectl apply -f - <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: mock-llm-ai namespace: kserve-test spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: kserve-ingress-gateway namespace: kserve hostnames: - mock-llm-kserve-test.example.com rules: - matches: - path: type: PathPrefix value: /v1/chat/completions backendRefs: - name: mock-llm-backend namespace: kserve-test group: agentgateway.dev kind: AgentgatewayBackend EOF
Step 6: Test the endpoint
Get the external address of the gateway and save it in an environment variable.
export INGRESS_GW_ADDRESS=$(kubectl get svc -n kserve agentgateway-proxy \ -o=jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}") echo $INGRESS_GW_ADDRESSSend a request to verify the setup works end-to-end.
curl -s http://$INGRESS_GW_ADDRESS/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mock-llm", "messages": [ {"role": "user", "content": "Hello"} ] }' | jqExample output:
{ "model": "mock-llm", "usage": { "prompt_tokens": 6, "completion_tokens": 1, "total_tokens": 7, "prompt_tokens_detail": { "cached_tokens": 0 } }, "choices": [ { "message": { "content": "Hello", "role": "assistant" }, "index": 0, "finish_reason": "stop" } ], "id": "chatcmpl-98473698-57bc-5d69-b91e-af0aace83ac9", "object": "chat.completion", "kv_transfer_params": null, "created": 1779134384 }
Optional Step 7: Apply token-based rate limiting
How token counting works: Agentgateway reads usage.total_tokens from the JSON response body returned by the inference service. Each request deducts that many tokens from the bucket. When the bucket empties, subsequent requests receive 429 Too Many Requests until the next fill interval.
Apply an
AgentgatewayPolicythat caps requests at 70 tokens per minute. The policy targets themock-llm-airoute that selects theAgentgatewayBackend.kubectl apply -f - <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayPolicy metadata: name: llm-token-budget namespace: kserve-test spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: mock-llm-ai traffic: rateLimit: local: - tokens: 70 unit: Minutes EOFVerify the policy is accepted and attached. Both
AcceptedandAttachedconditions must beTrue.kubectl get agentgatewaypolicy llm-token-budget -n kserve-test \ -o jsonpath='{.status.ancestors[0].conditions}'
Get the external address of the gateway and save it in an environment variable.
export INGRESS_GW_ADDRESS=$(kubectl get svc -n kserve agentgateway-proxy \ -o=jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}") echo $INGRESS_GW_ADDRESSRun a burst of requests to trigger the token rate limit. With
tokens: 70and each response consuming 7 tokens, the budget exhausts after roughly 10 requests.for i in $(seq 1 30); do curl -s -o /dev/null -w "%{http_code}\n" \ -X POST http://$INGRESS_GW_ADDRESS/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "mock-llm", "messages": [{"role": "user", "content": "Hello"}]}' doneExample output:
200 200 200 200 200 200 200 200 200 200 429 429 429 ...
Cleanup
Remove the resources created in this guide.
kubectl delete agentgatewaypolicy llm-token-budget -n kserve-test
kubectl delete AgentgatewayPolicy -n kserve-test model-echo-headers
kubectl delete httproute mock-llm-ai -n kserve-test
kubectl delete agentgatewaybackend mock-llm-backend -n kserve-test
kubectl delete inferenceservice mock-llm -n kserve-test
kubectl delete namespace kserve-test
helm uninstall kserve -n kserve
helm uninstall kserve-crd
kubectl delete gateway kserve-ingress-gateway -n kserve
kubectl delete namespace kserve