Skip to content

For the complete documentation index, see llms.txt. Markdown versions of all docs pages are available by appending .md to any docs URL.

Page as Markdown

KServe

Use KServe with agentgateway.

KServe is a Kubernetes-native platform for serving machine learning models. With agentgateway in front of KServe, you can enforce traffic management policies, such as token-based rate limiting, for inference requests without modifying your inference services.

Before you begin

  1. Follow the Get started guide to install agentgateway.

  2. Follow the Sample app guide to create a gateway proxy with an HTTP listener and deploy the httpbin sample app.

  3. Get the external address of the gateway and save it in an environment variable.

    export INGRESS_GW_ADDRESS=$(kubectl get svc -n agentgateway-system http -o jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
    echo $INGRESS_GW_ADDRESS  

Step 1: Install cert-manager

  1. Install cert-manager, which KServe requires for webhook certificates.

    kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.20.2/cert-manager.yaml
  2. Wait for cert-manager to be ready before you continue.

    kubectl wait --for=condition=available deployment --all -n cert-manager --timeout=120s

Step 2: Create the KServe namespace and gateway

  1. Create the kserve namespace.

    kubectl create namespace kserve
  2. Create a Gateway resource that agentgateway manages. KServe attaches HTTPRoute resources to this gateway automatically for each InferenceService you deploy.

    kubectl apply -f - <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: kserve-ingress-gateway
      namespace: kserve
    spec:
      gatewayClassName: agentgateway
      listeners:
        - name: http
          protocol: HTTP
          port: 80
          allowedRoutes:
            namespaces:
              from: All
      infrastructure:
        labels:
          serving.kserve.io/gateway: kserve-ingress-gateway
    EOF
  3. Verify the gateway service is created.

    kubectl get svc -n kserve kserve-ingress-gateway

    Example output:

    NAME                     TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
    kserve-ingress-gateway   LoadBalancer   10.96.4.5    <pending>     80:32764/TCP,443:31766/TCP   11s

Step 3: Install KServe

  1. Install the KServe CRDs.

    helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.17.0
  2. Install KServe resources using Helm.

    helm install kserve oci://ghcr.io/kserve/charts/kserve-resources \
      --version v0.17.0 \
      --namespace kserve \
      --create-namespace \
      --set kserve.controller.deploymentMode=Standard \
      --set kserve.controller.gateway.ingressGateway.enableGatewayApi=true \
      --set kserve.controller.gateway.ingressGateway.createGateway=false \
      --set kserve.controller.gateway.ingressGateway.kserveGateway=kserve/kserve-ingress-gateway \
      --set kserve.controller.gateway.ingressGateway.className=agentgateway \
      --set kserve.controller.gateway.disableIstioVirtualHost=true \
      --set kserve.controller.gateway.disableIngressCreation=false \
      --set kserve.controller.knativeAddressableResolver.enabled=false \
      --set kserve.controller.gateway.localGateway.gateway="" \
      --set kserve.controller.gateway.localGateway.gatewayService=""

Step 4: Deploy a mocked LLM with llm-d-inference-sim

Instead of a real model, this guide uses llm-d-inference-sim to serve a mock OpenAI compatible endpoint. llm-d-inference-sim’s /v1/chat/completions path returns a properly structured OpenAI chat completion response, including usage.total_tokens in the response body, which agentgateway reads to enforce token-based rate limits.

  1. Create the test namespace.

    kubectl create namespace kserve-test
  2. Deploy an InferenceService using llm-d-inference-sim directly via spec.predictor.containers. This approach bypasses KServe’s model runtime machinery entirely, no ClusterServingRuntime or model storage is needed.

    kubectl apply -f - <<EOF
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: mock-llm
      namespace: kserve-test
    spec:
      predictor:
        containers:
          - name: kserve-container
            image: ghcr.io/llm-d/llm-d-inference-sim:v0.9.0-rc3
            args:
              - --model
              - mock-llm
              - --port
              - "8080"
              - --mode
              - echo
            ports:
              - containerPort: 8080
                protocol: TCP
            resources:
              requests:
                cpu: "100m"
                memory: "128Mi"
              limits:
                cpu: "500m"
                memory: "256Mi"
    EOF
  3. Wait for the InferenceService to become ready.

    kubectl get inferenceservices mock-llm -n kserve-test --watch

Optional Step 4b: Apply a transformation policy to the KServe-generated HTTPRoute

Without a policy, agentgateway forwards requests and responses as-is. This step shows how a transformation policy can enrich responses with additional headers — without touching the inference service itself.

  1. Verify that KServe created an HTTPRoute after the Gateway becomes READY. The route attaches to kserve/kserve-ingress-gateway with hostname mock-llm-kserve-test.example.com.

    kubectl get httproute mock-llm -n kserve-test -o yaml
  1. Get the external address of the gateway and save it in an environment variable.

    export INGRESS_GW_ADDRESS=$(kubectl get svc -n kserve agentgateway-proxy \
      -o=jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
    echo $INGRESS_GW_ADDRESS
  2. Confirm that the response contains no custom headers.

    curl -s -X POST http://$INGRESS_GW_ADDRESS/v1/chat/completions \
      -H "Host: mock-llm-kserve-test.example.com" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "mock-llm",
        "messages": [{"role": "user", "content": "Hello"}]
      }' -v 2>&1 | grep "^<"

    Example Output:

    < HTTP/1.1 200 OK
    < server: fasthttp
    < date: Mon, 18 May 2026 21:55:33 GMT
    < content-type: application/json
    < content-length: 353
  3. Apply a transformation policy that reads the model name from the request and response body and injects them as response headers.

    kubectl apply -f - <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayPolicy
    metadata:
      name: model-echo-headers
      namespace: kserve-test
    spec:
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: HTTPRoute
          name: mock-llm
      traffic:
        transformation:
          response:
            set:
              - name: x-requested-model
                value: 'string(json(request.body).model)'
              - name: x-actual-model
                value: 'string(json(response.body).model)'
    EOF
  4. Send the same request again and check the headers.

    curl -s -X POST http://$INGRESS_GW_ADDRESS/v1/chat/completions \
      -H "Host: mock-llm-kserve-test.example.com" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "mock-llm",
        "messages": [{"role": "user", "content": "Hello"}]
      }' -v 2>&1 | grep "^<"

    Example output:

    < HTTP/1.1 200 OK
    < server: fasthttp
    < date: Mon, 18 May 2026 21:56:12 GMT
    < content-type: application/json
    < content-length: 353
    < x-requested-model: mock-llm
    < x-actual-model: mock-llm

Step 5: Create an AgentgatewayBackend

KServe generates the HTTPRoute with a plain Kubernetes Service as the backendRef. However, to apply a token-based rate limiting policy, agentgateway needs the backend to be an AgentgatewayBackend. This way, agentgateway knows that the backend is an LLM that has a response body with the usage.total_tokens field to count against the rate limit bucket. In the following steps, you create an AgentgatewayBackend and a second HTTPRoute to route to it as a workaround to the KServe-created, Service-based setup.

  1. Create an AgentgatewayBackend that points at the llm-d-inference-sim service.

    kubectl apply -f - <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayBackend
    metadata:
      name: mock-llm-backend
      namespace: kserve-test
    spec:
      ai:
        provider:
          openai:
            model: mock-llm
          host: mock-llm-predictor.kserve-test.svc.cluster.local
          port: 80
          path: "/v1/chat/completions"
    EOF
  2. Create a second HTTPRoute that routes to the AgentgatewayBackend. This route uses the same hostname as the KServe-generated route but matches only the /v1/chat/completions path, so the gateway prefers it for LLM traffic.

    kubectl apply -f - <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: mock-llm-ai
      namespace: kserve-test
    spec:
      parentRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: kserve-ingress-gateway
          namespace: kserve
      hostnames:
        - mock-llm-kserve-test.example.com
      rules:
        - matches:
            - path:
                type: PathPrefix
                value: /v1/chat/completions
          backendRefs:
            - name: mock-llm-backend
              namespace: kserve-test
              group: agentgateway.dev
              kind: AgentgatewayBackend
    EOF

Step 6: Test the endpoint

  1. Get the external address of the gateway and save it in an environment variable.

    export INGRESS_GW_ADDRESS=$(kubectl get svc -n kserve agentgateway-proxy \
      -o=jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
    echo $INGRESS_GW_ADDRESS
  2. Send a request to verify the setup works end-to-end.

    curl -s http://$INGRESS_GW_ADDRESS/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "mock-llm",
        "messages": [
          {"role": "user", "content": "Hello"}
        ]
      }' | jq

    Example output:

    {
      "model": "mock-llm",
      "usage": {
        "prompt_tokens": 6,
        "completion_tokens": 1,
        "total_tokens": 7,
        "prompt_tokens_detail": {
          "cached_tokens": 0
        }
      },
      "choices": [
        {
          "message": {
            "content": "Hello",
            "role": "assistant"
          },
          "index": 0,
          "finish_reason": "stop"
        }
      ],
      "id": "chatcmpl-98473698-57bc-5d69-b91e-af0aace83ac9",
      "object": "chat.completion",
      "kv_transfer_params": null,
      "created": 1779134384
    }

Optional Step 7: Apply token-based rate limiting

How token counting works: Agentgateway reads usage.total_tokens from the JSON response body returned by the inference service. Each request deducts that many tokens from the bucket. When the bucket empties, subsequent requests receive 429 Too Many Requests until the next fill interval.

  1. Apply an AgentgatewayPolicy that caps requests at 70 tokens per minute. The policy targets the mock-llm-ai route that selects the AgentgatewayBackend.

    kubectl apply -f - <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayPolicy
    metadata:
      name: llm-token-budget
      namespace: kserve-test
    spec:
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: HTTPRoute
          name: mock-llm-ai
      traffic:
        rateLimit:
          local:
            - tokens: 70
              unit: Minutes
    EOF
  2. Verify the policy is accepted and attached. Both Accepted and Attached conditions must be True.

    kubectl get agentgatewaypolicy llm-token-budget -n kserve-test \
      -o jsonpath='{.status.ancestors[0].conditions}'
  1. Get the external address of the gateway and save it in an environment variable.

    export INGRESS_GW_ADDRESS=$(kubectl get svc -n kserve agentgateway-proxy \
      -o=jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
    echo $INGRESS_GW_ADDRESS
  2. Run a burst of requests to trigger the token rate limit. With tokens: 70 and each response consuming 7 tokens, the budget exhausts after roughly 10 requests.

    for i in $(seq 1 30); do
      curl -s -o /dev/null -w "%{http_code}\n" \
        -X POST http://$INGRESS_GW_ADDRESS/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{"model": "mock-llm", "messages": [{"role": "user", "content": "Hello"}]}'
    done

    Example output:

    200
    200
    200
    200
    200
    200
    200
    200
    200
    200
    429
    429
    429
    ...

Cleanup

Remove the resources created in this guide.

kubectl delete agentgatewaypolicy llm-token-budget -n kserve-test
kubectl delete AgentgatewayPolicy -n kserve-test model-echo-headers
kubectl delete httproute mock-llm-ai -n kserve-test
kubectl delete agentgatewaybackend mock-llm-backend -n kserve-test
kubectl delete inferenceservice mock-llm -n kserve-test
kubectl delete namespace kserve-test
helm uninstall kserve -n kserve
helm uninstall kserve-crd
kubectl delete gateway kserve-ingress-gateway -n kserve
kubectl delete namespace kserve
Was this page helpful?
Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

Tip: one topic per conversation gives the best results. Use the + button in the chat header to start a new conversation.

Switching topics? Starting a new conversation improves accuracy.
↑↓ navigate select esc dismiss

What could be improved?

Your feedback helps us improve assistant answers and identify docs gaps we should fix.

Need more help? Join us on Discord: https://discord.gg/y9efgEmppm

Want to use your own agent? Add the Solo MCP server to query our docs directly. Get started here: https://search.solo.io/.