Skip to content
✨ agentgateway has joined the Agentic AI Foundation (AAIF) — Learn more

For the complete documentation index, see llms.txt. Markdown versions of all docs pages are available by appending .md to any docs URL.

Page as Markdown

Inference routing

Use agentgateway with the Kubernetes Gateway API Inference Extension to route requests to AI inference workloads, such as Large Language Models (LLMs) that run in your Kubernetes environment.

This page covers Kubernetes Gateway API mode, where agentgateway routes to InferencePool backends from Gateway API resources. If you want to run the Endpoint Picker Extension (EPP) with agentgateway as a standalone sidecar proxy, see the standalone request scheduler guide instead.

For more information, see the following resources.

Before you begin

To use the Inference Extension with agentgateway, upgrade your Helm installation with the inferenceExtension.enabled=true value.

helm upgrade -i -n agentgateway-system agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
  --version $AGENTGATEWAY_VERSION \
  --set inferenceExtension.enabled=true \
  --reuse-values

About

The Inference Extension extends the Gateway API with two key resources, an InferencePool and an InferenceModel, as shown in the following diagram.

    graph TD
    InferencePool --> InferenceModel_v1["InferenceModel v1"]
    InferencePool --> InferenceModel_v2["InferenceModel v2"]
    InferencePool --> InferenceModel_v3["InferenceModel v3"]
  

The InferencePool groups together InferenceModels of LLM workloads into a routable backend resource that the Gateway API can route inference requests to. An InferenceModel represents not just a single LLM model, but a specific configuration that includes information such as the version and criticality. The InferencePool uses this information to ensure fair consumption of compute resources across competing LLM workloads and share routing decisions with the Gateway API.

Agentgateway with Inference Extension

Agentgateway integrates with the Inference Extension as a supported Gateway API provider. A Gateway can route requests to InferencePools, as shown in the following diagram.

    graph LR
    Client -->|inference request| agentgateway
    agentgateway -->|routes to| InferencePool
    subgraph  
        subgraph InferencePool
            direction LR
            InferenceModel_v1
            InferenceModel_v2
            InferenceModel_v3
        end
        agentgateway
    end
  

The client sends an inference request to get a response from a local LLM workload. The Gateway receives the request and routes to the InferencePool as a backend. Then, the InferencePool selects a specific InferenceModel to route the request to, based on criteria such as the least-loaded model or highest criticality. The Gateway returns the response to the client.

Set up Inference Extension

Refer to the Kgateway tabs in the Getting started guide in the Inference Extension docs.

Quickstart

In this quickstart, you deploy the following components.

  • vLLM for model serving.
  • A local model configuration. Qwen is used in this example.
  • Kubernetes Gateway API Inference Extension.
  • Agentgateway with inference enabled.
  • The llm-d InferencePool via Helm, configured for Qwen.

Steps:

  1. Deploy the Qwen vLLM instance. The container image uses CPU instead of GPU, which makes for easier local or small cluster testing.

    kubectl apply -f - <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-qwen25-15b-instruct
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm-qwen25-15b-instruct
      template:
        metadata:
          labels:
            app: vllm-qwen25-15b-instruct
        spec:
          containers:
            - name: vllm
              image: "vllm/vllm-openai-cpu:v0.18.0" # CPU image for local testing; pin tag to avoid drift
              imagePullPolicy: IfNotPresent
              command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
              args:
              - "--model"
              - "Qwen/Qwen2.5-1.5B-Instruct"
              - "--port"
              - "8000"
              env:
                - name: PORT
                  value: "8000"
                - name: VLLM_CPU_KVCACHE_SPACE
                  value: "4"
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              livenessProbe:
                failureThreshold: 240
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 180
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              readinessProbe:
                failureThreshold: 600
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 180
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              resources:
                 limits:
                   cpu: "11"
                   memory: "10Gi"
                 requests:
                   cpu: "11"
                   memory: "10Gi"
              volumeMounts:
                - mountPath: /data
                  name: data
                - mountPath: /dev/shm
                  name: shm
          restartPolicy: Always
          schedulerName: default-scheduler
          terminationGracePeriodSeconds: 30
          volumes:
            - name: data
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
    EOF

    Wait about 2-3 minutes for the Qwen model to download. Verify that the pod is running.

    kubectl get pods -w
  2. Install the CRDs for the Kubernetes Gateway API Inference Extension.

    kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.4.0/manifests.yaml
  3. Install the Kubernetes Gateway API CRDs, agentgateway, and the agentgateway CRDs.

    kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.0/standard-install.yaml
    helm upgrade -i --create-namespace \
      --namespace agentgateway-system \
      --version v1.3.0-alpha.1 \
      agentgateway-crds oci://cr.agentgateway.dev/charts/agentgateway-crds
    helm upgrade -i -n agentgateway-system agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
      --version v1.3.0-alpha.1 \
      --set inferenceExtension.enabled=true
  4. Deploy the InferencePool and the Endpoint Picker extension (EPP/llm-d) via Helm. The InferencePool acts as a logical grouping of AI model servers for load balancing and routing inference requests. The EPP provides intelligent selection among available model servers.

    The GATEWAY_PROVIDER is set to none because you install your own gateway provider, agentgateway.
    export IGW_CHART_VERSION=v1.1.0
    export GATEWAY_PROVIDER=none
    
    helm install vllm-qwen25-15b-instruct \
      --set inferencePool.modelServers.matchLabels.app=vllm-qwen25-15b-instruct \
      --set provider.name=$GATEWAY_PROVIDER \
      --version $IGW_CHART_VERSION \
      oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

    Verify that the InferencePool is deployed.

    kubectl get inferencepool
  5. Deploy a Gateway and HTTPRoute for inference routing. The HTTPRoute routes to the InferencePool that you created in the previous step. The inferencePool.modelServers.matchLabels.app selector matches any pod with the vllm-qwen25-15b-instruct label from step 1.

    kubectl apply -f - <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: agentgateway
      listeners:
      - name: http
        port: 80
        protocol: HTTP
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: llm-route
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
      rules:
      - backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: vllm-qwen25-15b-instruct
        matches:
        - path:
            type: PathPrefix
            value: /
        timeouts:
          request: 300s
    EOF
  6. Verify the end-to-end flow. A request flows through the following path.

        graph LR
        Client -->|curl| Gateway
        Gateway -->|path prefix /| HTTPRoute
        HTTPRoute --> InferencePool
        InferencePool -->|selects model server| vLLM["vLLM pod"]
        vLLM -->|response| Client
      

    Send a test request to the inference gateway.

    IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
    PORT=80
    
    curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
      "model": "Qwen/Qwen2.5-1.5B-Instruct",
      "prompt": "What is the warmest city in the USA?",
      "max_tokens": 100,
      "temperature": 0.5
    }'

    Example output:

    HTTP/1.1 200 OK
    date: Sat, 11 April 2026 19:54:07 GMT
    server: uvicorn
    content-type: application/json
    transfer-encoding: chunked
    
    {"choices":[{"finish_reason":"length","index":0,"text":" The warmest city in the United States is Phoenix, Arizona..."}],"model":"Qwen/Qwen2.5-1.5B-Instruct","object":"text_completion","usage":{"completion_tokens":100,"prompt_tokens":10,"total_tokens":110}}

Use AI policies with InferencePools

The quickstart routes directly from an HTTPRoute to an InferencePool. Use that pattern when you only need Gateway API Inference Extension endpoint selection.

To also use agentgateway LLM features, such as token counting, token-based rate limits, guardrails, transformations, and LLM observability, route the HTTPRoute to an AgentgatewayBackend. Then, configure a custom provider on the AgentgatewayBackend that targets the InferencePool.

    graph LR
    Client --> Gateway
    Gateway --> HTTPRoute
    HTTPRoute --> AgentgatewayBackend
    AgentgatewayBackend --> InferencePool
    InferencePool --> ModelServer["model server"]
  
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: qwen-inferencepool
  namespace: agentgateway-system
spec:
  ai:
    provider:
      custom:
        backendRef:
          group: inference.networking.k8s.io
          kind: InferencePool
          name: vllm-qwen25-15b-instruct
        model: Qwen/Qwen2.5-1.5B-Instruct
        formats:
        - type: Completions
          path: /v1/chat/completions
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
  namespace: agentgateway-system
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1/chat/completions
    backendRefs:
    - group: agentgateway.dev
      kind: AgentgatewayBackend
      name: qwen-inferencepool
    timeouts:
      request: 300s

The following example applies an LLM token budget to the same HTTPRoute. Because the route points to an AgentgatewayBackend with a custom provider, agentgateway can parse the LLM response usage and enforce the token limit while still using the InferencePool for endpoint selection.

apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: qwen-token-budget
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: llm-route
  traffic:
    rateLimit:
      local:
      - tokens: 1000
        unit: Minutes

For more token rate limiting details, see Rate limiting for LLMs.

For more custom provider examples, see Custom providers.

Most users can keep the default llm-d Router OpenAI parser and send OpenAI-compatible requests, such as /v1/chat/completions. If clients send a different request format, configure the llm-d Router EPP parser, such as router.epp.parser, for that client-facing format. For parser options, see the llm-d Router parser docs.
Was this page helpful?
Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

Tip: one topic per conversation gives the best results. Use the + button in the chat header to start a new conversation.

Switching topics? Starting a new conversation improves accuracy.
↑↓ navigate select esc dismiss

What could be improved?

Your feedback helps us improve assistant answers and identify docs gaps we should fix.

Need more help? Join us on Discord: https://discord.gg/y9efgEmppm

Want to use your own agent? Add the Solo MCP server to query our docs directly. Get started here: https://search.solo.io/.