Content safety and PII protection

Protect LLM requests and responses from sensitive data exposure and harmful content using layered content safety controls.

About

Content safety helps you prevent sensitive information from reaching LLM providers and block harmful content in both requests and responses. Content safety practices broadly cover a range of techniques including personally identifiable information (PII) detection, PII sanitization, data loss prevention, prompt guards, and other guardrail features.

Agentgateway provides a layered approach to content safety through prompt guards that can reject, mask, or moderate content before it reaches the LLM or returns to users.

You can layer multiple protection mechanisms to create comprehensive content safety:

  • Regex-based detection: Fast, deterministic matching for known patterns like credit cards, SSNs, emails, and custom patterns
  • External moderation: Leverage cloud provider guardrails for advanced content filtering
  • Custom webhooks: Integrate your own content safety logic for specialized requirements

This guide shows you how to use each layer and combine them for defense-in-depth content protection.

How content safety works

Agentgateway processes content safety checks in the request and response paths. You can configure multiple prompt guards that run in sequence, allowing you to combine different detection methods.

  sequenceDiagram
    participant Client
    participant Gateway as Agentgateway
    participant Guard as Content Safety Layer
    participant LLM

    Client->>Gateway: Send prompt
    Gateway->>Guard: 1. Regex check (fast)
    Guard-->>Gateway: Pass/Reject/Mask

    alt Passed Regex
        Gateway->>Guard: 2. External moderation (if configured)
        Guard-->>Gateway: Pass/Reject/Mask

        alt Passed Moderation
            Gateway->>Guard: 3. Custom webhook (if configured)
            Guard-->>Gateway: Pass/Reject/Mask

            alt Passed All Guards
                Gateway->>LLM: Forward sanitized request
                LLM-->>Gateway: Generate response
                Gateway->>Guard: Response guards
                Guard-->>Gateway: Pass/Reject/Mask
                Gateway-->>Client: Return sanitized response
            end
        end
    else Rejected
        Gateway-->>Client: Return rejection message
    end

The diagram shows content flowing through multiple guard layers. Each layer can:

  • Pass: Allow content to proceed to the next layer
  • Reject: Block the request and return an error message
  • Mask: Replace sensitive patterns with placeholders and continue

Choose the right approach

Use this table to decide which content safety layer to use for your requirements.

RequirementRecommended ApproachReason
Detect known PII formats (SSN, credit cards, emails)Regex with builtinsFast, deterministic, no external dependencies
Block hate speech, violence, harmful contentExternal moderation (OpenAI, Bedrock)ML-based detection trained for content safety
Organization-specific restricted termsRegex with custom patternsSimple pattern matching for known strings
Named entity recognition (people, orgs, places)Custom webhookRequires NER models not available in built-in options
HIPAA, PCI-DSS, or other compliance requirementsLayered approachCombine regex + external moderation + custom validation
Integration with existing DLP toolsCustom webhookAllows reuse of existing security infrastructure
Fastest performance with minimal latencyRegex onlyNo external API calls
Most comprehensive protectionAll three layersDefense-in-depth with multiple detection methods

Performance considerations

Each content safety layer adds latency to requests. Plan your configuration accordingly.

  • Regex guards: < 1ms per check, negligible latency impact
  • External moderation: 50-200ms depending on provider and network latency
  • Custom webhooks: Varies based on webhook implementation and location

To optimize performance:

  • Use regex for fast, deterministic checks before slower external checks
  • Deploy webhook servers in the same region as the gateway
  • Configure appropriate timeouts for external moderation endpoints
  • Consider request size limits to avoid processing very large prompts

For webhook-specific performance tuning, see the Guardrail Webhook optimization guide.

ℹ️
Evaluation order: Prompt guards are evaluated after rate limiting. This means that requests rejected by content safety checks (403 Forbidden) still consume rate limit quota. If you want to avoid consuming quota on blocked requests, authentication policies (JWT/OPA) are evaluated before rate limiting and can prevent quota consumption.

Before you begin

  1. Set up an agentgateway proxy.
  2. Set up access to the OpenAI LLM provider.

Layer 1: Regex-based detection

Regex-based prompt guards provide fast, deterministic pattern matching for known sensitive data formats. Use this layer for common PII patterns and custom organization-specific strings.

Built-in patterns

Agentgateway includes built-in regex patterns for common sensitive data types:

  • CreditCard: Credit card numbers (Visa, MasterCard, Amex, Discover)
  • Ssn: US Social Security Numbers
  • Email: Email addresses
  • PhoneNumber: US phone numbers
  • CaSin: Canadian Social Insurance Numbers

Example configuration that masks credit cards in responses:

kubectl apply -f - <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: content-safety-regex
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  backend:
    ai:
      promptGuard:
        response:
        - regex:
            builtins:
            - CreditCard
            - Ssn
            - Email
            action: Mask
EOF

Custom patterns

You can also define custom regex patterns for organization-specific sensitive data.

Example that rejects requests containing specific restricted terms:

kubectl apply -f - <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: content-safety-custom
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  backend:
    ai:
      promptGuard:
        request:
        - response:
            message: "Request blocked due to policy violation"
          regex:
            action: Reject
            matches:
            - "confidential"
            - "internal-only"
            - "project-\\w+-secret"  # Custom pattern with regex
EOF

Test regex guards

Send a request with a fake credit card number and verify it gets masked in the response:

curl "$INGRESS_GW_ADDRESS/openai" -H content-type:application/json -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "What type of number is 5105105105105100?"
    }
  ]
}' | jq
curl "localhost:8080/openai" -H content-type:application/json -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "What type of number is 5105105105105100?"
    }
  ]
}' | jq

Example output showing the credit card masked as <CREDIT_CARD>:

{
  "choices": [
    {
      "message": {
        "content": "<CREDIT_CARD> is an even number."
      }
    }
  ]
}

Layer 2: External moderation endpoints

External moderation endpoints use cloud provider AI services to detect harmful content, hate speech, violence, and other policy violations. These services often use ML models trained specifically for content moderation.

OpenAI Moderation

The OpenAI Moderation API detects potentially harmful content across categories including hate, harassment, self-harm, sexual content, and violence.

  1. Create a secret with your OpenAI API key:

    kubectl create secret generic openai-secret \
      -n agentgateway-system \
      --from-literal="Authorization=Bearer $OPENAI_API_KEY"
  2. Configure the prompt guard to use OpenAI Moderation:

    kubectl apply -f - <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayPolicy
    metadata:
      name: content-safety-openai
      namespace: agentgateway-system
    spec:
      targetRefs:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: openai
      backend:
        ai:
          promptGuard:
            request:
            - openAIModeration:
                policies:
                  auth:
                    secretRef:
                      name: openai-secret
                model: omni-moderation-latest
              response:
                message: "Content blocked by moderation policy"
    EOF
  3. Test with content that triggers moderation:

    curl -i "$INGRESS_GW_ADDRESS/openai" \
      -H "content-type: application/json" \
      -d '{
        "model": "gpt-4o-mini",
        "messages": [
          {
            "role": "user",
            "content": "I want to harm myself"
          }
        ]
      }'
    curl -i "localhost:8080/openai" \
      -H "content-type: application/json" \
      -d '{
        "model": "gpt-4o-mini",
        "messages": [
          {
            "role": "user",
            "content": "I want to harm myself"
          }
        ]
      }'

    Expected response:

    HTTP/1.1 403 Forbidden
    Content blocked by moderation policy

AWS Bedrock Guardrails

AWS Bedrock Guardrails provide content filtering, PII detection, topic restrictions, and word filters. You must first create a guardrail in the AWS Bedrock console.

ℹ️
For instructions on creating Bedrock Guardrails, see the AWS Bedrock Guardrails documentation.
  1. Get your guardrail identifier and version:

    aws bedrock list-guardrails
  2. Configure the prompt guard:

    kubectl apply -f - <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayPolicy
    metadata:
      name: content-safety-bedrock
      namespace: agentgateway-system
    spec:
      targetRefs:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: openai
      backend:
        ai:
          promptGuard:
            request:
            - bedrockGuardrails:
                guardrailIdentifier: your-guardrail-id
                guardrailVersion: "1"  # or "DRAFT"
                region: us-west-2
                policies:
                  backendAuth:
                    aws: {}
            response:
            - bedrockGuardrails:
                guardrailIdentifier: your-guardrail-id
                guardrailVersion: "1"
                region: us-west-2
                policies:
                  backendAuth:
                    aws: {}
    EOF
ℹ️
The aws: {} configuration uses the default AWS credential chain (IAM role, environment variables, or instance profile). For authentication details, see the AWS authentication documentation.

Google Model Armor

Google Model Armor (formerly Vertex AI Safety) provides content safety filtering for Google Cloud customers. Configuration follows a similar pattern to other external moderation endpoints.

ℹ️
For Google Model Armor configuration details, contact Solo.io support or consult the Google Cloud documentation for Vertex AI content safety features.

Layer 3: Custom webhook integration

For advanced content safety requirements beyond regex and cloud provider services, you can integrate custom webhook servers. This allows you to use specialized ML models, proprietary detection logic, or integrate with existing security tools.

Use cases for custom webhooks

  • Named Entity Recognition (NER) for detecting person names, organizations, locations
  • Industry-specific compliance rules (HIPAA, PCI-DSS, GDPR)
  • Integration with existing DLP or security tools
  • Custom ML models for domain-specific content detection
  • Multi-step validation workflows
  • Advanced contextual analysis

Webhook configuration

Configure a prompt guard to call your webhook service:

kubectl apply -f - <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: content-safety-webhook
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  backend:
    ai:
      promptGuard:
        request:
        - webhook:
            backendRef:
              kind: Service
              name: content-safety-webhook
              port: 8000
        response:
        - webhook:
            backendRef:
              kind: Service
              name: content-safety-webhook
              port: 8000
EOF

For a complete guide on implementing and deploying custom webhook servers, see the Guardrail Webhook API documentation.

Combining multiple layers

You can configure multiple prompt guards that run in sequence, creating defense-in-depth protection. Guards are evaluated in the order they appear in the configuration.

Example configuration that uses all three layers:

kubectl apply -f - <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: content-safety-layered
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  backend:
    ai:
      promptGuard:
        request:
        # Layer 1: Fast regex check for known patterns
        - regex:
            builtins:
            - Ssn
            - CreditCard
            - Email
            action: Reject
          response:
            message: "Request contains PII and cannot be processed"
        # Layer 2: OpenAI moderation for harmful content
        - openAIModeration:
            policies:
              auth:
                secretRef:
                  name: openai-secret
            model: omni-moderation-latest
          response:
            message: "Content blocked by moderation policy"
        # Layer 3: Custom webhook for domain-specific checks
        - webhook:
            backendRef:
              kind: Service
              name: content-safety-webhook
              port: 8000
        response:
        # Response guards run in same order
        - regex:
            builtins:
            - Ssn
            - CreditCard
            action: Mask
        - webhook:
            backendRef:
              kind: Service
              name: content-safety-webhook
              port: 8000
EOF

What’s next

Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

Tip: one topic per conversation gives the best results. Use the + button in the chat header to start a new conversation.

Switching topics? Starting a new conversation improves accuracy.
↑↓ navigate select esc dismiss

What could be improved?

Your feedback helps us improve assistant answers and identify docs gaps we should fix.

Need more help? Join us on Discord: https://discord.gg/y9efgEmppm

Want to use your own agent? Add the Solo MCP server to query our docs directly. Get started here: https://search.solo.io/.