Request transformations

Use LLM request transformations to dynamically compute and set fields in LLM requests using Common Expression Language (CEL) CEL (Common Expression Language) A simple expression language used throughout agentgateway to enable flexible configuration. CEL expressions can access request context, JWT claims, and other variables to make dynamic decisions. expressions. Transformations let you enforce policies such as capping token usage or conditionally modifying request parameters, without changing client code.

To learn more about CEL, see the following resources:

Before you begin

Set up an agentgateway proxy.
Set up access to the OpenAI LLM provider.

Configure LLM request transformations

Create an AgentgatewayPolicy resource to apply an LLM request transformation. The following example caps max_tokens to 10, regardless of what the client requests.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: cap-max-tokens
  namespace: agentgateway-system
  labels:
    app: agentgateway
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  backend:
    ai:
      transformations:
      - field: max_tokens
        expression: "min(llmRequest.max_tokens, 10)"
EOF

Setting	Description
`backend.ai.transformations`	A list of LLM request field transformations.
`field`	The name of the LLM request field to set. Maximum 256 characters.
`expression`	A CEL expression that computes the value for the field. Use the `llmRequest` variable to access the original LLM request body. Maximum 16,384 characters.

ℹ️

You can specify up to 64 transformations per policy. Transformations take priority over overrides for the same field. If an expression fails to evaluate, the field is silently removed from the request.

Thinking budget fields, such as reasoning_effort and thinking_budget_tokens can also be set or capped by using transformations. This way, operators can enforce reasoning limits centrally without requiring client changes. For example, use "field": "reasoning_effort" with the expression "medium" to cap all requests to medium reasoning efforts regardless of what the client sends.

Send a request with max_tokens set to a value greater than 10. The transformation caps it to 10 before the request reaches the LLM provider. Verify that the completion_tokens value in the response is 10 or fewer, the response is capped and the finish_reason is set to length.

curl "$INGRESS_GW_ADDRESS/v1/chat/completions" \
-H "content-type: application/json" \
-d '{
  "model": "gpt-3.5-turbo",
  "max_tokens": 5000,
  "messages": [
    {
      "role": "user",
      "content": "Tell me a short story"
    }
  ]
}' | jq

curl "localhost:8080/v1/chat/completions" \
-H "content-type: application/json" \
-d '{
  "model": "gpt-3.5-turbo",
  "max_tokens": 5000,
  "messages": [
    {
      "role": "user",
      "content": "Tell me a short story"
    }
  ]
}' | jq

Example output:

{
  "model": "gpt-3.5-turbo-0125",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 10,
    "total_tokens": 22,
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    }
  },
  "choices": [
    {
      "message": {
        "content": "Once upon a time, in a small village nestled",
        "role": "assistant",
        "refusal": null,
        "annotations": []
      },
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  ...
}

Inject LLM model information as response headers

Use CEL expressions to inject LLM model information as response headers. This strategy is useful for detecting silent fallbacks, where a request is redirected to a different model without the client being notified. However, this setup might not be suitable for streaming responses.

Inject model headers from request and response bodies

Parse the model field from the incoming request body and the upstream response body using json(), then inject them as response headers. This configuration lets you compare which model was requested against which model actually responded.

json(request.body).model: Reads the model field from the incoming request body.
json(response.body).model: Reads the model field from the upstream response body.

Create a AgentgatewayPolicy resource that targets the OpenAI provider’s HTTPRoute and injects the model fields as response headers.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: llm-model-headers
  namespace: agentgateway-system
  labels:
    app: agentgateway
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  traffic:
    transformation:
      response:
        set:
        - name: x-requested-model
          value: 'string(json(request.body).model)'
        - name: x-actual-model
          value: 'string(json(response.body).model)'
EOF

Send a chat completion request through the gateway and inspect the response headers.

curl -vi "http://$INGRESS_GW_ADDRESS/v1/chat/completions" \
 -H "Content-Type: application/json" \
 -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hi"}]}'

curl -vi "http://localhost:8080/v1/chat/completions" \
 -H "Content-Type: application/json" \
 -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hi"}]}'

Example output:

< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< content-type: application/json
content-type: application/json
< x-requested-model: gpt-4
x-requested-model: gpt-4
< x-actual-model: gpt-4
x-actual-model: gpt-4
...

Actual model values might differ slightly from the requested model, even if the same model is used. Some responses might include a unique identifier as part of the model name. In these circumstances, you might use the contains() function to verify.

When a fallback model handles the request, x-actual-model differs from x-requested-model:

< x-requested-model: gpt-4o
x-requested-model: gpt-4o
< x-actual-model: gpt-4o-mini
x-actual-model: gpt-4o-mini

Detect fallback with the llm context variables

When the agentgateway proxy routes to an AI backend, the llm CEL context provides first-class variables that are parsed directly from the LLM protocol layer rather than from raw body strings:

llm.requestModel: The model name from the original request.
llm.responseModel: The model name the upstream LLM provider reported in the response.

Use metadata to compute each value once and reference it by name. This setup avoids repeating the default() fallback expression in every header and keeps the x-model-fallback condition readable:

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: llm-context-vars
  namespace: agentgateway-system
  labels:
    app: agentgateway
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  traffic:
    transformation:
      response:
        metadata:
          requestedModel: 'default(llm.requestModel, string(json(request.body).model))'
          actualModel: 'default(llm.responseModel, string(json(response.body).model))'
        set:
        - name: x-requested-model
          value: metadata.requestedModel
        - name: x-actual-model
          value: metadata.actualModel
        - name: x-model-fallback
          value: 'metadata.requestedModel != metadata.actualModel ? "true" : "false"'
EOF

The default() fallback is written once per value rather than repeated in every header and in the comparison.

Cleanup

You can remove the resources that you created in this guide.

kubectl delete AgentgatewayPolicy cap-max-tokens -n agentgateway-system --ignore-not-found
kubectl delete AgentgatewayPolicy llm-model-headers -n agentgateway-system --ignore-not-found
kubectl delete AgentgatewayPolicy llm-context-vars -n agentgateway-system --ignore-not-found
kubectl delete httproute openai -n agentgateway-system --ignore-not-found

Prompt templates Budget and spend limits