> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tera.gw/llms.txt
> Use this file to discover all available pages before exploring further.

# gpt-oss-20b

> OpenAI's open-weight 20B reasoning model on Tera. OpenAI-compatible. 131k context. US-only, zero retention.

<Info>
  Model id: `openai/gpt-oss-20b` — pass this as the `model` field. The API surface is the OpenAI Chat Completions API; existing OpenAI SDKs and any gateway that abstracts over OpenAI-compatible providers work without code changes.
</Info>

## At a glance

|                    |                                                                   |
| ------------------ | ----------------------------------------------------------------- |
| **Model id**       | `openai/gpt-oss-20b`                                              |
| **Provider**       | openai                                                            |
| **HuggingFace**    | [`openai/gpt-oss-20b`](https://huggingface.co/openai/gpt-oss-20b) |
| **Context length** | 131,072 tokens                                                    |
| **Max output**     | 4,096 tokens                                                      |
| **Quantization**   | `mxfp4` (native), prefix caching enabled                          |
| **Reasoning**      | Emits `reasoning` (OpenAI gpt-oss parser)                         |
| **Tool calling**   | OpenAI-format, auto tool choice enabled                           |

## Pricing

|        | per million tokens |
| ------ | ------------------ |
| Input  | \$0.07             |
| Output | \$0.25             |

Reasoning tokens count toward output. See [How billing works](/pricing#how-billing-works).

## Quickstart

Set `base_url` to `https://api.tera.gw/v1` and pass your `sk-tera-...` key. No other code change.

<CodeGroup>
  ```python python theme={null}
  from openai import OpenAI

  client = OpenAI(
      base_url="https://api.tera.gw/v1",
      api_key="sk-tera-...",
  )

  resp = client.chat.completions.create(
      model="openai/gpt-oss-20b",
      messages=[{"role": "user", "content": "Summarize the gpt-oss license in 2 sentences."}],
      max_tokens=512,
  )
  print(resp.choices[0].message.content)
  ```

  ```javascript node theme={null}
  import OpenAI from "openai";

  const client = new OpenAI({
    baseURL: "https://api.tera.gw/v1",
    apiKey: process.env.TERA_API_KEY,
  });

  const resp = await client.chat.completions.create({
    model: "openai/gpt-oss-20b",
    messages: [{ role: "user", content: "Summarize the gpt-oss license in 2 sentences." }],
    max_tokens: 512,
  });

  console.log(resp.choices[0].message.content);
  ```

  ```bash curl theme={null}
  curl https://api.tera.gw/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TERA_API_KEY" \
    -d '{
      "model": "openai/gpt-oss-20b",
      "messages": [{"role": "user", "content": "Summarize the gpt-oss license in 2 sentences."}],
      "max_tokens": 512
    }'
  ```
</CodeGroup>

## Reasoning

gpt-oss-20b runs with the OpenAI gpt-oss reasoning parser. Chain-of-thought tokens are returned in a separate `reasoning` field so they don't pollute `content`. OpenAI SDKs that expect a plain `content` string continue to work.

<Note>
  Some providers expose this as `reasoning_content`. We follow OpenAI's recommendation and use `reasoning`. If you're porting code that expects `reasoning_content`, treat the two as aliases.
</Note>

<CodeGroup>
  ```json non-streaming theme={null}
  {
    "choices": [{
      "message": {
        "role": "assistant",
        "reasoning": "The user is asking about... I should think about...",
        "content": "The final answer is X."
      },
      "finish_reason": "stop"
    }]
  }
  ```

  ```text streaming theme={null}
  data: {"choices":[{"delta":{"role":"assistant","reasoning":"The user is"}}]}
  data: {"choices":[{"delta":{"reasoning":" asking about"}}]}
  ...
  data: {"choices":[{"delta":{"content":"The final answer is"}}]}
  data: {"choices":[{"delta":{"content":" X."}}]}
  data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
  data: [DONE]
  ```
</CodeGroup>

See [Reasoning models](/concepts/reasoning) for the streaming Python loop and rendering patterns.

## Tool calling

gpt-oss-20b runs with the OpenAI tool-call parser and `enable_auto_tool_choice=true`. The request and response shapes match the OpenAI Chat Completions API 1:1.

```python theme={null}
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
}]

# Turn 1: model decides to call the tool
first = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
)
tool_call = first.choices[0].message.tool_calls[0]

# Turn 2: provide tool output, get final answer
final = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"},
        first.choices[0].message,
        {"role": "tool", "tool_call_id": tool_call.id, "content": "18C, partly cloudy"},
    ],
    tools=tools,
)
print(final.choices[0].message.content)
```

`tool_choice` accepts `"auto"` (default), `"none"`, `"required"`, or `{"type": "function", "function": {"name": "..."}}`. Parallel tool calls are supported — the response can contain multiple entries in `tool_calls`. Streaming tool calls arrive as `delta.tool_calls[i].function.arguments` JSON fragments that must be concatenated by call index.

See [Tool calling](/concepts/tool-calling) for the full streaming reconstruction example.

## Structured outputs / JSON mode

Pass `response_format` to constrain the assistant's output.

```python theme={null}
# JSON mode — guarantees valid JSON, schema not enforced
resp = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "List 3 colors as JSON."}],
    response_format={"type": "json_object"},
)

# Structured outputs — JSON Schema enforced
resp = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Extract: 'Alice is 30 years old.'"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                },
                "required": ["name", "age"],
                "additionalProperties": False,
            },
        },
    },
)
```

## Streaming

Set `"stream": true` and consume Server-Sent Events on the same endpoint. See [Streaming](/concepts/streaming) for the wire format.

```python theme={null}
stream = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Count to five."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if reasoning := getattr(delta, "reasoning", None):
        print(f"[think] {reasoning}", end="", flush=True)
    if content := delta.content:
        print(content, end="", flush=True)
```

## Sampling parameters

`temperature`, `top_p`, `top_k`, `max_tokens`, `stop`, `seed`, `frequency_penalty`, `presence_penalty`, `repetition_penalty`, `logprobs`, `top_logprobs`

`seed` is honored for deterministic sampling. `top_k`, `repetition_penalty`, and `min_p` are vLLM extensions beyond OpenAI's surface — ignored by clients that don't pass them.

## Supported features

`tools`, `json_mode`, `structured_outputs`, `reasoning`, `logprobs`

## OpenAI compatibility matrix

| Field / behavior                                      | Status | Notes                                                                                                             |
| ----------------------------------------------------- | ------ | ----------------------------------------------------------------------------------------------------------------- |
| `messages` (system / user / assistant / tool)         | ✅      | Standard OpenAI shape                                                                                             |
| `tools` / `tool_choice` / `tool_calls`                | ✅      | Auto tool choice enabled                                                                                          |
| `response_format` (`json_object`, `json_schema`)      | ✅      | Strict schema enforcement supported                                                                               |
| `stream` (SSE)                                        | ✅      | Terminated by `data: [DONE]`                                                                                      |
| `seed`                                                | ✅      | Deterministic                                                                                                     |
| `logprobs` / `top_logprobs`                           | ✅      |                                                                                                                   |
| `reasoning` field                                     | ✅      | Sibling of `content`; OpenAI-spec field for chain-of-thought. Some providers use `reasoning_content` as an alias. |
| `OpenAI-Organization`, `OpenAI-Project` headers       | ➖      | Accepted and ignored                                                                                              |
| `/v1/moderations`                                     | ❌      | Not offered                                                                                                       |
| `/v1/embeddings`, `/v1/images/*`, `/v1/fine_tuning/*` | ❌      | Not offered                                                                                                       |
| Image inputs                                          | ❌      | Text-only today                                                                                                   |

## Reliability and routing

For gateways that route across multiple providers (Respan, OpenRouter, in-house abstractions), the relevant behaviors:

* **Cold start**: The first request after a backend cold-boot is slow (\~2–12s TTFT) because vLLM compiles CUDA graphs on first traffic. Subsequent requests are warm. Schedule a warmup probe before routing real traffic if you can.
* **Gateway-side retry**: 5xx errors trigger automatic retry across healthy replicas within the Tera gateway before being returned to you. You'll see a single response.
* **Health-aware routing**: Unhealthy backends are taken out of rotation automatically; clients don't need to manage this.
* **Concurrency**: Per-key concurrency and throughput are sized to your workload. Reach out for higher provisioned envelopes.
* **Idempotency**: Requests are not deduplicated server-side. If you retry a request that may have succeeded, you may be billed for both.
* **Streaming cancellation**: If the client disconnects mid-stream, generation is cancelled on the backend.

## Observability

Every response carries headers and a body useful for trace correlation.

| Surface                             | Value                                             |
| ----------------------------------- | ------------------------------------------------- |
| `X-Tera-Request-ID` response header | Unique per request. Quote this in support emails. |
| `usage.prompt_tokens`               | Input tokens billed                               |
| `usage.completion_tokens`           | Output tokens billed (includes reasoning)         |
| `usage.total_tokens`                | Sum                                               |
| `choices[0].finish_reason`          | `stop`, `length`, or `tool_calls`                 |

Read the header with the OpenAI Python SDK via the `with_raw_response` accessor:

```python theme={null}
raw = client.chat.completions.with_raw_response.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "hi"}],
)
request_id = raw.http_response.headers.get("X-Tera-Request-ID")
resp = raw.parse()
```

## Errors

All errors return a JSON body of the shape:

```json theme={null}
{ "error": { "message": "...", "type": "...", "code": "..." } }
```

| HTTP  | `error.type`            | When                                                 | Retry?                                           |
| ----- | ----------------------- | ---------------------------------------------------- | ------------------------------------------------ |
| `400` | `invalid_request_error` | Malformed body, unknown sampling parameter values    | No — fix the request                             |
| `401` | `authentication_error`  | Missing or invalid API key                           | No — rotate the key                              |
| `403` | `permission_error`      | Key not scoped to this model                         | No — request scope                               |
| `404` | `model_not_found`       | `model` field not in the catalog                     | No — check `/v1/models`                          |
| `429` | `rate_limit_error`      | Per-key concurrency / token-rate limit hit           | Yes — honor `Retry-After` header                 |
| `500` | `server_error`          | Gateway-side failure after backend retries exhausted | Yes — bounded retry with backoff                 |
| `503` | `service_unavailable`   | Backend cold or no healthy replica                   | Yes — first request after long idle may hit this |

## Rate limits

Per-key concurrency and tokens-per-second are provisioned to your expected workload. Tell us the shape — peak QPS, sustained concurrency, rough token volumes — and we'll size accordingly. Bursts beyond your provisioned envelope return `429` with `Retry-After`.

## Cost example

Typical agentic turn with a tool call (1,000 input tokens, 600 output tokens):

| Stage                      |           Tokens |             Cost |
| -------------------------- | ---------------: | ---------------: |
| User prompt + system       |           700 in |      \$0.0000490 |
| Reasoning + tool call      |          400 out |      \$0.0001000 |
| Tool result + final answer | 300 in / 200 out |      \$0.0000710 |
| **Per turn**               |                  | **\~\$0.000220** |

At 50,000 turns/day this runs \~$11.00/day (~$330/month). Volume committed-use pricing available — email [hello@tera.gw](mailto:hello@tera.gw).

## Onboard

1. Email [hello@tera.gw](mailto:hello@tera.gw?subject=Tera%20API%20access%20%E2%80%94%20gpt-oss-20b) — tell us expected concurrency, peak QPS, and rough token volumes.
2. We issue an `sk-tera-...` key.
3. Smoke-test against `https://api.tera.gw/v1` from your gateway.
4. Ramp.

Bring the `X-Tera-Request-ID` of any failing request and we can trace it end-to-end.
