gpt-oss-120b

Model id: openai/gpt-oss-120b — pass this as the model field. The API surface is the OpenAI Chat Completions API; existing OpenAI SDKs and any gateway that abstracts over OpenAI-compatible providers work without code changes.

At a glance


Model id	`openai/gpt-oss-120b`
Provider	openai
HuggingFace	`openai/gpt-oss-120b`
Context length	131,072 tokens
Max output	4,096 tokens
Quantization	`mxfp4` (native)
Reasoning	Emits `reasoning` (OpenAI gpt-oss parser)
Tool calling	OpenAI-format, auto tool choice enabled

Pricing

	per million tokens
Input	$0.09
Output	$0.36

Reasoning tokens count toward output. See How billing works.

Quickstart

Set base_url to https://api.tera.gw/v1 and pass your sk-tera-... key. No other code change.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.tera.gw/v1",
    api_key="sk-tera-...",
)

resp = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Summarize the gpt-oss license in 2 sentences."}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Reasoning

gpt-oss-120b runs with the OpenAI gpt-oss reasoning parser. Chain-of-thought tokens are returned in a separate reasoning field so they don’t pollute content. OpenAI SDKs that expect a plain content string continue to work.

Some providers expose this as reasoning_content. We follow OpenAI’s recommendation and use reasoning. If you’re porting code that expects reasoning_content, treat the two as aliases.

{
  "choices": [{
    "message": {
      "role": "assistant",
      "reasoning": "The user is asking about... I should think about...",
      "content": "The final answer is X."
    },
    "finish_reason": "stop"
  }]
}

See Reasoning models for the streaming Python loop and rendering patterns.

Tool calling

gpt-oss-120b runs with the OpenAI tool-call parser and enable_auto_tool_choice=true. The request and response shapes match the OpenAI Chat Completions API 1:1.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
}]

# Turn 1: model decides to call the tool
first = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
)
tool_call = first.choices[0].message.tool_calls[0]

# Turn 2: provide tool output, get final answer
final = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"},
        first.choices[0].message,
        {"role": "tool", "tool_call_id": tool_call.id, "content": "18C, partly cloudy"},
    ],
    tools=tools,
)
print(final.choices[0].message.content)

tool_choice accepts "auto" (default), "none", "required", or {"type": "function", "function": {"name": "..."}}. Parallel tool calls are supported — the response can contain multiple entries in tool_calls. Streaming tool calls arrive as delta.tool_calls[i].function.arguments JSON fragments that must be concatenated by call index. See Tool calling for the full streaming reconstruction example.

Structured outputs / JSON mode

Pass response_format to constrain the assistant’s output.

# JSON mode — guarantees valid JSON, schema not enforced
resp = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "List 3 colors as JSON."}],
    response_format={"type": "json_object"},
)

# Structured outputs — JSON Schema enforced
resp = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Extract: 'Alice is 30 years old.'"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                },
                "required": ["name", "age"],
                "additionalProperties": False,
            },
        },
    },
)

Streaming

Set "stream": true and consume Server-Sent Events on the same endpoint. See Streaming for the wire format.

stream = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Count to five."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if reasoning := getattr(delta, "reasoning", None):
        print(f"[think] {reasoning}", end="", flush=True)
    if content := delta.content:
        print(content, end="", flush=True)

Sampling parameters

temperature, top_p, top_k, max_tokens, stop, seed, frequency_penalty, presence_penalty, repetition_penalty, logprobs, top_logprobs seed is honored for deterministic sampling. top_k, repetition_penalty, and min_p are vLLM extensions beyond OpenAI’s surface — ignored by clients that don’t pass them.

Supported features

tools, json_mode, structured_outputs, reasoning, logprobs

OpenAI compatibility matrix

Field / behavior	Status	Notes
`messages` (system / user / assistant / tool)	✅	Standard OpenAI shape
`tools` / `tool_choice` / `tool_calls`	✅	Auto tool choice enabled
`response_format` (`json_object`, `json_schema`)	✅	Strict schema enforcement supported
`stream` (SSE)	✅	Terminated by `data: [DONE]`
`seed`	✅	Deterministic
`logprobs` / `top_logprobs`	✅
`reasoning` field	✅	Sibling of `content`; OpenAI-spec field for chain-of-thought. Some providers use `reasoning_content` as an alias.
`OpenAI-Organization`, `OpenAI-Project` headers	➖	Accepted and ignored
`/v1/moderations`	❌	Not offered
`/v1/embeddings`, `/v1/images/`, `/v1/fine_tuning/`	❌	Not offered
Image inputs	❌	Text-only today

Reliability and routing

For gateways that route across multiple providers (Respan, OpenRouter, in-house abstractions), the relevant behaviors:

Cold start: The first request after a backend cold-boot is slow (~2–12s TTFT) because vLLM compiles CUDA graphs on first traffic. Subsequent requests are warm. Schedule a warmup probe before routing real traffic if you can.
Gateway-side retry: 5xx errors trigger automatic retry across healthy replicas within the Tera gateway before being returned to you. You’ll see a single response.
Health-aware routing: Unhealthy backends are taken out of rotation automatically; clients don’t need to manage this.
Concurrency: Per-key concurrency and throughput are sized to your workload. Reach out for higher provisioned envelopes.
Idempotency: Requests are not deduplicated server-side. If you retry a request that may have succeeded, you may be billed for both.
Streaming cancellation: If the client disconnects mid-stream, generation is cancelled on the backend.

Observability

Every response carries headers and a body useful for trace correlation.

Surface	Value
`X-Tera-Request-ID` response header	Unique per request. Quote this in support emails.
`usage.prompt_tokens`	Input tokens billed
`usage.completion_tokens`	Output tokens billed (includes reasoning)
`usage.total_tokens`	Sum
`choices[0].finish_reason`	`stop`, `length`, or `tool_calls`

Read the header with the OpenAI Python SDK via the with_raw_response accessor:

raw = client.chat.completions.with_raw_response.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "hi"}],
)
request_id = raw.http_response.headers.get("X-Tera-Request-ID")
resp = raw.parse()

Errors

All errors return a JSON body of the shape:

{ "error": { "message": "...", "type": "...", "code": "..." } }

HTTP	`error.type`	When	Retry?
`400`	`invalid_request_error`	Malformed body, unknown sampling parameter values	No — fix the request
`401`	`authentication_error`	Missing or invalid API key	No — rotate the key
`403`	`permission_error`	Key not scoped to this model	No — request scope
`404`	`model_not_found`	`model` field not in the catalog	No — check `/v1/models`
`429`	`rate_limit_error`	Per-key concurrency / token-rate limit hit	Yes — honor `Retry-After` header
`500`	`server_error`	Gateway-side failure after backend retries exhausted	Yes — bounded retry with backoff
`503`	`service_unavailable`	Backend cold or no healthy replica	Yes — first request after long idle may hit this

Rate limits

Per-key concurrency and tokens-per-second are provisioned to your expected workload. Tell us the shape — peak QPS, sustained concurrency, rough token volumes — and we’ll size accordingly. Bursts beyond your provisioned envelope return 429 with Retry-After.

Cost example

Typical agentic turn with a tool call (1,000 input tokens, 600 output tokens):

Stage	Tokens	Cost
User prompt + system	700 in	$0.0000630
Reasoning + tool call	400 out	$0.0001440
Tool result + final answer	300 in / 200 out	$0.0000990
Per turn		~$0.000306

At 50,000 turns/day this runs ~

15.30/day (~

459/month). Volume committed-use pricing available — email hello@tera.gw.

Onboard

Email hello@tera.gw — tell us expected concurrency, peak QPS, and rough token volumes.
We issue an sk-tera-... key.
Smoke-test against https://api.tera.gw/v1 from your gateway.
Ramp.

Bring the X-Tera-Request-ID of any failing request and we can trace it end-to-end.

​At a glance

​Pricing

​Quickstart

​Reasoning

​Tool calling

​Structured outputs / JSON mode

​Streaming

​Sampling parameters

​Supported features

​OpenAI compatibility matrix

​Reliability and routing

​Observability

​Errors

​Rate limits

​Cost example

​Onboard

At a glance

Pricing

Quickstart

Reasoning

Tool calling

Structured outputs / JSON mode

Streaming

Sampling parameters

Supported features

OpenAI compatibility matrix

Reliability and routing

Observability

Errors

Rate limits

Cost example

Onboard