Skip to main content
Model id: openai/gpt-oss-120b — pass this as the model field. The API surface is the OpenAI Chat Completions API; existing OpenAI SDKs and any gateway that abstracts over OpenAI-compatible providers work without code changes.

At a glance

Model idopenai/gpt-oss-120b
Provideropenai
HuggingFaceopenai/gpt-oss-120b
Context length131,072 tokens
Max output4,096 tokens
Quantizationmxfp4 (native)
ReasoningEmits reasoning (OpenAI gpt-oss parser)
Tool callingOpenAI-format, auto tool choice enabled

Pricing

per million tokens
Input$0.09
Output$0.36
Reasoning tokens count toward output. See How billing works.

Quickstart

Set base_url to https://api.tera.gw/v1 and pass your sk-tera-... key. No other code change.
from openai import OpenAI

client = OpenAI(
    base_url="https://api.tera.gw/v1",
    api_key="sk-tera-...",
)

resp = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Summarize the gpt-oss license in 2 sentences."}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Reasoning

gpt-oss-120b runs with the OpenAI gpt-oss reasoning parser. Chain-of-thought tokens are returned in a separate reasoning field so they don’t pollute content. OpenAI SDKs that expect a plain content string continue to work.
Some providers expose this as reasoning_content. We follow OpenAI’s recommendation and use reasoning. If you’re porting code that expects reasoning_content, treat the two as aliases.
{
  "choices": [{
    "message": {
      "role": "assistant",
      "reasoning": "The user is asking about... I should think about...",
      "content": "The final answer is X."
    },
    "finish_reason": "stop"
  }]
}
See Reasoning models for the streaming Python loop and rendering patterns.

Tool calling

gpt-oss-120b runs with the OpenAI tool-call parser and enable_auto_tool_choice=true. The request and response shapes match the OpenAI Chat Completions API 1:1.
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
}]

# Turn 1: model decides to call the tool
first = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
)
tool_call = first.choices[0].message.tool_calls[0]

# Turn 2: provide tool output, get final answer
final = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"},
        first.choices[0].message,
        {"role": "tool", "tool_call_id": tool_call.id, "content": "18C, partly cloudy"},
    ],
    tools=tools,
)
print(final.choices[0].message.content)
tool_choice accepts "auto" (default), "none", "required", or {"type": "function", "function": {"name": "..."}}. Parallel tool calls are supported — the response can contain multiple entries in tool_calls. Streaming tool calls arrive as delta.tool_calls[i].function.arguments JSON fragments that must be concatenated by call index. See Tool calling for the full streaming reconstruction example.

Structured outputs / JSON mode

Pass response_format to constrain the assistant’s output.
# JSON mode — guarantees valid JSON, schema not enforced
resp = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "List 3 colors as JSON."}],
    response_format={"type": "json_object"},
)

# Structured outputs — JSON Schema enforced
resp = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Extract: 'Alice is 30 years old.'"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                },
                "required": ["name", "age"],
                "additionalProperties": False,
            },
        },
    },
)

Streaming

Set "stream": true and consume Server-Sent Events on the same endpoint. See Streaming for the wire format.
stream = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Count to five."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if reasoning := getattr(delta, "reasoning", None):
        print(f"[think] {reasoning}", end="", flush=True)
    if content := delta.content:
        print(content, end="", flush=True)

Sampling parameters

temperature, top_p, top_k, max_tokens, stop, seed, frequency_penalty, presence_penalty, repetition_penalty, logprobs, top_logprobs seed is honored for deterministic sampling. top_k, repetition_penalty, and min_p are vLLM extensions beyond OpenAI’s surface — ignored by clients that don’t pass them.

Supported features

tools, json_mode, structured_outputs, reasoning, logprobs

OpenAI compatibility matrix

Field / behaviorStatusNotes
messages (system / user / assistant / tool)Standard OpenAI shape
tools / tool_choice / tool_callsAuto tool choice enabled
response_format (json_object, json_schema)Strict schema enforcement supported
stream (SSE)Terminated by data: [DONE]
seedDeterministic
logprobs / top_logprobs
reasoning fieldSibling of content; OpenAI-spec field for chain-of-thought. Some providers use reasoning_content as an alias.
OpenAI-Organization, OpenAI-Project headersAccepted and ignored
/v1/moderationsNot offered
/v1/embeddings, /v1/images/*, /v1/fine_tuning/*Not offered
Image inputsText-only today

Reliability and routing

For gateways that route across multiple providers (Respan, OpenRouter, in-house abstractions), the relevant behaviors:
  • Cold start: The first request after a backend cold-boot is slow (~2–12s TTFT) because vLLM compiles CUDA graphs on first traffic. Subsequent requests are warm. Schedule a warmup probe before routing real traffic if you can.
  • Gateway-side retry: 5xx errors trigger automatic retry across healthy replicas within the Tera gateway before being returned to you. You’ll see a single response.
  • Health-aware routing: Unhealthy backends are taken out of rotation automatically; clients don’t need to manage this.
  • Concurrency: Per-key concurrency and throughput are sized to your workload. Reach out for higher provisioned envelopes.
  • Idempotency: Requests are not deduplicated server-side. If you retry a request that may have succeeded, you may be billed for both.
  • Streaming cancellation: If the client disconnects mid-stream, generation is cancelled on the backend.

Observability

Every response carries headers and a body useful for trace correlation.
SurfaceValue
X-Tera-Request-ID response headerUnique per request. Quote this in support emails.
usage.prompt_tokensInput tokens billed
usage.completion_tokensOutput tokens billed (includes reasoning)
usage.total_tokensSum
choices[0].finish_reasonstop, length, or tool_calls
Read the header with the OpenAI Python SDK via the with_raw_response accessor:
raw = client.chat.completions.with_raw_response.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "hi"}],
)
request_id = raw.http_response.headers.get("X-Tera-Request-ID")
resp = raw.parse()

Errors

All errors return a JSON body of the shape:
{ "error": { "message": "...", "type": "...", "code": "..." } }
HTTPerror.typeWhenRetry?
400invalid_request_errorMalformed body, unknown sampling parameter valuesNo — fix the request
401authentication_errorMissing or invalid API keyNo — rotate the key
403permission_errorKey not scoped to this modelNo — request scope
404model_not_foundmodel field not in the catalogNo — check /v1/models
429rate_limit_errorPer-key concurrency / token-rate limit hitYes — honor Retry-After header
500server_errorGateway-side failure after backend retries exhaustedYes — bounded retry with backoff
503service_unavailableBackend cold or no healthy replicaYes — first request after long idle may hit this

Rate limits

Per-key concurrency and tokens-per-second are provisioned to your expected workload. Tell us the shape — peak QPS, sustained concurrency, rough token volumes — and we’ll size accordingly. Bursts beyond your provisioned envelope return 429 with Retry-After.

Cost example

Typical agentic turn with a tool call (1,000 input tokens, 600 output tokens):
StageTokensCost
User prompt + system700 in$0.0000630
Reasoning + tool call400 out$0.0001440
Tool result + final answer300 in / 200 out$0.0000990
Per turn~$0.000306
At 50,000 turns/day this runs ~15.30/day( 15.30/day (~459/month). Volume committed-use pricing available — email hello@tera.gw.

Onboard

  1. Email hello@tera.gw — tell us expected concurrency, peak QPS, and rough token volumes.
  2. We issue an sk-tera-... key.
  3. Smoke-test against https://api.tera.gw/v1 from your gateway.
  4. Ramp.
Bring the X-Tera-Request-ID of any failing request and we can trace it end-to-end.