Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tera.gw/llms.txt

Use this file to discover all available pages before exploring further.

Tera implements the OpenAI Chat Completions API surface. Existing OpenAI SDKs work by changing two settings:
from openai import OpenAI

client = OpenAI(
    base_url="https://api.tera.gw/v1",
    api_key="sk-tera-...",
)

What’s the same

  • Endpoints/v1/chat/completions, /v1/completions, /v1/models, /v1/audio/speech
  • Streaming — Server-Sent Events with data: {...} frames terminated by data: [DONE]
  • Request shapemessages, temperature, top_p, max_tokens, stop, seed, frequency_penalty, presence_penalty, stream, tools, tool_choice, response_format
  • Response shapeid, object, created, model, choices[].message, choices[].finish_reason, usage
  • Tool calling — OpenAI-compatible tools array and tool_calls in responses

What’s different

Model IDs

Tera uses HuggingFace IDs as the canonical model name — no provider prefix.
{ "model": "Qwen/Qwen2.5-7B-Instruct" }
See Models for the catalog.

Extra sampling parameters

Tera supports a few sampling parameters beyond OpenAI’s surface. They’re optional and ignored if you don’t pass them.
  • top_k — top-k sampling
  • repetition_penalty — additional penalty term (distinct from OpenAI’s frequency/presence penalties)
  • min_p — minimum probability threshold

Reasoning models

Models like Qwen/Qwen3.5-27B emit explicit reasoning traces. Tera splits these into a separate reasoning_content field rather than mixing them with the user-facing answer. See Reasoning.

No org / project headers

We don’t require or accept OpenAI-Organization or OpenAI-Project headers. Drop them if your client sends them — they’re ignored.

No moderation endpoint

We don’t offer /v1/moderations. Use OpenAI’s moderation endpoint if you need it, or run a separate guard model.

No embeddings, no images, no fine-tuning

Today: text generation and TTS only. No /v1/embeddings, no /v1/images/*, no /v1/fine_tuning/*. Let us know if you need these.

Behavioral notes

  • First request after cold start is slower (~2–12s TTFT) due to CUDA graph compilation. Subsequent requests are fast.
  • 5xx errors trigger automatic retry within the gateway across healthy backend replicas before being returned to you.
  • Health-aware routing — if a backend fails health checks, traffic is steered to healthy replicas with no client changes.