Max output tokens

Short answer

Max output tokens is the upper limit on how many tokens a model can generate in a single response. It's usually much smaller than the context window — GPT-5.5 caps output at 128K even though its context window is 1M.

The context window and the max output tokens are independent limits. A model with a 1M context window might cap output at 8K, 32K, or 128K depending on the model.

  • GPT-5.5 / GPT-5.4: context 1M, max output 128K.
  • Claude Opus 4.8 / Sonnet 4.6: context 1M, max output 128K.
  • Gemini 3.1 Pro: context 1M, max output 64K.

If you're doing long-form generation (long articles, code refactors of big repos), max output is often the real constraint.

Related terms