We've been calculating model cost and value wrong, and it has never mattered more. Models are getting more expensive, people can burn through a $20 Anthropic plan in a couple of Opus-heavy sessions, and GitHub Copilot, where many enterprises pay for developer AI consumption, is moving toward token-based pricing.
For years, the large language model market had a familiar split: US frontier labs sold the best capability, while open-weight and international models competed on value. GPT-5.5 complicates that story. The important detail is not just that GPT-5.5 is intelligent. It is that GPT-5.5 low reasoning appears to land in the same cost-performance cluster that previously belonged to the affordable model challengers.
How To Read The Charts
Price per million tokens is still useful, but it is only the sticker price. The realized cost depends on how many tokens the model needs to solve the task, how often it needs retries, and whether higher reasoning settings burn extra output tokens to reach a better answer.
The first chart is useful because it compares intelligence against cumulative benchmark cost rather than list price alone. The second chart is useful because output token usage is a proxy for token bloat, latency, and future context-window pressure in agentic systems.
Why This Surprised Me
As someone who has spent a lot of time around open-weight and locally hosted models, the old tradeoff felt stable. If you wanted the strongest model, you usually paid for a closed frontier model. If you wanted something dramatically cheaper and still good enough, you looked toward the open-weight model ecosystem, often including models created by Chinese labs and self-hosted or at providers in the US, EU, and Singapore.
That value lane mattered. It gave builders access to capable models at prices that sometimes looked dramatically better than American frontier offerings. The models were often a little behind on raw intelligence, but the economics were compelling enough that the tradeoff made sense.
GPT-5.5 changes the shape of that conversation. Looking at the cost-to-run data, GPT-5.5 low reasoning appears cheaper than DeepSeek V4 Pro while nearly matching its performance. It also appears cheaper than GLM-5 while slightly exceeding its average intelligence score, and it lands near Qwen3.6 with lower cumulative cost and higher intelligence in this view. That is a different kind of OpenAI story: not just best-in-class capability, but credible value.
Why Cumulative Cost Matters
We usually talk about models in two incomplete ways. First, we talk about benchmark scores, either individually or as an average across multiple benchmarks. Second, we talk about price per million input or output tokens. Both are useful, but neither fully captures what it feels like to operate these models in real workflows.
A model can look inexpensive on a per-token basis and still be expensive to operate if it burns a large number of output tokens to reach its answer. This is especially important as reasoning modes become more common. Low, medium, high, and extended reasoning settings can improve benchmark scores, but they often do so by spending more tokens.
That is why I find the cumulative cost of running a benchmark suite more useful than token price alone. It captures not just what each token costs, but how many tokens the model tends to spend while solving the task. For agents, RAG systems, code assistants, and long-running workflows, that is closer to the bill we actually feel.
Token Efficiency Is Not Just Cost Efficiency
In a world where GitHub Copilot and other developer tools are moving toward token-sensitive economics, token bloat matters. Agents and RAG systems can burn through millions of tokens quickly. If a model needs to ramble its way into the right answer, the cost problem is obvious. The less obvious problem is that every extra token also competes for context.
GPT-5.5 low reasoning is interesting because it appears to compete with strong value models on both price and performance while using far fewer output tokens. For example, in this Artificial Analysis benchmark view, GPT-5.5 low edges out DeepSeek V4 Pro in intelligence while using roughly 22% of the output tokens. That is not a universal runtime guarantee, but it is not just a pricing detail either. It is an architectural advantage for systems that run many iterative steps.
Token Bloat Becomes Context Bloat
People who describe themselves as agentic engineers or context engineers will probably appreciate the token reduction more than anyone. Extra reasoning tokens do not only increase the bill. They also increase context-window pressure.
One way to improve benchmark scores is to let models think longer, explain more, and generate more intermediate text until they stumble into a better answer. It can work. The problem is that long agent runs already suffer from context-window bloat. Even when a system advertises a very large context window, model performance can degrade as the context becomes crowded, stale, or noisy.
That is why so much attention is moving toward context minimization strategies: progressive disclosure, skill-based context loading, focused subagents, small task-specific toolsets, and avoiding the old pattern of giving one model thirty tools and ten knowledge articles all at once. A model that achieves stronger results with fewer output tokens helps with that same problem by default.
The Bigger Shift
The value frontier may no longer be defined only by open-weight or China-based models. GPT-5.5 suggests that a US frontier model can compete in the same economic conversation when the right reasoning mode is selected. That does not erase the value of open-weight models, local deployment, or model sovereignty. Those still matter for privacy, control, resilience, and cost negotiation.
But it does mean the model-selection conversation should change. The best value model is not always the model with the cheapest listed token price. It is the model that can complete the work reliably with the lowest total operational cost, the least unnecessary context growth, and the fewest retries.
For agentic systems, that may make GPT-5.5 low reasoning one of the more important data points to watch.
Operator Takeaway
When choosing models for agents, compare cost per successful task, output tokens, latency, retries, and context growth, not just benchmark rank or token price. A model that costs more per token can still be cheaper in production if it finishes the work with fewer tokens, fewer retries, and less context pollution.