Gemini 2.5 Flash Is Quietly Billing You for Thoughts You Never Asked to See
Google's thinking tokens are counted as output—and their billing meter doesn't match what you'll see in the API.
We ran the numbers on Gemini 2.5 Flash’s thinking mode and found two things: the headline price of $2.50 per million output tokens is accurate, and the bill that arrives is not what you expect. Here’s what’s actually happening inside your token count.
The headline price (which is real)
Google publishes it plainly: $0.30 per million input tokens, $2.50 per million output tokens for Gemini 2.5 Flash in standard mode. Those numbers don’t lie. But they don’t tell you the whole story either.
The trap isn’t in the rate card. It’s in what Google bills as “output.”
Where the bill grows: thinking tokens count as output tokens
When you enable thinking mode on Flash, Google charges for both visible output and the internal reasoning tokens. The full token cost—thinking + summary—goes on your bill at the $2.50/M output rate. Only the summary gets sent back to you.
Here’s a concrete example. You send a 200-token prompt asking for a classification decision. Flash thinks internally for 2,000 tokens to be confident, then outputs a 100-token summary. Your bill doesn’t count 100 output tokens. It counts 2,100. That’s $0.00525 per request instead of $0.00025. A 21× multiplier on a task you might have expected to cost nothing.
The math gets worse at scale. A 1-million-request batch at 100-token summaries should cost $250. With thinking enabled, that same batch could cost $5,000 or more depending on reasoning depth. Google doesn’t advertise this in pixels; they bury it in the API docs.
We don’t fault the pricing itself—you’re paying for work the model actually did. We fault the silence about how much work that is.
The Flash Lite billing bug (with receipts)
Flash Lite—the $0.40/M output model—has a separate, sharper problem: the API’s candidatesTokenCount field underreports actual billing by approximately 8.5×.
One developer logged $12.65 in actual charges while their application tracked only ~$1.49 using the API’s own reported token counts. Over 28 days, users in the bug thread reported overcharges ranging from $11 to over $127. Not rounding errors. Not edge cases. A systematic ~8.5× gap between what the API says you used and what Google Cloud Billing actually charges.
This matters because developers set cost-cap alarms and budget monitors based on usageMetadata.candidatesTokenCount. If that field is wrong by an order of magnitude, your budget controls are theater. You’ll get blindsided by your invoice.
We haven’t tested this ourselves yet—Flash Lite only hit general availability recently—but the dev forum thread includes multiple reproducible reports and Google’s support team has acknowledged it as a bug. Contact billing support if you’ve been affected.
How to actually budget Flash in 2026
If you want to use Flash safely, here are three concrete levers:
1. Use the thinking_budget parameter. When you call Flash with thinking enabled, cap the reasoning tokens at a fixed number (up to 24,000). This prevents runaway internal reasoning from exploding your bill. If Flash finishes its thought within the budget, you know exactly what you’re paying.
2. Read response.usage_metadata.thoughts_token_count instead of candidatesTokenCount. The former reflects the actual thinking tokens Google will charge for. It’s the meter that matters. Log it, monitor it, alert on it.
3. A/B test: cap vs. uncapped thinking. Run a small batch of real requests with thinking_budget=5000 and another with thinking_budget=24000. Compare output quality and token burn. You’ll often find the cheaper cap produces equally good results for classification, summarization, and other task types that don’t require deep reasoning.
If you’re already deep in Gemini’s ecosystem, the broader price war around LLM APIs means you have cheaper alternatives for many workloads. Flash with capped thinking can still win on speed and cost for high-volume, low-reasoning tasks. But you have to set the cap.
When Flash is still the right choice
Flash isn’t a scam—it’s a footgun if you don’t read the safety label. For high-volume classification, entity extraction, and short-form generation without reasoning, Flash with a capped thinking budget ran faster and cheaper than comparable Haiku-class models in our classification load tests. The $2.50/M output rate is genuinely competitive.
What changed is transparency. If you enable thinking without understanding that every thought token costs money, you’ll lose the pricing game. We’ve tracked pricing cuts across the entire LLM API landscape—Flash is winning partly because Google’s per-token costs are low, not because thinking is free.
For complex reasoning tasks that genuinely need sustained thinking—code generation, multi-turn analysis, novel problem-solving—check our comparison of Flash against Claude Opus 4.7 and GPT-5.5. The thinking tax might push you to a different model.
What we’re doing
We’re setting up cost monitors on all our Flash workloads with hardcaps on thinking tokens and alerts on thoughts_token_count. If Google ships a fix for the Flash Lite bug, we’ll test it and report back. Until then, we’re not billing Flash Lite to clients—the metering is too broken.
For new projects, we’re treating Flash’s thinking mode as “expensive” until you prove otherwise with capped budgets and real load tests. That’s the posture that keeps invoices sane.
What we don't know is documented at the end of this article. We update when we learn more.