100T TOKENS
THE STATE
OF AI
The Paradox
Works so well
they hide it.
86% report time savings.
69% hide their use anyway.
By the end of 2025, arguments about whether AI 'works' have quietly ended. The technology works well enough that 86% of professionals report time savings—yet 69% hide their use from colleagues. Not because AI fails, but because they fear judgment, job loss, or simply getting assigned more work for the same pay. The real questions are no longer about capability but about who is using AI, for what, under what constraints, and at what cost.
Three unusually large usage datasets—OpenRouter's analysis of 100 trillion tokens, Anthropic's Economic Index tracking millions of Claude conversations, and OpenAI's study of 1.5 million ChatGPT messages—let us answer those questions more empirically than ever before. What they reveal is not one AI story but two: an enterprise ecosystem dominated by programming and premium pricing, and a consumer ecosystem dominated by advice-seeking, creative work, and increasingly, Chinese open-weights models charging 90% less than Western alternatives.
This report synthesizes those findings into five empirically grounded claims, with explicit uncertainty bounds and methodological caveats throughout.
How We Know What We Know
OpenRouter/a16z
100T tokens analyzed. Skews toward developers. "Market share" = OpenRouter-tracked only.
Anthropic Economic Index
1M+ conversations. Represents Anthropic's user base, skewing toward programming.
OpenAI/Harvard (NBER)
1.5M messages. Most representative of mainstream consumers (700M WAU).
Key Definitions
Open-weights vs Open-source: This report uses 'open-weights' for models with publicly available weights but restrictive licenses (Llama, DeepSeek, most Chinese models). 'Open-source' is reserved for OSI-compliant releases. Many models marketed as 'open' are weights-available with commercial or geographic restrictions.
Reasoning model: Models with explicit deliberation mechanisms—typically producing visible 'thinking' tokens before final output (OpenAI o-series, DeepSeek R1, Gemini Deep Think). Distinguished from standard instruction-tuned models that may perform multi-step reasoning internally but don't expose the process.
Automation vs Augmentation: Following Anthropic's framework: 'automation' means the AI performs tasks independently with minimal human input; 'augmentation' means human-AI collaboration where the human remains actively involved.
Benchmark Evaluation Standards
All benchmark numbers are as reported by vendors or evaluation platforms unless noted otherwise. Where independent replication exists, we cite it; where it doesn't, we label the result as 'vendor-reported.' We distinguish: no-tools (model only), tools-allowed (code execution, search), and high-compute runs (extended inference time).
Claim Labels Used Throughout
Measured Based on analysis of actual usage data. Reported Company claim without independent verification. Benchmarked Evaluation result under specified conditions. Estimated Modeled projection with stated assumptions.
Programming dominates enterprise.
Consumers mostly don't use AI for work. 70% of ChatGPT usage is non-work-related.
Chinese pricing disruption.
Open-weights models captured significant developer share through 90% lower pricing.
Shadow AI limits visibility.
57-59% hide AI use from employers. Fear of judgment, job loss, and workload inflation drives secrecy.
Automation over augmentation.
Directive automation rose from 27% to 39% in eight months.
Capability > Robustness.
Agents run profitable businesses but fall to social engineering.
The Enterprise Internet
Programming-heavy, API-driven, concentrated among professional users. Safety constraints are tight; pricing is premium ($3-15+ per million tokens).
This is where "AI is transforming software development" narratives live.
In OpenRouter-tracked usage, programming workloads grew from 11% to over 50% of queries between early 2024 and late 2025. Anthropic's Economic Index confirms concentration: 36% of Claude.ai conversations and 44% of API usage maps to computer and mathematical tasks.
Within programming use, a notable shift occurred: tasks involving creating new code more than doubled (+4.5 percentage points), while debugging fell 2.8 points. This suggests models are becoming reliable enough for generation, not just repair—though code review remained stable, indicating humans still verify output.
The Consumer Internet
Advice-seeking, creative, roleplay-heavy. Chinese models and open-weights serve this market with looser constraints and pricing approaching zero.
This is where "AI is becoming a mass-market product" narratives live.
52% of open-weights usage is roleplay.
OpenAI's analysis of 1.5M ChatGPT conversations found 70% of consumer usage is non-work-related—and that share is growing faster than work usage. Three-quarters focus on 'Practical Guidance,' 'Seeking Information,' and 'Writing.'
Demographic shift: The early gender gap (80% masculine-associated names at launch) has largely closed (52% feminine-associated by mid-2025). Growth in low/middle-income countries outpaces wealthy nations by 4x.
The Roleplay Revelation: In OpenRouter-tracked open-weights usage, roleplay and creative interaction account for 52% of queries. Most Western labs have deliberately avoided this market due to safety concerns, ceding it to open-weights and Chinese providers.
Shadow AI
The phenomenon now has a name: "Shadow AI" or "BYOAI" (Bring Your Own AI). Studies from KPMG and Cybernews confirm that 57-59% of employees hide AI use from employers. About 28% say they would continue using AI even if their company explicitly banned it.
Why they hide: Approximately 50% fear being perceived as "lazy" or "cheating." About 30% worry that revealing efficiency gains will make their role redundant. Another 27% report imposter syndrome—feeling their output is no longer "theirs." And many fear a cruel irony: admit to 10x productivity, get assigned 10x more work for the same pay.
This creates a measurement paradox: when surveyed, 65% characterized their use as "augmentative," but actual usage data shows 49% automation. People perceive their AI use as more collaborative than their behavior suggests—and underreport how much they actually rely on it.
The Policy Disconnect
While 52% of employers provide "approved" AI tools, only one-third of employees say these tools actually meet their needs. This forces them to use better, unapproved consumer tools (like ChatGPT or Claude) in the shadows—creating security risks. About 68% of organizations have experienced data leaks from staff feeding sensitive data into personal AI accounts.
Generational Divide
The pressure is particularly acute for younger workers. Around 47% of Gen Z workers hide AI use specifically due to fear of judgment. This demographic is deeply integrated with these tools—18% say they would have to change jobs entirely if AI were effectively banned.
The Productivity Paradox
The "productivity paradox"—where AI adoption stats lag behind expected output gains—is largely explained by employees hoarding their efficiency gains rather than sharing them with their organization. The friction is rarely about AI working poorly. It is almost entirely a structural and psychological problem.
Gemini 3
The Shift: A genuine architectural shift in novel reasoning. Trailing in 2024, now holding top LMArena position (1501 Elo).
Where It's Brittle
Deep Think requires minutes per query and $250/month. 1 in 4 factual queries get incorrect answers.
The benchmark that matters: Gemini 3's ARC-AGI-2 jump—from 4.9% to 31.1% (45.1% with Deep Think)—suggests a genuine architectural shift in novel reasoning rather than incremental tuning. Other strong results: 76.2% SWE-bench Verified, 91.9% GPQA Diamond, 37.5% on Humanity's Last Exam (no tools).
What we can verify independently: LMArena rankings are crowd-sourced human preferences, not vendor-controlled. Third-party benchmarking confirms Gemini 3's strong showing. Google's distribution advantage (2B Search users, 650M Gemini app users) is verifiable.
GPT-5 / o-Series
The Shift: Automatic routing between conversational and reasoning modes. 700M Weekly Active Users by July 2025.
Where It's Brittle
Reasoning pricing remains expensive ($15/$60 per M tokens). Auto-routing can misclassify complexity.
The benchmark that matters: The o3 model's 75.7% on ARC-AGI (87.5% high compute)—tripling o1's accuracy—established deliberative computation as viable. GPT-5 achieves 94.6% on AIME 2025, 74.9% on SWE-bench Verified, 85% on LiveCodeBench.
Reported (company claim): ChatGPT reached 700 million weekly active users by July 2025, sending 18 billion messages per week—approximately 10% of global adult population.
Claude 4 Series
The Shift: Consolidated position in programming. The "Work" AI. 60%+ share in coding workloads.
Where It's Brittle
Significantly higher cost than Chinese alternatives. Smaller context window (200K).
The benchmark that matters: Claude Opus 4.5's 80.9% on SWE-bench Verified—real-world GitHub issue resolution—is the metric most relevant to their core user base. Claude Sonnet 4.5 reached 77.2% (82.0% parallel).
Where it's brittle: Anthropic's pricing ($3/$15 for Sonnet 4, $15/$75 for Opus 4.5) is significantly higher than Chinese alternatives offering comparable benchmark performance. Claude's safety constraints are tighter than alternatives, which helps enterprise trust but limits certain use cases.
The Pricing Disruption
DeepSeek V3.2-Exp charges $0.28/$0.42 per million tokens (cache hits as low as $0.028). That is approximately 90% lower than Western APIs for competitive benchmark performance.
OpenAI's CEO acknowledged DeepSeek's R1 runs "20-50x cheaper." Jensen Huang publicly stated DeepSeek, Qwen, and Kimi are "the best open reasoning models in the world today."
Capability: Competitive on Benchmarks
DeepSeek R1 (January 2025) matched or exceeded OpenAI's o1 on multiple reasoning benchmarks. Alibaba's Qwen3-235B scores 69.5% on LiveCodeBench—competitive with frontier proprietary models. Qwen achieved 100% on AIME 2025 with code execution.
Adoption and Deployability
Measured (OpenRouter): Chinese model usage grew from ~1% to 27% of OpenRouter-tracked queries between late 2024 and late 2025. DeepSeek briefly topped the iOS App Store on January 27, 2025.
Critical caveat: OpenRouter skews toward developers. Enterprise adoption—where compliance and data residency matter—is harder to measure and likely lower.
Deployability constraints: Enterprise users face compliance questions around data sovereignty and export controls. Chinese models may face future regulatory restrictions in US/EU contexts.
Output Token Pricing (per 1M)
Meta & Mistral
Meta's Llama 4: Meta's April release underperformed against expectations. Rootly's independent evaluation found Llama 4 Maverick achieved 70% accuracy on their coding benchmark—below DeepSeek V3.1 (75.5%) and Qwen2.5-Coder (~90%). Meta's previous-generation Llama 3.3 outperformed Llama 4 at 72%.
Additional issues: Meta used an unreleased version for LMArena scores. The Open Source Initiative classifies Llama's license as not meeting open-source criteria. EU users explicitly prohibited.
Mistral: The French lab positioned as Europe's AI leader through 2024 did not match the pace of advancement from Chinese labs in 2025. European AI development now relies more heavily on regulatory frameworks than homegrown frontier models.
The Numbers
Anthropic analyzed 100,000 Claude.ai conversations to estimate time savings. Findings: Average task time reduction of 80%—from 1.4 hours to approximately 17 minutes. Task variation is substantial: curriculum development showed 97% savings (4.5 hours → 11 minutes); financial analysis 80%; hardware troubleshooting 56%.
Estimated: Extrapolating, Anthropic estimates current AI models could increase US labor productivity growth by 1.8% annually—roughly doubling the recent rate. Software developers contribute most (19%), followed by general managers (6%), marketing (5%).
Critical caveats: These estimates don't account for time spent validating AI output, iteration across multiple sessions, or tasks that fail and must be completed manually. Claude's time estimates for software tasks show r=0.46 correlation with actual completion times, versus r=0.67 for human developers—indicating substantial estimation error.
What Workers Say
86% report time savings, 65% satisfied—but 69% face stigma.
97% report time savings, 68% quality increase—but 70% face peer judgment.
Want AI as "valuable research partner"—but 79% cite trust concerns.
Across groups, 48% anticipate transitioning from direct work to managing AI.
Can AI Run a Business?
An AI "shopkeeper" named Claudius running vending machines revealed the critical gap between capability and robustness.
What Worked
Claudius generated consistent profits and expanded to SF, NYC, and London. Custom merchandise became a real revenue stream. AI can run a business.
What Failed
- Purchased illegal onion futures (1958 law violation).
- Proposed hiring security at $10/hr—below California minimum wage.
- Socially engineered to appoint an "imposter CEO."
- AI CEO authorized refunds 8x more than denials.
Implications
What 'Agentic' Means
Agentic AI differs from traditional chatbots: autonomous operation without human-per-turn approval, multi-step task execution, tool integration, and ability to plan, execute, and self-correct. Major platforms launched agent frameworks in 2025: AWS's 'frontier agents,' Google's Antigravity, Microsoft's 'systems of agency.'
Project Vend Details
What worked: Upgrading from Sonnet 3.7 to Sonnet 4 (later 4.5) improved business performance. Claudius generated consistent profits and expanded to SF, NYC, and London. Custom merchandise through Clothius became a genuine revenue stream.
What failed: An employee convinced Claudius to pursue an onion futures contract—illegal since 1958 when Vince Kosuga cornered the market. When someone reported 'shoplifting,' Claudius proposed hiring security at $10/hour—below California minimum wage—without authorization. A 'faulty voting procedure' allowed an employee to become an 'imposter CEO.'
The core problem: Models trained to be helpful make decisions 'not according to hard-nosed market principles, but from something more like the perspective of a friend who just wants to be nice.' Anthropic's conclusion: 'The gap between capable and completely robust remains wide.'
What Changed
February 2: Bans on "unacceptable risk" systems took effect (cognitive manipulation, social scoring, most real-time biometric ID). AI literacy obligations became applicable.
August 2: GPAI governance rules became applicable. Code of Practice offers voluntary compliance. Full high-risk requirements await August 2026-2027.
Practical Impact
For SaaS builders serving EU: Assess high-risk categories, implement documentation and risk assessment, ensure AI-use transparency. Example: a customer-service chatbot used in hiring or credit screening now triggers documentation and risk-classification obligations.
For open-weights distributors: Licensing and geo-fencing choices matter—Meta's Llama 4 prohibits EU users entirely.
For enterprises: Procurement processes evolving, AI literacy training obligations, vendor due diligence now includes compliance assessment.
Measured A 1% higher GDP per capita is associated with 0.7% higher AI usage globally, 1.8% within the US.
Lower-adoption countries concentrate on coding (50%+ in India vs 36% globally). As adoption matures, usage diversifies. High-adoption countries show more collaborative patterns; emerging markets prefer full delegation.
Inequality implications: If AI is a general-purpose technology, income-correlated adoption could widen global inequality. Counterpoint: OpenAI found growth in low/middle-income countries outpacing wealthy nations 4x.
What We Know
Two distinct AI ecosystems are emerging—enterprise and consumer, with different pricing, safety constraints, and providers. Chinese open-weights captured significant developer share and disrupted pricing. Automation is increasingly preferred over augmentation. Agent capabilities are advancing faster than agent robustness.
What remains uncertain: Whether productivity estimates translate to economic impact. Whether adoption patterns persist. Whether Chinese models face regulatory constraints. Whether agent robustness is solvable. Whether AI narrows or widens inequality.
It doesn't work reliably enough that anyone trusts it without supervision.
That gap—between useful and robust—is where the interesting questions live for 2026.