state

The state of AI in 2025

Data-Driven Analysis • Report
DECEMBER 2025
100T TOKENS

THE STATE
OF AI

What three landmark usage studies reveal about how AI is actually being used. The separation of Enterprise and Consumer ecosystems defines 2025.

The Paradox

Works so well
they hide it.

86% report time savings.
69% hide their use anyway.

By the end of 2025, arguments about whether AI 'works' have quietly ended. The technology works well enough that 86% of professionals report time savings—yet 69% hide their use from colleagues. Not because AI fails, but because they fear judgment, job loss, or simply getting assigned more work for the same pay. The real questions are no longer about capability but about who is using AI, for what, under what constraints, and at what cost.

Three unusually large usage datasets—OpenRouter's analysis of 100 trillion tokens, Anthropic's Economic Index tracking millions of Claude conversations, and OpenAI's study of 1.5 million ChatGPT messages—let us answer those questions more empirically than ever before. What they reveal is not one AI story but two: an enterprise ecosystem dominated by programming and premium pricing, and a consumer ecosystem dominated by advice-seeking, creative work, and increasingly, Chinese open-weights models charging 90% less than Western alternatives.

This report synthesizes those findings into five empirically grounded claims, with explicit uncertainty bounds and methodological caveats throughout.

Methodology & Definitions

How We Know What We Know

OpenRouter/a16z

100T tokens analyzed. Skews toward developers. "Market share" = OpenRouter-tracked only.

Anthropic Economic Index

1M+ conversations. Represents Anthropic's user base, skewing toward programming.

OpenAI/Harvard (NBER)

1.5M messages. Most representative of mainstream consumers (700M WAU).

Key Definitions

Open-weights vs Open-source: This report uses 'open-weights' for models with publicly available weights but restrictive licenses (Llama, DeepSeek, most Chinese models). 'Open-source' is reserved for OSI-compliant releases. Many models marketed as 'open' are weights-available with commercial or geographic restrictions.

Reasoning model: Models with explicit deliberation mechanisms—typically producing visible 'thinking' tokens before final output (OpenAI o-series, DeepSeek R1, Gemini Deep Think). Distinguished from standard instruction-tuned models that may perform multi-step reasoning internally but don't expose the process.

Automation vs Augmentation: Following Anthropic's framework: 'automation' means the AI performs tasks independently with minimal human input; 'augmentation' means human-AI collaboration where the human remains actively involved.

Benchmark Evaluation Standards

All benchmark numbers are as reported by vendors or evaluation platforms unless noted otherwise. Where independent replication exists, we cite it; where it doesn't, we label the result as 'vendor-reported.' We distinguish: no-tools (model only), tools-allowed (code execution, search), and high-compute runs (extended inference time).

Claim Labels Used Throughout

Measured Based on analysis of actual usage data. Reported Company claim without independent verification. Benchmarked Evaluation result under specified conditions. Estimated Modeled projection with stated assumptions.

01 Five Claims That Define 2025
1
Usage

Programming dominates enterprise.

Consumers mostly don't use AI for work. 70% of ChatGPT usage is non-work-related.

Measured OpenRouter: 50%+ queries are code.
2
Market

Chinese pricing disruption.

Open-weights models captured significant developer share through 90% lower pricing.

Measured Chinese usage: 1% → 27% in one year.
3
Economy

Shadow AI limits visibility.

57-59% hide AI use from employers. Fear of judgment, job loss, and workload inflation drives secrecy.

Measured 28% would use AI even if banned.
4
Workflow

Automation over augmentation.

Directive automation rose from 27% to 39% in eight months.

Measured 77% of API usage is automation.
5
Safety

Capability > Robustness.

Agents run profitable businesses but fall to social engineering.

Demo Project Vend failures.
02 Two Internets of AI
Ecosystem A

The Enterprise Internet

Programming-heavy, API-driven, concentrated among professional users. Safety constraints are tight; pricing is premium ($3-15+ per million tokens).

This is where "AI is transforming software development" narratives live.

Anthropic commands 60%+ share in coding workloads.

In OpenRouter-tracked usage, programming workloads grew from 11% to over 50% of queries between early 2024 and late 2025. Anthropic's Economic Index confirms concentration: 36% of Claude.ai conversations and 44% of API usage maps to computer and mathematical tasks.

Within programming use, a notable shift occurred: tasks involving creating new code more than doubled (+4.5 percentage points), while debugging fell 2.8 points. This suggests models are becoming reliable enough for generation, not just repair—though code review remained stable, indicating humans still verify output.

Ecosystem B

The Consumer Internet

Advice-seeking, creative, roleplay-heavy. Chinese models and open-weights serve this market with looser constraints and pricing approaching zero.

This is where "AI is becoming a mass-market product" narratives live.

70% of ChatGPT usage is non-work-related.
52% of open-weights usage is roleplay.

OpenAI's analysis of 1.5M ChatGPT conversations found 70% of consumer usage is non-work-related—and that share is growing faster than work usage. Three-quarters focus on 'Practical Guidance,' 'Seeking Information,' and 'Writing.'

Demographic shift: The early gender gap (80% masculine-associated names at launch) has largely closed (52% feminine-associated by mid-2025). Growth in low/middle-income countries outpaces wealthy nations by 4x.

The Roleplay Revelation: In OpenRouter-tracked open-weights usage, roleplay and creative interaction account for 52% of queries. Most Western labs have deliberately avoided this market due to safety concerns, ceding it to open-weights and Chinese providers.

The Stigma Gap

Shadow AI

The phenomenon now has a name: "Shadow AI" or "BYOAI" (Bring Your Own AI). Studies from KPMG and Cybernews confirm that 57-59% of employees hide AI use from employers. About 28% say they would continue using AI even if their company explicitly banned it.

Why they hide: Approximately 50% fear being perceived as "lazy" or "cheating." About 30% worry that revealing efficiency gains will make their role redundant. Another 27% report imposter syndrome—feeling their output is no longer "theirs." And many fear a cruel irony: admit to 10x productivity, get assigned 10x more work for the same pay.

This creates a measurement paradox: when surveyed, 65% characterized their use as "augmentative," but actual usage data shows 49% automation. People perceive their AI use as more collaborative than their behavior suggests—and underreport how much they actually rely on it.

The Policy Disconnect

While 52% of employers provide "approved" AI tools, only one-third of employees say these tools actually meet their needs. This forces them to use better, unapproved consumer tools (like ChatGPT or Claude) in the shadows—creating security risks. About 68% of organizations have experienced data leaks from staff feeding sensitive data into personal AI accounts.

Generational Divide

The pressure is particularly acute for younger workers. Around 47% of Gen Z workers hide AI use specifically due to fear of judgment. This demographic is deeply integrated with these tools—18% say they would have to change jobs entirely if AI were effectively banned.

The Productivity Paradox

The "productivity paradox"—where AI adoption stats lag behind expected output gains—is largely explained by employees hoarding their efficiency gains rather than sharing them with their organization. The friction is rarely about AI working poorly. It is almost entirely a structural and psychological problem.

03 Model Landscape
Google DeepMind
Gemini 3

The Shift: A genuine architectural shift in novel reasoning. Trailing in 2024, now holding top LMArena position (1501 Elo).

Where It's Brittle

Deep Think requires minutes per query and $250/month. 1 in 4 factual queries get incorrect answers.

The benchmark that matters: Gemini 3's ARC-AGI-2 jump—from 4.9% to 31.1% (45.1% with Deep Think)—suggests a genuine architectural shift in novel reasoning rather than incremental tuning. Other strong results: 76.2% SWE-bench Verified, 91.9% GPQA Diamond, 37.5% on Humanity's Last Exam (no tools).

What we can verify independently: LMArena rankings are crowd-sourced human preferences, not vendor-controlled. Third-party benchmarking confirms Gemini 3's strong showing. Google's distribution advantage (2B Search users, 650M Gemini app users) is verifiable.

ARC-AGI-2 (Deep Think) 45.1%
SWE-bench Verified 76.2%
GPQA Diamond 91.9%
OpenAI
GPT-5 / o-Series

The Shift: Automatic routing between conversational and reasoning modes. 700M Weekly Active Users by July 2025.

Where It's Brittle

Reasoning pricing remains expensive ($15/$60 per M tokens). Auto-routing can misclassify complexity.

The benchmark that matters: The o3 model's 75.7% on ARC-AGI (87.5% high compute)—tripling o1's accuracy—established deliberative computation as viable. GPT-5 achieves 94.6% on AIME 2025, 74.9% on SWE-bench Verified, 85% on LiveCodeBench.

Reported (company claim): ChatGPT reached 700 million weekly active users by July 2025, sending 18 billion messages per week—approximately 10% of global adult population.

ARC-AGI (High Compute) 87.5%
LiveCodeBench 85.0%
Anthropic
Claude 4 Series

The Shift: Consolidated position in programming. The "Work" AI. 60%+ share in coding workloads.

Where It's Brittle

Significantly higher cost than Chinese alternatives. Smaller context window (200K).

The benchmark that matters: Claude Opus 4.5's 80.9% on SWE-bench Verified—real-world GitHub issue resolution—is the metric most relevant to their core user base. Claude Sonnet 4.5 reached 77.2% (82.0% parallel).

Where it's brittle: Anthropic's pricing ($3/$15 for Sonnet 4, $15/$75 for Opus 4.5) is significantly higher than Chinese alternatives offering comparable benchmark performance. Claude's safety constraints are tighter than alternatives, which helps enterprise trust but limits certain use cases.

SWE-bench Verified 80.9%
04 Chinese Open-Weights Models

The Pricing Disruption

DeepSeek V3.2-Exp charges $0.28/$0.42 per million tokens (cache hits as low as $0.028). That is approximately 90% lower than Western APIs for competitive benchmark performance.

OpenAI's CEO acknowledged DeepSeek's R1 runs "20-50x cheaper." Jensen Huang publicly stated DeepSeek, Qwen, and Kimi are "the best open reasoning models in the world today."

Capability: Competitive on Benchmarks

DeepSeek R1 (January 2025) matched or exceeded OpenAI's o1 on multiple reasoning benchmarks. Alibaba's Qwen3-235B scores 69.5% on LiveCodeBench—competitive with frontier proprietary models. Qwen achieved 100% on AIME 2025 with code execution.

Adoption and Deployability

Measured (OpenRouter): Chinese model usage grew from ~1% to 27% of OpenRouter-tracked queries between late 2024 and late 2025. DeepSeek briefly topped the iOS App Store on January 27, 2025.

Critical caveat: OpenRouter skews toward developers. Enterprise adoption—where compliance and data residency matter—is harder to measure and likely lower.

Deployability constraints: Enterprise users face compliance questions around data sovereignty and export controls. Chinese models may face future regulatory restrictions in US/EU contexts.

Output Token Pricing (per 1M)

Claude ($15)
GPT-5 ($10)
DeepSeek ($0.42)
Leadership Shift

Meta & Mistral

Meta's Llama 4: Meta's April release underperformed against expectations. Rootly's independent evaluation found Llama 4 Maverick achieved 70% accuracy on their coding benchmark—below DeepSeek V3.1 (75.5%) and Qwen2.5-Coder (~90%). Meta's previous-generation Llama 3.3 outperformed Llama 4 at 72%.

Additional issues: Meta used an unreleased version for LMArena scores. The Open Source Initiative classifies Llama's license as not meeting open-source criteria. EU users explicitly prohibited.

Mistral: The French lab positioned as Europe's AI leader through 2024 did not match the pace of advancement from Chinese labs in 2025. European AI development now relies more heavily on regulatory frameworks than homegrown frontier models.

05 The Productivity Question
Time Reduction
Value/Conversation
Potential Growth
Top Contributor
80%
1.4h → 17min
$55
Labor cost saved
1.8%
Annual (if universal)
19%
Software developers
Why skepticism is warranted: Estimates don't account for time validating output, iteration across sessions, or failed tasks completed manually. RCTs show smaller gains (14-56%). Most importantly: the projection assumes universal adoption. The stigma data suggests we're nowhere near that. This is potential, not current reality.

The Numbers

Anthropic analyzed 100,000 Claude.ai conversations to estimate time savings. Findings: Average task time reduction of 80%—from 1.4 hours to approximately 17 minutes. Task variation is substantial: curriculum development showed 97% savings (4.5 hours → 11 minutes); financial analysis 80%; hardware troubleshooting 56%.

Estimated: Extrapolating, Anthropic estimates current AI models could increase US labor productivity growth by 1.8% annually—roughly doubling the recent rate. Software developers contribute most (19%), followed by general managers (6%), marketing (5%).

Critical caveats: These estimates don't account for time spent validating AI output, iteration across multiple sessions, or tasks that fail and must be completed manually. Claude's time estimates for software tasks show r=0.46 correlation with actual completion times, versus r=0.67 for human developers—indicating substantial estimation error.

What Workers Say

General Workforce
86% report time savings, 65% satisfied—but 69% face stigma.
Creatives
97% report time savings, 68% quality increase—but 70% face peer judgment.
Scientists
Want AI as "valuable research partner"—but 79% cite trust concerns.

Across groups, 48% anticipate transitioning from direct work to managing AI.

06 Agentic AI—Capability vs Robustness
Case Study: Project Vend (Anthropic)

Can AI Run a Business?

An AI "shopkeeper" named Claudius running vending machines revealed the critical gap between capability and robustness.

What Worked

Claudius generated consistent profits and expanded to SF, NYC, and London. Custom merchandise became a real revenue stream. AI can run a business.

What Failed

  • Purchased illegal onion futures (1958 law violation).
  • Proposed hiring security at $10/hr—below California minimum wage.
  • Socially engineered to appoint an "imposter CEO."
  • AI CEO authorized refunds 8x more than denials.
"Models trained to be helpful make decisions not according to hard-nosed market principles, but from something more like the perspective of a friend who just wants to be nice. They can be socially engineered precisely because they're optimized for helpfulness."

Implications

Security Vulnerable to social engineering that exploits helpful training.
Accountability When an agent makes illegal contracts, who's responsible?
Policy as Code "Don't break the law" isn't sufficient when the agent doesn't know the law.
Audit Decisions need logging and escalation paths.

What 'Agentic' Means

Agentic AI differs from traditional chatbots: autonomous operation without human-per-turn approval, multi-step task execution, tool integration, and ability to plan, execute, and self-correct. Major platforms launched agent frameworks in 2025: AWS's 'frontier agents,' Google's Antigravity, Microsoft's 'systems of agency.'

Project Vend Details

What worked: Upgrading from Sonnet 3.7 to Sonnet 4 (later 4.5) improved business performance. Claudius generated consistent profits and expanded to SF, NYC, and London. Custom merchandise through Clothius became a genuine revenue stream.

What failed: An employee convinced Claudius to pursue an onion futures contract—illegal since 1958 when Vince Kosuga cornered the market. When someone reported 'shoplifting,' Claudius proposed hiring security at $10/hour—below California minimum wage—without authorization. A 'faulty voting procedure' allowed an employee to become an 'imposter CEO.'

The core problem: Models trained to be helpful make decisions 'not according to hard-nosed market principles, but from something more like the perspective of a friend who just wants to be nice.' Anthropic's conclusion: 'The gap between capable and completely robust remains wide.'

07 Regulatory Developments
EU AI Act

What Changed

February 2: Bans on "unacceptable risk" systems took effect (cognitive manipulation, social scoring, most real-time biometric ID). AI literacy obligations became applicable.

August 2: GPAI governance rules became applicable. Code of Practice offers voluntary compliance. Full high-risk requirements await August 2026-2027.

Practical Impact

For SaaS builders serving EU: Assess high-risk categories, implement documentation and risk assessment, ensure AI-use transparency. Example: a customer-service chatbot used in hiring or credit screening now triggers documentation and risk-classification obligations.

For open-weights distributors: Licensing and geo-fencing choices matter—Meta's Llama 4 prohibits EU users entirely.

For enterprises: Procurement processes evolving, AI literacy training obligations, vendor due diligence now includes compliance assessment.

08 Geographic Patterns

Measured A 1% higher GDP per capita is associated with 0.7% higher AI usage globally, 1.8% within the US.

Israel
Singapore
Australia
DC (US)
7x
Expected usage
4.6x
Expected usage
4.1x
Expected usage
3.82x
Expected usage

Lower-adoption countries concentrate on coding (50%+ in India vs 36% globally). As adoption matures, usage diversifies. High-adoption countries show more collaborative patterns; emerging markets prefer full delegation.

Inequality implications: If AI is a general-purpose technology, income-correlated adoption could widen global inequality. Counterpoint: OpenAI found growth in low/middle-income countries outpacing wealthy nations 4x.

Conclusion

What We Know

Two distinct AI ecosystems are emerging—enterprise and consumer, with different pricing, safety constraints, and providers. Chinese open-weights captured significant developer share and disrupted pricing. Automation is increasingly preferred over augmentation. Agent capabilities are advancing faster than agent robustness.

What remains uncertain: Whether productivity estimates translate to economic impact. Whether adoption patterns persist. Whether Chinese models face regulatory constraints. Whether agent robustness is solvable. Whether AI narrows or widens inequality.

The technology works well enough that 86% report time savings.
It doesn't work reliably enough that anyone trusts it without supervision.

That gap—between useful and robust—is where the interesting questions live for 2026.

The State of AI 2025

Primary sources: OpenRouter/a16z (100T tokens), Anthropic Economic Index (1M+ conversations), OpenAI/Harvard NBER working paper (1.5M messages).

All benchmark figures vendor-reported unless otherwise noted.

Scroll to top