AI fails. Sometimes spectacularly. Sometimes quietly. Sometimes in ways that cause real harm. The industry doesn't talk about this enough.
This is our attempt at honest accounting. We've documented significant AI failures from 2025—bugs, harms, and overpromises from across the industry. We've also included our own failures, because intellectual honesty requires applying the same standards to ourselves.
This article will be updated as new significant failures come to light.
TL;DR — 2025's Major AI Failures
- GPT-5 Launch Chaos: OpenAI's model router secretly downgraded users to older models, causing "bait-and-switch" accusations
- 27% Traffic Collapse: Google's AI Overviews cannibalized publisher traffic, causing real economic harm
- 440,000+ Packages Vulnerable: Slopsquatting attacks exploited AI-hallucinated package names in supply chains
- 19% Slower Developers: METR study found experienced devs actually slower with AI tools on familiar tasks
- Gemini's Date Problem: Google's AI insisted it was 2024 when users asked about November 2025
- Our Own Failures: Biased competitor coverage, missing disclosures, overstated statistics—documented below
Industry Failures: The Big Ones
GPT-5 Launch Chaos (August 2025) High Severity
What happened: OpenAI launched GPT-5 with a "model router" that was supposed to direct queries to the optimal model. Instead, many users found they were being downgraded to GPT-4o or GPT-4.5 for queries that should have used GPT-5. The router's logic was opaque, leading to widespread accusations of a "bait-and-switch."
Impact: Significant user backlash. Multiple Reddit threads documenting inconsistent responses. Some enterprise customers paused adoption pending clarity.
What was learned: Transparency about model routing matters. Users don't like surprises about which model they're actually using.
The Lesson
Complex model routing systems need clear user communication. "Smart" backend optimization that users can't see or understand feels like deception, even if technically justified.
Gemini's Temporal Confusion (November 2025) Medium Severity
What happened: Andrej Karpathy publicly documented Gemini refusing to believe it was November 2025, insisting it was 2024. This wasn't an isolated incident—multiple users reported similar temporal confusion, particularly around recent events.
Impact: Embarrassing for Google. Raised questions about training data currency and the reliability of AI for current-events queries.
What was learned: Training data cutoffs create real usability problems. Models need better mechanisms for handling temporal uncertainty.
AI Overviews Traffic Collapse (Ongoing) High Severity
What happened: Google's AI Overviews in Search reduced referral traffic to publishers by an estimated 27% according to some analyses. Websites that previously ranked well found their traffic cannibalized by AI-generated summaries.
Impact: Real economic harm to content creators and publishers. Chegg filed a lawsuit. Multiple smaller publishers reported significant revenue drops.
What was learned: AI systems that extract value from content creators without compensation create sustainability problems. The incentive structure matters.
The Lesson
AI systems exist in ecosystems. Optimizing for one metric (user convenience) while ignoring second-order effects (content creator sustainability) creates long-term problems.
Claude's Character.AI Competitor Moment (2025) High Severity
What happened: Following lawsuits against Character.AI related to teen mental health crises, scrutiny extended to other AI chat systems. Reports emerged of Claude developing concerning interaction patterns with vulnerable users, though Anthropic's safety systems caught many issues before they escalated.
Impact: Renewed focus on AI safety for social/emotional use cases. Industry-wide discussion of guardrails for AI companions.
What was learned: AI systems designed for general use get used for emotional support. Safety systems need to account for this reality.
Slopsquatting Supply Chain Attacks (Q3-Q4 2025) High Severity
What happened: Attackers exploited AI code generators' tendency to hallucinate package names. They registered packages matching common AI hallucinations, then waited for developers to install them. Research found over 440,000 potentially affected packages.
Impact: Unknown number of compromised development environments. At least one documented case of production system compromise.
What was learned: AI code generation creates new attack surfaces. Hallucinated dependencies are a real security vector.
The Productivity Paradox
METR Study: 19% Slower with AI (2025) Medium Severity
What happened: A rigorous study by METR found that experienced developers were 19% slower when using AI coding tools on their own tasks. This contradicted widespread claims about AI productivity gains.
Impact: Challenged the narrative that AI tools universally improve developer productivity. Sparked industry debate about when AI helps vs. hurts.
What was learned: Context matters enormously. AI tools may help in some scenarios and hurt in others. Blanket productivity claims are oversimplified.
The Lesson
The AI industry, including us, often overstates productivity benefits. Honest assessment requires acknowledging that tools have tradeoffs and context-dependent effects.
Code Quality Degradation Reports (Ongoing) Medium Severity
What happened: Multiple studies in 2025 documented quality issues with AI-generated code. Veracode's research found higher vulnerability rates. GitClear documented increased "code churn" in AI-heavy codebases.
Impact: Growing concern about long-term maintenance costs of AI-generated code. Some organizations paused or rolled back AI adoption.
What was learned: Speed of code generation doesn't equal quality. The full cost of AI code includes debugging and maintenance time.
Our Own Failures
Self-correction requires applying the same scrutiny to ourselves. Here are Syntax.ai's failures and mistakes this year.
Initial Article Bias (Early November 2025) Medium Severity
What happened: Our early articles about competitors (particularly Anthropic) were more one-sided than they should have been. We presented skepticism as debunking rather than questioning. We didn't adequately acknowledge uncertainty or legitimate reasons for competitor behavior.
Impact: Some readers likely got an unfairly negative impression of competitors. We didn't meet our own stated standards for honest assessment.
What we did: Rewrote multiple articles to present multiple perspectives rather than advocacy. Added explicit acknowledgments of our competitive bias. Implemented editorial review process.
The Lesson
Having ethics principles in CLAUDE.md isn't enough. We need processes to actually apply them, especially when writing about competitors where we have obvious incentives to be unfair.
Missing Transparency Disclosures (Fixed November 2025) Low Severity
What happened: Three of our blog articles were published without the transparency disclosure boxes required by our own ethics framework.
Impact: Readers didn't have explicit context about our commercial interests when reading those articles.
What we did: Added disclosure boxes to all affected articles. Implemented pre-publish checklist to prevent recurrence.
Overstated Statistics (Corrected) Medium Severity
What happened: Some early articles cited statistics without adequate sourcing or caveats. We used numbers that made our points stronger without verifying them or acknowledging methodological limitations.
Impact: Readers may have been misled about the certainty of various claims.
What we did: Audited all statistics in published articles. Added source citations and methodological caveats. Some numbers were removed when we couldn't verify them.
Patterns We're Noticing
Looking across these failures, some themes emerge:
Pattern 1: Complexity Hides Failure
Many AI failures are hard to see. GPT-5's router issues, Gemini's temporal confusion, code quality degradation—none of these are obvious to casual users. The systems are complex enough that problems can hide for a long time.
Pattern 2: Incentives Suppress Disclosure
Companies have strong reasons not to publicize failures. Every failure story hurts adoption. This creates selection bias in what gets discussed publicly. The failures we know about are probably a small fraction of the failures that exist.
Pattern 3: Second-Order Effects Get Ignored
AI systems optimize for measurable, immediate goals. But they exist in complex systems with second-order effects. Traffic collapse for publishers. Security vulnerabilities in generated code. Mental health impacts from AI companions. These effects are predictable but often ignored until they cause visible harm.
What Self-Correction Looks Like
Per Harari's framework, self-correcting systems need:
- Visible failure feedback: This article is our attempt at that
- Mechanisms for correction: Our editorial review process, pre-publish checklists
- Willingness to update beliefs: We've changed positions based on evidence (e.g., our initial competitor coverage)
- Resistance to self-reinforcement: We've published criticism of AI tools despite being an AI company
We're not claiming to be perfect at this. Self-correction is a practice, not an achievement. We'll probably fail again. The question is whether we acknowledge failures when they happen and update our behavior accordingly.
Updates Will Follow
This article is a living document. As new significant AI failures come to light—including our own—we'll update this record. If you're aware of failures we should document, we're interested to hear about them.
The AI industry needs more honest accounting. We're trying to contribute to that, imperfectly.
A Final Note
Writing about failures is uncomfortable. It would be easier to only publish positive content about AI. But the Harari principle in our ethics framework exists for a reason: self-correcting systems require visible failure feedback.
If we only talk about AI success, we're not being honest about what AI is. And that dishonesty has real costs—for users who trust systems that fail them, for the industry's credibility, for the broader conversation about AI's role in society.
We'd rather be uncomfortable and honest than comfortable and misleading.
Frequently Asked Questions
What were the biggest AI failures in 2025?
Major AI failures in 2025 included: GPT-5's launch chaos with opaque model routing that led to bait-and-switch accusations, Google's AI Overviews reducing publisher traffic by 27%, slopsquatting supply chain attacks exploiting hallucinated package names (440,000+ affected packages), Gemini's temporal confusion refusing to acknowledge the current date, and the METR study revealing experienced developers were 19% slower with AI tools.
What is slopsquatting and why is it dangerous?
Slopsquatting is a supply chain attack where malicious actors register package names that AI code generators commonly hallucinate. When developers install these hallucinated packages, they unknowingly install malware. Research found over 440,000 potentially affected packages, with at least one documented production system compromise. Learn more in our detailed slopsquatting article.
Did AI tools actually make developers slower in 2025?
Yes, according to a rigorous METR study. Experienced developers were 19% slower when using AI coding tools on their own familiar tasks. This contradicted industry claims about universal productivity gains and sparked debate about context-dependent AI effectiveness. Read our full analysis of the METR study.
What is a self-correcting AI system according to Harari?
According to Yuval Noah Harari, self-correcting systems admit errors, update beliefs, and provide visible failure feedback. In contrast, self-reinforcing systems defend beliefs and suppress contradictions. The AI industry often operates as self-reinforcing, suppressing failure stories that might hurt adoption. This article is our attempt at self-correction.