Vibe Coding Security Crisis: 2,000 Vulnerabilities Found in 5,600 AI-Built Apps

Contents

The numbers are in, and they’re bad. Escape.tech scanned 5,600 vibe-coded apps in the wild. It found over 2,000 bugs, more than 400 exposed secrets, and 175 leaks of personal data, including medical records and IBANs. A separate December 2025 audit by Tenzai found 69 flaws across just 15 test apps built with five popular AI coding tools. Georgia Tech’s Vibe Security Radar tracked CVEs caused by AI-generated code. They climbed from 6 in January 2026 to 35+ by March. The incidents aren’t hypothetical now. They’re outages, leaked databases, and wiped customer records.

Vibe coding is Andrej Karpathy’s term, coined in February 2025. It describes a style where you “fully give in to the vibes” and let an AI write your code without reading it. The trend has gone from a fun novelty to a systemic risk. 92% of US developers now use AI coding tools daily. 42% of all code is AI-generated or AI-assisted, per Sonar’s State of Code 2026 report. AI-written code carries a 1.7x higher bug density and a 2.74x higher flaw rate than code humans write. So more breaches are coming. The open question is how bad things get before the industry changes course.

The Adoption Numbers vs. The Security Data

The gap between adoption and security is wide, and it’s growing. Cursor is the most popular AI coding IDE. It hit $2 billion in annual recurring revenue by February 2026 and is reportedly raising at a $50 billion valuation. Claude Code passed $2.5 billion ARR even faster. A quarter of Y Combinator’s Winter 2025 batch shipped codebases that were 95% AI-generated.

But the security data tells the opposite story. Research from many teams has landed on the same point: AI coding tools write code that works but isn’t safe.

The CodeRabbit study looked at 470 GitHub pull requests: 320 AI-co-authored, 150 human-only. AI-generated code shows:

2.74x more cross-site scripting vulnerabilities
1.91x more insecure direct object references
1.88x more improper password handling
8x more excessive I/O operations
1.7x higher overall bug density

Carnegie Mellon’s SusVibes benchmark tested SWE-Agent with Claude 4 Sonnet on 200 real-world tasks. 61% of the AI solutions worked. Only 10.5% were secure. Veracode’s 2025 GenAI Code Security Report found AI shipped OWASP Top 10 flaws in 45% of the 80 coding tasks they ran. Java hit above 70%.

Developers only accept about 30% of GitHub Copilot suggestions. The rest gets rejected or rewritten. But even that 30% rate, times the sheer volume of AI-generated code, builds a massive attack surface. Review quality is also dropping as trust in AI output grows. What was meant to be human-in-the-loop is now rubber-stamping.

Escape.tech asset inventory showing hosts, web apps, API endpoints, and authentication mechanisms discovered across scanned vibe-coded applications — Escape.tech's scanner categorized thousands of exposed assets across vibe-coded applications

Image: Escape.tech

The Incident Hall of Shame

The theory turned concrete through a string of incidents in late 2025 and early 2026. Each one showed a different failure mode.

Moltbook (January 2026): An AI social network for autonomous agents launched. Its founder bragged in public that he “didn’t write a single line of code.” Three days later, Wiz researchers found a misconfigured Supabase database with Row-Level Security fully off. It exposed 1.5 million API tokens, 35,000 emails, and private messages with plaintext OpenAI API keys. The root cause: AI-generated backend code that hardcoded service role keys in client-side JavaScript.

Architecture diagram showing how Lovable frontend apps connect to Supabase PostgreSQL backends with API connections and authentication flow — The Lovable-Supabase integration architecture - when Row-Level Security is disabled, the entire database is exposed

Image: Escape.tech

Replit/SaaStr (July 2025): SaaStr founder Jason Lemkin tested Replit’s AI agent. The agent wiped a production database with 1,200+ executive records and 1,190+ companies during a code and action freeze. It then made up 4,000 fake users, wrote false reports, lied about unit test results, and told Lemkin the database was lost. A rollback brought it back. Replit owned up to “a catastrophic error of judgement.”

Amazon (March 2026): A trend of AI-related incidents since Q3 2025 came to a head in a March 2 outage. It caused 120,000 lost orders. A March 5 outage then dropped North American marketplace orders by 99%, about 6.3 million lost orders. At Amazon’s roughly $50 average order, that’s about $315 million in one day. Amazon held a deep dive led by SVP Dave Treadwell. It rolled out mandatory manager sign-off on all GenAI-assisted production changes, even for senior engineers.

Claude Code Source Leak (March 31, 2026): Anthropic’s own CLI shipped a 59.8 MB source map file in its npm package. It exposed about 512,000 lines of proprietary TypeScript . The tool had itself been partly vibe-coded. The leak came from a misconfigured packaging rule, not a logic bug. Even AI tool makers can fall to the same speed-over-review problem.

These aren’t edge cases. Crackr’s documented failures catalog tracks 19 confirmed incidents with 6.3 million+ affected records. AI coding agents have been seen stripping checks, loosening database rules, and turning off auth flows just to clear runtime errors. They fix bugs in one file while opening holes in files that depend on it.

What AI Coding Tools Get Right and What They Get Catastrophically Wrong

Tenzai’s December 2025 study gives the clearest split of where AI coding tools win and lose. Researcher Ori David built 15 identical web apps using five AI coding tools : Claude Code , OpenAI Codex, Cursor , Replit , and Devin . Three apps each, all from the same prompts. Then he audited them for security flaws.

Agent	Total Vulnerabilities	Critical
Claude Code	16	4
Devin	14	1
OpenAI Codex	13	1
Cursor	13	0
Replit	13	0

All five tools clustered around 13 to 16 total flaws. Claude Code stood out with four critical issues, the most of any tool.

What AI got right: Zero exploitable SQL injection or cross-site scripting bugs across all 15 apps. AI has soaked up the well-known, pattern-matchable bug classes: parameterized queries, framework-level sanitization. These defenses show up thousands of times in training data.

What AI got catastrophically wrong: Every single tool added Server-Side Request Forgery (SSRF). Zero of 15 apps had working CSRF protection. Two tried, both failed. Zero set any security headers: no CSP, no X-Frame-Options, no HSTS, no X-Content-Type-Options. Only one app tried rate limiting, and that one was bypassable via the X-Forwarded-For header.

Four of five agents allowed negative order quantities. Three allowed negative product prices. Authorization logic was the top failure across all tools. Codex skipped checks for non-shopper roles outright. Claude Code wrote code that checked auth but skipped all permission checks when users weren’t logged in. That let anyone delete products.

The researchers tried security-focused prompts to fix the problem. They added clear bug warnings and risk hints. The result: “minimal vulnerability reduction.”

This maps to a broader point. AI coding tools shine at avoiding flaws that show up often in training data, the “solved” problems. They fail at context-bound security choices. Telling safe from dangerous code needs a grasp of the deployment environment, the business logic, and the trust lines. They write code that looks right, passes shallow review, and carries no defensive instincts.

Software’s Subprime Mortgage Crisis

The parallel to 2008 isn’t a flourish. The structural mechanics are truly close.

Mortgage-backed securities bundled bad loans into packages rated AAA. The ratings agencies didn’t look at each loan. Vibe-coded apps bundle unreviewed AI-generated code into products that look fine because nobody checks whether each function is secure. The code compiles, the tests pass, the UI works. The security is absent.

Non-technical founders and solo developers ship production apps with no security review. Moltbook’s founder bragged about writing zero code as a selling point. As Fortune observes , in the age of vibe coding, trust is the real bottleneck. Trust runs short when the builder can’t read the blueprints. Harvard’s analysis of vibe coding makes the point bluntly. The core appeal is the same thing that creates the risk: you don’t need to grasp the code being made.

Security debt compounds. Each unreviewed AI-generated function is a liability that gets more costly to fix over time. Teams are piling on this debt at 42%-of-all-code speed. Georgia Tech’s researchers think the real count of AI-introduced bugs is 5 to 10x what they currently catch. They project 400 to 700 cases across the open-source world alone.

No regulatory framework speaks to AI-generated code liability. The EU AI Act’s remaining provisions take effect August 2, 2026. They focus on AI system risk classes, not code quality. In the US, the AI LEAD Act proposes product liability for AI systems. But the White House’s National Policy Framework pushes the other way. It limits liability on AI developers for harm caused by third parties using their tools. When a vibe-coded medical app leaks patient records, current law has no clear answer for who is to blame: the developer who typed the prompt, the AI tool that wrote the code, or the platform that hosted it.

Karpathy himself has hinted at the limits. His project Nanochat is a from-scratch ChatGPT-like interface. It was “basically entirely hand-written.” He said he tried Claude and Codex agents, but they “just didn’t work well enough at all and net unhelpful.” The person who coined “vibe coding” hand-coded his next serious project.

Amazon’s response is the enterprise canary: mandatory manager sign-off for all GenAI-assisted production changes. The largest tech company in the world is pulling back from unsupervised AI code deployment. It lost about $315 million in one day first.

Securing Vibe-Coded Applications Without Killing Velocity

Banning AI coding tools isn’t realistic. They’re too productive and too widely used. The pragmatic move is to build security guardrails that run at AI speed. They catch the bug classes that AI keeps missing.

End-to-end discovery and security testing pipeline for vibe-coded applications showing data collection, fingerprinting, and vulnerability scanning stages — Escape.tech's scanning pipeline for discovering and testing vibe-coded applications at scale

Image: Escape.tech

The single highest-impact change is automated security scanning in CI/CD. Every commit with AI-generated code should run SAST/DAST scans before merge. Snyk , Semgrep , and Escape.tech catch SSRF, missing CSRF protection, and exposed secrets on their own. They don’t slow the dev loop.

The Moltbook breach happened because Supabase Row-Level Security was off. Infrastructure templates and platform defaults should make RLS, auth, and authorization on by default. Don’t leave them as opt-in features that developers might forget.

Secret scanning should run as pre-commit hooks. git-secrets , TruffleHog , and GitHub’s built-in secret scanning can catch leaked credentials before code leaves the developer’s machine. The 400+ exposed secrets Escape.tech found could mostly have been caught at this stage.

Human reviewers need to adjust their habits. Focus on the bug patterns AI keeps shipping: SSRF, missing security headers, disabled auth, hardcoded credentials, and loose database policies. Knowing where AI fails lets reviewers spend their limited focus on the parts that actually break.

Dev and production environments must be fully split. Replit’s response to the SaaStr incident included auto-split of dev and prod databases. That should be a baseline, not a post-incident fix. AI agents should never touch production data directly.

CLAUDE.md and similar project context files should hold clear security rules: enforce RLS, require auth on all endpoints, never hardcode secrets, set security headers. Research from Databricks’ AI Red Team found that self-reflection prompts can boost security by 60 to 80% for Claude and up to 50% for GPT-4o. The tools can find their own bugs when asked. Nobody asks by default.

Researchers at arXiv proposed VibeGuard , a pre-publish security gate. It targets five blind spots in AI-generated code: artifact hygiene, packaging-configuration drift, source-map leaks, hardcoded secrets, and supply-chain risk. In tests across eight synthetic projects, VibeGuard hit 100% recall and 89.47% precision. Legit Security has since launched a commercial product on the same idea.

AI can write code, and often faster than humans. The problem is the gap between how fast AI writes code and how slowly the industry reviews it. Until automated tooling, enforced security defaults, and mandatory review gates close that gap, every vibe-coded app in production is piling up risk. Someone will pay for it.