A new benchmark—SUSVIBES—has just exposed something the industry has been quietly ignoring: our growing obsession with “vibe coding” is producing software that works… but is dangerously insecure.
Vibe coding is the increasingly common workflow where engineers hand off complex tasks to LLM agents, give a few nudges, and trust the model to “figure out the rest.” The problem isn’t that these agents can’t build features. They absolutely can. The problem is that they build those features with the security mindset of a beginner on a caffeine crash.

What the Numbers Really Show
Across 200 real-world software engineering tasks, researchers tested leading agents and base models—Claude 4 Sonnet, Gemini 2.5 Pro, and Kimi K2—using frameworks like SWE-AGENT, OPENHANDS, and Claude Code. The results were blunt:
1. The Code Works. The Code Is Insecure.
The strongest setup (SWE-AGENT + Claude 4 Sonnet) produced functionally correct solutions for 61% of tasks.
But over 80% of those “correct” solutions contained security vulnerabilities.
This is vibe coding in a sentence: great demos, terrible defenses.
2. Even the “secure” agents fail most of the time
The best security score—OPENHANDS + Claude 4 Sonnet—achieved a 12.5% SECPASS rate.
Even then, 74.7% of its functionally correct solutions were still insecure.
The average SECPASS across all agents? Around 10%.
3. The vulnerabilities aren’t trivial
These aren’t cosmetic issues. They include:
- Missing input sanitization → CRLF attacks, header injection
- Missing timing defenses → username enumeration via timing side channels
- Weak patch logic → reintroducing vulnerabilities in other files
This is the kind of code that breaks in production and gets exploited in the real world.
Why SUSVIBES Matters
Existing benchmarks were too shallow. They tested single-function edits, small snippets, or toy scenarios. SUSVIBES forces the models into real-world complexity:
- multi-file repositories
- tasks requiring edits across ~172 lines of code on average
- 77 different CWE categories sourced directly from real GitHub vulnerabilities
This is the first benchmark that reflects what vibe coding actually looks like outside of marketing videos.
Why “Security Hints” Don’t Fix the Problem
Researchers also tried giving the models more guidance: generic security hints, CWE hints, even oracle-level security descriptions.
The result?
Functionality dropped. Security didn’t improve.
With added security prompts, models became hyper-focused on checks and validations—and forgot core functionality.
On average, functional correctness fell up to 8.5 percentage points.
Security improved only marginally, and often not at all.
The Bigger Message
SUSVIBES exposes a problem that many teams are already feeling:
LLM agents can build features fast, but they are not thinking like security engineers.
Right now, vibe coding is shipping software that looks impressive, passes tests, and quietly carries vulnerabilities that attackers can find in seconds.
Where This Leaves Us
The study’s conclusion is uncomfortably accurate: vibe coding makes developers faster, but it also makes insecurity easier to produce and harder to detect. Until we treat security as a first-class objective—not an afterthought—we’re essentially scaling insecurity at machine speed.
Vibe coding isn’t just a workflow risk; it’s a systemic one.
Reference
S. Zhao, D. Wang, K. Zhang, J. Luo, Z. Li, and L. Li, “Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks,” arXiv, [Online]. Available: arXiv:2512.03262. 1 [Accessed: Dec. 5, 2025].
