The Dangers of Vibe Coding

December 5, 2025

The Dangers of Vibe Coding

A new benchmark—SUSVIBES—has just exposed something the industry has been quietly ignoring: our growing obsession with “vibe coding” is producing software that works… but is dangerously insecure.

Vibe coding is the increasingly common workflow where engineers hand off complex tasks to LLM agents, give a few nudges, and trust the model to “figure out the rest.” The problem isn’t that these agents can’t build features. They absolutely can. The problem is that they build those features with the security mindset of a beginner on a caffeine crash.

What the Numbers Really Show

Across 200 real-world software engineering tasks, researchers tested leading agents and base models—Claude 4 Sonnet, Gemini 2.5 Pro, and Kimi K2—using frameworks like SWE-AGENT, OPENHANDS, and Claude Code. The results were blunt:

1. The Code Works. The Code Is Insecure.

The strongest setup (SWE-AGENT + Claude 4 Sonnet) produced functionally correct solutions for 61% of tasks.
But over 80% of those “correct” solutions contained security vulnerabilities.

This is vibe coding in a sentence: great demos, terrible defenses.

2. Even the “secure” agents fail most of the time

The best security score—OPENHANDS + Claude 4 Sonnet—achieved a 12.5% SECPASS rate.
Even then, 74.7% of its functionally correct solutions were still insecure.

The average SECPASS across all agents? Around 10%.

3. The vulnerabilities aren’t trivial

These aren’t cosmetic issues. They include:

Missing input sanitization → CRLF attacks, header injection
Missing timing defenses → username enumeration via timing side channels
Weak patch logic → reintroducing vulnerabilities in other files

This is the kind of code that breaks in production and gets exploited in the real world.

Why SUSVIBES Matters

Existing benchmarks were too shallow. They tested single-function edits, small snippets, or toy scenarios. SUSVIBES forces the models into real-world complexity:

multi-file repositories
tasks requiring edits across ~172 lines of code on average
77 different CWE categories sourced directly from real GitHub vulnerabilities

This is the first benchmark that reflects what vibe coding actually looks like outside of marketing videos.

Why “Security Hints” Don’t Fix the Problem

Researchers also tried giving the models more guidance: generic security hints, CWE hints, even oracle-level security descriptions.

The result?

Functionality dropped. Security didn’t improve.

With added security prompts, models became hyper-focused on checks and validations—and forgot core functionality.
On average, functional correctness fell up to 8.5 percentage points.

Security improved only marginally, and often not at all.

The Bigger Message

SUSVIBES exposes a problem that many teams are already feeling:
LLM agents can build features fast, but they are not thinking like security engineers.

Right now, vibe coding is shipping software that looks impressive, passes tests, and quietly carries vulnerabilities that attackers can find in seconds.

Where This Leaves Us

The study’s conclusion is uncomfortably accurate: vibe coding makes developers faster, but it also makes insecurity easier to produce and harder to detect. Until we treat security as a first-class objective—not an afterthought—we’re essentially scaling insecurity at machine speed.

Vibe coding isn’t just a workflow risk; it’s a systemic one.

Reference

S. Zhao, D. Wang, K. Zhang, J. Luo, Z. Li, and L. Li, “Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks,” arXiv, [Online]. Available: arXiv:2512.03262. ¹ [Accessed: Dec. 5, 2025].

Dipo Tepede Blogging Blog, Interviews