AI Dependency Tools Hallucinate Security Fixes, Leaving Vulnerabilities in Production
Sonatype study finds frontier LLMs recommend nonexistent software patches and insecure library versions, creating hidden technical debt across AI stacks.

Large language models tasked with recommending software dependency upgrades are systematically introducing security vulnerabilities into production code, according to new research from software supply chain firm Sonatype that tested leading frontier models on real-world upgrade scenarios.
The study, published in two parts beginning in February and concluding this week, found that even the most advanced reasoning models from OpenAI, Anthropic, and Google generated hallucinated version numbers, nonexistent security patches, and upgrade paths that left critical vulnerabilities unaddressed. Nearly 28 percent of dependency recommendations from OpenAI's GPT-5 in the initial study were complete fabrications, while newer models including GPT-5.2, Claude Sonnet 3.7 and 4.5, Claude Opus 4.6, and Gemini 2.5 Pro and 3 Pro showed improvement but still produced significant error rates.
"These are the libraries used to train, fine-tune, orchestrate, and serve LLMs," the Sonatype report stated. "The irony is difficult to ignore: AI agents recommending upgrades inside the AI stack are themselves failing to avoid critical vulnerabilities in the very tools that power them."
Sonatype co-founder and CTO Brian Fox characterized the problem as a new category of technical debt that organizations often fail to detect. Errors in software dependency recommendations are "subtle, structured, and quietly becoming part of normal development work," he said, rather than obvious failures that trigger immediate review.
The research comes as reasoning models—which devote additional compute time to planning sequences of steps before generating output—have become the dominant architecture across frontier labs. The approach, which OpenAI researcher Noam Brown recently described as "surprisingly similar to AlphaGo," allows models to scale performance by using more time and computing power on difficult tasks, analogous to how harder problems take humans longer to solve.
(Sonatype's experiment demonstrated that equipping GPT-5 Nano with access to a real-time version recommendation API backed by vulnerability data and trust scores significantly reduced hallucinations and steered the model toward safer upgrade paths. "Grounding doesn't just prevent hallucinations; it steers the model toward versions with fewer known vulnerabilities when a perfect option doesn't exist," the report noted.)
The dependency hallucination findings add to mounting evidence that generative AI deployments are struggling to deliver returns. A widely cited MIT study from mid-2025 found that roughly 95 percent of generative AI investments had yet to produce meaningful returns, while analysts labeled the final quarter of that year the "Great Decoupling," a period when markets began penalizing companies unable to demonstrate clear revenue or efficiency gains from AI spending. Industry research firm Gartner reported that more than half of surveyed companies lacked data environments prepared to support AI at scale, effectively blocking pilot programs from reaching production deployment.
Keywords
Sources
https://www.darkreading.com/application-security/ai-powered-dependency-decisions-security-bugs
Detailed Sonatype study findings on hallucination rates and grounding experiments across frontier models
https://www.theatlantic.com/technology/2026/03/alphago-ai-boom/686618/
Contextualizes reasoning model architecture as analogous to AlphaGo's planning approach
https://www.fsrmagazine.com/feature/restaurants-rethink-ai-strategy-as-costs-rise-and-results-lag/
Broader AI deployment challenges including MIT study on investment returns and data readiness gaps
https://www.news-medical.net/news/20260401/New-method-improves-reliability-of-protein-language-model-predictions.aspx
Parallel reliability challenges in biological language models and prediction uncertainty
