A massive audit of 2.5 million biomedical papers reveals a twelvefold increase in AI-hallucinated citations, highlighting a dangerous gap in enterprise AI security and data integrity.
Biomedical research is quietly accumulating a structural integrity problem that most publishers are not equipped — or motivated — to fix. A large-scale audit of 2.5 million biomedical papers, conducted by Columbia University and partner institutions, found that fabricated references have increased more than twelvefold since 2023, with 98% of affected papers receiving no publisher response. The citations look legitimate. They match the paper's topic, mirror the formatting conventions of the journal, and slot into reference lists without triggering any existing detection system. They are, in the precise technical sense, AI-hallucinated citations — and they are now embedded in the literature that informs clinical guidelines.
This is not a hypothetical risk. It is a documented, accelerating failure mode at the intersection of generative AI deployment and scientific publishing infrastructure — and it represents one of the most underexamined enterprise AI security risks in any regulated domain.
The Audit: What 2.5 Million Papers Revealed
The Columbia University-led analysis represents one of the most comprehensive systematic examinations of AI-generated content contamination in academic literature to date. Researchers cross-referenced citation records across 2.5 million biomedical papers, checking whether cited works actually exist, whether they say what the citing paper claims, and whether the cited authors, journals, volumes, and page numbers resolve to real publications.
The findings are stark:
Fabricated references increased more than twelvefold since 2023, yet 98% of papers containing hallucinated citations received no corrective action from publishers.
The twelvefold figure is not a marginal statistical artifact. It tracks almost precisely with the mass adoption of large language model writing assistants in academic workflows following the public release of GPT-4 and subsequent models in late 2022 and 2023. Researchers using LLMs to draft, edit, or expand manuscript sections are, in many cases, unknowingly inheriting fabricated references that the model confidently generates to fill bibliographic gaps.
What makes this particularly difficult to detect is the topical coherence of the hallucinated citations. Unlike early AI errors — which might produce obviously wrong author names or nonexistent journal titles — modern LLM hallucinations generate references that are plausible in every surface dimension. The fabricated paper title sounds like something that would exist. The journal name is real. The volume and issue numbers are plausible. The author names may even belong to real researchers in the relevant field. Only a direct lookup — checking whether the DOI resolves, whether the paper appears in PubMed or Scopus, whether the cited content actually supports the claim — exposes the fabrication.
Why Publishers Are Failing to Catch This
The 98% non-response rate is not primarily a story about publisher negligence, though that is a factor. It is a story about infrastructure misalignment: the tools publishers use to verify submissions were not designed for this threat model.
Reference Verification Is Not Standard Practice
Most journals do not systematically verify that every cited reference exists and says what the author claims. Peer reviewers are expected to flag obviously wrong citations in their area of expertise, but reviewers are not librarians conducting systematic audits. They read for scientific validity, methodological soundness, and contribution to the field — not for bibliographic integrity across dozens of references.
Some publishers use automated tools to check for duplicate submissions, plagiarism, and image manipulation. Reference hallucination detection is largely absent from these pipelines. The tools that do exist — such as reference validation plugins in manuscript submission systems — typically check formatting compliance, not existence or accuracy.
The Scale Problem
Biomedical publishing operates at enormous volume. PubMed indexes over 36 million citations, with approximately 1.5 million new records added annually. Even if a publisher wanted to manually verify every reference in every submission, the labor cost would be prohibitive. This is precisely the environment in which AI-assisted fraud scales fastest: the attack surface is vast, the verification infrastructure is thin, and the cost of insertion is near zero for a researcher using an LLM.
Incentive Structures That Discourage Retraction
When fabricated citations are identified, the path to correction is slow and institutionally uncomfortable. Retractions carry reputational costs for journals, not just authors. Editors face pressure to minimize the appearance of quality failures. The result is a documented tendency toward corrections (which are less visible than retractions) or, in 98% of cases, no action at all.
This is consistent with broader patterns in retraction science. Research by the Retraction Watch database has long documented that the average time between publication and retraction is over two years, and that many papers continue to accumulate citations even after retraction notices are issued.
Clinical Consequences: When Evidence Bases Are Corrupted
The stakes in biomedical research are categorically different from those in, say, economics or literary criticism. Clinical guidelines — the documents that determine how physicians treat patients — are built on systematic reviews and meta-analyses that aggregate evidence from primary literature. If that primary literature contains fabricated citations pointing to studies that do not exist or do not support the claimed findings, the downstream effects can propagate into treatment protocols.
Consider the mechanism:
- A researcher uses an LLM to assist in drafting a review article on, for example, drug interaction thresholds in elderly patients.
- The LLM generates several plausible-sounding citations to support a dosing claim.
- The researcher, under time pressure, does not verify each reference individually.
- The paper passes peer review, where reviewers assume bibliographic accuracy.
- The paper is indexed in PubMed and cited in subsequent systematic reviews.
- A clinical guideline committee incorporates the systematic review into updated treatment recommendations.
At no point in this chain does any automated system flag the original fabrication. The hallucinated citation has, in effect, laundered itself into clinical evidence.
This is not a theoretical chain of events. The Columbia University audit identified papers that have already been cited in downstream literature — meaning the contamination is not contained at the point of origin.
Enterprise AI Security Risks: The Broader Frame
The biomedical citation crisis is a specific instance of a general class of enterprise AI security risk that organizations across industries are beginning to confront: the problem of LLM output laundering.
In enterprise contexts, this risk manifests when AI-generated content — which may contain hallucinated facts, fabricated sources, or confidently wrong technical claims — enters workflows that treat it as verified information. In financial services, this might mean an AI-drafted compliance memo citing a regulation that does not exist. In legal contexts, it might mean a brief citing a case that was never decided. In biomedical research, it means fabricated studies entering the evidence base.
The common thread is that LLMs do not distinguish between what they know and what they confabulate. They generate fluent, contextually appropriate text regardless of whether the underlying facts are accurate. In low-stakes contexts, this is an inconvenience. In regulated, high-consequence domains, it is a systemic vulnerability.
Three Dimensions of Enterprise Exposure
1. Trust Chain Contamination Once hallucinated content enters a trusted repository — a published journal, a regulatory filing, a compliance database — downstream consumers inherit the error with a false confidence premium. The source's authority masks the content's unreliability.
2. Detection Asymmetry The cost of inserting hallucinated content (near zero, via LLM) is orders of magnitude lower than the cost of detecting and removing it. This asymmetry favors contamination at scale.
3. Audit Trail Opacity Most enterprise AI deployments do not maintain granular logs of which outputs were generated by LLMs versus human authors. When a hallucination is eventually identified, tracing it to its origin — and assessing how far it has propagated — is often impossible.
What Effective Mitigation Looks Like
The Columbia University findings implicitly define the requirements for any credible response. Several approaches are technically feasible today, though none is currently deployed at scale in biomedical publishing.
Automated Reference Existence Verification
Every submitted manuscript's reference list should be automatically checked against PubMed, Crossref, Scopus, and other canonical databases at submission time. DOI resolution, author name matching, and journal volume/issue validation can be performed programmatically. This would catch a significant fraction of hallucinated citations before peer review begins.
Some preprint servers and submission platforms are beginning to explore this, but it is not yet standard practice at major journals.
Semantic Claim-Citation Alignment
More sophisticated — and more technically demanding — is checking whether a cited paper actually supports the claim for which it is cited. This requires retrieving the cited paper's abstract or full text and using NLP or LLM-based verification to assess alignment. This is an active research area, with tools like Semantic Scholar's citation context analysis providing early-stage infrastructure.
AI Disclosure Requirements With Audit Hooks
Journals requiring authors to disclose LLM use in manuscript preparation is a necessary but insufficient step. More useful would be structured disclosure with audit hooks: authors specifying which sections were LLM-assisted, enabling editors to apply heightened reference scrutiny to those sections specifically.
Publisher Liability Frameworks
Ultimately, the 98% non-response rate reflects an absence of accountability. Regulatory frameworks in the EU and, increasingly, in U.S. federal research funding contexts are beginning to address AI-generated content in research. The FDA's evolving guidance on AI in clinical research submissions is one signal that liability frameworks may eventually reach the publishing layer.
The Compounding Timeline
The twelvefold increase since 2023 is not a plateau — it is a trajectory. LLM adoption in academic writing continues to accelerate. Models are becoming more capable of generating topically coherent, superficially credible hallucinations. The gap between fabrication capability and detection capability is widening, not narrowing.
The Columbia University audit provides a baseline. Without structural intervention at the publishing infrastructure level, the 2026 figures will likely make the 2023-to-2025 growth curve look modest. The clinical evidence base is not a static archive — it is a living system that practitioners depend on in real time. Contamination that compounds unchecked does not stay contained in academic databases. It moves into guidelines, into formularies, into treatment decisions.
The 98% non-response rate is not just a publishing statistic. It is a measure of how far the enterprise AI security risk has already advanced before the systems responsible for containing it have meaningfully engaged.
Sources:
- Columbia University / partner institution audit via The Decoder: https://the-decoder.com/ai-hallucinated-citations-are-creeping-into-papers-that-shape-clinical-guidelines-researchers-warn/
- Retraction Watch Database: https://retractionwatch.com
- Semantic Scholar: https://www.semanticscholar.org
- PubMed / NCBI: https://pubmed.ncbi.nlm.nih.gov
- Crossref DOI Resolution Infrastructure: https://www.crossref.org
Last reviewed: May 27, 2026



