When AI models are trained on data generated by other AI, the resulting feedback loop degrades model integrity. Learn how this silent threat impacts enterprise AI security risks and long-term reliability.
When a data labeler opens a chatbot to answer a question that was supposed to come from their own judgment, something quietly breaks inside the AI training pipeline. It's not dramatic. There's no error log. The model doesn't crash. But the signal — the irreplaceable human signal that reinforcement learning from human feedback depends on — gets a little less human. Do this at scale, across thousands of labelers, across dozens of training runs, and you have what researchers are now calling AI inbreeding: a feedback loop where AI models are increasingly trained on the outputs of other AI models, masquerading as authentic human preference data.
This is not a hypothetical concern. It is happening right now, and the consequences for enterprise AI security risks are more serious than most security teams have accounted for.
The Confession at the Heart of the Problem
In a striking piece of investigative journalism, New Scientist reported that workers responsible for generating training data for next-generation AI models have openly admitted to using chatbots to produce the very feedback that is supposed to represent human judgment. According to the New Scientist report, these contractors — the human raters whose preferences shape how models learn to behave — are quietly outsourcing their cognitive labor to the same class of tools their work is meant to improve.
The incentive structure is painfully obvious in retrospect. RLHF labeling work is repetitive, often poorly compensated, and evaluated on throughput. A labeler who uses a chatbot to generate polished, confident-sounding responses can complete far more tasks per hour than one who thinks carefully and writes original answers. From a purely economic standpoint, the shortcut makes sense. From a data integrity standpoint, it is quietly catastrophic.
"People training new AI models admit they just get chatbots to do it." — New Scientist, 2025
This is the feedback loop no one designed but everyone enabled: AI outputs feeding into AI training data, which shapes the next generation of AI outputs, which will eventually feed the generation after that.
Why RLHF Is Particularly Vulnerable
To understand why this matters so much, it helps to understand what RLHF (Reinforcement Learning from Human Feedback) is actually doing. Unlike pre-training on raw text, RLHF is where a model learns preferences — which responses are more helpful, more accurate, more aligned with human values. It is the stage that separates a raw language model from a useful, trustworthy assistant.
The entire mechanism depends on one assumption: that the preference signals are genuinely human. When a rater says "Response A is better than Response B," the model updates its behavior accordingly. If that judgment was actually generated by GPT-4 or Claude, the model is not learning human preferences. It is learning what another AI thinks human preferences should look like — a subtly but meaningfully different thing.
The insidious part is that AI-generated feedback often looks better than human feedback by conventional quality metrics. It's grammatically clean, internally consistent, and confidently expressed. Automated quality checks designed to flag low-effort or incoherent responses will frequently pass AI-generated labels without issue. The contamination is invisible to standard tooling.
The Enterprise Exposure No One Is Talking About
Here is where the enterprise AI security risk calculus gets genuinely uncomfortable.
Large organizations are deploying foundation models — or fine-tuning them on proprietary data — under the assumption that the base model's alignment properties are sound. Security teams evaluate outputs. Red teams probe for jailbreaks and harmful content. Compliance teams review data handling. But almost no enterprise security framework currently audits the provenance of the training signal that shaped the model's fundamental behavioral tendencies.
If AI inbreeding has quietly shifted a model's preference distribution toward what AI systems consider good rather than what humans actually prefer, several failure modes become plausible:
Subtle sycophancy amplification. AI models already have a known tendency toward sycophancy — telling users what they want to hear. If the raters shaping RLHF were themselves using AI tools that exhibit this tendency, the training signal may have systematically reinforced it, making the deployed model more prone to validating incorrect assumptions from enterprise users.
Homogenization of outputs. Human raters bring diverse perspectives, cultural contexts, and domain expertise. AI-generated labels tend to reflect the biases and stylistic preferences of whichever frontier model the labeler happened to use. Over successive training runs, this could compress the model's behavioral diversity — producing tools that are confidently consistent but brittle in edge cases that matter to specific enterprise verticals.
Eroded adversarial robustness. Models trained on genuinely diverse human feedback tend to be more robust to unusual inputs. Training on AI-generated feedback that clusters around the median of what a chatbot considers a good response may reduce exposure to the long tail of human reasoning styles — the very tail that adversarial users and edge cases inhabit.
The Counterargument Worth Taking Seriously
Fairness demands engaging with the strongest version of the opposing view. Some researchers argue that AI-generated training data is not inherently problematic — that if the AI generating the labels is itself well-aligned and high-quality, the signal may be perfectly adequate or even superior to noisy human feedback from under-trained contractors.
This argument has some merit in narrow, well-defined tasks. For coding benchmarks or factual question-answering, an AI rater might be more consistent and accurate than a human with limited domain expertise. Constitutional AI approaches from Anthropic have demonstrated that AI self-critique can be a useful component of alignment pipelines.
But this is precisely where the distinction between intentional AI-assisted labeling and covert AI inbreeding matters enormously. When a lab deliberately uses AI feedback as part of a documented, controlled methodology, they can account for its limitations, validate it against human baselines, and disclose it to downstream users. When contractors secretly substitute chatbot outputs for human judgment in a pipeline designed to capture human preferences, none of those safeguards exist. The contamination is undocumented, unquantified, and unknown to the enterprise deploying the resulting model.
What Enterprises Should Actually Do
The honest answer is that no enterprise security team can currently audit the RLHF provenance of a foundation model they did not build themselves. That information is not in the model card. It is likely not known precisely even by the labs that built the model, given how distributed annotation work tends to be.
But that doesn't mean the risk is unmanageable. Several practical responses are worth prioritizing:
Demand transparency from vendors. Enterprise AI procurement should now include questions about annotation methodology, contractor oversight, and AI-use policies for labelers. Vendors who cannot answer these questions are not necessarily hiding something nefarious — but the absence of auditing infrastructure is itself a risk signal.
Treat model alignment as a dynamic property, not a static one. Fine-tuning on enterprise data is an opportunity to re-introduce genuine human signal. Internal RLHF pipelines with properly incentivized, monitored raters can partially compensate for upstream contamination in the base model.
Build behavioral baselines before deployment. Systematic red-teaming that specifically probes for sycophancy, homogenization, and brittleness on domain-specific edge cases gives enterprises a behavioral fingerprint against which future model versions can be compared. If a new model version shows unexpected drift toward AI-typical response patterns, that is worth investigating.
Advocate for industry standards. The AI industry needs something analogous to chain-of-custody documentation for training data. Enterprises, as the paying customers of AI infrastructure, have leverage to push for this — and should use it.
The Deeper Stakes
AI inbreeding is not just a quality problem. It is a trust problem. The value proposition of large language models in enterprise environments rests on the claim that these systems have been shaped by human values, human preferences, and human judgment. If that claim is quietly hollowing out — if the humans in the loop are increasingly just conduits for AI outputs rather than genuine sources of signal — then the entire alignment story that enterprises have been sold deserves much harder scrutiny.
The workers who admitted to New Scientist that they use chatbots for RLHF labeling were not acting maliciously. They were responding rationally to broken incentive structures. The malice, if we want to use that word, lies in building annotation pipelines that prioritize volume over verifiability, and in deploying the resulting models into high-stakes enterprise environments without disclosing what we don't know about how they were built.
Enterprise AI security has spent years focused on the outputs of AI systems — what they say, what data they expose, how they can be manipulated. It is past time to focus equally hard on the inputs: the training signals that determined, long before any prompt was written, what these systems believe good looks like.
Last reviewed: June 27, 2026



