Enterprise AI Security

Unsolvable Problems Expose Hidden Enterprise AI Security Risks

Published: May 18, 20266 min read

AI models are failing to recognize unsolvable tasks, creating a dangerous blind spot for businesses. Learn why confident hallucinations are a major security risk.

When an AI model confidently produces a wrong answer, that's a bug. When it confidently produces an answer to a question that has no answer, that's a category of failure most enterprise security frameworks aren't built to catch.

A new benchmark called SOOHAK has made this problem impossible to ignore. Constructed by 64 mathematicians, SOOHAK contains 439 handwritten tasks, of which 99 are deliberately unsolvable. The results are sobering: Google's Gemini 3 Pro, currently the top performer, solves only 30 percent of research-level problems correctly — and not a single model exceeds 50 percent accuracy at identifying tasks that simply cannot be solved. The implication isn't just academic. For organizations deploying AI in high-stakes workflows, this represents a concrete and underappreciated enterprise AI security risk.

Here are three reasons why this failure mode exists — and why it should change how you think about AI deployment.

1. Models Are Trained to Answer, Not to Abstain

The most fundamental reason AI systems fail to recognize unsolvable problems is architectural: they are optimized to produce outputs. Every training signal, every RLHF reward, every benchmark used to evaluate model quality pushes the model toward generating a confident, coherent response. Saying "this problem has no solution" is, from a training dynamics perspective, almost always the losing move.

This isn't a flaw in any particular model — it's a structural property of how language models are built and evaluated. The SOOHAK findings make this concrete. According to reporting on the benchmark, increased compute improves problem-solving performance but does not improve models' ability to admit a problem has no solution. Scale, in other words, makes models better at answering — not better at knowing when not to.

No model exceeded 50% accuracy at identifying tasks with no valid solution, even as raw problem-solving scores improved with model scale.

For enterprise deployments, this asymmetry is dangerous. A model integrated into a legal research workflow, a financial analysis pipeline, or a clinical decision-support system will encounter edge cases — malformed queries, contradictory constraints, logically impossible requirements. The assumption that a capable model will flag these cases is, according to SOOHAK, empirically false. The model will generate something. It will look confident. And downstream systems will treat it as valid.

2. Mathematical Reasoning Is the Wrong Proxy for Epistemic Honesty

For years, math benchmarks have served as the gold standard for measuring AI reasoning. The logic is intuitive: if a model can solve competition-level mathematics, it must have a robust internal model of what constitutes a valid solution. SOOHAK dismantles this assumption.

The benchmark was designed specifically to separate two distinct capabilities: problem-solving (can the model find a correct answer?) and problem recognition (can the model determine whether a valid answer exists?). These are not the same skill, and SOOHAK's results show they don't correlate the way the field has assumed.

Gemini 3 Pro leads on the former at 30 percent — a respectable score on genuinely hard, research-level mathematics. But that performance doesn't transfer to the latter. The 99 unsolvable problems in SOOHAK expose a gap that raw reasoning benchmarks were never designed to measure.

This matters for enterprise AI security because organizations routinely use benchmark performance as a proxy for deployment readiness. A model that scores well on MATH, GSM8K, or even frontier benchmarks like AIME is assumed to be "reliable." But reliability in the sense that matters for production systems — knowing the boundaries of one's own competence — is a separate capability that current evaluation frameworks don't adequately measure.

Increased compute improves problem-solving but does not improve models' ability to admit a problem has no solution — a critical limitation for real-world deployment.

The security implication is direct: if your model evaluation process doesn't include adversarial unsolvable inputs, you have a blind spot. And blind spots in enterprise AI aren't theoretical risks — they're liability exposure.

3. Confidence Signals Are Uncalibrated for Impossibility

The third failure mode is perhaps the most insidious: AI models don't just fail to identify unsolvable problems — they fail confidently. The outputs produced for SOOHAK's 99 unsolvable tasks aren't hedged, uncertain, or flagged with low confidence. They look like answers.

This is a calibration problem with a specific character. Standard calibration research focuses on whether a model's expressed confidence matches its accuracy on solvable problems. SOOHAK introduces a different question: what happens to confidence signals when the problem is structurally impossible? The answer, based on the benchmark results, is that confidence doesn't collapse the way it should. Models continue to generate plausible-looking mathematics for problems that have no mathematical solution.

For enterprise risk management, this is the nightmare scenario. Hallucination detection systems, human-in-the-loop review processes, and output validation pipelines are typically calibrated to catch low-confidence or incoherent outputs. A model that produces high-confidence, internally coherent, but fundamentally invalid outputs bypasses most of these controls.

Consider the attack surface this creates. A malicious actor who understands this failure mode could craft inputs — in customer service, contract review, code generation, or data analysis contexts — that are structurally unsolvable but designed to elicit confident, plausible-looking outputs. The model doesn't know it's being manipulated. The downstream system doesn't know the output is invalid. The human reviewer, if there is one, sees a confident response and moves on.

What Enterprises Should Do Differently

The SOOHAK findings don't mean AI models are unusable in enterprise settings. They mean the current evaluation and deployment practices are insufficient for the risk profile most organizations are actually operating under.

Three concrete adjustments follow from this analysis:

Audit your evaluation benchmarks. If your model selection process relies entirely on task-completion accuracy, you're missing the abstention dimension. Add adversarial unsolvable inputs to your internal evaluation suites and measure how often models correctly refuse to answer versus how often they confabulate.

Treat confident outputs as higher risk, not lower. Counterintuitively, high-confidence model outputs in ambiguous or edge-case scenarios should trigger more scrutiny, not less. Confidence calibration for impossible inputs is broken; your review processes should account for that.

Separate capability from reliability in vendor assessments. A model that leads benchmarks like SOOHAK's problem-solving component is capable. A model that can identify the boundaries of its own competence is reliable. For most enterprise use cases, you need both — and right now, the market doesn't have a clean signal for the latter.

The SOOHAK benchmark, built by 64 mathematicians with 439 carefully constructed tasks, has done something valuable: it has given the field a precise, measurable way to see a failure mode that was previously invisible. The 50 percent ceiling on unsolvable-problem detection isn't a number that will age well as a footnote. It's a benchmark that should become a standard part of how enterprise AI security is evaluated.

The models will keep getting better at answering. The question is whether we'll build the infrastructure to know when they shouldn't.

Sources:

The Decoder: New math benchmark reveals AI models confidently solve problems that have no solution

Last reviewed: May 18, 2026

Enterprise AI SecurityAI Risk ManagementLLM EvaluationAI Governance

Looking for AI solutions for your business?

Discover how our AI services can help you stay ahead of the competition.

Contact Us

Continue Reading

Enterprise AI