How Reliable Is AI? What Reliability Really Means in 2026
A model that answers a question correctly 95% of the time sounds excellent. Then you ask it to do a real job — read an email, look up the customer in your CRM, draft a reply, and schedule a follow-up — and the failure rate quietly stops being 5%. It becomes the number you get when you multiply four imperfect steps together. That gap between “usually right on one answer” and “dependable across a whole task” is the entire story of AI reliability in 2026.
This is the question that decides whether AI stays a demo or becomes infrastructure. So let’s be precise about what reliability actually means for AI, why language models hallucinate, why AI agents are harder to make reliable than chatbots, and how you can tell whether a tool is trustworthy enough to hand real work to.
Why AI reliability is a different problem from software reliability
Traditional software is deterministic. Given the same input, a function returns the same output every time. When it breaks, it breaks the same way, you can write a test that catches it, and the test stays green forever. Reliability engineering for normal software is mostly about uptime, latency, and not shipping regressions.
Large language models are probabilistic. They predict the most likely next token given everything before it, sampling from a distribution rather than executing a fixed rule. The same prompt can produce different answers on different runs, and a tiny change in wording can flip a correct answer into a wrong one. That non-determinism is not a bug you can patch — it is how the technology works.
This breaks the usual definition of reliability. The hard part of AI reliability, as one widely-cited framing of the enterprise problem puts it, is “less ‘the model is wrong’ and more ‘we cannot tell ahead of time when it is wrong, and our regression tests don’t catch it.’” In a 2026 survey of engineering leaders, 70% named non-deterministic outputs as their number-one production-readiness barrier. A system that is usually right but unpredictably wrong, and confidently wrong when it is wrong, is harder to operate than one that fails loudly and consistently.
So when we talk about AI reliability, we mean something more demanding than accuracy on a benchmark. We mean: does it do the right thing consistently, across many runs, under messy real-world inputs — and does it know when it doesn’t know?
Can’t you just turn the randomness off?
A common reaction is: if non-determinism is the problem, set the temperature to zero and make the model deterministic. It helps, but it doesn’t solve the problem, for two reasons.
First, even at temperature zero, identical outputs aren’t fully guaranteed in practice. Floating-point arithmetic on parallel hardware, batching, and changes to the model or its serving infrastructure can all produce different results for the same prompt. “Deterministic” sampling reduces variance; it does not eliminate it.
Second, and more importantly, low temperature makes a model consistent, not correct. If the model is confidently wrong, turning down the randomness just makes it reliably wrong — it will hallucinate the same fake citation every single time. Determinism settings are a useful tool for tasks that have one right answer and need to be repeatable, but they treat the symptom (variance) rather than the disease (the model’s willingness to assert things it doesn’t know). Reliability has to be built above the model, not only tuned inside it.
Hallucinations: why models confidently make things up
A hallucination is when a model generates information that is fluent, plausible, and wrong — a fake citation, an invented policy, a customer who doesn’t exist. The unsettling part is that hallucinations don’t look like errors. They look exactly like correct answers, because the model is doing the same thing it always does: producing the most statistically likely continuation.
The most important recent explanation of why came from OpenAI. In their September 2025 paper Why Language Models Hallucinate, researchers argue that hallucinations are not a mysterious glitch but a predictable consequence of how models are trained and graded. Two mechanisms stand out:
- Some errors are statistically unavoidable. During pretraining, a model learns the distribution of language. For facts that appear rarely or only once in the training data, the math of next-token prediction guarantees the model will sometimes guess wrong — there simply isn’t enough signal to pin the fact down.
- Evaluation rewards confident guessing over honesty. Most benchmarks score a model on accuracy and give zero credit for saying “I don’t know.” Under that scheme, guessing is the optimal test-taking strategy, exactly the way a student guesses on a multiple-choice exam rather than leaving it blank. As OpenAI’s summary of the work puts it, the dominant scoreboards “reward lucky guesses over honest acknowledgments of uncertainty.” Models that hallucinate fluently outscore models that admit doubt — so we train the fluency in.
That second point reframes hallucination from a data-quality problem into an incentive problem. We built leaderboards that punish abstention, and we got models that would rather invent an answer than withhold one.
How often do models actually hallucinate?
The honest answer is: it depends heavily on the task, and the numbers are not as reassuring as the headlines about ever-smarter models suggest.
The most consistent public measurement is Vectara’s Hallucination Leaderboard, which tests a narrow, fair task: given a source document, summarize it without adding facts that aren’t there. This is grounded summarization — the easiest possible setting, because the correct information is sitting right in front of the model. On the original dataset, the best models score remarkably well, with leading models hallucinating under 2% of the time.
But Vectara’s refreshed late-2025 benchmark, using longer documents (up to 32K tokens) across law, medicine, finance, and technology — closer to what enterprise systems actually face — told a harsher story. On the harder data, several frontier reasoning models, including GPT-5, Claude Sonnet 4.5, Grok-4, and Gemini-3-Pro, all exceeded 10% hallucination rates, with one fast-reasoning variant hitting 20.2%. Notably, reasoning-tuned models often did worse at staying faithful to a source — more elaboration meant more opportunities to drift.
Read that carefully. Even when you hand a model the exact source material and ask it only to summarize, the best systems in 2026 still invent unsupported claims more than one time in ten on realistic documents. Now imagine a task where the answer isn’t sitting in front of the model.
When hallucinations meet the real world
These aren’t academic concerns. Two cases have become the standard cautionary tales.
In Moffatt v. Air Canada, decided by British Columbia’s Civil Resolution Tribunal in February 2024, an airline customer was told by Air Canada’s website chatbot that he could claim a bereavement discount retroactively after booking. That was false; the real policy, linked elsewhere on the same site, said the opposite. Air Canada argued the chatbot was a “separate legal entity” responsible for its own statements. The tribunal rejected that outright, found the company liable for negligent misrepresentation, and held that a business is responsible for everything on its website — chatbot output included. The damages were small (about CA$650), but the precedent was not: you own what your AI says.
In Mata v. Avianca, two New York lawyers used ChatGPT to draft a legal brief and filed it with the court. The brief cited multiple cases that did not exist — complete with fabricated quotations and citations. When opposing counsel and the judge couldn’t find the cases, the truth came out. In June 2023, Judge P. Kevin Castel sanctioned the attorneys and their firm $5,000, finding they had acted in bad faith. The cases have since multiplied: courts worldwide now regularly catch AI-fabricated citations in filings.
The common thread is that the AI was fluent and confident in both cases. Nobody double-checked, because the output looked right. That is precisely what makes hallucination a reliability problem and not just an accuracy footnote.
There’s a deeper governance point buried in the Air Canada ruling, and researchers have given it a name: the accountability gap. When an AI system acts on a company’s behalf, who is responsible when it’s wrong? The company that deployed it would like to say “the AI did that, not us.” Courts are saying the opposite — that you cannot delegate liability to a piece of software you chose to put in front of customers. As a Springer analysis of the case frames it, AI introduces an agency-and-responsibility gap that the law is now closing in the deployer’s disfavor. The practical consequence for anyone shipping an AI product: reliability isn’t just a quality metric, it’s a liability surface. Every confident hallucination your system emits is one you may have to answer for.
AI agent reliability: where the real difficulty lives
A chatbot produces one answer, and a human reads it. An AI agent takes many steps on its own — calling tools, reading data, making decisions, acting on the results — often without a human checking each one. That changes the reliability math completely, and it’s why “are AI agents reliable?” is a genuinely harder question than “is this model accurate?”
The compounding-error problem
Here is the single most important idea in agent reliability. If a task requires a chain of steps, and each step succeeds independently with probability p, then the end-to-end success rate is p raised to the number of steps. Reliability that looks fine per-step collapses across a chain.
| Per-step reliability | 3-step task | 5-step task | 10-step task | 20-step task |
|---|---|---|---|---|
| 90% | 73% | 59% | 35% | 12% |
| 95% | 86% | 77% | 60% | 36% |
| 99% | 97% | 95% | 90% | 82% |
| 99.9% | 99.7% | 99.5% | 99% | 98% |
A 95%-reliable step — which sounds great — yields only a 60% chance of finishing a 10-step task without a single error. To get a 10-step task to 90% end-to-end, every individual step has to hit 99%. This is sometimes called the “march of nines”: each additional nine of per-step reliability takes roughly as much engineering effort as the last, but the payoff is enormous because the errors multiply. Most agent demos look magical on a three-step happy path and fall apart on the twelve-step real workflow precisely because of this curve.
Benchmarks confirm the gap between capable and reliable
Two benchmark families make this concrete.
GAIA, built by Meta and Hugging Face researchers, tests agents on messy, real-world tasks that chain web browsing, file parsing, calculation, and reasoning. Its harder levels are deliberately designed to be sensitive to error accumulation and tool-use failures — exactly the conditions that expose compounding error.
τ-bench (tau-bench), from Sierra, is even more revealing because it measures consistency, not just success. Its pass^k metric asks: can the agent solve the same task correctly across k independent attempts? In the original study, GPT-4o — the best performer tested — succeeded less than 50% of the time on a single retail customer-service task, and its pass^8 score (succeeding on all eight tries) dropped to about 25%, a roughly 60% fall from its single-attempt rate. As Sierra put it, that means “there is only a 25% chance that the agent will resolve 8 cases of the same issue with different customers” — far below what a real user-facing deployment needs.
The story holds at the frontier. When Sierra released a harder knowledge-intensive variant, τ-knowledge, in early 2026, the best model (GPT-5.2 with high reasoning) passed just 25.5% of tasks on the first try and only 9.3% reliably across four runs. Capability keeps climbing; reliability lags well behind it.
The lesson from pass^k is blunt: an agent that works once in a demo tells you almost nothing about whether it works every time in production. Reliability is a separate axis from intelligence, and it’s the axis that matters when an agent is touching your inbox, your calendar, and your customers.
The specific ways agents break
Beyond compounding error, agents fail in characteristic ways that a single-shot chatbot never encounters:
- Tool-use errors: calling the wrong tool, passing malformed arguments, or misreading a tool’s output and acting on the misreading.
- Error propagation: one wrong intermediate result quietly poisons every step that follows, and the agent has no idea anything went wrong.
- No error recovery: when a step fails, a brittle agent plows ahead instead of noticing, retrying, or asking for help.
- Context drift: over a long task the agent loses track of the original goal or earlier constraints.
- Silent failure: the most dangerous mode — the agent reports success while having done the wrong thing.
How reliability is actually measured and improved
The good news is that reliability is engineerable. You can’t make a probabilistic model deterministic, but you can wrap it in systems that catch, constrain, and correct its mistakes. The most dependable AI products are not the ones with the smartest base model — they’re the ones with the most disciplined scaffolding around it.
| Technique | What it does | What it fixes |
|---|---|---|
| Grounding / RAG | Feeds the model verified source data at answer time instead of relying on memorized facts | Hallucinations on factual lookups; RAG alone is reported to cut hallucinations by roughly 40–70% |
| Evals | Automated test suites that score outputs against expected behavior, run continuously | The “regression tests don’t catch it” problem; turns reliability into something you can measure |
| Structured outputs | Forces responses into a strict schema (JSON, function calls) | Malformed tool calls; downstream parsing failures |
| Guardrails & abstention | Rules that block unsupported answers and let the model say “I don’t know” or escalate | Confident wrong answers; pairs directly with the abstention fix from the hallucination research |
| Verification & retries | Checks each step’s output before continuing; retries or re-plans on failure | Error propagation; silent failure; the compounding-error curve |
| Human-in-the-loop | Routes low-confidence or high-stakes actions to a person before they execute | Irreversible mistakes on consequential tasks |
| Determinism settings | Lower sampling temperature and fixed seeds for tasks that need repeatability | Run-to-run variance on tasks with one right answer |
None of these is a silver bullet, and the OpenAI hallucination work is clear that you can’t fully eliminate the problem at the model level. But layered together — grounding plus guardrails plus evals plus human review for the risky cases — they move a system from “impressive demo” to “trustworthy enough to run unattended.” The most reliable production stacks combine all of them rather than betting on any single fix.
Reliability is a systems problem, not a model problem
The biggest mental shift for anyone building or buying AI is to stop treating the model as the product. The model is one component. Reliability is a property of the whole system — the retrieval layer that feeds it grounded data, the verification layer that checks its work, the guardrails that constrain its actions, the eval harness that catches regressions, and the human escalation path for the cases that shouldn’t be automated.
This is why two products built on the same underlying model can have wildly different reliability. One ships the raw model with a thin prompt and lets it run. The other invests in the scaffolding: it grounds every factual claim, it forces tool calls through schema validation, it re-checks intermediate results before acting on them, and it routes anything irreversible to a person. The second one is slower to build and less flashy in a demo, but it’s the one you can actually leave running on your inbox.
Research on agent reliability increasingly studies exactly these failure trajectories — how an erroneous intermediate output derails a downstream task, and how systems can detect and correct failures from execution traces before they compound. The frontier of the field is no longer “make the model smarter.” It’s “make the system around the model catch the model’s mistakes.” That distinction is the whole game.
Why reliability is the real blocker to AI adoption
If you only read launch announcements, you’d think capability is the bottleneck. The data from the field says otherwise: the thing keeping agents out of production is reliability.
LangChain’s 2025 State of Agent Engineering survey found that quality — accuracy, consistency, and staying on-policy — was the single biggest barrier to getting agents into production, cited by roughly a third of respondents as their primary blocker. Separate 2026 industry research reported that the large majority of agent pilots never graduate to production, with evaluation gaps, governance friction, and model reliability among the most-cited reasons. Across surveys, the pattern repeats: unreliable performance and non-deterministic outputs, not raw intelligence, are what stall deployments.
This is also why teams keep paying for capabilities they already have. A model smart enough to draft a perfect email is useless if it sends the email to the wrong person one time in twenty. Adoption is gated by trust, and trust is built on consistency, not peak performance.
How to tell whether an AI tool is reliable enough to trust
When you’re evaluating any AI assistant or agent — including Carly — push past the demo and ask the questions that separate reliable systems from flashy ones:
- Does it work the second, fifth, and fiftieth time? Run the same task repeatedly. A reliable tool produces consistent results; an unreliable one dazzles once and drifts. This is the pass^k question applied to your own work.
- What does it do when it’s unsure? The right answer is “it asks, abstains, or escalates” — not “it confidently guesses.” A tool that never says “I don’t know” is a tool that hallucinates silently.
- Is its work grounded in your real data? An agent acting on your actual emails, calendar, and CRM records is far more reliable than one improvising from training data. Grounding is the difference between reading the file and guessing what’s in it.
- Can you see and verify what it did? Reliable agents leave a trail. You should be able to check the actions before or after they happen, not just trust a green checkmark.
- Does it recover from failures? Watch what happens when a step goes wrong. Good agents notice, retry, or hand off. Brittle ones barrel ahead.
- Is reliability the product, or an afterthought? Tools built for reliability talk about consistency, verification, and error handling. Tools built for demos talk about how many things they can do.
If you’re comparing options against these criteria, our roundups of the best AI agent platforms, the best AI personal assistants, and the best AI agents for productivity are built around exactly this distinction between capable and dependable.
Where Carly fits: built to be dependable, not just impressive
This is the problem Carly is built around. The pitch isn’t “Carly is the smartest agent” — it’s that Carly does real work dependably, across email, calendar, CRM, and 200+ connected tools, the way an actual AI employee would.
That dependability comes from the techniques above, not from hoping the model behaves:
- Grounded in your real systems. Carly acts on your live inbox, calendar, and CRM data — labeling and filing email, moving attachments into the right folders, updating records, scheduling, running sequences — rather than improvising from memory. Grounding is the most direct defense against the hallucinations that sink ungrounded chatbots.
- Persistent memory across tasks. Carly remembers your preferences, contacts, and how you work, so it stays consistent over time instead of starting fresh — and forgetting your constraints — on every request.
- Real workflows, not happy-path demos. Carly is designed to run multi-step jobs that have to survive the compounding-error curve, which is exactly where demo-grade agents fall apart on the twelfth step.
Carly starts at $35/month, and you can put the reliability questions above to it directly — run a real workflow, run it again, and watch whether it holds — at dashboard.carlyassistant.com.
The number that matters in 2026
The headline metric for AI has quietly shifted. For two years the question was “how capable is it?” — and on that axis, the models are extraordinary. The question that decides whether any of that capability turns into real work is narrower and harder: how reliable is it, across many runs, on the messy task you actually have?
Capability is now abundant. Reliability is scarce. The systems that win the next phase of AI won’t be the ones that can do the most in a demo — they’ll be the ones you can stop watching.
Ready to automate your busywork?
Carly schedules, researches, and briefs you—so you can focus on what matters.
See what people say
"Before Carly, I relied on a Calendly link, but the whole process felt impersonal and not very professional. Carly changed that by handling all the back-and-forth, so I'm no longer stuck in endless email threads trying to line up schedules.
Now Carly reaches out to candidates, shares my real-time availability, lets them pick a slot, then sends a Zoom link and drops it straight into my calendar. She sends reminders to both of us before each call, which has significantly reduced no-shows and last-minute confusion.
On top of scheduling, Carly acts like a full executive assistant, sending me my schedule the night before so I can prepare for each call. It reminds me of the old x.ai assistant, but Carly is noticeably smarter, faster, and better suited to my healthcare recruitment business."


