AI makes things up. You have probably watched it happen, and it is fair to ask the next question: compared to what? People make mistakes too. The honest comparison is not one number against another number. It depends entirely on the task. On some grounded, source-based tasks the best models now match or beat skilled people in head-to-head tests. Cut loose from a source, they still trail. Once you see why, the fix stops being about the model and starts being about the work you build around it.
Why there is no single answer to this question
People want one number for AI and one number for humans so they can line them up. That number does not exist for either side.
AI error swings with the job. The same model that may stay very close to a document you handed it will invent a citation when you ask it to recall a fact from memory. Human error swings the same way. The same person who types one number wrong in a thousand can misremember a face with total confidence.
So a fair comparison has one rule: match the task. Put the model and the person on the same job, with the same information, and then read the error rates side by side. Anything short of that is guesswork dressed up as a verdict. This is the same habit we walked through in Why Does AI Make Things Up?, now turned toward the human side of the ledger.
Reading your own work this carefully, deciding what to trust and what to check, is the first skill in The 7 Levels of AI Proficiency. It is the difference between using AI and being used by it.
What "grounded" means, and why it changes everything
Two kinds of AI work look the same from the outside and behave completely differently underneath.
Grounded work gives the model a source and asks it to stay inside that source. Summarize this report. Pull the dates from this contract. Answer this question using only the file I attached. The source is the boundary, and the model is graded on whether it stays inside it.
Ungrounded work gives the model nothing but the question and asks it to answer from memory. What year did this happen? Write me an original analysis of this market. Tell me about this person. There is no boundary, so the model fills the space with whatever its training pulled together, and some of that turns out to be wrong.
Most business AI is grounded, or it can be. You usually have the document, the data, the policy, or the email thread. That single design choice, whether the model works from a source or from memory, moves the error rate more than picking one model over another. Hold that idea. It is where this whole comparison ends up.
How often AI makes mistakes, by task
Here is the spread, from the work AI does best to the work it does worst.
On grounded summarization, the leaders are very accurate. The current Vectara Hallucination Leaderboard, which tests how often a model adds something the source never said, shows a top model at 1.8% on its summarization benchmark. Independent reviews put leading models in the 0.7% to 1.5% range on the same kind of grounded task.
Push the difficulty up and the number climbs. Vectara's new harder benchmark uses more than 7,700 documents from law, medicine, finance, education, and tech, and the early leader there sits at 3.3%. Several frontier "thinking" models that score well on easy material climb above 10% on the hard set, and one large model reaches 13.6%. The harder and more specialized the source, the more even good models slip.
Cut the source away entirely and the picture changes. OpenAI's own system card for its o3 and o4-mini reasoning models reports them hallucinating on 33% and 48% of PersonQA, a benchmark answered from memory rather than from a source. Secondary summaries put fully open-ended generation higher still, often cited in the 40% to 80% range depending on how the task is set up. These are the same kinds of models doing a different job, and the error rate jumps by a wide margin once the grounding is gone.
How often humans make mistakes, by the same kind of task
Now the side that gets less attention. It is easy to treat people as the reliable baseline. The measured record is more humbling than that.
Data entry. Studies of human keying put the error rate at 1% to 4% per field, with about 1% treated as the mark of a careful operator, per aggregated human-error research. One wrong field in a hundred is what "good" looks like.
Radiology. Experienced radiologists working in daily practice run a 3% to 5% real-time error rate, and retrospective reviews of CT and MRI studies surface errors in 20% to 30% of cases, per a review in the medical literature. These are trained specialists reading at the top of their field.
Eyewitness memory. Confident human recall is one of the least reliable instruments we have. Mistaken eyewitness identification appears in 69% of the wrongful convictions later overturned by DNA evidence, the single leading cause, per the Innocence Project. In those cases a witness recalled the wrong face with full confidence.
Of the wrongful convictions later overturned by DNA evidence trace back to one cause: a confident human eyewitness who recalled the wrong face.
Source: Innocence ProjectNone of this means people are unreliable across the board. It means human error is real, measurable, and usually larger than we admit when we set ourselves up as the gold standard.
The task-matched comparison
Put them on the same job, same information, and read across.
| Task (same information, same job) | Human error | AI error | Honest read |
|---|---|---|---|
| Grounded document summary | n/a as a clean benchmark; humans miss and misstate at meaningful rates under time pressure | 0.7% to 1.8% on the leaders; 3.3% on a harder enterprise set | AI is strong here. This is most business AI work. |
| Clinical notes from a source | human notes averaged at least 1 error and about 4 omissions each | one model: 1.47% added errors, 3.45% omissions on the same task | Roughly matched, AI slightly ahead on this study. |
| Speech transcription | professional transcribers: 5.9% word errors (clean audio), 11.3% (harder audio) | a measured system: 5.8% and 11.0% on the same audio | Essentially even. Human parity reached. |
| Data entry, per field | 1% to 4%, about 1% for a careful operator | grounded extraction sits in a similar single-digit band | Comparable; the win comes from pairing them. |
| Open-ended answer from memory | a knowledgeable person is usually careful and checkable | OpenAI's o3 and o4-mini hallucinated on 33% and 48% of a from-memory person-fact benchmark; secondary summaries cite 40% to 80% on open generation | AI is clearly worse. Do not run it ungrounded. |
| Confident recall of a detail | eyewitness memory wrong in 69% of DNA exonerations | grounded model with the source in hand stays close to the source | Grounded AI beats unaided human memory. |
Sources for these rows: the clinical note head-to-head, the speech-recognition human-parity study and its companion paper, the Vectara leaderboard, the OpenAI o3 and o4-mini system card, and a secondary review of model error rates.
Read the table top to bottom and a pattern shows up on its own. Wherever the source is present, AI is competitive or ahead. Wherever the source is absent, AI falls behind. What separates "trust it" from "check it twice" is largely whether the model was given something to work from. The workflow usually counts for more than the logo on the tool, though the model you choose still pulls real weight.
A grounded model beats an unaided human on grounded work. A memory-only model loses to a careful human on open questions. Grounding is the lever you reach for first, and the model you choose is the lever you reach for second.
The clinical-notes case, up close
One study makes the point cleanly because it ran the two side by side on the same job. Human-written clinical notes averaged at least one outright error and about four omissions per note. A model doing the same task produced added errors in 1.47% of cases and omissions in 3.45%, per the head-to-head in Nature's npj Digital Medicine.
The model still made mistakes here. The point of the study is narrower and more useful: a fair, side-by-side test put the human baseline lower than the people in the room had assumed. When the comparison is honest, "but it makes mistakes" turns into "compared to what, exactly?"
Where reasoning models fit, and why harder is not always safer
There is a wrinkle worth naming, because it surprises people. The newest "thinking" models, the ones built to reason through a problem step by step, are not automatically the most reliable on grounded summary. Several of them sit above 10% on Vectara's hard enterprise test, higher than smaller models built for tight grounded work.
The takeaway is not "avoid reasoning models." It is that more reasoning power is the right tool for open-ended problem solving and the wrong tool to judge by for staying inside a source. Pick the model for the job. A focused grounded model for grounded work. A reasoning model for the messy, open problems where you expect to check the output anyway.
What this means for the work in front of you
The whole comparison points at one practical step, and it has nothing to do with waiting for a perfect model.
If the answer to "is AI more reliable than a person here?" depends on grounding, then your job is to build the grounding in. Give the model the source. Keep a person on the result. Design the steps so the model works inside a boundary and a human signs off at the point where a mistake would actually cost something.
That is a workflow decision, and no software purchase replaces it. It is the skill that separates a team that gets burned by AI from a team that gets value out of it. The skill is designing the path the model runs and choosing where the human checkpoint sits, which counts for more than picking the smartest model. In The 7 Levels of AI Proficiency, that is the step from using a tool to building the system the tool runs inside.
What you can set up this week
Three small steps. None of them require new software.
Ground one task you already do.
Pick something you ask AI to do from memory and hand it the source instead. Attach the document, the data, the policy. Watch the error rate fall on its own.
Add one human checkpoint where it counts.
Find the one place in a task where a wrong answer would cost real money or real trust, and put a person's sign-off right there. Not everywhere. The one place that carries the cost.
Match the model to the job.
Use a tight, grounded model for source-based work and save the reasoning models for open problems you plan to review anyway.
If you want to see where your team sits on this skill today, the free 7 Levels of AI Proficiency assessment takes about ten minutes and shows you where the strength and the missing layer are.
Related reading: Level 4: The Commander (Context Engineer).
Sources
- Vectara. "Introducing the Next Generation of Vectara's Hallucination Leaderboard." Accessed June 24, 2026.
- Vectara. "Hallucination Leaderboard" (dataset and methodology). Accessed June 24, 2026.
- SQ Magazine. "LLM Hallucination Statistics." Accessed June 24, 2026.
- npj Digital Medicine (Nature). "Comparison of large language model and human clinical note generation." Accessed June 24, 2026.
- Microsoft Research. "Toward Human Parity in Conversational Speech Recognition." Accessed June 24, 2026.
- Xiong et al. "Toward Human Parity in Conversational Speech Recognition" (paper, PDF). Accessed June 24, 2026.
- Parsli. "Human Error Statistics" (data-entry error research). Accessed June 24, 2026.
- National Center for Biotechnology Information. "Errors in Radiology" (review). Accessed June 24, 2026.
- Innocence Project. "How Eyewitness Misidentification Can Send Innocent People to Prison." Accessed June 24, 2026.
- OpenAI. "OpenAI o3 and o4-mini System Card" (PersonQA hallucination evaluation: o3 33%, o4-mini 48%). Accessed June 24, 2026.
Frequently Asked Questions
Does AI hallucinate more than humans?
It depends on the task. On grounded work, where the model summarizes or pulls facts from a source you give it, the best models now rival or beat skilled people in head-to-head tests like clinical-note and transcription studies, with leaders measured at 0.7% to 1.8% on standard summarization tests. On open-ended work from memory, AI is clearly worse: OpenAI's own o3 and o4-mini hallucinated on 33% and 48% of a from-memory benchmark. There is no single number for either side, so the only honest comparison is task by task.
How often does AI make mistakes?
On grounded summarization, leading models fall between 0.7% and 1.8% on standard datasets and about 3.3% on a harder enterprise set spanning law, medicine, and finance. On answers pulled from memory the risk climbs sharply: OpenAI reported its o3 and o4-mini models hallucinating on 33% and 48% of a person-fact benchmark, and secondary summaries put fully open-ended generation in the 40% to 80% range.
Are humans more reliable than AI?
Not by as much as people assume, and it depends on the task. Human data entry runs 1% to 4% errors per field, radiologists run 3% to 5% in real time and 20% to 30% on retrospective review, and eyewitness misidentification appears in 69% of DNA exonerations. On grounded tasks, the best models are now competitive with skilled humans in measured head-to-head tests, though the result still turns on the task, the source, and the review process.
Find your AI Proficiency level
The free 7 Levels assessment places you across seven stages of AI capability. Under ten minutes. Research-backed scoring.