Hi all - Nate here.
Here’s the next installment on using AI in research and striving for factual accuracy in the results.
Most conversations about AI errors treat them as one problem: hallucination. The AI made something up. But after months of using AI to research community health programs, I found that "hallucination" doesn't come close to describing what actually goes wrong. The errors I encountered fell into at least six distinct types, each with different causes. And if you don't tell them apart, you'll build the wrong safeguards.
1. Fabrication
This is the error everyone talks about. The AI generates something that has no basis in any source.
In a comparison between Claude and NotebookLM reports on the same set of programs, NotebookLM's report mentioned a program called "MarIA - DeepSeek," described as using a framework to predict chronic risks. This program didn't appear anywhere in the Claude report and couldn't be matched to any known program in the dataset. It was flagged as a possible fabrication that needed verification.
This is the classic "hallucination." But in my experience, it was the easiest type of error to spot. An invented program name jumps out. The five errors below are subtler and, for that reason, more dangerous.
2. Miscounting
I gave NotebookLM all 37 individual program reports and asked it to write a synthesis. It counted 31 programs. I told it the correct number was 37 and asked it to try again. It counted 34.
This isn't hallucination. The information was all there. The AI just couldn't reliably count across its own inputs when everything was written in paragraphs. There was no list, no table, no structure it could count against. It was reading narrative text and trying to keep a tally, which is something AI does poorly.
A count of 34 out of 37 looks close enough to be plausible, which makes it easy to miss.
3. Misclassification
Programs ended up with wrong labels when AI read descriptions and made judgment calls about how to categorize them. The most telling example was deployment status. Many programs were coded as "Active scaling" based on language in their source documents. But when I looked more closely, phrases like "scaling nationally" often meant "we plan to scale nationally" or "we received funding to scale," not that scaling was actually happening. About 12 programs had to be reclassified from "scaling" to "pilot" after comparing the actual deployment footprint against the self-reported language.
The problem isn't that AI can't categorize. It's that source documents often use promotional or aspirational language, and AI takes that language at face value. A program website that says "transforming healthcare delivery nationwide" gets coded as national scale, even when the actual footprint is 21 facilities in one region.
4. Citation errors
NotebookLM uses numbered citations like [1], [3], [5]. But I found that several different numbers all pointed back to the same document. It turned out NotebookLM was citing individual passages within a document rather than the document itself. So what looked like three independent sources supporting a claim was actually one source cited three times. Meanwhile, other source documents weren't being cited at all.
Citation errors also came in other forms: a fact credited to the wrong source, a citation pointing to a source that didn't exist, or a citation pointing to a real source that didn't contain the claimed information.
The fix was to stop using numbered citations and instead name the source directly in the text. Writing "according to the ReliefWeb press release" instead of "[3]" made it immediately clear which source supported each claim and whether multiple claims were all relying on the same one.
5. Content that looks real but isn't substantiated
This one is different from hallucination. The AI didn't make anything up. It found real content on the internet that used all the right words but had nothing behind it.
AI search tools surfaced a product called EasyClinic as a community health worker program in Rwanda. It matched all the right search terms: AI, community health workers, Rwanda. The company's website had pages with language like "our mobile tools empower CHWs to..." that read like descriptions of a real program. But there was no named facility, no district, no implementing partner, no date of deployment, and no third-party confirmation. The company did have a real product (an electronic medical records system used in a clinical study of clinicians in Nairobi), but the community health worker story existed only as marketing content.
AI search tools can't tell the difference between content that describes something real and content that just uses the right language. A blog post that says "our tools empower CHWs" looks the same to a search engine as a WHO report documenting an actual deployment. The only way to catch this is to look for independent confirmation: a third-party source, a government partner, a named location, a date. If none of that exists, the content isn't evidence of a program. It's just words on a website.
6. Inference stated as fact
In one program report, the original source said the research team would test the AI model's "accuracy and cultural appropriateness." The AI-generated report reframed this as the program's goal being "to generate accurate, culturally appropriate responses."
The shift is subtle. The source said "we will assess whether it's accurate and culturally appropriate," a question being tested. The report turned this into a stated intention: "the program aims to be accurate and culturally appropriate." These sound similar but mean different things.
This type of error is especially hard to catch because the reframed version sounds reasonable. You'd only spot it by comparing the AI's output word-for-word against what the original source actually said.
Why the distinction matters
If you treat all of these as "hallucination" and respond with a single fix (like "add a fact-checking step"), you'll catch some errors and miss others entirely. Miscounting requires a structural fix: use a spreadsheet, not paragraphs. Content that looks real but isn't substantiated requires looking for independent confirmation, not just checking whether the AI quoted its source correctly. Inference-as-fact requires comparing the AI's words against the exact source language, not just checking whether the claim is roughly correct.
Each error type has a different cause and needs a different response. Lumping them together under "AI makes stuff up" is both inaccurate and unhelpful.
Quick reference
Error type | What happens | What to do |
|---|---|---|
Fabrication | AI invents something with no basis in any source | Verify names, programs, and claims against original sources |
Miscounting | AI loses track when counting across narrative text | Use structured data (tables, lists) instead of paragraphs |
Misclassification | AI takes promotional or aspirational language at face value | Compare labels against actual deployment footprint, not self-reported language |
Citation errors | Citations point to wrong sources, or multiple citations point to the same one | Name sources directly in text instead of using numbered references |
Unsubstantiated content | AI surfaces real content that uses the right words but has no evidence behind it | Look for independent confirmation: third-party sources, named locations, dates |
Inference as fact | AI restates a research question or aspiration as a stated outcome | Compare AI output word-for-word against original source language |

