Hi all - Nate here,
If you're using AI for research, you should be concerned about factual accuracy.
I've spent the last several months building a spreadsheet that maps programs where AI is used to support community health workers in low- and middle-income countries. I've documented 38 programs so far. I used AI tools throughout the process, and I fact-checked practically every claim they produced. What I found has changed how I work with these tools.
The broad pattern
I started with Perplexity, widely considered the gold standard for AI web research. It gave broken URLs, fabricated citations, and mixed up details across different programs. It kept counting things as CHW programs when they didn't work with CHWs at all, and labeling tools as AI when they were just digital health platforms with no AI component. ChatGPT had the same problems.
It was so unreliable that I had to manually verify practically everything. By the end, I'd spent as much time verifying AI outputs as I would have doing the research without AI.
That's when it started to feel like a hall of mirrors. Each stage of the research, discovery, synthesis, verification, introduced its own layer of potential errors. And each stage built on the one before it. An AI finds a program. It writes a summary. You ask it to verify the summary. At every step, new inaccuracies can enter, and they compound. After enough layers, it gets hard to know what's real.
The experiment I didn't plan
After I had gathered all the programs and moved to fact-checking, I stumbled one source of errors.
I had one Claude chat that had been doing broad web searches to find programs I might have missed. It had run dozens of searches and accumulated a large collection of results. I opened a separate Claude chat and asked it to write a detailed entry for one specific program, fetching and reading the full web pages for that program.
I took the detailed entry back to the first chat and asked it to verify the claims, one by one.
It got the specifics wrong repeatedly. It said Bangladesh wasn't a program country. It was. It said the open-source date was September 2023. It was August 2023. It said the journal was BMJ Health & Care Informatics. It was Digital Health. It said the lead author wasn't Gathecha. Gathecha was correct.
The chat that had done more research got the facts wrong. The one that had read fewer sources but read them completely was right nearly every time.
Why this happens
When Claude searches the web, it uses a tool that returns short snippets, similar to Google search previews. A separate tool retrieves the full content of a specific URL. The model decides which to use based on context. If it already has text that seems relevant, it tends to work with that rather than fetching something new.
The first chat had accumulated dozens of snippets from earlier work. When I asked it to verify specific claims, it didn't go back and fetch the full pages. It relied on what it already had. Those snippets were incomplete, but nothing signaled they were incomplete. A blog post snippet included three of four AI techniques but cut off before the fourth. An article snippet didn't include the journal name or full author list.
There's also a known problem with how language models handle long contexts. Research on the "Lost in the Middle" problem (Liu et al., 2023) shows that LLMs pay the most attention to information near the beginning and end of their context window and lose track of content in the middle. Once the first chat had processed dozens of search results, earlier snippets fell into that blind spot.
And search results themselves can be stale or come from secondary sources that paraphrase the original incorrectly. Even when the first chat did search, it sometimes found third-party summaries rather than the primary source, and those summaries contained errors that the original pages did not.
What to do about it
The fixes are structural, not just about prompting better.
Use separate chats for separate tasks. Don't use the same conversation for broad discovery and precise verification. One chat for searching, a separate chat for detailed write-up, a fresh chat for verification. Each starts with a clean context window.
Tell the model which tool to use. If you need precise verification, say "fetch the full page at this URL and check this claim." Without that instruction, the model often defaults to snippets or reuses what's already in context.
Watch for hedging. When a model says "I can't verify this from my sources," that's a signal its sources are incomplete, not that the claim is wrong. That's the moment to dig deeper.
Cross-check between instances. Two separate chats working the same problem surfaces errors that are invisible within a single conversation.
Keep verification conversations short. The longer the conversation, the more noise in the context window. If you shift to a precision task, start a new chat.
What I’m working on
The spreadsheet I published on AI and CHW programs is accurate because I did the verification work myself, program by program. AI tools are unreliable on the exact kind of factual precision that research demands, and the failures can be hard to spot. But I don't think the right conclusion is that AI is useless for research. They still have so much potential to allow us to do so much more work in a fixed amount of time and to improve the quality of our work.
So, I'm doing a deep dive on this. I'm testing different workflows, documenting what fails and what actually works, and building toward practical methods for using AI in research without sacrificing accuracy. I will share what I learn along the way. And if you're working on similar problems, feel free to reach out and share your experiences.
