Hi all, Nate here.

I built a citation verification skill for Claude Cowork (You can download it at the end of the message). Does it actually beat the tools people already use?

I took a 35-citation report on an AI health platform and rigged it with fabricated statistics, reversed findings, and a made-up study. Then I fed it through four AI tools to see which ones caught what.

My skill caught 10 of 11. ChatGPT caught 5 and fabricated quoted evidence for 3 more. Plain Claude chat caught 3. Perplexity refused the task.

A quick aside on Cowork and skills

Most global health folks who have tried Claude have only used the chat window. There are two other features worth knowing about: Cowork and skills.

Cowork allows you to work with AI agents and to complete multi-step workflows. It can also run multiple agents at once (sub-agents), which can work in parallel, hand off parts of a task to each other, and check each other's work. For this test, Cowork set up two agents. The first agent worked through each of the 35 citations one by one. The second agent independently reviewed the first agent's verdicts, starting fresh without seeing how the first one had reasoned, and caught things the first agent had missed or gotten wrong. If you've been wondering what AI agents actually do in practice, this is a concrete example.

A skill is a set of detailed instructions Claude follows when it encounters a task the skill is designed for. I don't write skills from scratch. I usually build one after a long back-and-forth with Claude on a specific task, where we work through different approaches, figure out what works, and surface things I hadn't thought of at the start. When we land on a method I'll want to use again and again, I ask Claude to turn the conversation into a skill. Then I test that skill on the next similar task, see where it falls short, and revise. The citation-verification skill I used here went through several rounds of that process and is now about 2,000 words. It tells Claude how to handle each citation, what verdicts it can give (SUPPORTED, PARTIAL, UNSUPPORTED, and so on), what to do when a source is inaccessible, and how to structure the second agent's independent review.

Think of a skill as an expert playbook that sits ready for when you need it. Cowork is the environment where the playbook actually gets used.

How I tested it

Four tools, six test runs in total. Eleven errors in the document for the tools to catch: 9 I planted deliberately, plus 2 I'd already found in the original document when I went through it by hand.

For Cowork, I just asked it to run the citation-verification skill. All the instructions are built into the skill itself, so nothing else was needed.

For ChatGPT and Perplexity, I ran each one twice, with two different prompts.

The rigorous prompt was about 250 words. It asked the tool to return a verdict and supporting passage for every citation, using specific categories (SUPPORTED, PARTIAL, UNSUPPORTED, INACCESSIBLE, HALLUCINATED SOURCE), and instructed it not to guess.

The basic prompt was a single sentence: "Check the accuracy of the citations in this document." I wanted to see what happens when you don't put careful thought into the prompt, which is how most people ask most of the time, including me much of the time.

Claude chat (no skill) got the basic prompt only, to see what the underlying Claude model catches on its own, with no scaffolding around it.

Before running any of the tools, I went through all 35 citations by hand. That gave me my own answer key to compare each tool's results against.

Full results

Cowork with the skill:

  • 10 of 11 caught (91%), 0 fabrications

Rigorous prompt:

  • ChatGPT: 5 of 11 caught (45%), 3 fabricated quotes

  • Perplexity: refused the task

Basic prompt:

  • Claude chat (no skill): 3 of 11 caught (27%), 0 fabrications

  • ChatGPT: 1 of 11 caught (9%), 0 fabrications

  • Perplexity: 0 of 11 caught (0%), 0 fabrications

Three things worth your attention

The skill outperformed every other setup, by a lot. The same Claude model, given a simple one-sentence prompt without the skill, caught 27% of the errors. With the skill running in Cowork, it caught 91%. Part of the difference is the skill itself, which is a much more detailed set of instructions than any user would write on the fly. Part of it is what Cowork adds: the second agent caught errors the first agent missed, and when a source couldn't be fully opened, the workflow flagged the verdict as uncertain rather than inventing a quote to match the claim.

If you're doing any task where accuracy matters and you'd benefit from a repeatable workflow (literature reviews, document checking, structured research, fact-checking), a good skill running in Cowork is probably the best option.

ChatGPT caught more errors under a careful prompt. It also made things up. Under the rigorous prompt, ChatGPT caught 5 of 11 errors, versus 1 of 11 under the basic prompt. But it also fabricated three verdicts: confident SUPPORTED labels backed up by quoted passages that don't actually appear in the source. For example, I planted a claim that a tool correctly interpreted "94.15% of negative mRDTs" when the real figure in the source was 97.12%. ChatGPT's "quote from the source" read: "96.38% of positives and 94.15% of negatives correctly interpreted." That quote isn't in the source. It's a rewording of my planted (wrong) claim. The rigorous prompt's instruction to quote exact passages pressured ChatGPT to produce evidence even when it couldn't reach the source. The more careful prompt gave me more of what I wanted, and more of what I didn't.

Perplexity refused, and wasn't much better when it didn't. Under the rigorous prompt, Perplexity declined the task. Under the basic prompt, it engaged but caught zero errors.

Perplexity has a reputation for being the best AI tool for factual, source-based answers. I've defaulted to it as my first-line fact-checker on that assumption. These results make me question it.

Two things got in the way. When you upload a document, Perplexity doesn't actually read the whole thing. It breaks the document into pieces and only looks at the ones it thinks are relevant to the question you ask. That's fine for a search-and-summarize task, but it breaks citation checking, where every citation matters. Even when I pasted the document text directly into the prompt (which gets around the upload problem), Perplexity couldn't open the full content of most of the cited sources. Rather than guess from search snippets, it refused to give verdicts on those citations. On one hand, it’s good to know that Perplexity wouldn’t make things up, but it won't give you usable output either.

Perplexity is good at search. It isn't built for checking specific claims against specific sources.

Takeaways

  • It's worth the effort to build effective, comprehensive Claude skills for complex tasks. This test is one example of what that investment buys you.

  • Claude Cowork is impressive. Use it for complex tasks.

  • A simple prompt like "check these citations" probably won't give you a great result.

  • Be careful: a more detailed prompt can have the counter-intuitive effect of pushing a tool to hallucinate. If you ask for specific formatted output like quoted passages, the tool may produce that output even when the real version isn't there.

I've included the citation-verification skill as a download with this post. If you try it on your own documents and it fails or surprises you, I'd like to hear about it.

To use the skill:

  1. Download the Claude desktop app if you haven’t already. Sign in to Claude on the desktop app.

  2. Download the skill to your computer.

  3. Double-click the file. It should open in the Claude desktop app.

  4. Click “Add to library” in Claude.

  5. Tell Claude to use the skill next time you want to verify citations in a document.

Keep Reading