Hi all, Nate here.

I asked five AI "advisors" to independently critique a draft of last week’s essay on evaluating AI health interventions. Each one came back with a confident, well-reasoned critique. Each one pointed to a different fundamental problem. And here's the part that stopped me: if I'd asked just one of them, I would have followed that single thread and never known the other four existed.

This is a continuation of something I've been working on (obsessing over, really) for the past year: understanding where AI gets things wrong, how to catch it, and how to build workflows that produce outputs you can actually trust. I've written about AI's factual accuracy problems in research, about the difference between web search snippets and full-page retrieval, and about building and testing a citation-verification skill. The LLM Council is another powerful method in that same vein.

What is the LLM Council?

The concept comes from Andrej Karpathy (former head of AI at Tesla and co-founder of OpenAI). His original version sends your question to multiple different AI models (ChatGPT, Claude, Gemini, and Grok) simultaneously, has them review and rank each other's responses, and then a designated lead model synthesizes the best answer.

The version I used is a skill built by Ole Lehmann for Claude. Instead of using multiple different models (which would give you genuine architectural diversity but requires more setup), it assigns five distinct analytical roles to a single model (Claude): a contrarian, a first-principles thinker, an expansionist, a practical executor, and an outsider representing the target audience. They each analyze independently, then peer-review each other, and a final synthesis identifies where they agree, where they clash, and what everyone missed.

What five advisors found in my essay

My essay argued that traditional methods for evaluating health interventions don't work for AI tools, because the AI models underneath these tools can change without warning. My proposed solution was continuous monitoring. It seemed solid to me.

The Contrarian said I was treating all evidence as equally perishable. It isn't. A study showing that clinicians make better decisions with AI support tells you something about the workflow, and that survives a model update. What expires is the specific accuracy and safety data. I was lumping durable and perishable evidence together without distinguishing between them.

The First-Principles Thinker argued that my argument quietly shifts the burden of proof from "the developer proves safety before deployment" to "we deploy first and watch what happens." That's a fundamentally different risk allocation, and I was sliding between the two without flagging it.

The Expansionist argued I was thinking too small. Continuous monitoring isn't just a defensive response to model changes. Done right, it creates a real-time learning system that generates better evidence than any traditional trial could.

The Executor costed out what "lightweight monitoring" actually requires in a district health system and showed it's not lightweight at all. Then proposed adapting supplier quality agreements from pharmaceutical procurement, where manufacturers must specify what changes they can make and what notice they must provide.

The Outsider, representing my target audience, asked the question nobody else raised: who actually does this work? "Build data collection into the clinical workflow" sounds simple until the answer is always the same overstretched health workers doing everything else.

Five analyses. All coherent. All pointing in different directions. Any one of them, delivered alone, would have felt like the answer.

Why this matters for how you use AI

Every time you ask an AI model a question, you get one confident, well-reasoned response. You follow that thread. You feel like you're getting to some "truth." But what you're actually getting is one of many possible analytical paths. The model could just as easily have emphasized a different angle or framed the problem through a different lens, and given you an equally convincing argument for a different conclusion.

Is there evidence this works?

A study by Shaikh et al. tested a "Council of AIs" (five GPT-4 instances working together through iterative discussion) on 325 U.S. medical licensing exam questions. The council achieved 97%, 93%, and 94% accuracy across the three exam stages, outperforming individual GPT-4 models. When the five models initially disagreed, the deliberation process reached the correct answer 83% of the time.

A separate study tested a multi-agent debate framework on mathematical reasoning and found 4-6% higher accuracy and over 30% fewer factual errors compared to single-model approaches. Multi-agent debate frameworks have also been developed specifically for health misinformation detection and fact-checking, with similar results.

Where the council got it wrong

Each advisor also proposed a concrete solution. Several of them were questionable. The Outsider proposed SMS-based patient feedback that a reviewer called "dangerously misleading" because patients can't judge diagnostic accuracy. The Expansionist proposed three ideas borrowed from other fields (ensemble evaluation from weather forecasting, canary deployments from software engineering, a "leapfrog" framing from development economics) that all sounded sophisticated, but broke down when considering the actual constraints of LMIC health systems.

The peer-review phase caught most logical and structural problems. But it didn't catch everything. The council is good at identifying reasoning flaws. It's less reliable at catching "this sounds new but is actually something we already do under a different name." That's where your domain expertise still matters.

When to use this

Not every question needs five advisors. A council is overkill for "summarize this report." It's powerful when being wrong has consequences: pressure-testing an argument before publishing, evaluating a program design before presenting it to a funder, reviewing a policy recommendation before it goes to a ministry.

The question to ask yourself: if there were five smart colleagues with different perspectives in the room, would this decision benefit from hearing all of them? If yes, the council is worth the time.

What I'd carry forward from this

The most important thing the LLM Council showed me isn't any single critique. It's how different the five analyses were. Five coherent, persuasive arguments, each pointing somewhere different, each one plausible enough to follow on its own.

This highlights both the benefit and the risk of using AI as a thinking partner. It will give you one good answer that you may mistake for the only answer. But structured disagreement allows you to test all assumptions. And AI gives you every different perspective at your fingertips if you choose to use it.

You can get the skill here. You just have to give the skill text to Claude (upload the document or paste the text into the chat) and tell it to create a skill based on this text. It will create the skill and you just click “Save skill.” Then whenever you want to use it, tell Claude to use the LLM Council skill, and it will launch the whole process.

Keep Reading