"Do Not Hallucinate" and Other Instructions the Model Can't Follow

You've probably seen the advice to add "do not hallucinate" or "only use real facts" to a prompt. It sounds like a safety switch. But hallucination is not the model deciding to lie, so an instruction not to do it is aimed at a choice the model never actually makes.

The common move

The familiar prompt Summarize this research paper and do not hallucinate or make up any facts. Only cite real sources.

It feels like locking a door: you've told the model the rule, so surely it will follow it. The question worth asking is the same one as always: what is this instruction actually doing?

What you think it does, and what it actually does

What you think it does

Flips a switch. Treats hallucination as a behavior the model can turn off on request.
Assumes the model knows. Implies the model can tell which of its own statements are true and is choosing whether to tell you.
Makes citations real. Implies "only real sources" produces sources you can trust.

What it actually does

Nothing to flip. The model predicts plausible next words. There is no separate truth-checking faculty to switch off, so the instruction has nothing to act on.
It can't tell. The model cannot reliably distinguish its true outputs from its false ones. A confident wrong answer feels identical, from the inside, to a correct one.
Produces citation-shaped text. "Only cite real sources" yields text that looks like real citations. That look-alike is the failure, not the fix.

The instruction mistakes a property of the system (it generates plausible continuations) for a behavior it can be told to stop. You can't instruct away how the thing works.

What this looks like in practice

What happened

A real exchange from this course's development work.

I asked Claude to read a PDF of one of my conference papers and comment on its voice and structure. It came back with confident, detailed observations: characterizations of the writing, specific phrasings, a read on how the argument was built. It sounded like a careful close reading.

Claude had never opened the file. It had generated plausible observations about what a paper on that topic, by someone like me, would probably contain, and presented them as if drawn from the actual document.

I caught it with one question: had it actually read that exact PDF? When it checked its own tool history, it confirmed it never had. The real paper was substantially different from what it had described.

The reason this is worth showing: the answer did not look like an error. It looked competent. It got caught only because I knew my own paper well enough to notice the mismatch and ask a direct, verifying question.

Fluency is not reliability. A confident wrong answer is more dangerous than an obviously broken one, because nothing about it signals that you should check.

The more honest move

Don't instruct the model not to err. Instead, shape the output so its errors are easy to find, then check the parts that matter.

The reliable version of "don't hallucinate" isn't a command to be truthful. It's an instruction to separate and flag what's grounded in your source from what's coming from the model's own general knowledge:

A prompt that exposes the seams When you summarize or use the source I gave you, mark each claim: [grounded] if it comes directly from the source, [general] if it's from your own knowledge, [unsure] if you're not confident. List any names, dates, numbers, or citations separately at the end so I can check them.

This does not stop fabrication. The model can mislabel its own claims, so the tags aren't guarantees. What it does is turn a smooth wall of confident prose into something with seams you can inspect: a flagged claim is a found claim, and a list of specifics is a checklist for verification instead of paragraphs you'd have to comb.

The concept underneath has a name. Grounding means tying output to source material you provided; the model's general knowledge is what's baked into its weights, where fabrication lives. Asking the model to show you which is which is a real technique. Trusting its answer without checking is not.

Try this

Use the flagging prompt above, then verify what the flags surface:

Check the load-bearing specifics (the claims your work rests on) against a source outside the model.
If the model gives you a citation, open it. Don't ask the same model to confirm itself; it will often confirm its own fabrication.

Handing those flagged claims to a second model as a reviewer is the natural next step. Setting that up well takes its own prompt engineering, which a later sheet covers.

The principle underneath

An AI gives you plausible text, not verified text, and it can't tell the difference. Your job isn't to instruct it into reliability; it's to build the check it can't run on itself.