The human test

When "pretending you know nothing," shows you know a lot more than you're letting on

Apr 22, 2026

If a state-of-the-art model gives you a dumb response, it’s your fault.

I’ve said this before. I’ve written it before. I believe it more every month. And yet I keep watching teams, including my own, blame the model for failures that are obviously prompt failures. Obviously. In hindsight. Which is the problem: in hindsight.

So I want to propose something concrete. A test. I call it the Human Test, and it’s embarrassingly simple.

The test

Take your prompt. The exact prompt. No modifications. No additional context. No “oh but the model also knows about...” No preamble. Just the prompt, exactly as the LLM receives it.

Give it to a colleague. A smart one. Someone who could, in theory, do the task.

Don’t tell them what project it’s for. Don’t tell them which product, which feature, which customer. Don’t tell them what you’re testing or why. Don’t give them anything that isn’t in the prompt itself.

Then ask them to do the task.

Can they? Do they give you the answer you’re looking for? Do they even understand what you’re asking?

If yes: your prompt is probably fine. If the model still fails, it might genuinely be a model limitation.

If no: the model won’t either. And you just found out without burning tokens.

The discipline

The hard part isn’t running the test. The hard part is being honest about it.

Because the moment you hand someone your prompt, you’ll want to add context. You’ll want to say “this is for the customer onboarding flow” or “the user asking this is probably a Dutch speaker” or “they’ll have already searched our knowledge base before asking this.” Your brain will itch to explain.

Don’t.

Be ruthless about this. The entire point is to simulate what the LLM actually receives. The LLM doesn’t get your mental context. It doesn’t get the Slack thread you had before writing the prompt. It doesn’t know what team this is for, what the user’s intent probably is, or what you meant but didn’t write.

It gets the prompt. Only the prompt.

If you give your colleague even a single sentence of extra context, you’ve invalidated the test. You’ve given them something the model doesn’t have, and now their success tells you nothing.

Where we got burned

We build AI colleagues at Neople. Each one gets access to tools for searching and reading knowledge bases. knowledge_list, knowledge_read, knowledge_search. Standard agentic setup.

In the system prompt, we’d describe the available knowledge: “You have access to documents X, Y, and Z.” Names. Maybe a one-liner about what each one contains. Seemed reasonable.

Then we’d watch the AI search for Dutch words in English documents. Or search for product names in a policy document. Or read the FAQ when the answer was in the troubleshooting guide. Consistently. Across different models. Across different customers.

Our first instinct, every time: the model is bad at tool use. It doesn’t understand retrieval. It’s picking the wrong source.

But if we’d done the Human Test, if we’d handed a new hire that same system prompt and said “a customer is asking about return policies, find the answer,” they would’ve been just as lost. Because we told them the documents existed. We didn’t tell them which ones are in Dutch and which are in English. We didn’t tell them which document covers which topic. We didn’t tell them the FAQ is for pre-sales questions and the troubleshooting guide is for post-purchase issues.

We assumed the model would figure out the same things we’d figure out. But we’d figure them out because we have context the model doesn’t have. We built the knowledge base. We know what’s in it. The model has a list of file names.

The fix wasn’t a better model. It was a better prompt. Language tags. Topic descriptions. When-to-use guidance for each source. The kind of context a new hire would need on their first day. Obvious in retrospect. Invisible until we ran the test.

Why we don’t skip it

The Human Test feels silly. You’re handing a colleague a paragraph of text and asking them to pretend they know nothing. It feels like a waste of their time. It feels like something you should be able to evaluate yourself.

You can’t. That’s the whole point.

You can’t evaluate your own prompts because you have the curse of knowledge. You know what you meant. You know the context. You read your own prompt and your brain fills in the gaps automatically. Of course it’s clear, you think. I wrote it. I know what I’m asking.

The model doesn’t know what you’re asking. It knows what you wrote. Those are different things more often than you’d like to admit.

A colleague, stripped of context, is the closest thing you have to a fresh pair of eyes that actually processes language the way an LLM does: literally.

When to run it

Not every prompt needs the Human Test. Your one-off “summarize this email” doesn’t need peer review.

Run it when:

The prompt will run thousands of times. System prompts, agent instructions, automated pipelines. A 5% improvement in a prompt that runs 10,000 times a day is worth an hour of testing.

You’re consistently getting bad results. If you’ve tweaked the prompt three times and the model keeps doing the wrong thing, stop tweaking. Run the test. The problem is probably something you can’t see because you’re too close to it.

The task requires implicit knowledge. Anything where you think “well, the model should know that...” is a red flag. Should know based on what? If the knowledge isn’t in the prompt or the documents, it isn’t anywhere.

You’re blaming the model. This is the biggest signal. The moment you catch yourself saying “the model is just bad at this,” run the Human Test. Nine times out of ten, a human without context would be just as bad at it. The model isn’t bad. Your prompt is incomplete.

The uncomfortable truth

Prompt engineering isn’t a mystical skill. It’s communication. The same skill you need to write a clear ticket, a clear spec, a clear email. The ability to say what you mean without relying on shared context that doesn’t exist.

Most of us are terrible at this. Not because we can’t write. Because we’re so used to communicating with people who share our context that we’ve forgotten how much we leave unsaid. Every conversation with a coworker is built on a foundation of shared knowledge: the product, the codebase, the customer, the last three meetings. We take all of that for granted.

An LLM starts every conversation from zero. And the Human Test is the fastest way to feel what that zero actually looks like.

Testing prompts on humans at neople.io.

A guest post by

Job Nijenhuis

CTO @ Neople Writing about startup life, working with LLMs, and building a product. Sometimes all at the same time. 📸 likes taking pictures occasionally, and playing the 🎷 (though usually not simultaneously)

1 Comment

Ready for more?