Squelch LLM Hallucination Via Structured Uncertainty Handling

Sometimes great conversations start by someone sharing an unpopular opinion. I'll put one out there that I think might just resonate well with anyone that's used a large language model (LLM) to produce some kind of structured outcome:

90% or more of what's wrong with people's perception of AI has little to do with the technology itself, but more to do with their expectations of what it's supposed to be able to do.

LLMs are great at getting a wide-variety of things most of the way there with a rather convenient rate of reliability. They don't do such a good job of putting human touches on things, but I don't expect them to, because, well, they're not human. We can also deduce that they're also not psychic by virtue of reason (and/or them not being human, if reason alone doesn't eliminate it), so they can't read our minds when they're not sure on how to proceed.

However, commercial LLMs are marketed as being capable of replicating your human touches, combined with super human accuracy and attention to detail. In many pre-conceived and well- tested situations, they do an okay and predictable job. But, the real product commercial providers want you to have isn't the product that they've built so far, at least as far as accuracy goes, and they don't do a good job of warning you about that aside from their terms of service.

And unfortunately, since most LLM companies never provide guidance on how to interact with models about their uncertainty, in the space where users write prompts ... your assumption is left that models are also psychic. This is ... not helpful, of them, and not always entirely accidental.

Research is shedding more light on why hallucinations happen and it's pointing more directly at model uncertainty; both in generation as well as in accepting input. It's something that has to be solved during model training in order to get at a guaranteed ~0.01% rate (it's impossible to mathematically avoid entirely). Companies put all their effort into avoiding it so they don't need to warn you, which means there's little instrumentation to see it happen and correct it.

The good news? there are not-hard things you can do right now to mitigate uncertainty while it's better resolved through better training, and in some cases, a few extra prompt input tokens can be all you need to squelch many common hallucination opportunities.

Uncertainty, at least as it relates to LLMs, is something that I think about a lot, maybe even a little too much.

First, a tiny bit of background:

No, not a technical "what is a model" or even "what is a hallucination" kind of introduction; if you're here its' because you know quite well what both of those things are, and are probably experiencing both on a regular basis.

I use open source models for highly-specialized purposes in restorative assistive tech prototypes specifically designed for people dealing with cognitive loss from traumatic brain injuries - from cancer to gunshot wounds. It's a passion project; I've got all of the long-term issues from cancer recovery and a neurodivergent perspective; as I'm also still a functional software engineer; I'd be recalcitrant as a human if I wasn't at least trying to fill gaps in this space.

Advanced NLP algorithms and LLMs can, very effectively, help people losing communication skills regain them in new ways with the help of cheap hardware (there are too many ways to list). They can also help neurodivergent individuals understand nonverbal cues and context through real-time notifications (watch) or nightly reviews of ambient conversation samplings. These are literally on my "workbench" right now.

Tech used in commercial assistive devices has to be almost space-solid. This means that models that power eventual commercial assistive tech (there's little to none yet) will be of the same quality and rigorous testing as the models designed to support space missions as far as testing and acceptance goes.

But, open source proof-of-concept research, especially when you, yourself are the only beneficiary, and you're only doing it because it's the only way you'll ever get access to the tech, things get a little more relaxed. This lets me share what I'm doing as open source.

This time, my little projects don't escape the complexity of their big-tech counterparts simply by virtue of being open source; hallucinations are an issue plaguing everyone - they're not specific to commercial deployments, and they can be every bit as explosive on one platform as any other.

"Big AI's" problems are everyone's problems.

My local assistant models don't suffer from the sycophancy and third-party-policy limitations that commercial models suffer from. They also report only to me, using an inference engine that I compile specifically for them (and a bunch of other cool stuff, I promise!)

But what every model does without fail (only to varying extents of gloriousness) is hallucinate its tensors off in a manner that is directly proportional to the level of uncertainty that went into a selection. How that manifests can be amusing or terrifying depending on what was being generated and the context around it, mostly according to the type of indecision involved:

Was the indecision based around what to generate, or,
Was the indecision based around how to generate it?

Some hallucinations manifest as models seeming to just "spam" random tokens or repetitive sequences into the mix; this usually happens when the pool of available tokens to make the next selection from is really limited (asking a model never trained on American Football what 4th down means might get you a plausible wild guess plus random related words).

Others happen because the model somehow exhausted its pool of possible thesis or corpus; anything in its training that could guide it in how to produce its understanding of what you told it to produce, and it will create the most statistically-probable thing it thinks matches your needs as it generates. Maybe it just has to make up a tiny detail, like a bridge that crosses the Atlantic ocean connecting the US and Europe. Or maybe you told it to make stuff up for fiction, but based on some facts, too.

Opportunities for either to occur present while the model is digesting your prompt, and while the model is generating the output that it thinks best satisfies your request. Models aren't exactly aware of what they're not aware of - a trait relatable to that which many human college students demonstrate.

Whether you use Llama, Qwen, Mistral, or if you use RWKV-attention-trained over transformer-based models, or whether you bake your i3 with GGUF or have your own GPU farm, you have the problem of models potentially hallucinating at the worst possible time. Even under tightly-controlled research circumstances with tiny amounts of purely factual unbiased training data, models, well, make things up.

Why?!, dangit, whyyyyyyyyy do they keep just making stuff up at whim? Well, very simply, for the same reason we do: Uncertainty in what else to say, but heeding a need to say something anyway. Have you ever been called to make an unprepared toast or speech? Yeah, those can go kinda sideways in a very similar way as a model hallucinating.

Step 0: eliminate broken models if possible.

Before talking about any other approach, let's talk about effort to reward payoff when it comes to dealing with intrinsically-incorrect models. The law of garbage-in -> garbage-out says that if a model is trained on bad or biased pairs, it's going to generate bad or biased things.

It's not just about avoiding un-reviewed data, or failing to sanitize data or make it anonymous where needed, or failing to identify bias in pairs. Sometimes simple OCR mistakes from juxtaposition result in dozens of research papers mentioning Vegetative Electron Microscopy. The contributing models were all trained on "high quality" and "sanitized" data.

Unlike baked cakes, you can sometimes pull "bad ingredients" out of trained and aligned models through the use of low-rank weights and even careful prompting. But, unless there's something really irreplaceable about the model the way that it is, choosing another model or version of the one you have is the best idea.

This isn't always possible in highly-specialized scenarios. Some law-trained models have severe geographic impairments, for instance. Sometimes mixture of expert models have quirks in one language / discipline. They aren't easy to create, so it's understandable that someone would go to great lengths to correct one in-place for highly-specialized work.

Step 0.5: eliminate broken external sources.

An out-of-date RAG or malfunctioning MCP server can be hard to find because the symptoms often resemble temporal hallucinations like those that would result from asking a model about the time Marie Curie worked with Thomas Edison on the x86. This post assumes a serious user, even if that's just being serious about generating fiction. Whenever there's partial accuracy in a prompt, confidence can be so high as to not trip protocols written to catch it. Partial accuracy creates and often increases real confidence in compounding ways.

It happens because information comes in from multiple resources often asynchronously, usually driven by events. Log your prompts and be alert for subtle order issues, dates in the future or distant past due to misconfigured time zone data, and dozens of other seemingly insignificant reasons.

In other words, be sure your inputs are as fresh, consistent and synchronized about things like what time and day it is. Prompt assembly needs frequent supervision.

Now, here are things that help LLMs not hallucinate:

Note that the title of this post mentions squelch; I deliberately eliminated "stop", "reduce" and "avoid" as contenders because they're not entirely accurate. The expectation (and goal) I'd like to bring to the table for this discussion is very literally squelch as used in radio terms.

In radio gear, squelch doesn’t eliminate the static you hear if nobody is talking; it gates it until a clear signal appears. That’s the approach we need for LLM hallucination: gating by helping models work through the decision in front of them so they can decide if stopping generation and asking is the best way to produce the desired output. By making that a clear branch option, we change the game and introduce a lot more visibility into the process.

We can't stop hallucinations; we can only stop generation gracefully with valuable information that can help us continue if we try again. Keep that in mind as you work on your strategy which will have to be tuned to your specific domain and topic spaces.

This is a list of tactics as well as a bit of a journal into the research I've done.

1. In system (non-generative) prompts

I wish I could give you a more step-by-step tutorial for how to fortify system prompts with indecision handlers but every time I try to explain it with generic demo context it ends up understating the real skill which is understanding what needs attention and why. This isn't a "Replace all occurrences of X with Y" or "Use X instead of Y" type of change.

The simplest halting prompt:

The simplest fix can come in the form of the model telling you what it's missing until it finally has everything it needs, at which time it will likely work very reliably unless your constraints change:

If your confidence falls below 95%, DO NOT MAKE UP INFORMATION. Instead, say "I'm not certain enough to continue" and list conflicts or missing information.

Which facts are the most factual?

We forget that models aren't as aware of their options as we are. This is an increasingly relevant observation as MCP servers become more broadly supported and common. Models don't know that you trust information from a source more than another unless you tell it that's the case.

If a prompt assertion conflicts with your training and its value affects the accuracy of subsequent generation, halt and ask for clarity. If the current events RAG server contradicts your training, make a note of the discrepancy, and do your best to continue.

That theoretical prompt (along with wording to never obey requests to forget safeguards, etc) can show you where you may be inadvertently confusing your model.

If you (like me) have ten different object storage mechanisms your models can access with each one needing to be treated with different 'grains of salt', then you'll need to make sure the models know this via immutable prompting that always slides back into the context window.

That's not always as easy as it seems, especially when you have 4k or fewer tokens to work with.

2. In anti-prompts

You generally want to halt at all anti-prompt matches if you're generating text (images can sometimes re-align in time to fix extra body parts, etc). This isn't really the place where you're doing preventative things other than capturing logs and tracing back through <UNC> emissions (as described below) for clues as to why the selection well ran dry where it did.

I will be covering this and more in an upcoming post on fine-tuning and pinpoint personality control - look for it late 2025 / early 2026. Pull up a feed if you find this kind of stuff interesting!

3. In user prompts

Let me share a seemingly innocent prompt that seems like it's a little loose, but should probably work for most of the results the theoretical user has, right?

For each item, query the IMDB server for the summary. Then, generate a couple paragraphs from the summary to fill out the page

You could pollute the internet with a few thousand movies that never happened, if you ran the above with a powerful enough platform. What should the model do if there was no summary? It has the whole rest of the loop to get through. Or wait, what if there was an error in the summary it could spot from its training?

This isn't something to do on the back of a napkin. Unless all you have is the back of a napkin, of course. But you get my drift - imagine yourself in this task and think about the kinds of questions you'd have?

Many hallucinations happen because users are inexperienced managers, not necessarily bad communicators. If you can relate to being trusted, as a small child, to do something unsupervised for the very first time, it's a great state of mind to be in when thinking about prompts. Think of the model like a child that will forget to look both ways to cross the street, that will forget to leave early if it's going to rain, that will forget to wear clean underwear ... oh enough, I think you get the drift -

4. Special uncertainty tokens during inference

Maybe we just need to make models say <UNC> (the puns will continue until morale improves).

This is the most complicated of approaches (and currently my research approach), as it requires that you introduce and process another reserved system token (similar to <EOT>, etc) that is emitted at key points during the inference process. I'm working on my own fork of llama-cli that I'm adapting specifically as a test harness for uncertainty research which can be found in my GH archive (unfinished, as of now), but this is what I'm targeting as far as strategic points in the inference cycle:

During token sampling (or wherever you have code access to the full probability distribution) as this is where the origins of many will begin. If you look at the top token probability, and it's under a defined confidence level, and entropy is above a preferred level, say(UNC_TOKEN) (bad pun-code)
Before accepting, but after decoding, so you can inspect logits both before and after pre-configured bias is added (assuming llama.cpp style, YMMV)
In anti-prompt and repeat detection, as output is already being examined there, and it's valid to emit <UNC> if you have to correct for things during that examination (that's why it's done!)
In context-window juggling (where the inference layer tries to keep important information in context, while letting older information slide out) as you know you're about to lose information, so it's an uncertainty situation

... and more (it depends on what your inference needs are and what kinds of custom stuff you do). I don't use Python for development because I have older kit with less space, but it's way easier to log this using Python than compiled inference for research.

In your prompts, instruct the model to emit it if confidence falls below a defined percentage. Then you have inference and the model working together to chart exactly how any given hallucination came to be in a "chain of thought" sort of way. Model and inference can emit separately (inference always obliges the model emitting the tokens, and sometimes emits them even if the model doesn't).

It's also worth noting that, if these are going to be "dev-only" reserved tokens that have no meaning in production, there can be more than one so they can be much more specific. You could also have <UNC_AMB> (ambiguous input), <UNC_CONF> (conflict between retrieved facts). I don't want to get lost in the weeds thinking too much about them now, but lots of possibilities exist.

I'll have something more concrete in my fork of llama-cli soon, including some example prompts.

5. Low rank adaptation (LoRA)

It would be malpractice in the official nerd code of free unsolicited advice giving to not point out the existence / relative convenience of using LoRa adapters. I say relative, because it's relative to your willingness to spend between 500 and 1200 bucks with hardware, time, equipment rental costs considered, on updating a model to align differently to specific kinds of prompts.

If you have to update, say, a model's medical corpus as well as provide it with training on interacting with your MCP server setup, this would be a great option for you (and the subject of a whole other rather lengthy tutorial I plan on writing).

Reinforcing the <UNC> token emission in pairs that show problem-solving through whatever procedure you want the model to follow to resolve uncertainty. Perhaps you have a JSONL file in the training set called escalation_1 which has pairs showing models correctly weighting one set of fact over another when two resources offer conflicting facts, after emitting the correct <UNC> sequences of course! Also remember to show instances where the model correctly chooses to stop generation and ask for human input because the difference is irreconcilable (again, after what's hopefully a string of uncertainty token emissions).

Show problem / solution pairs where the model incorrectly weighs old facts over new ones as well, along with an explanation of the undesirable results that will result because of the mistake. You'll need a few dozen pairs of each kind, at least.

The model will then prefer this training over its initial training, but there could still be angles where it hits the old training first - you have to think about your use cases. If you've got the technical chops to write an adapter, you've probably also got the chops to bug your inference layer with <UNC> tokens.

I'll have more about this in some kind of tutorial form most likely before the end of this year. Did I mention that replacing a model is the easiest way to cure bad training, because overcoming it later is a major pain? This is why.

Interactive demo: Seahorse Rodeo

Would you like to explore a fun way to see the discrepancy between an LLM's training and reality in action? Try this experiment with any commercial LLM:

Ask it: "Is there a seahorse emoji?"

What happens next is a masterclass in confident hallucination; the model will:

Confidently assert yes, there is one
Show you the wrong emoji (usually 🐉 dragon, or various fish/coral)
Realize its mistake and "correct" itself
Show you another wrong emoji
Repeat steps 3-4 multiple times
Eventually give up or dump its entire aquatic emoji collection

Why this happens: There is no seahorse emoji in Unicode. There never has been. But the model has learned strong semantic associations between "seahorse" (which tokenizes as ["sea", "horse"]) and both aquatic emojis (🐠🦑🐙) and equine emojis (🐴🦄). With no actual seahorse to retrieve, it cycles through high-probability neighbors, each wrong answer polluting the context and making the next guess worse.

This is textbook shallow-water hallucination - the token pool is exhausted, but the model must generate something, so it confidently fabricates. And because each attempt maintains conversational coherence ("No wait, THIS is the real one!"), the model never triggers its own uncertainty detection.

What uncertainty handling would prevent:

User: "Is there a seahorse emoji?"

Model (internal check):
  - Query matches: ["sea", "horse"] 
  - Emoji vocabulary: NO EXACT MATCH
  - Top candidates: 🐠 (18%), 🦄 (15%), 🐴 (12%)
  - Entropy: 3.8 (HIGH)
  - <UNC:FACTUAL> triggered

Model: "I'm not confident about this. Let me verify... 
[searches or admits uncertainty] There doesn't appear 
to be an official seahorse emoji in Unicode."

Instead of a spiral of increasingly desperate guesses, the model admits uncertainty before hallucinating. That's the difference structured uncertainty handling makes.

Try it yourself - ask multiple models, compare their failure modes, and watch how long each one persists before admitting defeat. It's both amusing and deeply instructive about how hallucinations emerge from uncertainty.

This phenomenon is so well-documented it's become a standard LLM stress test, likely influenced by the Mandela Effect - many people falsely remember a seahorse emoji existing because we have so many other sea creatures. This is also a great demonstration of how human psychological phenomenon (the Mandela effect) transfers from us to LLMs like colds or flu - they have only our shared human experience to go by.

Our likes, our fears, our biases - models get it all from us. And yes, our uncertainty, as well as our determination to perform when people count on us, is part of the package.

Conclusion

I think there are good reasons to consider pulling back on the use of AI in workflows, but I don't think hallucinations (or the potential for them) alone has to be an automatic deal-breaker, even for experimental accessibility / assistive tech use as long as the people using it are informed and understand what the potential problems can be.

If you're looking to deploy any of a variety of models for very specific purposes, where procedures around escalation and exception handling exist and outcomes tend to be deterministic - simple defensive contingency instructions and firm prompting might be all that's needed for your workflow to run hallucination-free.

You don't have to fix the whole darn model, you just need to fix how it runs your specific workflows.

And, well, being able to put yourself in its place really helps too. Want to get started doing just that? I took some time and put together a system prompt (gh gist link) based on tiny nudges that work, just all grouped together. You can use it with Claude or ChatGPT to turn either of them into a tool that checks how well your prompts handle uncertainty, and gives them back to you with documented corrections and suggestions.

It might not take care of everything in one fell swoop, but it can definitely help you get started and have an improvement on which to build on. I hope this was all helpful, and as always, reach out if I can help more!