Published 6 May 202610 min read

The week an LLM hallucinated a political position — and a journalist nearly quoted it

An auto-generated TBMM summary said a member voted on a bill that hadn't been voted on. The journalist held the story for 24 hours. Four days of building a verification layer that can say no.

nlpllmcivic-techverification

Late October, a Wednesday. A journalist who follows the NLP Parliament project emailed me a screenshot. The summary I had auto-generated for a single TBMM session claimed that a member from a minority party had voted in favor of a controversial bill. The sentence was clean and confident. The vote had never happened. The bill had not even come up that session. The member had not spoken about it. The journalist was hours away from publishing a story that cited my model. I asked him to hold for 24 hours.

He held. The story did not run with the bad sentence. The four days that followed are what this post is about.

I am going to anonymise the actors. The journalist is "the journalist." The member is "an opposition MP from a smaller party." The bill is "a controversial procurement bill." These details do not matter for the technical story. They matter very much if you happen to be the wrong MP or the wrong party.

What the summariser actually was

The TBMM session summariser took a transcript, ran it through the speaker attribution pipeline I described in the previous post, and produced two things per speaker:

A short list of topics that speaker addressed.
A short list of positions that speaker took on each topic.

Internally there was an extractive pass — sentence ranking — followed by a generative pass: Claude rewriting the highlights into clean prose. The generative pass is the part that bit me.

The pipeline used a sliding context window, because TBMM sessions are long. I left about 1,500 tokens of overlap between chunks. The idea was that a speaker whose speech straddles a chunk boundary still gets full context. It sounded sensible at the whiteboard. It was also the root cause of what happened.

How the hallucination happened

Two adjacent sessions, on consecutive sitting days. Call them A and B.

Session A: the substantive debate on the procurement bill. The opposition MP gave a long, sceptical speech about the bill's contracting clauses. He never said how he would vote, because there was no vote that day.
Session B: a scheduled vote on a different bill, plus a procedural vote that touched on procurement language.

Both sessions ended up in the same chunked context for the summariser run, because the production batch grouped sessions by week.

The 1,500-token overlap window contained two specific fragments:

From session A, the MP's name immediately followed by the topic kamu ihale yasası (public procurement law).
From session B, an unrelated speaker mentioning a vote outcome on procurement language.

The LLM completed the pattern. Here is what it produced:

The MP from [opposition party] voted in favor of the procurement bill, citing reforms to the contracting clauses.

That sentence has a subject, a verb, an object, a justification, a tone. It also has zero anchor in either session. The MP never said "I will vote." There was no vote on that bill on that day. The "reforms" were the journalist's later words, not the MP's.

It looked right. That was the problem.

Why this category of failure is dangerous

Three things made it dangerous.

First, confident tone. The LLM did not say "I'm not sure" or "possibly." It said "voted in favor." Generative models are tuned for fluency, not epistemic humility. They fill in a missing verb with the most plausible one from the training distribution, and "voted" is extremely plausible in a parliamentary text.

Second, structured citation. The summariser attached a source_session_id to each generated sentence. That made the lie traceable to the wrong session — meaning a human spot-check would land on session A, see the procurement speech, and nod. Yes, this MP did address procurement. The vote claim survives the check.

Third, sampling bias in QA. We had been spot-checking summaries by sampling random sentences and reading the surrounding transcript. The check was always "is the topic right?" The check was never "did the verb actually happen?"

The journalist's screenshot was the first time anyone had bothered to verify a verb. He only did it because he was about to print it.

The naive prompt that did this

Here is the prompt that produced the hallucination. I am keeping it because it looks reasonable.

def summarise_speaker(name, party, chunks):
    prompt = (
        "Given the transcript fragments below from one MP "
        "in a parliamentary session, produce 2-4 sentences "
        "describing what this MP said, including any positions taken.\n\n"
        f"MP: {name} ({party})\n\n"
        f"FRAGMENTS:\n{join_chunks(chunks)}\n"
    )
    return llm.generate(prompt, temperature=0.2, max_tokens=200)

The bug is not in the prompt. The bug is the assumption that the LLM will hold itself to the fragments. It will not. It has been trained on millions of news articles where an MP speaks about a bill and then votes on it. Pattern completion closes the gap unless something else stops it.

What I built: claim-level verification

Four days. Roughly.

The core idea is simple. The LLM's output gets parsed into structured claims, and every claim is re-verified against the OCR. If a claim cannot be anchored to a specific sentence span in the source, the claim is dropped. Not softened — dropped.

Step one: change the prompt to emit structured claims instead of free prose.

CLAIM_SCHEMA = {
    "type": "object",
    "properties": {
        "claims": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "topic":        {"type": "string"},
                    "verb":         {"enum": [
                        "addressed", "criticised", "supported",
                        "questioned", "voted_for", "voted_against",
                        "abstained"
                    ]},
                    "object":       {"type": "string"},
                    "evidence_span":{"type": "string"}
                },
                "required": ["topic", "verb", "object", "evidence_span"]
            }
        }
    }
}

The evidence_span is the key. The model has to quote the supporting span from the transcript, verbatim or near-verbatim. Without a span, no claim.

Step two: re-anchor every span against the OCR.

def verify_span(evidence_span, transcript, speaker_name, threshold=0.85):
    # Restrict the search to lines actually attributed to this speaker.
    speaker_lines = [l for l in transcript if l.speaker == speaker_name]
    if not speaker_lines:
        return None
    # Fuzzy match. Edit distance per character, normalised by length.
    best = max(speaker_lines, key=lambda l: fuzz.partial_ratio(l.text, evidence_span))
    score = fuzz.partial_ratio(best.text, evidence_span) / 100
    return best if score >= threshold else None

If verify_span returns None, the claim is dropped. Not annotated, not flagged — dropped. The summary is allowed to come out shorter than the model wanted. That is the discipline.

Step three: verb-specific gates.

ACT_VERBS = {"voted_for", "voted_against", "abstained"}
 
def gate_act_verb(claim, vote_record_api):
    if claim["verb"] not in ACT_VERBS:
        return True
    # Voting verbs require an explicit row in the official vote record.
    row = vote_record_api.lookup(
        session=claim["session_id"],
        topic=claim["object"],
        member=claim["speaker_id"],
    )
    if row is None:
        return False
    # The recorded vote must match the LLM's claimed verb.
    return row["vote"] == claim["verb"]

This is the gate that would have caught the hallucination. The vote record API would have returned None for "vote on procurement bill in session A," because no such vote existed. The claim drops, the summary reads "addressed the procurement bill," and the journalist has no screenshot to send.

Results

I ran the verification layer over the back catalogue. The first scan was uncomfortable.

17% of original claims were dropped because their evidence_span could not be anchored above threshold.
Of those dropped claims, about 6% — roughly 1% of all claims — were the bad kind. Claims asserting a specific verb that had never happened. Hallucinated positions, hallucinated votes, hallucinated "agreed with."
The remaining 11% were soft drops: claims whose evidence span was present but had been rephrased too aggressively by the LLM. The information was real; the wording would not anchor. We later relaxed the threshold for non-action verbs to 0.78 and recovered most of these.
Vote-verb claims dropped by the gate: 4.2% of all vote claims. Every single one was a false positive.

Why the fix has to be at the system level

When you discover an LLM hallucination, the temptation is to fix the prompt. Add "Do not make claims that are not supported." Add "Only state facts from the fragments." I tried both. They cut the hallucination rate by maybe 30%. They do not eliminate it. They cannot.

The model is doing what it was trained to do: complete plausible sequences. Plausibility is not truth. The fix has to live outside the model, in a layer that can refuse to publish a claim. And that layer has to be willing to say no.

That sentence is the entire post: the layer needs to be willing to say no.

A summary that says nothing about a topic is recoverable. A summary that says the wrong thing about a topic, with a confident verb and a citation, is a public record problem. The worst case is not "we missed a position." The worst case is "we manufactured one."

A note on responsibility

I want to say this carefully, because this is the part where AI writing turns into preaching, and I am trying hard not to preach.

Parliamentary records are documents that journalists, researchers, students, and citizens use to make claims about elected officials. If a generative model produces a fluent sentence that ends up pasted into a news article, the model has, in effect, contributed to the public record. The model cannot be held accountable. The person who built the system can.

So I went back and read every part of the pipeline asking one question: where could this lie, and who would catch it? Wherever the answer was "nobody," I added a layer that could.

I do not think this makes the system safe. I think it makes it less unsafe. There is a difference, and the difference is the work.

Two months later, the journalist sent me a second screenshot. The story he was about to file — sourced from my model — was right this time. He had spent six hours verifying it. I told him I had spent four days building a thing that would have saved him those six hours. He paused, then typed back: "Yeah, but I still wouldn't trust it." That is the correct answer.