Published 22 March 202611 min read

Speaker attribution on noisy OCR: an evening-by-evening notebook

How a regex turned a Speaker's plea for silence into twelve confident misattributions, and the five evenings it took to climb from 95.6% to 99.1% accuracy across 850 parliamentary sessions.

nlpcivic-techocrproduction

The first time I ran the speaker attribution model on the TBMM session of 6 February 2023, it confidently labeled twelve of the Speaker's statements as Kemal Kılıçdaroğlu's. That session had aired three days after the earthquake. The Speaker had spent eleven minutes asking for silence so the names of the dead could be read aloud. None of those twelve sentences belonged to Kılıçdaroğlu. I knew this because I had watched the session live that night. The model did not.

I sat in front of the terminal and felt a very specific kind of small. A model that gets a normal session 95.6% right is a model that can fail catastrophically on the day the transcript matters most.

This post is the evening-by-evening notebook that took that number from 95.6 to 99.1. It is not a victory lap. It is the story of how a bad regex took five evenings to fix, and how the LLM — the thing everyone assumes is the answer — only did four percent of the actual work.

What the corpus actually looks like

The NLP Parliament project ingests TBMM session transcripts from 1995 to 2024. Once you flatten the PDFs into text, that's about 6,000 pages. The PDFs themselves are scanned and then OCR'd. Signatures bleed into the margin. Page breaks land mid-sentence. There's a particular horizontal line used to separate procedural blocks — and the OCR insists on reading it as a row of lowercase L's.

The speaker tag, in the clean case, looks like this:

MUSTAFA KARA (X Partisi) – Sayın Başkan, değerli milletvekilleri...

All-caps name, party in parentheses, an em-dash, then the speech. Easy. A regex does it in one line.

In the noisy case, the same line OCRs as:

MUSTAFA KARA (X Parlisl) — Sayın Başkan, değerli...

Three things to notice. "Partisi" became "Parlisl" — a faint scan rendered the "ti" sequence as "lI". The em-dash got wider. And in twenty percent of pages, the em-dash isn't there at all; the OCR drops it.

A regex that was 100% confident on the clean case fails silently on the noisy one. Silent failure is the worst kind. The speech gets attributed to the previous speaker, who is still sitting in the regex's memory.

Evening 1: the regex

I started where everyone starts.

SPEAKER_RE = re.compile(
    r"^([A-ZÇĞİÖŞÜ ]{4,40})\s*\(([^)]+)\)\s*[–—-]\s*(.*)"
)
 
def attribute(lines):
    current = None
    for line in lines:
        m = SPEAKER_RE.match(line)
        if m:
            # new speaker block starts
            current = (m.group(1).strip(), m.group(2).strip())
            yield current, m.group(3)
        elif current:
            yield current, line

This pattern handles the clean case. It also assumes that if no new tag is found, the current speaker continues. That assumption is the bug. On a malformed tag the regex doesn't match, the line is treated as continuation, and the previous speaker eats the next speaker's words.

I spent four hours on evening one trying variations. More dash characters. Looser whitespace. A more permissive party group. Every fix shifted the failure to a different page. I stopped around 1:30 a.m. because I had the feeling I was confusing accuracy with stubbornness.

Evening 2: fuzzy matching, and twelve merged speakers

On evening two I gave up on exact match. Fuzzy match the speaker block against a known shape, allow up to two edits.

from rapidfuzz import fuzz
 
def looks_like_speaker_tag(line: str) -> bool:
    # Speaker tags are short, mostly uppercase, contain parens.
    if "(" not in line or ")" not in line:
        return False
    head = line.split("–")[0] if "–" in line else line.split("—")[0]
    # Uppercase ratio in the first 30 chars; humans rarely shout this much.
    upper = sum(c.isupper() for c in head[:30])
    return upper >= 10 and fuzz.partial_ratio(head, head.upper()) > 85

This caught the malformed tags. It also caught things that weren't tags at all. Section headings in caps. Page footers. Procedural markers like OTURUM AÇILIRKEN. On one session it identified twelve "new speakers" that did not exist; their lines got ingested as fresh blocks, fragmenting the actual speakers' speeches into confetti.

Recall up, precision down. The model was now wrong in a more interesting way. I stopped at midnight, because being wrong differently is not progress.

Evening 3: the dictionary

This was the evening I stopped trying to be clever.

I built a dictionary. For each session I knew the date. From the date I could pull the official MP roster — who was sworn in, who was absent, who had switched parties since the last session. The dictionary for a 2017 session contains around 550 entries; for a 2023 session, around 600.

def load_speaker_dict(session_date):
    # roster.json is hand-curated from official TBMM records.
    roster = json.load(open(f"rosters/{session_date.year}.json"))
    # On any given day, only members present can speak. Absences matter.
    present = roster["present_on"].get(session_date.isoformat(), roster["sworn_in"])
    return {normalize(m["name"]): m for m in roster["members"] if m["id"] in present}

Now the fuzzy tag match had an anchor. The candidate name was matched against the day's roster, not the entire universe of Turkish names. False positives dropped sharply.

Then I hit the case I hadn't anticipated: members whose names change. A member who married mid-term and took her husband's surname. A member expelled from a party who kept sitting as an independent. The roster had her one way, the OCR had her another, and the dictionary missed the link.

I added an alias table. Each entry can carry a list of past names with effective date ranges. By the end of the evening, the dictionary had grown a small set of footnotes that read like tiny biographies. Some of them are still my favourite part of the codebase.

Evening 4: position and shape

Evening four was the boring evening that did most of the work.

After looking at a thousand correctly-attributed lines and a thousand incorrectly-attributed ones, I noticed this: a real speaker tag almost always starts at column zero, follows a blank line, and is followed by either a dash or a hard newline. Procedural calls like BAŞKAN: use a colon, not a dash, and the next line is short and parenthetical.

def is_substantive_tag(prev_blank, line, next_line):
    # Substantive speakers get a dash; procedural calls get a colon.
    has_dash = any(d in line for d in ("–", "—"))
    has_colon = line.rstrip().endswith(":")
    # Real speeches continue on the next line; procedural calls are short.
    next_is_substance = len(next_line.strip()) > 40
    return prev_blank and has_dash and not has_colon and next_is_substance

Combined with the dictionary, this pushed attribution accuracy to roughly 96% on a held-out batch of 50 sessions. Three evenings of careful work, 0.4 points above the original regex. This is what production NLP feels like.

Evening 5: the LLM, used sparingly

The last 4% was the long tail. Cases where the OCR was so degraded that the dictionary could not anchor a candidate. Cases where two similarly-named members spoke in the same minute. Cases where the page break landed exactly between the speaker tag and the first word.

For these — and only for these — I added an LLM adjudicator.

def adjudicate(window, candidates, model):
    # Only called when heuristic confidence < 0.7.
    # We pass a 6-line window and the dictionary candidates for the day.
    prompt = (
        "Given the following transcript window and the list of MPs "
        "present in this session, identify the speaker of the last line. "
        "If uncertain, return 'unknown'.\n\n"
        f"WINDOW:\n{window}\n\nCANDIDATES:\n{candidates}\n"
    )
    out = model.generate(prompt, max_tokens=40, temperature=0)
    return parse_candidate(out, candidates)

The adjudicator runs on about 200 lines per session on average. At current pricing, roughly three cents a session. Across 850 sessions, the LLM handled the cases that moved accuracy from 96% to 99.1%.

The counter-intuitive lesson sits right there. The LLM did not solve the problem. The boring heuristic and the hand-curated dictionary did 95% of the work. The LLM only handled the last 4%, which made it look magical. If I had started with the LLM, I would have spent a fortune and still been chasing the same long tail at 96%.

A small zoo of OCR errors

A non-exhaustive sample, kept in a file called ocr_zoo.md:

Partisi → Parlisl (faint scan, t→l, i→I)
İSMET → ISMET (dotted-I lost on tilted page)
BAŞKAN → BASKAN → BAŞKAH (cedilla migrates; N→H on bottom margin)
Em-dash → nothing (1 page in 5)
Two columns merged into one when the centre fold curls
(X.Y.Z. Partisi) punctuation reduced to (X Y Z Partisi)

Each of these demanded an evening of its own at some point.

What I'd tell past me

If I could go back to evening one, I would say this: don't start with the regex. Start with the dictionary. The dictionary is the thing that turns guessing into looking up. Everything else is decoration.

And the easter egg, because every notebook needs one. The Speaker dictionary now holds 1,247 entries. Three of them carry the note ⚠️ not the same person as the X-name with similar spelling. I'll let you guess which three names cause the most confusion. Guess the obvious ones and you'll probably get two right — and the third one painfully wrong.