Published 3 April 20268 min read

The week BM25 beat my cross-encoder — and why I kept the reranker anyway

Sometimes the smartest engineering decision is to delete the thing you built three weeks ago. Sometimes it's just to gate it behind an if statement.

hybrid-searchrerankingproductionmultilingual

I had just spent three weeks bolting a BGE-Reranker stage onto the end of our hybrid pipeline. The cross-encoder cost us 40 ms of P95 latency. In return, it lifted nDCG@10 by 6 points. Then queries from the Macedonian marketplace started arriving. For half of them, BM25 alone — no fusion, no dense vectors, no rerank — was beating the full pipeline.

That one sentence ruined a week. The week, in turn, taught me what the cross-encoder was actually for.

The pipeline as it stood

The hybrid search engine has three stages: retrieval, fusion, rerank. Retrieval fires three named-vector queries against Qdrant in parallel — BM25 sparse, dense text, dense image. Fusion runs Reciprocal Rank Fusion across the three. Rerank takes the top 50 of the fused list and runs a BGE-Reranker-Base cross-encoder over (query, document) pairs to produce the final ordering.

The reranker is the only stage that gets to read both the query and the document at the same time. Retrieval embeds them separately and compares distances. Fusion reads nothing — it just merges rank lists. The cross-encoder is the only place in the system where the model can say "this exact query and this exact title and this exact description belong together."

It took three weeks to make it fit inside our latency budget. The win was clean: nDCG@10 went from 0.74 to 0.80 on the global eval set. The kind of number you can defend at a roadmap review.

The Macedonian Friday

Two weeks after the cross-encoder hit 100 percent traffic, the data lead pulled per-locale metrics for a planning meeting. The headline was fine. The breakdown wasn't.

locale   nDCG@10 before    nDCG@10 after    delta
-----   ----------------    --------------    -------
tr-TR        0.71              0.81           +10
en-US        0.78              0.84            +6
mk-MK        0.69              0.65            -4

Macedonian queries had lost four points. The expensive thing I had just shipped was making them worse.

I sat with it for a day. Retrieval was the same code path for every locale. Fusion was the same. The only piece doing locale-conditional work was the cross-encoder, because language is something a cross-encoder absorbs implicitly through training. BGE-Reranker-Base is trained predominantly on English and Chinese. Turkish gets some incidental help through transfer. Macedonian — Cyrillic and Latin mixed, often code-switched with Albanian or Serbian fragments — was a distribution the model had effectively never seen.

Faced with a Macedonian query and a Macedonian product description, the cross-encoder was producing scores that were almost random with respect to relevance. It would re-sort the top-50 of a fused list that BM25 had usually nailed, scrambling the order so that correct results were pushed down and noisy ones up.

For half of the Macedonian queries, the simplest possible system — just BM25, no fusion, no rerank — beat the full pipeline. I ran the eval three times because I did not want it to be true.

It was true.

The reflex to delete

My first instinct was to rip the cross-encoder out. Three weeks of work, gone. The latency budget would loosen. The infra cost would drop. The team would have one less moving part to babysit. There was a quiet, ego-shaped pull toward "I was wrong about this whole thing."

Two things stopped me.

First, the Turkish and English numbers. The reranker was a +10 point gain on Turkish, +6 on English. Those are the two biggest traffic segments. Throwing it away to fix Macedonian would have been the most expensive locale-specific bug fix in the history of search infrastructure.

Second, the merchant CRM had quietly been running a small test: replaying real Turkish purchase-intent queries against the system with and without the reranker. With the reranker on, click-through-rate on the top-3 results was up 8 percent. That number is not nDCG. It is dollars.

The right answer wasn't "delete the thing you built three weeks ago." The right answer was "gate the thing you built three weeks ago behind an if."

The conditional rerank

The fix was simpler than the bug.

// Rerank only for locales where the cross-encoder has been shown to help.
// The list is data-driven from the nightly eval — if mk-MK ever rises above
// the baseline, it moves automatically. We do not hardcode language opinions.
func shouldRerank(query Query) bool {
    cfg := rerankerConfig.Current()
    if cfg.GlobalEnabled == false {
        return false
    }
    locale := query.Locale.Normalize()
    return cfg.LocalesWithLift[locale]
}
 
func RunPipeline(ctx context.Context, q Query) ([]Result, error) {
    fused, err := retrieveAndFuse(ctx, q)
    if err != nil {
        return nil, err
    }
    if shouldRerank(q) {
        return rerankTop50(ctx, q, fused)
    }
    return fused[:min(20, len(fused))], nil
}

LocalesWithLift is not a constant. It's a map the nightly evaluator maintains. Every night, the eval harness re-runs the previous day's queries through both pipelines — with and without rerank — for each locale, and updates the map. A locale that shows at least a 1.5-point nDCG@10 lift with rerank on gets included. Everything else stays out.

That was the part that let me ship it. The decision was no longer me at a desk picking which languages get the cross-encoder. The decision was the data. If Macedonian got a new reranker tomorrow that actually worked for it, the map would catch the lift and flip the bit automatically. If Turkish ever stopped benefiting, it would drop off the map without anyone touching code.

The locale-aware eval harness

The harness was the engineering work that made the policy defensible. It runs nightly on a frozen 35,000-query sample drawn from the previous week's production logs, stratified by locale so that small-traffic locales like Macedonian get enough sample to produce a stable number.

async def evaluate_rerank_lift(
    queries: list[Query],
    locales: list[str],
) -> dict[str, float]:
    # Run each query through both pipelines, compute nDCG@10 delta per locale.
    # Stratified sampling guarantees mk-MK gets enough volume to be honest.
    lifts: dict[str, list[float]] = defaultdict(list)
    for q in queries:
        baseline = await pipeline_without_rerank(q)
        candidate = await pipeline_with_rerank(q)
        b = ndcg_at_k(baseline, q.relevance_labels, k=10)
        c = ndcg_at_k(candidate, q.relevance_labels, k=10)
        lifts[q.locale].append(c - b)
    return {loc: statistics.mean(vals) for loc, vals in lifts.items()}

The output of that function is the entire LocalesWithLift policy. A locale ships into rerank if its mean lift exceeds 1.5 and the bootstrap 90 percent confidence interval is positive. The 1.5 threshold is not a magic number. It's calibrated against the latency cost — about 40 ms — at which we think the rerank is worth it.

What you fall back to when rerank is off

Skipping the rerank stage is not free either. The fused list is still produced by RRF over three modalities, and the head of the fused list is usually good — but the order inside the head is rougher without the cross-encoder's reordering. We had to make the fused order more thoughtful when the rerank fallback kicked in.

def confidence_aware_fallback(fused: list[ScoredDoc], q: Query) -> list[Result]:
    # When the cross-encoder is skipped, fall back to a confidence-weighted
    # blend of the fusion score and a tiny logistic regression trained on
    # historical click data per locale. Cheap, deterministic, no model load.
    blend = []
    for doc in fused[:20]:
        ctr_prior = locale_ctr_prior(q.locale, doc.category)
        blended = 0.6 * doc.fusion_score + 0.4 * ctr_prior
        blend.append(Result(doc=doc.id, score=blended))
    blend.sort(key=lambda r: r.score, reverse=True)
    return blend

The logistic regression is one of those parts of the system you have to remind yourself is even there. It runs in microseconds. It was trained on six months of historical click data, broken out by locale and category. It is not a brain. It's a small, dumb, useful piece of statistics that picks up "in this locale, in this category, products of this rough shape get clicked more often." When the cross-encoder is in the pipeline, it overwhelms this signal completely. When the cross-encoder is gone, the small signal stops getting overwhelmed and becomes the difference between an okay fallback and a noticeably worse one.

The cost angle

The numbers that made this an easy ship past the engineering manager were the cost numbers. Macedonian traffic is around 12 percent of total queries. Skipping the rerank on it saved 22 percent of total reranker inference, because Macedonian queries were also disproportionately long — more tokens, bigger cross-encoder context, longer per-pair compute. Dropping that load freed up GPU capacity we had previously been planning to expand. The infra ticket for the expansion got cancelled the week after the conditional rerank shipped.

The latency budget gained from skipping rerank for Macedonian — about 38 ms at P95 — got re-invested into a slightly larger retrieval limit for Macedonian queries specifically. Bigger candidate pool, more BM25 head to work with, better odds the right item is in the top 20 to begin with. Three weeks later, the data lead's locale chart had Macedonian back at +2 points relative to the pre-rerank baseline. Not because we built it a reranker. Because we stopped breaking what was already working.

What I wrote on the planning doc that quarter

The closing line of my quarterly write-up was this: the reranker is a tool to use sometimes, not a pipeline default. A year of search work, compressed into eleven words.

The hybrid search literature tends to present pipelines as monolithic — retrieve, fuse, rerank, ship. In production, every stage is conditional on something. Rerank is conditional on the model knowing the language. The image stage is conditional on the query having visual intent. The dense text stage is conditional on the model having seen catalogue-like data. The only thing that runs unconditionally is BM25, because BM25 has no opinions to disagree with.

Sometimes the smartest engineering decision is to delete the thing you built three weeks ago. Sometimes it's just to gate it behind an if statement. Knowing which one your situation needs is the senior version of the job. Mine that month needed the if.

The reranker is still in production. It still gives Turkish and English their +10 and +6. Macedonian still gets BM25, the small logistic regression, and the extended candidate pool. The pipeline got smarter not by adding anything, but by learning when to step out of the way.