Özgür Işık Damar
10 min read

Six months with RRF in production: what k=60 doesn't tell you

The 60 in '1/(60+rank)' came from a 2009 paper. Most teams who copy it think it's a Qdrant default.

hybrid-searchqdrantproduction

Recall@10 looked great on the benchmark notebook. Three weeks into production the merchant CRM team started forwarding me screenshots. "Why does mavi gömlek return a navy blanket?" The answer was in k.

That's the whole post in one paragraph. The rest is detail I wish I had read in August 2025, before I shipped RRF with the default constant the rest of the internet copy-pastes.

How we got here

We run hybrid search over a cross-border catalogue between Türkiye and North Macedonia. About 7 million live items, three retrieval modalities, one fusion stage. The modalities are the usual suspects:

  • a dense multilingual text encoder
  • a BM25 sparse index with IDF reweighting
  • an image embedding tower (CLIP/SigLIP) for cross-modal retrieval

When Qdrant shipped native hybrid search, switching over took an afternoon. The API is clean, fusion is built in, and the default fusion strategy is RRF — Reciprocal Rank Fusion. The formula is the part nobody questions:

score(d) = sum over modalities of 1 / (k + rank_modality(d))

That k is a constant. The Qdrant docs use 60. The Elastic docs use 60. The HuggingFace examples use 60. A senior on my team with ten years of search experience used 60. So I used 60.

The naive benchmark, then real users

Our offline eval set was 4,200 labelled queries. nDCG@10 with k=60 was 0.61. Recall@10 was 0.78. The numbers looked fine on the notebook. I rolled out behind a feature flag at 5 percent, watched dashboards for a week, ramped to 100 percent. Latency held. Error rate was flat. Everything was green.

Then the screenshots started.

A merchant ops lead from the Skopje office sent the navy blanket one. A customer service supervisor sent another where a search for Adidas Samba beyaz returned a pair of white socks — Samba branded, technically a match, but socks. A junior on the data team rebuilt the eval set from a week of real production logs and the offline numbers got worse: nDCG@10 dropped from 0.61 to 0.49.

The eval set had been clean. Production wasn't. Every interesting failure was the tail of one modality bubbling into the top of the fused result.

Why k=60 hurts you specifically

Look at the formula again. With k=60, a result ranked 200th in a single modality still contributes 1/(60+200) = 0.0038 to the fused score. Small number. But if a query weakly matches one modality across many candidates, those tiny contributions stack. Worse, the dense text tower had a habit of producing low-quality, semantically adjacent neighbours at ranks 50 to 200. Things like navy blanket when you asked for navy shirt. The model wasn't wrong. The model was saying "these are kind of similar." RRF took that lukewarm opinion and gave it enough weight to land in the top 10 once the other two modalities had nothing to say.

Every merchant CRM screenshot had the same shape: one modality has nothing useful in its top 20, the other two are noisy, and RRF stitches a fused list where the tail of the noisy modalities outranks the head of the silent one. The constant k=60 is the dial that controls how loud the tail can shout. Sixty is loud.

The naive RRF, for reference

This is what I shipped. It's also what's on the front page of half the search tutorials right now.

# Qdrant's native RRF. k is implicit and equals 60.
# Looks innocent. Behaves like it owes you nothing.
results = await client.query_points(
    collection_name="products",
    prefetch=[
        models.Prefetch(query=text_vec,  using="text",  limit=200),
        models.Prefetch(query=sparse,    using="bm25",  limit=200),
        models.Prefetch(query=image_vec, using="image", limit=200),
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=20,
)

There is no k argument anywhere in that snippet. That's the whole problem. You inherit 60 by silence.

The sweep

I ran the same 4,200-query eval set, plus the production-rebuilt set, with k in . Results were not subtle.

knDCG@10 (clean)nDCG@10 (prod)comment
200.690.71head dominates, tail muted
300.740.73sweet spot for our mix
600.610.49the default we shipped
1200.550.44tail is loud
2000.480.38everyone votes equally, badly

On long-tail queries (four or more tokens, often a phrase) k=20 was even better — 0.77 — because the head of each modality was reliable and we didn't need the smoothing.

Small k is sharp. Large k is mushy. The default is mushy.

Heterogeneous k beats homogeneous k

This is the part I didn't see coming. Our three modalities have very different rank distributions. BM25 has a steep head — the first ten results are usually the only ten that matter, then it falls off a cliff. The dense text tower is gentler — the first forty or so are all plausibly relevant. The image tower is the gentlest of all — the first hundred can be useful, especially for cross-modal queries.

Use one k for everything and you under-weight BM25's confident head while over-weighting the image tower's noisy tail. The fix was to give each modality its own k:

# Per-modality RRF: rewrite the fusion ourselves because Qdrant's
# native fusion takes one k for all prefetches.
def per_modality_rrf(hits_by_modality: dict[str, list[str]],
                     k_per_modality: dict[str, int]) -> dict[str, float]:
    scores: dict[str, float] = defaultdict(float)
    for modality, ids in hits_by_modality.items():
        k = k_per_modality[modality]
        for rank, doc_id in enumerate(ids, start=1):
            scores[doc_id] += 1.0 / (k + rank)
    return scores
 
# What landed in production, after the sweep.
K = {"bm25": 15, "text": 40, "image": 60}

BM25 with k=15: when keywords match, the top of the keyword list dominates. The text tower with k=40: we trust the top thirty before things get fuzzy. The image tower stays at k=60 because for picture-shaped queries its rank 100 is sometimes still useful. That single change took nDCG@10 on the production set from 0.49 to 0.74.

There's no magic in those numbers. They're what fell out of the sweep. The point is that they're different numbers, and the default API doesn't let you have different numbers.

Query-length-aware k

The other thing the sweep revealed: short queries and long queries want different smoothing. A two-token query like kırmızı elbise should be served almost entirely by the head of each list — there's nothing to disambiguate, the user knows what they want. A six-token query like yazlık keten erkek pantolon bej beden 32 is a long thin needle that benefits from a slightly wider net.

// chooseK picks k per modality and per query length.
// The constants below came from a sweep on six weeks of prod logs.
// Don't copy them, sweep your own. Yours will be different.
func chooseK(modality string, tokenCount int) int {
    base := map[string]int{"bm25": 15, "text": 40, "image": 60}[modality]
    switch {
    case tokenCount <= 2:
        return base / 2
    case tokenCount >= 5:
        return base + 20
    default:
        return base
    }
}

Yes, this is more code than Fusion.RRF. Yes, it produces better results. The trade is the trade.

The eval harness, briefly

The reason any of this is defensible is the harness. We run nightly evals against a frozen production-log sample with the latest catalogue snapshot. The harness can break the build if nDCG@10 drops more than 1.5 points on any locale.

async def eval_run(queries: list[QueryRecord], k_config: dict) -> Report:
    # Run the same query batch under candidate config and current prod config.
    # Compare per-locale because TR and MK have very different distributions.
    cand = await run_batch(queries, k_config)
    prod = await run_batch(queries, PROD_K_CONFIG)
    return Report(
        per_locale_ndcg=ndcg_at_k_per_locale(cand, prod, k=10),
        regression_threshold=1.5,
    )

The boring infrastructure is what makes the interesting tuning possible. Without a harness you're tuning by vibes and tickets, and tickets are a slow gradient.

The thing nobody tells you

The 60 in 1/(60+rank) came from a 2009 paper by Cormack, Clarke and Buettcher on combining the results of multiple search engines on TREC tracks. They tried a handful of constants and 60 happened to do well on the corpora they had. Most teams who copy it think it's a Qdrant default. It is not. It's a number from a paper, sized for a benchmark that isn't your traffic.

One rule, if I had to leave you with one: k is not a confidence parameter and it is not a "how much do I trust this modality" parameter. It's a smoothing parameter that decides how much the tail of each list contributes to the fused head. Pick it per modality. Pick it for your traffic. Re-pick it when your catalogue or query mix changes. The default value is somebody else's traffic.

Our k has moved twice since February 2025. Once when we added Macedonian Cyrillic queries and the BM25 head got more confident. Once when we swapped image encoders and the image tail got a lot less noisy. Each move was a sweep, an eval, and a quiet PR titled "k tuning, June 2026." Last Friday, the junior who rebuilt the eval set from prod logs pinged me on Slack: "k changed again?" Yes. It's a dial. We turned it.

// while you're here