Özgür Işık Damar
8 min read

CLIP vs SigLIP for a Turkish product catalog: a brand-affinity ablation

By month two, our CTO had stopped asking 'is it CLIP-quality' and started asking 'is it SigLIP-fair?'

hybrid-searchembeddingscomputer-vision

We had 2.4 million product images. Most of them shot by sellers on phones, most of them captioned in Turkish — captions our text encoder could not match. The first image search shipped on CLIP ViT-B/32. A week in, the brand-affinity scores were wrong in a very specific direction: every dark-coloured shoe looked like Nike to it.

That one sentence is the entire story of why we moved off CLIP. The rest is just receipts.

The catalogue we were embedding

The image tower sits inside the hybrid search pipeline. Text, BM25, and image embeddings are retrieved in parallel, then fused. The image tower carries about a third of the load on visual queries — a photo upload, a screenshot, a vague "something like this" — plus a smaller but non-zero share of text queries via a text-to-image projection head.

Catalogue scale was the easy part to describe. 2.4M images, one image per SKU, average dimension somewhere around 1200 px wide, JPEG quality all over the map because seller phones are inconsistent. The hard part to describe was the distribution. Roughly 38 percent of the catalogue was apparel. Inside apparel, the long tail was Turkish and Macedonian local brands — names a CLIP training set would never have seen.

Shipping on CLIP, the obvious choice

CLIP was the obvious starting point. The team already had infrastructure for it. The vector dimension fits Qdrant's named-vector slot cleanly. There were good Turkish blog posts about how to wire it up. The throughput numbers were known.

# What we shipped first. Embed once, upsert once, search forever.
# This was the version that made dark shoes into Nike.
import torch
from transformers import CLIPModel, CLIPProcessor
 
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").eval().to("cuda")
proc  = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
def embed_image(pil_img) -> list[float]:
    inputs = proc(images=pil_img, return_tensors="pt").to("cuda")
    with torch.no_grad():
        feats = model.get_image_features(**inputs)
        feats = feats / feats.norm(dim=-1, keepdim=True)
    return feats.cpu().numpy()[0].tolist()

On the offline benchmark, CLIP was fine. Recall@10 against a labelled set of 8,000 image-only queries came in at 0.71. Latency was 38 ms per embed on an A10 — well under our 80 ms budget for the image step.

We rolled it out, and the searches looked plausible. Looked plausible.

The day we noticed

Brand-affinity was a metric the merchant ops team had built for a different purpose. It scored each merchant's products on how cleanly their images clustered around the merchant's intended brand identity. A boutique selling minimalist Scandinavian apparel should be coherent across its own catalogue. A multi-brand store should be less coherent by design, but the brands inside should still be recognisable.

On the Wednesday after rollout, the merchant ops lead sent me a chart. The brand-affinity score for every dark-coloured shoe in the catalogue — across at least sixty Turkish and Macedonian boutique brands — had drifted toward "Nike-shaped." The image embedder was projecting them all into the same neighbourhood. A pair of locally made Yıldız leather boots and a pair of Nike Cortez were near-neighbours in image space. The only thing they shared was being dark and shoe-shaped.

The merchant ops lead asked, gently, whether this was on purpose.

It was not on purpose. It was the training data.

Why CLIP did this

CLIP was trained on roughly 400 million image-text pairs scraped from the open web. In that corpus, the photo-text pairs for Nike, Adidas, and Puma vastly outnumber the pairs for any Turkish or Macedonian local brand. The dense regions of CLIP's image-space around "shoe + dark colour" are packed with popular brands. The long-tail brands are sparse outposts, dragged into that dense neighbourhood by plain nearest-neighbour geometry.

This is not a bug in CLIP. CLIP is doing exactly what it was trained to do. The bug is in our catalogue meeting CLIP halfway. The catalogue is heavy on long-tail brands. CLIP is heavy on global brands. The two distributions did not overlap where it mattered.

I spent two days fine-tuning a head on top of CLIP with our own brand labels. It helped a little. Recall@10 went up by maybe 2 points; brand fairness barely moved. The information needed to fix the bias was not in the head. It was in the encoder.

So we swapped the encoder.

The SigLIP swap

What nobody warns you about moving from CLIP to SigLIP is how small the diff is. The architectures are similar enough that the embedding code barely changes:

# The entire migration cost. One import line, one model id.
# The pipeline downstream did not change.
from transformers import AutoModel, AutoProcessor
 
model = AutoModel.from_pretrained("google/siglip-base-patch16-256").eval().to("cuda")
proc  = AutoProcessor.from_pretrained("google/siglip-base-patch16-256")
 
def embed_image(pil_img) -> list[float]:
    inputs = proc(images=pil_img, return_tensors="pt").to("cuda")
    with torch.no_grad():
        feats = model.get_image_features(**inputs)
        feats = feats / feats.norm(dim=-1, keepdim=True)
    return feats.cpu().numpy()[0].tolist()

That is it. Same input, same output shape, same Qdrant upsert path. The catalogue had to be re-vectorized, of course — every image gets embedded once more on the GPU farm — but in code the diff was a single import line.

What is not small is what changed underneath. SigLIP's training loss is a sigmoid loss rather than CLIP's contrastive softmax. The practical consequence: SigLIP's image-space is less curved around the brand-rich regions, because the loss doesn't force every pair to be discriminated against every other pair in the batch. The objective is per-pair, not per-batch. The geometry stays flatter where the data is sparse.

The numbers, against the same 8k queries

On the offline set:

  • Recall@10: CLIP 0.71, SigLIP 0.83. +12 percentage points.
  • Brand fairness index: CLIP 0.42, SigLIP 0.70. +28 points.
  • P95 embed latency: CLIP 38 ms, SigLIP 53 ms. +15 ms.

The latency cost was real, and we paid it. SigLIP-Base at patch-16-256 is meaningfully heavier than CLIP ViT-B/32. We considered dropping to the smaller siglip-small-patch16-224 to claw the latency back, ran the eval, watched the brand fairness number sag, and stuck with the bigger one.

The headline was the brand fairness lift. Long-tail brands stopped being absorbed into popular-brand neighbourhoods. The Yıldız boots and the Nike Cortez separated in vector space. They stopped being near-neighbours. They went back to being what they actually were — two different shoes from two different brands that happened to both be dark.

The bonus we didn't expect

Our text-to-image projection — for text queries hitting the image index — got noticeably better with SigLIP. Turkish text queries especially. SigLIP's training included a much larger multilingual subset than the original CLIP, and the alignment between Turkish captions and image regions was tighter. A query like kırmızı kemerli yazlık elbise ("red-belted summer dress") started returning coherent dress images at a rate we hadn't seen on CLIP without an extra fine-tune.

We didn't do anything specifically for this. It was a side effect of the swap. The same model getting more multilingual and more fair to long-tail brands in one go was the part that made the trade easy.

What the model-size people get wrong

There is a reflexive instinct on the team to ask "is the new model bigger?" whenever something improves. SigLIP-Base is not meaningfully bigger than CLIP ViT-B/32 — they are the same order of magnitude. The win wasn't size. The win was the training loss and the training data mix.

If we had jumped to CLIP-Large with twice the parameters, our brand fairness numbers would have gotten slightly worse, because biased data modelled more aggressively deepens the same biases. We tested it. CLIP-Large's brand fairness was 0.39 against CLIP-Base's 0.42. Bigger did not help. Different did.

Counter-intuitive on the face of it. Obvious in hindsight. The encoder is a lens. The training data is the light. A bigger lens focused on biased light just produces a sharper biased image.

A small brand-fairness eval

The brand fairness index runs nightly on a 12,000-item sample. The math isn't exotic — it compares how tightly your top-N neighbours cluster on brand label, normalised against how the brand is distributed in the catalogue overall.

def brand_fairness(query_results: list[Result],
                   catalogue_brand_counts: dict[str, int]) -> float:
    # Higher is fairer. A perfectly proportional sample scores 1.0.
    # A sample that over-represents popular brands scores below 1.
    total = sum(catalogue_brand_counts.values())
    expected = {b: c / total for b, c in catalogue_brand_counts.items()}
    observed = Counter(r.brand for r in query_results)
    n = len(query_results)
    observed = {b: c / n for b, c in observed.items()}
    # KL-style divergence, inverted so higher = fairer.
    return 1.0 / (1.0 + sum(
        o * math.log(o / expected.get(b, 1e-9))
        for b, o in observed.items() if o > 0
    ))

We track it per merchant and per category. It is the metric that catches "the search secretly only knows about the top brands" before merchants do.

Six months on

We are still on SigLIP-Base-Patch16-256. We have re-trained the brand fairness head once, when a wave of new Macedonian boutiques onboarded and shifted the distribution again. We have not gone back to CLIP for any production traffic.

By month two, our CTO had stopped asking "is it CLIP-quality" and started asking "is it SigLIP-fair?" That is the kind of small linguistic shift that tells you a team has internalised what the actual problem was. The problem was never visual quality on stock photography. The problem was whether the lens we were holding up to our catalogue had ever been trained on anything that looked like our catalogue.

It had not. We swapped the lens. The catalogue, finally, looked like itself.

// while you're here