Data Collection & Crawler Infrastructure

Useful data collection is an engineering discipline, not a cat-and-mouse game. I build crawler infrastructure for sources you're allowed to read — your own supplier network, public data, licensed feeds — and I build it to keep working after the source changes its layout.

When you'd call me

Price tracking is someone's manual job — a person, a spreadsheet, and an afternoon that produces stale numbers.
Supplier catalogs arrive as portals and PDFs, and someone retypes them into your system by hand.
You need public data — registries, tenders, announcements — at a scale where clicking through is no longer an option.
You already have crawlers, but they break every time a source changes its markup, and nobody notices until the data is a week old.

What I do

Distributed crawler infrastructure — product catalogs, price and stock monitoring at the scale of millions of pages a day.
The full pipeline, not just fetching: crawl, clean, normalize, feed the catalog — the same pipeline that supplied the search index and the vectorizer.
Parsers built for schema drift — the defensive-parsing approach that survived a customs API with fourteen schema variants applies here one to one.
Operational discipline: rate limiting, retry with backoff, proxy management, and monitoring that wakes you before the data goes stale.
A compliance frame from the start, not as an afterthought — robots.txt, the source's terms of service and GDPR/KVKK constraints are checked per source before a single request is sent. I don't build anti-bot evasion, and I'll tell you when a source is off limits.

Collection is half the job — what the data becomes afterwards is covered under data engineering & analytics.

Numbers, not adjectives

The catalog pipelines I've run collected from supplier and partner sources at millions of pages a day, fed a 7-million-product catalog, and kept both search and the vectorizer supplied without manual touch-ups. The parsers come from the same school as the customs integration: assume the source will change, and survive it when it does.

Field notes

The customs API that returned 7 different schemas (and the parser that survived all of them)The defensive-parsing approach these crawlers are built on.Speaker attribution on noisy OCR: an evening-by-evening notebookExtracting structure from genuinely messy source data.Hybrid search with Qdrant: what nobody tells you about BM25 + dense + imageWhere the collected data ends up: the search index.

Where we'd start

Discovery produces three things: a source list with the legal frame per source — robots.txt, terms, GDPR/KVKK status — a volume projection, and a pipeline design. If a source can't be collected cleanly, the document says so before any code exists.

Tell me about your data sources