Data Collection & Crawler Infrastructure
Useful data collection is an engineering discipline, not a cat-and-mouse game. I build crawler infrastructure for sources you're allowed to read — your own supplier network, public data, licensed feeds — and I build it to keep working after the source changes its layout.
When you'd call me
- Price tracking is someone's manual job — a person, a spreadsheet, and an afternoon that produces stale numbers.
- Supplier catalogs arrive as portals and PDFs, and someone retypes them into your system by hand.
- You need public data — registries, tenders, announcements — at a scale where clicking through is no longer an option.
- You already have crawlers, but they break every time a source changes its markup, and nobody notices until the data is a week old.
What I do
- Distributed crawler infrastructure — product catalogs, price and stock monitoring at the scale of millions of pages a day.
- The full pipeline, not just fetching: crawl, clean, normalize, feed the catalog — the same pipeline that supplied the search index and the vectorizer.
- Parsers built for schema drift — the defensive-parsing approach that survived a customs API with fourteen schema variants applies here one to one.
- Operational discipline: rate limiting, retry with backoff, proxy management, and monitoring that wakes you before the data goes stale.
- A compliance frame from the start, not as an afterthought — robots.txt, the source's terms of service and GDPR/KVKK constraints are checked per source before a single request is sent. I don't build anti-bot evasion, and I'll tell you when a source is off limits.
Collection is half the job — what the data becomes afterwards is covered under data engineering & analytics.
Numbers, not adjectives
The catalog pipelines I've run collected from supplier and partner sources at millions of pages a day, fed a 7-million-product catalog, and kept both search and the vectorizer supplied without manual touch-ups. The parsers come from the same school as the customs integration: assume the source will change, and survive it when it does.
Field notes
Where we'd start
Discovery produces three things: a source list with the legal frame per source — robots.txt, terms, GDPR/KVKK status — a volume projection, and a pipeline design. If a source can't be collected cleanly, the document says so before any code exists.