Data Engineering & Analytics
Most companies don't have a data shortage — they have a refinement shortage. Every pipeline below exists because a production system needed it, not because a conference talk did.
When you'd call me
- You have years of data and still make decisions by gut feel, because nothing downstream of the database is trustworthy.
- Your LLM or RAG project stalled exactly where most do: the data wasn't clean, chunked or verifiable enough to retrieve against.
- Search or recommendations need an embedding pipeline, and nobody on the team has run one beyond a notebook.
- Fraud or price manipulation is happening in your marketplace and your current reports can't see it.
What I do
- Embedding & vector infrastructure — a production vectorization pipeline over product data: 512-dimension text and 512-dimension image vectors plus BM25 sparse signals in Qdrant, across 7 million products.
- LLM-ready data preparation and RAG pipelines — chunking strategy, quality filtering, and claim-level verification of the kind I built for a parliamentary NLP archive.
- Anomaly detection — cross-market fraud and price-manipulation detection with graph-based relationship models, and scoring a human can audit instead of a black box.
- Recommender systems — real-time, reinforcement-learning based, with the MLOps loop attached: training, deployment, monitoring.
- ETL/ELT pipeline design — from sources to warehouse with schema versioning and data contracts, learned the hard way on a data-mesh setup.
- Data quality and drift monitoring — automated checks that keep production accuracy from decaying silently.
- Test data engineering — production-sampled, KVKK-compliant test databases with sequence offsets, so staging finally behaves like prod.
- Reporting and observability — metric dashboards and funnel analysis that answer questions instead of decorating them.
When the embedding pipeline's purpose is product search, that end of the problem has its own page: hybrid search & retrieval.
Numbers, not adjectives
The vectorization pipeline behind Nova's catalog embeds 7 million products into Qdrant — text at 512 dimensions, images at 512, BM25 sparse alongside — and the hybrid search runs on exactly that. The verification layer from the parliamentary NLP project checks assertions against sources at the claim level, because the alternative already nearly happened: a journalist almost quoted a hallucination.
Field notes
Where we'd start
Discovery delivers a data inventory, a quality map and a priority order: which dataset unlocks which decision, what it costs to make it trustworthy, and what to deliberately ignore. If the honest finding is that you need three SQL views before you need any machine learning, that's what the document says.