A smart web-scrapper? — Shivam Anand

Here's a journey of how we build a smart web crawler before FireCrawl/etc existed.

Story Time

(you may skip if you want)

Our task of building a smart web scrapper began when we tried to make a web srapper for web-archive/internet-archive. As you know, web archive is a corpus of massive snapshots of evey websites possible for the past 20 years.

Initial implementation used scrapy (python) package to write custom webscrappers...but this got tedious too soon. We (team of 2) started to look out for better solutions.

Phase 1

Enter, the AI based scrapper. This is where it all began, we realised that we can parse HTML pages as it is, convert the content to markdown and with the help of LLMs it's quite easy to classify each section. Parsing HTML to md wasn't very difficult either. This worked great for the PoC.

But this didn't scaled well, because of the following reasons:

Markdown conversion was lossy, critical DOM structure, hierarchy, and semantic context are discarded during conversion.
Boilerplate content overwhelms signal, causing navigation, footers, and legal text to be misclassified as meaningful sections.
Section boundaries become ambiguous, forcing the LLM to guess structural intent from flattened text.
Images, pdfs and js-rendered content is often incomplete or missing, leading to inconsistent or partial extractions.
Token usage scales poorly, as markdown inflates content size and forces aggressive chunking.
Expensive AF!

Phase 2

RAG Approach -

Key Features

98% Retrieval Precision: Achieved industry-leading accuracy in document retrieval
Automated Content Optimization: ML-powered content enhancement and SEO optimization
Scalable Architecture: Handles millions of documents with sub-second query times
Vector Database Integration: Efficient semantic search using state-of-the-art embeddings

Technical Stack

Python, LangChain, Redis Queues (Yes, This exists), Qdrant (Open-source Vector DB)

Impact

Deployed to production serving thousands of daily queries with consistent sub-second response times and 98% user satisfaction rate.