A physics enthusiast who found his calling in solving impossible scraping challenges for D2C brands and lead gen agencies
That kid obsessed with Einstein and Feynman? Now I'm building production scrapers that extract millions of records while bypassing the toughest anti-bot systems. 50+ production systems built. Co-founder @ CubikTech.
Trusted by
From a small town in Goa to building production scrapers that break through the toughest anti-bot systems
Grew up in a small town in Goa, searching for my passion. Everything changed when a physics teacher ignited my curiosity.
Became obsessed with Einstein, Feynman, and relativity. Read everything from "Einstein's Relativity" to "Life 3.0". Topped my class in math and developed deep physics knowledge.
CS Engineering degree. The problem-solving approach from physics applied perfectly to code.
Failed first startup (MSME aggregator) taught me to solve real problems. Won multiple hackathons — loved breaking down complex systems and finding loopholes.
Co-founded CubikTech. Clients kept asking for "impossible" data extraction.
Breakthrough: Cloudflare-protected hotel data — found API loopholes, forged tokens, built custom decryption. Delivered what seemed impossible.
Scraping = system analysis + creative problem-solving + production engineering.
Follow for more scraping breakdowns and D2C automation insights
The physics mindset applied to breaking through anti-bot systems
Every website is a system with assumptions. My job? Find the assumptions, find the loophole.
Example: Most scrapers try to solve CAPTCHAs. I study the authentication flow to find API loopholes. For alltophotels.io, I discovered I could forge tokens directly. No CAPTCHA solving needed.
Don't accept "This can't be scraped." Break it down to fundamentals.
Example: Cloudflare blocks traditional scraping. But websites still need to load data for real users. Find how they authenticate real users, replicate that—not the browser automation.
One-off scripts fail. Production systems need resilience, monitoring, graceful degradation.
Example: The Blinkit scraper kept crashing every
6-8 hours. 3 days of investigation revealed a Docker configuration
issue. One line of code (init: true) fixed it—now it
runs for 48+ hours without issues.
Modern scraping isn't just extraction—it's extraction + enrichment + intelligence.
Example: Scraping ABC licenses is step 1. Using Gemini 2.5 Flash with real-time Google Search grounding to enrich with emails, websites, and business info? That's what turns data into intelligence.
Here's what most agencies won't tell you: Scrapers break. Often. Websites change their structure, update anti-bot systems, rotate authentication methods. If an agency promises "scrapers that never break," they either don't understand production scraping or they're lying.
At scale (20k+ products daily), you're not fighting one website—you're fighting dozens of platforms, each with their own quirks, rate limits, and protection systems. Maintenance isn't a bug, it's the job.
The question is: How fast can I detect the issue, identify the change, and deploy a fix?
This is why internal dev teams struggle with scraping. It's not just about writing code—it's about proxy rotation expertise, bot protection patterns, and knowing when to pivot from browser automation to API interception. That's the specialized knowledge that separates hobbyist scripts from production systems.
The technical capabilities that enable reliable, scalable scraping systems
Why it matters: Scraping at scale (20k+ products daily) requires job orchestration that survives crashes, handles retries, and distributes work across multiple workers. Message queues ensure no data is lost when scrapers fail.
Beyond basic extraction — retry loops, intelligent caching, and AI enrichment that scales to millions of records
Production-grade scraping systems handling millions of records
The Problem: D2C brands lose revenue to stockouts and get undercut by competitors who monitor pricing in real-time. Manual price checking doesn't scale. Most scrapers crash or get blocked.
Key Achievements: API interception (10× faster than HTML parsing), residential proxy integration, batch processing (48× database speedup), full pagination (50 pages per job), 3-6 parallel workers with intelligent job orchestration.
→ Perfect for D2C brands tracking competitor pricing across marketplaces, or agencies building competitive intelligence tools.
The Problem: Lead gen agencies burn 80+ hours/month on manual research: finding licensed businesses, enriching with emails, verifying contacts, categorizing by territory. This doesn't scale.
Google Maps scraping (7 endpoints). 9 venue categories. Multi-factor luxury scoring. 89% valid contact data.
alltophotels.io extraction (the origin story!). Authentication handling, 10+ data points per hotel.
Apify + BrightData versions. UK/US targeting. 1-10 parallel workers. Automatic checkpointing.
UK directory scraping. Letter-based batching (A-Z). 500 companies per batch.
800+ ads per brand. ML ensemble (92.7% accuracy). 8 LangGraph agents. Multi-modal analysis.
→ Built for lead gen agencies, GTM teams, and anyone drowning in manual prospect research.
Client: Built for Stack Optimise
The Problem: AbarAbove.com (bar accessories D2C brand) was manually researching 50 leads/week. Each lead took 30+ minutes: Find ABC license → Find business → Find contact info → Verify → Add to CRM. 100+ hours/month of pure manual work.
Deep domain expertise in the data challenges that matter to D2C brands, lead gen agencies, and GTM teams
Real-time intelligence for D2C brands competing on fast-moving marketplaces
Transform manual prospect research into automated intelligence pipelines
Monitor competitor moves before your competitors monitor yours
Hyperlocal intelligence for site selection and market analysis
The difference between scrapers that break in production and systems that run for months without supervision
Production scrapers need more than good code. They need robust infrastructure, monitoring, automatic recovery, and systems that run on themselves.
My Blinkit scraper kept crashing every 6-8 hours. The scraping
logic was perfect—the issue was how Docker handled background
browser processes. One configuration change (init: true) fixed the cleanup issue, turning 6-hour crashes into 48+ hour
continuous runs.
The takeaway: When something breaks, dig deep into how the entire system works—not just your code. Understand the platform's behavior, find where it falls short, and work around it. That's the physics approach to scraping.
Dealing with similar production challenges? DM me on X to discuss your approach.
@HarshGawas25 →From discovery to deployment — a proven 4-step process
Understand your data needs and assess target complexity
Analyze architecture, anti-bot protection, and find loopholes
Custom development, proxy setup, and production deployment
Monitoring, adaptation, and continuous optimization
I co-founded CubikTech, an AI-first automation agency for D2C brands. While scraping is my specialty, we also build broader automation solutions.
Customer support, operations automation, intelligent workflows
n8n, Make, custom APIs - automate repetitive ops
Custom dashboards, reconciliation, marketplace insights
90-day delivery for funded startups using AI tools
If your challenge is broader than data extraction, let's talk about what AI automation can do for your business.
No pitch, no commitment. Just a conversation about what's possible and how production scraping actually works.