Project Snapshot
Overview of client, duration, and goals.
- Client: Sakura (Project Manager, Japan)
- Duration: 1.5 months
- Goal: Scrape crowdfunding sites, use OCR+NLP to fill hidden fields, enable search/save/export
- Capacity: 100k–200k users/day

Problem
Key details like price, funding, and location are often inside images or non-scrapable sections. Traditional scraping misses these, hurting research and sourcing.
What We Built
A full SaaS: users sign up, search by categories/keywords, aggregate results across 9 websites. When text is incomplete, the scraper captures full-page screenshots and sends them to an OCR+NLP service to extract missing fields. Users review, save to a private workspace, or export to Google Sheets.
System Architecture
Microservices: Frontend (Next.js+TS), Scraper (Node+Express+Puppeteer with anti-bot hardening), OCR+NLP (Python+FastAPI+EasyOCR, embeddings, schema alignment), Export (Node+Express+Google Sheets), Database (NeonDB Postgres via Prisma), Deployment (VPS + Dokploy).
Key Flows
Search & scrape with screenshot capture for missing data • OCR+NLP enrichment with normalization and confidence-based merge • Save to private workspace and export to user Google Sheets with strong tenant isolation.
Demo Video
Technical Highlights
Anti-bot resilience (stealth plugins, user-agents, delays) • OCR precision with full-page screenshots and de-noising • NLP field inference (JP/KR/TW/EN) with rule+ML hybrid • Data integrity via confidence scores and auditable merge • Export isolation via per-user OAuth and scoped writes.
Challenges & Solutions
Cloudflare blocks → stealth + timing + headers; Export isolation bug → strict user context in jobs, per-tenant tokens, and pre-write checks.
Outcome
Scalable cross-market aggregator that unlocks reliable, export-ready data otherwise hidden inside images—usable for research, sourcing, and tracking.
Learnings
Set constraints early, price for value and risk, and document trade-offs to maintain velocity.