Project Snapshot

Overview of client, duration, and goals.

  • Client: Sakura (Project Manager, Japan)
  • Duration: 1.5 months
  • Goal: Scrape crowdfunding sites, use OCR+NLP to fill hidden fields, enable search/save/export
  • Capacity: 100k–200k users/day
OCR-Powered Crowdfunding Data Scraper (Full SaaS) Project Snapshot 1

Problem

Key details like price, funding, and location are often inside images or non-scrapable sections. Traditional scraping misses these, hurting research and sourcing.

What We Built

A full SaaS: users sign up, search by categories/keywords, aggregate results across 9 websites. When text is incomplete, the scraper captures full-page screenshots and sends them to an OCR+NLP service to extract missing fields. Users review, save to a private workspace, or export to Google Sheets.

System Architecture

Microservices: Frontend (Next.js+TS), Scraper (Node+Express+Puppeteer with anti-bot hardening), OCR+NLP (Python+FastAPI+EasyOCR, embeddings, schema alignment), Export (Node+Express+Google Sheets), Database (NeonDB Postgres via Prisma), Deployment (VPS + Dokploy).

Key Flows

Search & scrape with screenshot capture for missing data • OCR+NLP enrichment with normalization and confidence-based merge • Save to private workspace and export to user Google Sheets with strong tenant isolation.

Demo Video

Coming soon: product walkthrough video

Technical Highlights

Anti-bot resilience (stealth plugins, user-agents, delays) • OCR precision with full-page screenshots and de-noising • NLP field inference (JP/KR/TW/EN) with rule+ML hybrid • Data integrity via confidence scores and auditable merge • Export isolation via per-user OAuth and scoped writes.

Challenges & Solutions

Cloudflare blocks → stealth + timing + headers; Export isolation bug → strict user context in jobs, per-tenant tokens, and pre-write checks.

Outcome

Scalable cross-market aggregator that unlocks reliable, export-ready data otherwise hidden inside images—usable for research, sourcing, and tracking.

Learnings

Set constraints early, price for value and risk, and document trade-offs to maintain velocity.