← Back to Portfolio

Lead Web Scraper

AI-Powered Business Intelligence & Lead Generation
TYPESCRIPT • OPENAI GPT-4o-mini • PLAYWRIGHT • POSTGRESQL

Quick Navigation

Jump to any section:

Project Overview

AI-Powered B2B Lead Generation & Enrichment Platform

A sophisticated business intelligence system that discovers, crawls, and analyzes restaurant businesses using a combination of headless browser automation (Playwright), multi-provider API aggregation (Google Places, SerpAPI), and six distinct AI-powered analyzers (OpenAI GPT-4o-mini). The platform generates comprehensive lead profiles including AI-estimated revenue breakdowns, owner identification, chain affiliation detection, review problem analysis, and ordering system detection.

AI-Powered Capabilities:

  • Owner Name Extraction — AI crawls about/team/founder pages, extracts owner names with confidence scoring and multi-source consensus aggregation
  • Two-Stage Revenue Prediction — Initial AI estimate with adaptive web search enrichment when confidence falls below 0.6, breaking revenue into dine-in, takeout, and delivery channels
  • Chain Affiliation Detection — AI determines if a business is independent or part of a franchise, identifies parent companies, and estimates location counts
  • Business Type Classification — AI classifies businesses into 16 categories using website content, Google Place types, and service indicators
  • Review Problem Analysis — AI analyzes negative reviews across 7 problem categories (food quality, service, delivery, cleanliness, pricing, wait times, order accuracy)
  • Ordering Vendor Detection — Multi-signal scoring engine using domain matching, URL regex, text patterns, and script host detection

6 AI analyzers 5 data providers 6 database tables 10+ REST endpoints Session-based discovery pipeline

AI Analyzers & Detectors

Two-Stage Revenue Prediction Pipeline

OpenAI GPT-4o-mini Adaptive Web Search Google Search (Playwright) Bing Fallback Multi-Channel Breakdown

A unique two-stage revenue estimation pipeline that adapts its data gathering based on confidence. The AI first estimates revenue from available business data, then autonomously decides whether to perform web searches for industry benchmarks to refine its prediction.

Pipeline Flow:

  1. Stage 1 — Initial AI Estimate: GPT-4o-mini analyzes the restaurant profile (name, category, rating, review count, price level, location) and produces a revenue breakdown by channel
  2. Confidence Gate: If the AI's confidence score falls below 0.6, the pipeline automatically triggers web research
  3. Web Research: Three targeted search queries are executed:
    • "{category} restaurant average monthly revenue"
    • "restaurant industry revenue per location {location}"
    • "{category} restaurant sales benchmarks"
  4. Search Provider Cascade: Google Search via headless Playwright first; if no results, falls back to Bing Search API
  5. Stage 2 — Enhanced Estimate: AI re-analyzes with the original data plus web search results for a refined prediction
  6. Multi-Channel Output: Final estimate breaks revenue into dine-in, takeout/pickup, and delivery channels with total monthly revenue
Technical Highlight: The adaptive confidence gate (threshold 0.6) prevents unnecessary API calls for straightforward businesses while ensuring data-poor leads get enriched with market research. The search provider cascade (Google → Bing) ensures resilience. The output includes a usedWebSearch flag for transparency. This two-stage pattern could be generalized to any AI estimation task requiring variable data quality.

Multi-Source Owner Detection with Consensus Scoring

OpenAI GPT-4o-mini Playwright Page Crawling Priority Page Ordering Confidence Aggregation Early Exit Optimization

An intelligent multi-page owner extraction system that discovers restaurant owners by crawling relevant pages, sending each to the AI for name extraction, and aggregating results across sources using a custom consensus scoring algorithm.

Detection Algorithm:

  1. Page Discovery: Scans the website for links containing keywords: about, story, team, founder, contact, leadership, staff
  2. Priority Ordering: Pages are ranked by relevance (about > story > team > founder > contact) for optimal processing order
  3. Batch Processing: Pages are loaded 2 at a time (max 10 total) via Playwright, with visible text extracted (max 8000 chars per page)
  4. AI Extraction: Each page's content is sent to GPT-4o-mini with a prompt instructing it to identify ownership roles only (not managers or staff)
  5. Structured Response: AI returns JSON with owner_name, confidence (0-1), reasoning, and source_snippet for evidence
  6. Early Exit: If any single extraction returns confidence ≥ 0.8, processing stops immediately to save API calls
  7. Consensus Aggregation: All extractions are grouped by owner name. Final score = sum(confidences) × number of sources, with a multi-source boost of up to +0.15 for names confirmed across multiple pages
  8. Best Match: The owner name with the highest aggregated score is selected, with evidence from all confirming pages
Technical Highlight: The consensus scoring algorithm (confidence × sources + multi-source boost) rewards consistency across pages rather than a single high-confidence hit. This dramatically reduces false positives from AI hallucination. The early exit optimization at 0.8 confidence avoids unnecessary crawling on pages where the owner is clearly identified. Multiple owners are separated by " & " in the output.

Chain Affiliation Detection & Pre-Screening

OpenAI GPT-4o-mini Two-Phase Detection Auto-Decline Pipeline Location Count Estimation

A two-phase chain detection system that first pre-screens businesses to identify obvious chains (saving processing time), then performs deep AI analysis for borderline cases. Chains are automatically declined from the lead pipeline.

Two-Phase Flow:

  1. Phase 1 — Pre-Screening: chainPreScreener sends a quick AI query to classify the business as a large chain, small chain, or independent
  2. Skip Logic: If pre-screening identifies a large chain with high confidence, the full crawl is skipped entirely
  3. Phase 2 — Deep Analysis: For non-obvious cases, chainDetector performs full analysis considering website structure, location count, and business name patterns
  4. Output: Returns is_chain (boolean), group_name (parent company), confidence score, and evidence text
  5. Auto-Decline: When summary.isChainOrGroup.value === true, the lead status is automatically set to "declined" in the database
Technical Highlight: The pre-screening phase prevents expensive full crawls on obvious chains like McDonald's or Starbucks. The confidence threshold (< 0.7 treated as independent) is tuned to avoid false chain detections on independent restaurants that happen to have multiple locations. The auto-decline pipeline keeps the lead database clean without manual review of chain restaurants.

AI Review Problem Analysis

OpenAI GPT-4o-mini SerpAPI Review Scraping 7 Problem Categories Response Rate Calculation

Combines SerpAPI-powered review scraping with AI analysis to identify recurring business problems from customer reviews. The review scraper handles multiple response formats and calculates owner engagement metrics, while the AI categorizes problems across 7 dimensions.

Analysis Pipeline:

  1. Review Scraping: SerpAPI fetches up to 30 recent Google reviews with pagination (4 pages, 800ms delay between requests)
  2. Date Parsing: Handles three date formats: Unix timestamps, ISO dates, and relative strings ("3 days ago", "a week ago", "2 months ago") with estimated absolute date calculation
  3. Metrics Calculation:
    • reviewsLast4Weeks: Reviews in past 28 days
    • reviewsLast4Months: Reviews in past 4 months
    • reviewsPerWeek: Calculated from date range of scraped reviews
    • responseRate: 0-1 decimal of reviews with owner responses
    • responseFrequency: "always" (≥80%), "sometimes" (1-79%), "never" (0%)
  4. Problem Detection: AI analyzes up to 10 most negative reviews and categorizes problems: food quality, service, delivery, cleanliness, pricing, wait times, order accuracy
Technical Highlight: The review scraper handles three different SerpAPI response formats for owner replies (response.snippet, response.extracted_snippet.original, owner_response.text), providing resilience against API format changes. The relative date parser handles irregular English patterns ("a week ago" vs "2 weeks ago") with estimated date calculation.

Business Type Classification (16 Categories)

OpenAI GPT-4o-mini Regex Pre-Classification Google Place Type Mapping 16-Category Taxonomy

A hybrid classification system that combines regex-based pre-screening with AI-powered deep classification. The regex classifier provides a fast initial guess, while GPT-4o-mini analyzes website content, Google Place types, and service indicators for a definitive classification across 16 business categories.

Classification Taxonomy:

  1. Fast Regex Pass: Pattern matching on website text for keywords (bakery, coffee, espresso, menu, entree, frozen yogurt, gelato) with confidence 0.3-0.7
  2. Google Type Mapping: Maps Google Place types (restaurant, cafe, bakery, bar, etc.) to the internal taxonomy
  3. AI Deep Classification: GPT-4o-mini receives website content + Google types + service indicators and selects from 16 categories:
    • restaurant, fast_food, fast_casual, cafe, bakery, bar, pizzeria
    • food_truck, catering, deli, dessert_shop, breakfast_spot
    • juice_bar, buffet, food_hall, ghost_kitchen, other
  4. Validation: AI output is validated against the enum list; invalid types are rejected and defaulted
Technical Highlight: The hybrid approach uses cheap regex for obvious cases (bakery with "bakery" in the text) and reserves AI calls for ambiguous businesses. The 16-category taxonomy is defined as a TypeScript enum with human-readable labels, ensuring type safety throughout the codebase. Google type mapping serves as a bridge between Google's proprietary categories and the internal taxonomy.

Multi-Signal Ordering Vendor Detection

JSON Rules Engine Domain Matching URL Regex Script Host Detection Weighted Scoring

A configurable rules-based detection engine that identifies which ordering platform a restaurant uses. The system follows ordering CTAs, navigates to order pages, and applies a multi-signal scoring algorithm across four detection dimensions.

Detection Signals & Weights:

  1. CTA Discovery: Scans for links matching order keywords (order online, start order, pickup, delivery, order now)
  2. Link Following: Playwright navigates to the top 2 ordering links, capturing final URLs after redirects
  3. Domain Match (+5 pts): Checks if the order page domain matches known vendor domains (owner.com, popmenu.com, spothopper.com, etc.)
  4. URL Regex (+4 pts): Pattern matching on the full URL for vendor-specific patterns
  5. Script Host Detection (+3 pts): Collects all <script src> hostnames and matches against vendor script domains
  6. Text Pattern (+2 pts): Scans page content for strings like "powered by Owner" or "made with Popmenu"
  7. POS Hints: Each ordering vendor maps to likely POS systems (e.g., Toast ordering → Toast POS)
Technical Highlight: The rules engine is externalized in ordering.rules.json, making it trivially extensible without code changes. The weighted scoring system (domain: 5, URL: 4, script: 3, text: 2) prioritizes hard evidence (domain match) over soft signals (text mention). The versioned rules file enables A/B testing different detection strategies. Playwright's script host collection catches integrations that aren't visible in the HTML source.

Crawling & Data Pipelines

Zipcode Discovery Pipeline with Session Management

Google Places Nearby Search Quadrant-Based Search Session State Batch Processing Scaffold Entries

A session-based discovery pipeline that finds all restaurants in a zipcode, creates lightweight scaffold database entries, and processes them in resumable batches with configurable concurrency.

Discovery Flow:

  1. Geocoding: Converts zipcode to lat/lng coordinates via Google Geocoding API
  2. Multi-Strategy Search: Searches using 6 place types (restaurant, cafe, bakery, bar, meal_takeaway, meal_delivery) plus optional 16 food keywords for comprehensive mode
  3. Quadrant Division: For large radius areas, divides the search area into quadrants for better coverage past API limits
  4. Deduplication: Merges results by google_place_id across all search strategies
  5. Scaffold Creation: Pre-creates minimal database entries (name, place_id, zipcode) for all discovered restaurants
  6. Batch Processing: POST /api/zipcode/:sessionId/next processes configurable batches with chain pre-screening, full crawl, and AI summary generation
  7. Session Cleanup: Sessions auto-expire after 1 hour; can be manually deleted
Technical Highlight: The scaffold pattern prevents duplicate API calls when re-running discovery on the same zipcode. Rate limiting (200ms between Google Places requests, 2-second pagination delays) prevents API throttling. Concurrency is controlled via p-limit (default 3 parallel operations) to balance speed and API rate limits.

Confidence-Weighted Evidence System

Typed Field System Evidence Tracking Source Attribution JSONB Storage

Every data field in the system carries a confidence score (0-1) and an array of evidence objects with source URLs, descriptive notes, and optional source snippets. This creates a fully auditable data provenance chain from raw source to final value.

Field Structure:

  1. Value: The extracted data (text, number, or JSON)
  2. Normalized Choice: For enum-like fields, the canonical value (e.g., "restaurant", "yes", "toast")
  3. Confidence: 0.0-1.0 score indicating reliability of the extraction
  4. Evidence Array: Each entry contains:
    • url: Source URL where data was found
    • note: Human-readable description of how the data was extracted
    • snippet: Optional raw text excerpt from the source
  5. Database Storage: Stored in crawl_fields table with value_text, value_num, value_json columns and JSONB evidence column
Technical Highlight: The evidence system enables downstream consumers to audit any data point back to its source. The unique constraint (crawl_run_id, field_key) prevents duplicate fields per crawl. The confidence score drives automated decisions throughout the pipeline (revenue web search trigger at 0.6, owner early exit at 0.8, chain classification at 0.7).

Technical Architecture

Modular Pipeline Architecture

The application is organized into specialized modules that compose into configurable pipelines. Each module is independently testable and replaceable, with clear input/output contracts.

Providers (Data Sources)

  • Google Places API — Business data, ratings, reviews, geocoding, nearby search with pagination
  • SerpAPI — Google Maps review scraping with date parsing and owner response detection
  • OpenAI (GPT-4o-mini) — 6 distinct AI analysis functions with structured JSON responses
  • Google Search (Playwright) — Headless browser SERP scraping for market research
  • Bing Search — Fallback search provider when Google returns no results

Analyzers (AI Intelligence)

  • revenuePredictor — Two-stage revenue estimation with adaptive web search
  • reviewAnalyzer — Negative review problem categorization across 7 dimensions
  • chainDetector — Full chain/franchise affiliation analysis
  • chainPreScreener — Quick large-chain identification for skip optimization
  • businessTypeClassifier — 16-category classification with hybrid regex + AI
  • businessSummary — Orchestrates parallel execution of all analyzers

Detectors (Pattern Matching)

  • ownerDetection — Multi-page crawling with consensus scoring
  • orderingVendor — Rules-based multi-signal vendor identification
  • classifyBusiness — Fast regex-based pre-classification
  • googleTypeMapping — Google Place types to internal taxonomy bridge

Crawlers & Scrapers

  • playwrightCrawler — Headless Chromium website crawling with DOM parsing, link following, and script host detection
  • serpApiReviews — Paginated Google Maps review scraping with dual-format response handling
  • googleReviewsScraper — Legacy Playwright-based review scraper

Database Layer (PostgreSQL)

  • 6 Tables — businesses, crawl_runs, crawl_fields, crawl_pages, review_metrics, social_links, ordering_detections
  • UUID Primary Keys — Globally unique identifiers across all tables
  • JSONB Columns — Flexible storage for evidence, artifacts, and review data
  • Transactional Writes — All crawl data written in a single transaction with rollback on error
  • Upsert Patterns — ON CONFLICT handling for review_metrics, social_links, and ordering_detections
  • 6 Migrations — Incremental schema evolution from init through social links

API Layer (Express.js)

  • Leads API — Full CRUD with advanced filtering (rating, reviews, chain, vendor, business type), multi-field sorting, and pagination
  • Zipcode Discovery API — Session-based discovery with start/next/status/cancel lifecycle
  • Recrawl API — Batch re-crawling with configurable skip flags (reviews, owner detection)
  • Health Check — /health endpoint for uptime monitoring

Runners (Entry Points)

  • crawlOne — Single business crawl from CLI or API
  • crawlCsv — Batch CSV processing with p-limit concurrency control
  • discoverZipcode — Zipcode-based restaurant discovery with scaffold creation
  • detectOwner — Standalone owner detection runner

Skills Demonstrated

AI & LLM Integration

  • OpenAI GPT-4o-mini (6 distinct use cases)
  • Structured JSON prompt engineering
  • Adaptive confidence-gated pipelines
  • Multi-source consensus scoring
  • AI-driven web search enrichment
  • Review sentiment & problem analysis

Web Scraping & Automation

  • Playwright headless browser automation
  • DOM parsing & text extraction
  • Script host detection
  • CTA link following with redirect capture
  • Google SERP scraping
  • Rate limiting & pagination handling

API Integration

  • Google Places API (Nearby, Details, Geocoding)
  • SerpAPI (Google Maps Reviews)
  • Bing Search API
  • OpenAI Chat Completions API
  • Provider cascade pattern (fallbacks)

Database & Data Design

  • PostgreSQL schema design
  • UUID primary keys
  • JSONB for flexible evidence storage
  • Transactional writes with rollback
  • Upsert patterns (ON CONFLICT)
  • Incremental migrations

Architecture & Design

  • Modular pipeline composition
  • Confidence-weighted evidence system
  • Session-based discovery lifecycle
  • Rules engine (externalized JSON)
  • Multi-signal weighted scoring
  • Graceful degradation & fallbacks

TypeScript & Node.js

  • TypeScript 5.6 with strict typing
  • Express.js 5 REST API
  • p-limit concurrency control
  • CLI & API dual entry points
  • Comprehensive error handling
  • Structured logging system

Key Achievements

6 AI Analyzers

Revenue prediction, owner detection, chain detection, review analysis, business classification, pre-screening

Adaptive Pipeline

Confidence-gated web search that autonomously enriches low-confidence estimates

Consensus Scoring

Multi-source owner detection with confidence aggregation across pages

5 Data Providers

Google Places, SerpAPI, OpenAI, Google Search, Bing with cascade fallbacks

Rules Engine

Externalized JSON vendor detection with 4-signal weighted scoring

Evidence Provenance

Every data point carries confidence score and source evidence chain

Session Discovery

Resumable zipcode-based restaurant discovery with scaffold entries

Auto-Decline Chains

AI-detected chains are automatically removed from the lead pipeline

Technology Stack

TypeScript 5.6 Node.js OpenAI GPT-4o-mini Playwright PostgreSQL Express.js 5 Google Places API SerpAPI Bing Search API p-limit Axios JSONB UUID REST API ESLint + Prettier dotenv
← Back to Portfolio