Lead Web Scraper — AI-Powered Business Intelligence

Quick Navigation

Jump to any section:

Project Overview AI Analyzers & Detectors Crawling & Data Pipelines Technical Architecture Skills Demonstrated Key Achievements Technology Stack

Project Overview

AI-Powered B2B Lead Generation & Enrichment Platform

A sophisticated business intelligence system that discovers, crawls, and analyzes restaurant businesses using a combination of headless browser automation (Playwright), multi-provider API aggregation (Google Places, SerpAPI), and six distinct AI-powered analyzers (OpenAI GPT-4o-mini). The platform generates comprehensive lead profiles including AI-estimated revenue breakdowns, owner identification, chain affiliation detection, review problem analysis, and ordering system detection.

AI-Powered Capabilities:

Owner Name Extraction — AI crawls about/team/founder pages, extracts owner names with confidence scoring and multi-source consensus aggregation
Two-Stage Revenue Prediction — Initial AI estimate with adaptive web search enrichment when confidence falls below 0.6, breaking revenue into dine-in, takeout, and delivery channels
Chain Affiliation Detection — AI determines if a business is independent or part of a franchise, identifies parent companies, and estimates location counts
Business Type Classification — AI classifies businesses into 16 categories using website content, Google Place types, and service indicators
Review Problem Analysis — AI analyzes negative reviews across 7 problem categories (food quality, service, delivery, cleanliness, pricing, wait times, order accuracy)
Ordering Vendor Detection — Multi-signal scoring engine using domain matching, URL regex, text patterns, and script host detection

6 AI analyzers 5 data providers 6 database tables 10+ REST endpoints Session-based discovery pipeline

AI Analyzers & Detectors

Two-Stage Revenue Prediction Pipeline

OpenAI GPT-4o-mini Adaptive Web Search Google Search (Playwright) Bing Fallback Multi-Channel Breakdown

A unique two-stage revenue estimation pipeline that adapts its data gathering based on confidence. The AI first estimates revenue from available business data, then autonomously decides whether to perform web searches for industry benchmarks to refine its prediction.

Pipeline Flow:

Stage 1 — Initial AI Estimate: GPT-4o-mini analyzes the restaurant profile (name, category, rating, review count, price level, location) and produces a revenue breakdown by channel
Confidence Gate: If the AI's confidence score falls below 0.6, the pipeline automatically triggers web research
Web Research: Three targeted search queries are executed:
- "{category} restaurant average monthly revenue"
- "restaurant industry revenue per location {location}"
- "{category} restaurant sales benchmarks"
Search Provider Cascade: Google Search via headless Playwright first; if no results, falls back to Bing Search API
Stage 2 — Enhanced Estimate: AI re-analyzes with the original data plus web search results for a refined prediction
Multi-Channel Output: Final estimate breaks revenue into dine-in, takeout/pickup, and delivery channels with total monthly revenue

Technical Highlight: The adaptive confidence gate (threshold 0.6) prevents unnecessary API calls for straightforward businesses while ensuring data-poor leads get enriched with market research. The search provider cascade (Google → Bing) ensures resilience. The output includes a usedWebSearch flag for transparency. This two-stage pattern could be generalized to any AI estimation task requiring variable data quality.

Multi-Source Owner Detection with Consensus Scoring

OpenAI GPT-4o-mini Playwright Page Crawling Priority Page Ordering Confidence Aggregation Early Exit Optimization

An intelligent multi-page owner extraction system that discovers restaurant owners by crawling relevant pages, sending each to the AI for name extraction, and aggregating results across sources using a custom consensus scoring algorithm.

Detection Algorithm:

Page Discovery: Scans the website for links containing keywords: about, story, team, founder, contact, leadership, staff
Priority Ordering: Pages are ranked by relevance (about > story > team > founder > contact) for optimal processing order
Batch Processing: Pages are loaded 2 at a time (max 10 total) via Playwright, with visible text extracted (max 8000 chars per page)
AI Extraction: Each page's content is sent to GPT-4o-mini with a prompt instructing it to identify ownership roles only (not managers or staff)
Structured Response: AI returns JSON with owner_name, confidence (0-1), reasoning, and source_snippet for evidence
Early Exit: If any single extraction returns confidence ≥ 0.8, processing stops immediately to save API calls
Consensus Aggregation: All extractions are grouped by owner name. Final score = sum(confidences) × number of sources, with a multi-source boost of up to +0.15 for names confirmed across multiple pages
Best Match: The owner name with the highest aggregated score is selected, with evidence from all confirming pages

Technical Highlight: The consensus scoring algorithm (confidence × sources + multi-source boost) rewards consistency across pages rather than a single high-confidence hit. This dramatically reduces false positives from AI hallucination. The early exit optimization at 0.8 confidence avoids unnecessary crawling on pages where the owner is clearly identified. Multiple owners are separated by " & " in the output.

Chain Affiliation Detection & Pre-Screening

OpenAI GPT-4o-mini Two-Phase Detection Auto-Decline Pipeline Location Count Estimation

A two-phase chain detection system that first pre-screens businesses to identify obvious chains (saving processing time), then performs deep AI analysis for borderline cases. Chains are automatically declined from the lead pipeline.

Two-Phase Flow:

Phase 1 — Pre-Screening: chainPreScreener sends a quick AI query to classify the business as a large chain, small chain, or independent
Skip Logic: If pre-screening identifies a large chain with high confidence, the full crawl is skipped entirely
Phase 2 — Deep Analysis: For non-obvious cases, chainDetector performs full analysis considering website structure, location count, and business name patterns
Output: Returns is_chain (boolean), group_name (parent company), confidence score, and evidence text
Auto-Decline: When summary.isChainOrGroup.value === true, the lead status is automatically set to "declined" in the database

Technical Highlight: The pre-screening phase prevents expensive full crawls on obvious chains like McDonald's or Starbucks. The confidence threshold (< 0.7 treated as independent) is tuned to avoid false chain detections on independent restaurants that happen to have multiple locations. The auto-decline pipeline keeps the lead database clean without manual review of chain restaurants.

AI Review Problem Analysis

OpenAI GPT-4o-mini SerpAPI Review Scraping 7 Problem Categories Response Rate Calculation

Combines SerpAPI-powered review scraping with AI analysis to identify recurring business problems from customer reviews. The review scraper handles multiple response formats and calculates owner engagement metrics, while the AI categorizes problems across 7 dimensions.

Analysis Pipeline:

Review Scraping: SerpAPI fetches up to 30 recent Google reviews with pagination (4 pages, 800ms delay between requests)
Date Parsing: Handles three date formats: Unix timestamps, ISO dates, and relative strings ("3 days ago", "a week ago", "2 months ago") with estimated absolute date calculation
Metrics Calculation:
- reviewsLast4Weeks: Reviews in past 28 days
- reviewsLast4Months: Reviews in past 4 months
- reviewsPerWeek: Calculated from date range of scraped reviews
- responseRate: 0-1 decimal of reviews with owner responses
- responseFrequency: "always" (≥80%), "sometimes" (1-79%), "never" (0%)
Problem Detection: AI analyzes up to 10 most negative reviews and categorizes problems: food quality, service, delivery, cleanliness, pricing, wait times, order accuracy

Technical Highlight: The review scraper handles three different SerpAPI response formats for owner replies (response.snippet, response.extracted_snippet.original, owner_response.text), providing resilience against API format changes. The relative date parser handles irregular English patterns ("a week ago" vs "2 weeks ago") with estimated date calculation.

Business Type Classification (16 Categories)

OpenAI GPT-4o-mini Regex Pre-Classification Google Place Type Mapping 16-Category Taxonomy

A hybrid classification system that combines regex-based pre-screening with AI-powered deep classification. The regex classifier provides a fast initial guess, while GPT-4o-mini analyzes website content, Google Place types, and service indicators for a definitive classification across 16 business categories.

Classification Taxonomy:

Fast Regex Pass: Pattern matching on website text for keywords (bakery, coffee, espresso, menu, entree, frozen yogurt, gelato) with confidence 0.3-0.7
Google Type Mapping: Maps Google Place types (restaurant, cafe, bakery, bar, etc.) to the internal taxonomy
AI Deep Classification: GPT-4o-mini receives website content + Google types + service indicators and selects from 16 categories:
- restaurant, fast_food, fast_casual, cafe, bakery, bar, pizzeria
- food_truck, catering, deli, dessert_shop, breakfast_spot
- juice_bar, buffet, food_hall, ghost_kitchen, other
Validation: AI output is validated against the enum list; invalid types are rejected and defaulted

Technical Highlight: The hybrid approach uses cheap regex for obvious cases (bakery with "bakery" in the text) and reserves AI calls for ambiguous businesses. The 16-category taxonomy is defined as a TypeScript enum with human-readable labels, ensuring type safety throughout the codebase. Google type mapping serves as a bridge between Google's proprietary categories and the internal taxonomy.

Multi-Signal Ordering Vendor Detection

JSON Rules Engine Domain Matching URL Regex Script Host Detection Weighted Scoring

A configurable rules-based detection engine that identifies which ordering platform a restaurant uses. The system follows ordering CTAs, navigates to order pages, and applies a multi-signal scoring algorithm across four detection dimensions.

Detection Signals & Weights:

CTA Discovery: Scans for links matching order keywords (order online, start order, pickup, delivery, order now)
Link Following: Playwright navigates to the top 2 ordering links, capturing final URLs after redirects
Domain Match (+5 pts): Checks if the order page domain matches known vendor domains (owner.com, popmenu.com, spothopper.com, etc.)
URL Regex (+4 pts): Pattern matching on the full URL for vendor-specific patterns
Script Host Detection (+3 pts): Collects all <script src> hostnames and matches against vendor script domains
Text Pattern (+2 pts): Scans page content for strings like "powered by Owner" or "made with Popmenu"
POS Hints: Each ordering vendor maps to likely POS systems (e.g., Toast ordering → Toast POS)

Technical Highlight: The rules engine is externalized in ordering.rules.json, making it trivially extensible without code changes. The weighted scoring system (domain: 5, URL: 4, script: 3, text: 2) prioritizes hard evidence (domain match) over soft signals (text mention). The versioned rules file enables A/B testing different detection strategies. Playwright's script host collection catches integrations that aren't visible in the HTML source.

Crawling & Data Pipelines

Zipcode Discovery Pipeline with Session Management

Google Places Nearby Search Quadrant-Based Search Session State Batch Processing Scaffold Entries

A session-based discovery pipeline that finds all restaurants in a zipcode, creates lightweight scaffold database entries, and processes them in resumable batches with configurable concurrency.

Discovery Flow:

Geocoding: Converts zipcode to lat/lng coordinates via Google Geocoding API
Multi-Strategy Search: Searches using 6 place types (restaurant, cafe, bakery, bar, meal_takeaway, meal_delivery) plus optional 16 food keywords for comprehensive mode
Quadrant Division: For large radius areas, divides the search area into quadrants for better coverage past API limits
Deduplication: Merges results by google_place_id across all search strategies
Scaffold Creation: Pre-creates minimal database entries (name, place_id, zipcode) for all discovered restaurants
Batch Processing: POST /api/zipcode/:sessionId/next processes configurable batches with chain pre-screening, full crawl, and AI summary generation
Session Cleanup: Sessions auto-expire after 1 hour; can be manually deleted

Technical Highlight: The scaffold pattern prevents duplicate API calls when re-running discovery on the same zipcode. Rate limiting (200ms between Google Places requests, 2-second pagination delays) prevents API throttling. Concurrency is controlled via p-limit (default 3 parallel operations) to balance speed and API rate limits.

Confidence-Weighted Evidence System

Typed Field System Evidence Tracking Source Attribution JSONB Storage

Every data field in the system carries a confidence score (0-1) and an array of evidence objects with source URLs, descriptive notes, and optional source snippets. This creates a fully auditable data provenance chain from raw source to final value.

Field Structure:

Value: The extracted data (text, number, or JSON)
Normalized Choice: For enum-like fields, the canonical value (e.g., "restaurant", "yes", "toast")
Confidence: 0.0-1.0 score indicating reliability of the extraction
Evidence Array: Each entry contains:
- url: Source URL where data was found
- note: Human-readable description of how the data was extracted
- snippet: Optional raw text excerpt from the source
Database Storage: Stored in crawl_fields table with value_text, value_num, value_json columns and JSONB evidence column

Technical Highlight: The evidence system enables downstream consumers to audit any data point back to its source. The unique constraint (crawl_run_id, field_key) prevents duplicate fields per crawl. The confidence score drives automated decisions throughout the pipeline (revenue web search trigger at 0.6, owner early exit at 0.8, chain classification at 0.7).

Technical Architecture

Modular Pipeline Architecture

The application is organized into specialized modules that compose into configurable pipelines. Each module is independently testable and replaceable, with clear input/output contracts.

Providers (Data Sources)

Google Places API — Business data, ratings, reviews, geocoding, nearby search with pagination
SerpAPI — Google Maps review scraping with date parsing and owner response detection
OpenAI (GPT-4o-mini) — 6 distinct AI analysis functions with structured JSON responses
Google Search (Playwright) — Headless browser SERP scraping for market research
Bing Search — Fallback search provider when Google returns no results

Analyzers (AI Intelligence)

revenuePredictor — Two-stage revenue estimation with adaptive web search
reviewAnalyzer — Negative review problem categorization across 7 dimensions
chainDetector — Full chain/franchise affiliation analysis
chainPreScreener — Quick large-chain identification for skip optimization
businessTypeClassifier — 16-category classification with hybrid regex + AI
businessSummary — Orchestrates parallel execution of all analyzers

Detectors (Pattern Matching)

ownerDetection — Multi-page crawling with consensus scoring
orderingVendor — Rules-based multi-signal vendor identification
classifyBusiness — Fast regex-based pre-classification
googleTypeMapping — Google Place types to internal taxonomy bridge

Crawlers & Scrapers

playwrightCrawler — Headless Chromium website crawling with DOM parsing, link following, and script host detection
serpApiReviews — Paginated Google Maps review scraping with dual-format response handling
googleReviewsScraper — Legacy Playwright-based review scraper

Database Layer (PostgreSQL)

6 Tables — businesses, crawl_runs, crawl_fields, crawl_pages, review_metrics, social_links, ordering_detections
UUID Primary Keys — Globally unique identifiers across all tables
JSONB Columns — Flexible storage for evidence, artifacts, and review data
Transactional Writes — All crawl data written in a single transaction with rollback on error
Upsert Patterns — ON CONFLICT handling for review_metrics, social_links, and ordering_detections
6 Migrations — Incremental schema evolution from init through social links

API Layer (Express.js)

Leads API — Full CRUD with advanced filtering (rating, reviews, chain, vendor, business type), multi-field sorting, and pagination
Zipcode Discovery API — Session-based discovery with start/next/status/cancel lifecycle
Recrawl API — Batch re-crawling with configurable skip flags (reviews, owner detection)
Health Check — /health endpoint for uptime monitoring

Runners (Entry Points)

crawlOne — Single business crawl from CLI or API
crawlCsv — Batch CSV processing with p-limit concurrency control
discoverZipcode — Zipcode-based restaurant discovery with scaffold creation
detectOwner — Standalone owner detection runner

Skills Demonstrated

AI & LLM Integration

OpenAI GPT-4o-mini (6 distinct use cases)
Structured JSON prompt engineering
Adaptive confidence-gated pipelines
Multi-source consensus scoring
AI-driven web search enrichment
Review sentiment & problem analysis

Web Scraping & Automation

Playwright headless browser automation
DOM parsing & text extraction
Script host detection
CTA link following with redirect capture
Google SERP scraping
Rate limiting & pagination handling

API Integration

Google Places API (Nearby, Details, Geocoding)
SerpAPI (Google Maps Reviews)
Bing Search API
OpenAI Chat Completions API
Provider cascade pattern (fallbacks)

Database & Data Design

PostgreSQL schema design
UUID primary keys
JSONB for flexible evidence storage
Transactional writes with rollback
Upsert patterns (ON CONFLICT)
Incremental migrations

Architecture & Design

Modular pipeline composition
Confidence-weighted evidence system
Session-based discovery lifecycle
Rules engine (externalized JSON)
Multi-signal weighted scoring
Graceful degradation & fallbacks

TypeScript & Node.js

TypeScript 5.6 with strict typing
Express.js 5 REST API
p-limit concurrency control
CLI & API dual entry points
Comprehensive error handling
Structured logging system

Key Achievements

6 AI Analyzers

Revenue prediction, owner detection, chain detection, review analysis, business classification, pre-screening

Adaptive Pipeline

Confidence-gated web search that autonomously enriches low-confidence estimates

Consensus Scoring

Multi-source owner detection with confidence aggregation across pages

5 Data Providers

Google Places, SerpAPI, OpenAI, Google Search, Bing with cascade fallbacks

Rules Engine

Externalized JSON vendor detection with 4-signal weighted scoring

Evidence Provenance

Every data point carries confidence score and source evidence chain

Session Discovery

Resumable zipcode-based restaurant discovery with scaffold entries

Auto-Decline Chains

AI-detected chains are automatically removed from the lead pipeline

Technology Stack

TypeScript 5.6 Node.js OpenAI GPT-4o-mini Playwright PostgreSQL Express.js 5 Google Places API SerpAPI Bing Search API p-limit Axios JSONB UUID REST API ESLint + Prettier dotenv