Quick Navigation
Project Overview
A sophisticated business intelligence system that discovers, crawls, and analyzes restaurant businesses using a combination of headless browser automation (Playwright), multi-provider API aggregation (Google Places, SerpAPI), and six distinct AI-powered analyzers (OpenAI GPT-4o-mini). The platform generates comprehensive lead profiles including AI-estimated revenue breakdowns, owner identification, chain affiliation detection, review problem analysis, and ordering system detection.
AI-Powered Capabilities:
- Owner Name Extraction — AI crawls about/team/founder pages, extracts owner names with confidence scoring and multi-source consensus aggregation
- Two-Stage Revenue Prediction — Initial AI estimate with adaptive web search enrichment when confidence falls below 0.6, breaking revenue into dine-in, takeout, and delivery channels
- Chain Affiliation Detection — AI determines if a business is independent or part of a franchise, identifies parent companies, and estimates location counts
- Business Type Classification — AI classifies businesses into 16 categories using website content, Google Place types, and service indicators
- Review Problem Analysis — AI analyzes negative reviews across 7 problem categories (food quality, service, delivery, cleanliness, pricing, wait times, order accuracy)
- Ordering Vendor Detection — Multi-signal scoring engine using domain matching, URL regex, text patterns, and script host detection
6 AI analyzers 5 data providers 6 database tables 10+ REST endpoints Session-based discovery pipeline
AI Analyzers & Detectors
Two-Stage Revenue Prediction Pipeline
A unique two-stage revenue estimation pipeline that adapts its data gathering based on confidence. The AI first estimates revenue from available business data, then autonomously decides whether to perform web searches for industry benchmarks to refine its prediction.
Pipeline Flow:
- Stage 1 — Initial AI Estimate: GPT-4o-mini analyzes the restaurant profile (name, category, rating, review count, price level, location) and produces a revenue breakdown by channel
- Confidence Gate: If the AI's confidence score falls below 0.6, the pipeline automatically triggers web research
-
Web Research: Three targeted search queries are executed:
- "{category} restaurant average monthly revenue"
- "restaurant industry revenue per location {location}"
- "{category} restaurant sales benchmarks"
- Search Provider Cascade: Google Search via headless Playwright first; if no results, falls back to Bing Search API
- Stage 2 — Enhanced Estimate: AI re-analyzes with the original data plus web search results for a refined prediction
- Multi-Channel Output: Final estimate breaks revenue into dine-in, takeout/pickup, and delivery channels with total monthly revenue
Multi-Source Owner Detection with Consensus Scoring
An intelligent multi-page owner extraction system that discovers restaurant owners by crawling relevant pages, sending each to the AI for name extraction, and aggregating results across sources using a custom consensus scoring algorithm.
Detection Algorithm:
- Page Discovery: Scans the website for links containing keywords: about, story, team, founder, contact, leadership, staff
- Priority Ordering: Pages are ranked by relevance (about > story > team > founder > contact) for optimal processing order
- Batch Processing: Pages are loaded 2 at a time (max 10 total) via Playwright, with visible text extracted (max 8000 chars per page)
- AI Extraction: Each page's content is sent to GPT-4o-mini with a prompt instructing it to identify ownership roles only (not managers or staff)
- Structured Response: AI returns JSON with owner_name, confidence (0-1), reasoning, and source_snippet for evidence
- Early Exit: If any single extraction returns confidence ≥ 0.8, processing stops immediately to save API calls
- Consensus Aggregation: All extractions are grouped by owner name. Final score = sum(confidences) × number of sources, with a multi-source boost of up to +0.15 for names confirmed across multiple pages
- Best Match: The owner name with the highest aggregated score is selected, with evidence from all confirming pages
Chain Affiliation Detection & Pre-Screening
A two-phase chain detection system that first pre-screens businesses to identify obvious chains (saving processing time), then performs deep AI analysis for borderline cases. Chains are automatically declined from the lead pipeline.
Two-Phase Flow:
- Phase 1 — Pre-Screening: chainPreScreener sends a quick AI query to classify the business as a large chain, small chain, or independent
- Skip Logic: If pre-screening identifies a large chain with high confidence, the full crawl is skipped entirely
- Phase 2 — Deep Analysis: For non-obvious cases, chainDetector performs full analysis considering website structure, location count, and business name patterns
- Output: Returns is_chain (boolean), group_name (parent company), confidence score, and evidence text
- Auto-Decline: When summary.isChainOrGroup.value === true, the lead status is automatically set to "declined" in the database
AI Review Problem Analysis
Combines SerpAPI-powered review scraping with AI analysis to identify recurring business problems from customer reviews. The review scraper handles multiple response formats and calculates owner engagement metrics, while the AI categorizes problems across 7 dimensions.
Analysis Pipeline:
- Review Scraping: SerpAPI fetches up to 30 recent Google reviews with pagination (4 pages, 800ms delay between requests)
- Date Parsing: Handles three date formats: Unix timestamps, ISO dates, and relative strings ("3 days ago", "a week ago", "2 months ago") with estimated absolute date calculation
-
Metrics Calculation:
- reviewsLast4Weeks: Reviews in past 28 days
- reviewsLast4Months: Reviews in past 4 months
- reviewsPerWeek: Calculated from date range of scraped reviews
- responseRate: 0-1 decimal of reviews with owner responses
- responseFrequency: "always" (≥80%), "sometimes" (1-79%), "never" (0%)
- Problem Detection: AI analyzes up to 10 most negative reviews and categorizes problems: food quality, service, delivery, cleanliness, pricing, wait times, order accuracy
Business Type Classification (16 Categories)
A hybrid classification system that combines regex-based pre-screening with AI-powered deep classification. The regex classifier provides a fast initial guess, while GPT-4o-mini analyzes website content, Google Place types, and service indicators for a definitive classification across 16 business categories.
Classification Taxonomy:
- Fast Regex Pass: Pattern matching on website text for keywords (bakery, coffee, espresso, menu, entree, frozen yogurt, gelato) with confidence 0.3-0.7
- Google Type Mapping: Maps Google Place types (restaurant, cafe, bakery, bar, etc.) to the internal taxonomy
-
AI Deep Classification: GPT-4o-mini receives website content + Google types + service indicators and selects from 16 categories:
- restaurant, fast_food, fast_casual, cafe, bakery, bar, pizzeria
- food_truck, catering, deli, dessert_shop, breakfast_spot
- juice_bar, buffet, food_hall, ghost_kitchen, other
- Validation: AI output is validated against the enum list; invalid types are rejected and defaulted
Multi-Signal Ordering Vendor Detection
A configurable rules-based detection engine that identifies which ordering platform a restaurant uses. The system follows ordering CTAs, navigates to order pages, and applies a multi-signal scoring algorithm across four detection dimensions.
Detection Signals & Weights:
- CTA Discovery: Scans for links matching order keywords (order online, start order, pickup, delivery, order now)
- Link Following: Playwright navigates to the top 2 ordering links, capturing final URLs after redirects
- Domain Match (+5 pts): Checks if the order page domain matches known vendor domains (owner.com, popmenu.com, spothopper.com, etc.)
- URL Regex (+4 pts): Pattern matching on the full URL for vendor-specific patterns
- Script Host Detection (+3 pts): Collects all <script src> hostnames and matches against vendor script domains
- Text Pattern (+2 pts): Scans page content for strings like "powered by Owner" or "made with Popmenu"
- POS Hints: Each ordering vendor maps to likely POS systems (e.g., Toast ordering → Toast POS)
Crawling & Data Pipelines
Zipcode Discovery Pipeline with Session Management
A session-based discovery pipeline that finds all restaurants in a zipcode, creates lightweight scaffold database entries, and processes them in resumable batches with configurable concurrency.
Discovery Flow:
- Geocoding: Converts zipcode to lat/lng coordinates via Google Geocoding API
- Multi-Strategy Search: Searches using 6 place types (restaurant, cafe, bakery, bar, meal_takeaway, meal_delivery) plus optional 16 food keywords for comprehensive mode
- Quadrant Division: For large radius areas, divides the search area into quadrants for better coverage past API limits
- Deduplication: Merges results by google_place_id across all search strategies
- Scaffold Creation: Pre-creates minimal database entries (name, place_id, zipcode) for all discovered restaurants
- Batch Processing: POST /api/zipcode/:sessionId/next processes configurable batches with chain pre-screening, full crawl, and AI summary generation
- Session Cleanup: Sessions auto-expire after 1 hour; can be manually deleted
Confidence-Weighted Evidence System
Every data field in the system carries a confidence score (0-1) and an array of evidence objects with source URLs, descriptive notes, and optional source snippets. This creates a fully auditable data provenance chain from raw source to final value.
Field Structure:
- Value: The extracted data (text, number, or JSON)
- Normalized Choice: For enum-like fields, the canonical value (e.g., "restaurant", "yes", "toast")
- Confidence: 0.0-1.0 score indicating reliability of the extraction
-
Evidence Array: Each entry contains:
- url: Source URL where data was found
- note: Human-readable description of how the data was extracted
- snippet: Optional raw text excerpt from the source
- Database Storage: Stored in crawl_fields table with value_text, value_num, value_json columns and JSONB evidence column
Technical Architecture
Modular Pipeline Architecture
The application is organized into specialized modules that compose into configurable pipelines. Each module is independently testable and replaceable, with clear input/output contracts.
Providers (Data Sources)
- Google Places API — Business data, ratings, reviews, geocoding, nearby search with pagination
- SerpAPI — Google Maps review scraping with date parsing and owner response detection
- OpenAI (GPT-4o-mini) — 6 distinct AI analysis functions with structured JSON responses
- Google Search (Playwright) — Headless browser SERP scraping for market research
- Bing Search — Fallback search provider when Google returns no results
Analyzers (AI Intelligence)
- revenuePredictor — Two-stage revenue estimation with adaptive web search
- reviewAnalyzer — Negative review problem categorization across 7 dimensions
- chainDetector — Full chain/franchise affiliation analysis
- chainPreScreener — Quick large-chain identification for skip optimization
- businessTypeClassifier — 16-category classification with hybrid regex + AI
- businessSummary — Orchestrates parallel execution of all analyzers
Detectors (Pattern Matching)
- ownerDetection — Multi-page crawling with consensus scoring
- orderingVendor — Rules-based multi-signal vendor identification
- classifyBusiness — Fast regex-based pre-classification
- googleTypeMapping — Google Place types to internal taxonomy bridge
Crawlers & Scrapers
- playwrightCrawler — Headless Chromium website crawling with DOM parsing, link following, and script host detection
- serpApiReviews — Paginated Google Maps review scraping with dual-format response handling
- googleReviewsScraper — Legacy Playwright-based review scraper
Database Layer (PostgreSQL)
- 6 Tables — businesses, crawl_runs, crawl_fields, crawl_pages, review_metrics, social_links, ordering_detections
- UUID Primary Keys — Globally unique identifiers across all tables
- JSONB Columns — Flexible storage for evidence, artifacts, and review data
- Transactional Writes — All crawl data written in a single transaction with rollback on error
- Upsert Patterns — ON CONFLICT handling for review_metrics, social_links, and ordering_detections
- 6 Migrations — Incremental schema evolution from init through social links
API Layer (Express.js)
- Leads API — Full CRUD with advanced filtering (rating, reviews, chain, vendor, business type), multi-field sorting, and pagination
- Zipcode Discovery API — Session-based discovery with start/next/status/cancel lifecycle
- Recrawl API — Batch re-crawling with configurable skip flags (reviews, owner detection)
- Health Check — /health endpoint for uptime monitoring
Runners (Entry Points)
- crawlOne — Single business crawl from CLI or API
- crawlCsv — Batch CSV processing with p-limit concurrency control
- discoverZipcode — Zipcode-based restaurant discovery with scaffold creation
- detectOwner — Standalone owner detection runner
Skills Demonstrated
AI & LLM Integration
- OpenAI GPT-4o-mini (6 distinct use cases)
- Structured JSON prompt engineering
- Adaptive confidence-gated pipelines
- Multi-source consensus scoring
- AI-driven web search enrichment
- Review sentiment & problem analysis
Web Scraping & Automation
- Playwright headless browser automation
- DOM parsing & text extraction
- Script host detection
- CTA link following with redirect capture
- Google SERP scraping
- Rate limiting & pagination handling
API Integration
- Google Places API (Nearby, Details, Geocoding)
- SerpAPI (Google Maps Reviews)
- Bing Search API
- OpenAI Chat Completions API
- Provider cascade pattern (fallbacks)
Database & Data Design
- PostgreSQL schema design
- UUID primary keys
- JSONB for flexible evidence storage
- Transactional writes with rollback
- Upsert patterns (ON CONFLICT)
- Incremental migrations
Architecture & Design
- Modular pipeline composition
- Confidence-weighted evidence system
- Session-based discovery lifecycle
- Rules engine (externalized JSON)
- Multi-signal weighted scoring
- Graceful degradation & fallbacks
TypeScript & Node.js
- TypeScript 5.6 with strict typing
- Express.js 5 REST API
- p-limit concurrency control
- CLI & API dual entry points
- Comprehensive error handling
- Structured logging system
Key Achievements
6 AI Analyzers
Revenue prediction, owner detection, chain detection, review analysis, business classification, pre-screening
Adaptive Pipeline
Confidence-gated web search that autonomously enriches low-confidence estimates
Consensus Scoring
Multi-source owner detection with confidence aggregation across pages
5 Data Providers
Google Places, SerpAPI, OpenAI, Google Search, Bing with cascade fallbacks
Rules Engine
Externalized JSON vendor detection with 4-signal weighted scoring
Evidence Provenance
Every data point carries confidence score and source evidence chain
Session Discovery
Resumable zipcode-based restaurant discovery with scaffold entries
Auto-Decline Chains
AI-detected chains are automatically removed from the lead pipeline