`, `

` tags with CSS class/id hints like `"listing"`, `"card"`, `"member"` - `parse_listing(tag)` at line 58 — extracts name (tries `h1-h4`, `[itemprop=name]`, `a[title]`), website (first absolute href), city/category by keyword --- ### 6. `agents/scout/google_maps_client.py` — Google Places API - `search_companies(industry, location, limit, query_text)` at line 65 - Uses **Google Places API v1** (`places:searchText`) with `X-Goog-FieldMask` header to request only the fields we need - `query_text` comes from LLM query planner — overrides default query string - `_map_industry(raw_type, fallback)` at line 161 — maps Google place types (`"hospital"`, `"lodging"` etc.) to our 6-bucket industry taxonomy - `_parse_city_state(formatted_address)` at line 170 — parses `"123 Main St, Buffalo, NY 14201, USA"` by splitting on commas --- ### 7. `agents/scout/llm_deduplicator.py` — Two-Pass Deduplication **Agentic concept:** LLM-assisted fuzzy matching - `deduplicate(companies)` at line 167 - **Pass 1** — exact domain matching using `_extract_domain()` at line 61. Fast, handles ~80% of duplicates - **Pass 2** — `_find_suspicious_pairs()` at line 75 uses **`difflib.SequenceMatcher`** to find name pairs with similarity ≥ 0.75, then `_ask_llm_which_are_duplicates()` at line 107 sends up to 8 suspicious pairs to LLM in one call asking for a JSON array of which pair numbers are duplicates - LLM call uses **LangChain `HumanMessage`** same as other modules --- ### 8. `agents/scout/scout_critic.py` — Quality Scoring + Source Learning - `evaluate_quality(companies)` at line 45 — pure math, no LLM. Scores 0–10 based on `website` (5pts), `city` (3pts), `phone` (2pts) field presence rates - `update_source_performance(...)` at line 72 — **upsert** to `source_performance` table: rolling average `(old_avg * old_runs + new_score) / (old_runs + 1)` - `rank_sources(industry, location, sources, db)` at line 132 — SQLAlchemy query on `SourcePerformance` table, sorts by `avg_quality_score` descending. This is the **self-learning loop** — sources that historically perform better get tried first --- ## Execution Phases (`scout_agent.run()`) ``` 1. LLM Query Planning → 3–5 diverse search queries (not hardcoded strings) 2. Source Ranking → order API sources by past performance from DB 3. Phase 0: News Scout → finds companies IN THE NEWS with buying signals 4. Phase 1: Directory → scrapes configured DB sources (Yellow Pages etc.) 5. Phase 2: Tavily → AI-powered web search using planned queries 6. Phase 3: API Sources → Google Maps + Yelp, one call per planned query 7. LLM Deduplication → removes near-duplicates from the API batch 8. Quality Retry → if <80% of target found, generates NEW queries and retries 9. Source Performance → writes results back to DB so future runs learn ``` --- ## Scout Critic Quality Rubric After each source, the Critic scores the batch **0.0–10.0**: | Field | Points | |---|---| | Website present | 5.0 | | City present | 3.0 | | Phone present | 2.0 | - Score **≥ 6.0** = good quality - Score **< 6.0** = try another source The Critic also writes to the `source_performance` table — a rolling average per `(source, industry, location)`. Next run, `rank_sources()` reads this to put the best-performing source first. --- ## Key Agentic Concepts Used | Concept | Tool / Tech | Where | |---|---|---| | Intent-Based Prospecting | `news_scout_client` | Phase 0 — finds warm leads from news | | LLM Query Planning | Claude via `llm_query_planner` | Step 1 — diverse query generation | | Adaptive Source Ranking | `SourcePerformance` DB table | `rank_sources()` — learns over time | | LLM Deduplication | Claude via `llm_deduplicator` | After API batch collection | | Quality-gated Retry | `llm_query_planner.plan_retry_queries` | If <80% target hit | | Website Signal Enrichment | `website_crawler` | Crawls each company's site for employee/location signals | --- ## What Gets Saved - **News companies**: name + industry minimum (LLM already classified), with `intent_signal` field - **API companies**: name + industry + city minimum (no website required — Google Maps/Yelp are trusted sources) - **Directory companies**: must pass `_validate_scraped()` — requires name + website + reachable site --- ## Full Data Flow ``` User request: "find 10 healthcare companies in Rochester NY" ↓ llm_query_planner.plan_queries() ← LangChain → Ollama/OpenAI ↓ 4 diverse query strings scout_critic.rank_sources() ← SQLAlchemy reads source_performance ↓ ordered: [google_maps, yelp] or reversed if yelp historically better news_scout_client.find_companies_in_news() → _generate_news_queries() ← LangChain → Ollama/OpenAI → _search_news() × 3 ← Tavily API (topic=news) → _extract_companies_from_snippets() ← LangChain → Ollama/OpenAI → saved with intent_signal field ↓ directory_scraper.scrape_directory() ← BeautifulSoup4 + requests (paginated) → company_extractor.extract_all_fields() → website_crawler.crawl_company_site() ↓ search_client.search_with_queries() ← Tavily API (web mode) → directory_scraper.scrape_directory() per found URL ↓ google_maps_client.search_companies() × 4 queries ← Google Places API v1 yelp_client.search_companies() ← Yelp API ↓ llm_deduplicator.deduplicate() Pass 1: domain exact match Pass 2: SequenceMatcher similarity → LangChain → Ollama/OpenAI ↓ If <80% found: llm_query_planner.plan_retry_queries() → retry loop ↓ scout_critic.update_source_performance() × per source ← SQLAlchemy upsert ↓ return saved company IDs ```