Scout Agent#
Tech Stack Used#
Tech |
Purpose |
|---|---|
LangChain ( |
Wraps LLM calls in all 3 LLM modules |
Ollama / OpenAI |
LLM provider (configured via |
Tavily API |
Web search + news search (HTTP |
Google Places API v1 |
|
Yelp Business Search API |
Via |
BeautifulSoup4 |
HTML parsing in |
SQLAlchemy ORM |
All DB reads/writes ( |
Python |
HTTP calls to all external APIs and web pages |
|
Name similarity scoring in |
|
Memoizes Tavily search results in |
File-by-File Breakdown#
1. agents/scout/scout_agent.py — Coordinator / Orchestrator#
Entry point: run(industry, location, count, db_session, run_id) at line 75
This is a sequential multi-source loop — not a graph, not an async pipeline. Pure Python for loops with early-exit:
if len(saved_ids) >= count: break
Writes live progress to agent_run_logs table via _log_progress() at line 38 — that’s what the UI polls for the live status feed.
2. agents/scout/llm_query_planner.py — LLM Query Generation#
Agentic concept: Dynamic Query Planning
plan_queries(industry, location)at line 106 — sends prompt to LLM via LangChain’sHumanMessage, asks for a JSON array of 4 search queriesplan_retry_queries(...)at line 154 — called only if <80% of target found; sends what was already tried so LLM avoids repetition_call_llm(prompt)at line 28 — branches onLLM_PROVIDER: Ollama usesllm.invoke([HumanMessage(...)]), OpenAI uses rawchat.completions.create()Falls back to hardcoded queries
_fallback_queries()at line 80 if LLM fails
3. agents/scout/news_scout_client.py — Intent-Based Lead Discovery#
Agentic concept: Intent-Based Prospecting
find_companies_in_news(industry, location)at line 254 — public entry point_generate_news_queries()at line 82 — LLM generates news-specific queries (looks for events, not directories)_search_news(query)at line 216 — calls Tavily API with"topic": "news"to get article snippets (not web pages)_extract_companies_from_snippets()at line 126 — feeds snippets to LLM via LangChainHumanMessage, LLM extracts company name + signal type (expansion,new_facility,cost_pressure, etc.) + detail as structured JSONReturns
intent_signalfield that gets stored in theCompanyDB row
4. agents/scout/search_client.py — Tavily Directory Discovery#
search_with_queries(queries, location)at line 163 — takes LLM-planned queries and searches Tavily for directory URLs to scrape_cached_tavily_search()at line 65 —@lru_cache(maxsize=64)prevents redundant Tavily calls within same process runFilters out 19 known-unscrappable domains (LinkedIn, Glassdoor, ZoomInfo etc.) via
_UNSCRAPPABLE_DOMAINSat line 34Discovered URLs get saved to
directory_sourcestable viadirectory_scraper.save_directory_sources()
5. agents/scout/directory_scraper.py — HTML Scraping with Pagination#
scrape_directory(url)at line 30 — pagination loop: keeps callingget_next_page()until no next pagefetch_page(url)at line 126 — retry loop (up toMAX_RETRIES), realistic browser headers, optional proxy viaget_proxy_url()_find_listing_elements(soup)at line 266 — uses BeautifulSoup4 to find<article>,<div>,<li>tags with CSS class/id hints like"listing","card","member"parse_listing(tag)at line 58 — extracts name (triesh1-h4,[itemprop=name],a[title]), website (first absolute href), city/category by keyword
6. agents/scout/google_maps_client.py — Google Places API#
search_companies(industry, location, limit, query_text)at line 65Uses Google Places API v1 (
places:searchText) withX-Goog-FieldMaskheader to request only the fields we needquery_textcomes from LLM query planner — overrides default query string_map_industry(raw_type, fallback)at line 161 — maps Google place types ("hospital","lodging"etc.) to our 6-bucket industry taxonomy_parse_city_state(formatted_address)at line 170 — parses"123 Main St, Buffalo, NY 14201, USA"by splitting on commas
7. agents/scout/llm_deduplicator.py — Two-Pass Deduplication#
Agentic concept: LLM-assisted fuzzy matching
deduplicate(companies)at line 167Pass 1 — exact domain matching using
_extract_domain()at line 61. Fast, handles ~80% of duplicatesPass 2 —
_find_suspicious_pairs()at line 75 usesdifflib.SequenceMatcherto find name pairs with similarity ≥ 0.75, then_ask_llm_which_are_duplicates()at line 107 sends up to 8 suspicious pairs to LLM in one call asking for a JSON array of which pair numbers are duplicatesLLM call uses LangChain
HumanMessagesame as other modules
8. agents/scout/scout_critic.py — Quality Scoring + Source Learning#
evaluate_quality(companies)at line 45 — pure math, no LLM. Scores 0–10 based onwebsite(5pts),city(3pts),phone(2pts) field presence ratesupdate_source_performance(...)at line 72 — upsert tosource_performancetable: rolling average(old_avg * old_runs + new_score) / (old_runs + 1)rank_sources(industry, location, sources, db)at line 132 — SQLAlchemy query onSourcePerformancetable, sorts byavg_quality_scoredescending. This is the self-learning loop — sources that historically perform better get tried first
Execution Phases (scout_agent.run())#
1. LLM Query Planning → 3–5 diverse search queries (not hardcoded strings)
2. Source Ranking → order API sources by past performance from DB
3. Phase 0: News Scout → finds companies IN THE NEWS with buying signals
4. Phase 1: Directory → scrapes configured DB sources (Yellow Pages etc.)
5. Phase 2: Tavily → AI-powered web search using planned queries
6. Phase 3: API Sources → Google Maps + Yelp, one call per planned query
7. LLM Deduplication → removes near-duplicates from the API batch
8. Quality Retry → if <80% of target found, generates NEW queries and retries
9. Source Performance → writes results back to DB so future runs learn
Scout Critic Quality Rubric#
After each source, the Critic scores the batch 0.0–10.0:
Field |
Points |
|---|---|
Website present |
5.0 |
City present |
3.0 |
Phone present |
2.0 |
Score ≥ 6.0 = good quality
Score < 6.0 = try another source
The Critic also writes to the source_performance table — a rolling average per (source, industry, location). Next run, rank_sources() reads this to put the best-performing source first.
Key Agentic Concepts Used#
Concept |
Tool / Tech |
Where |
|---|---|---|
Intent-Based Prospecting |
|
Phase 0 — finds warm leads from news |
LLM Query Planning |
Claude via |
Step 1 — diverse query generation |
Adaptive Source Ranking |
|
|
LLM Deduplication |
Claude via |
After API batch collection |
Quality-gated Retry |
|
If <80% target hit |
Website Signal Enrichment |
|
Crawls each company’s site for employee/location signals |
What Gets Saved#
News companies: name + industry minimum (LLM already classified), with
intent_signalfieldAPI companies: name + industry + city minimum (no website required — Google Maps/Yelp are trusted sources)
Directory companies: must pass
_validate_scraped()— requires name + website + reachable site
Full Data Flow#
User request: "find 10 healthcare companies in Rochester NY"
↓
llm_query_planner.plan_queries() ← LangChain → Ollama/OpenAI
↓ 4 diverse query strings
scout_critic.rank_sources() ← SQLAlchemy reads source_performance
↓ ordered: [google_maps, yelp] or reversed if yelp historically better
news_scout_client.find_companies_in_news()
→ _generate_news_queries() ← LangChain → Ollama/OpenAI
→ _search_news() × 3 ← Tavily API (topic=news)
→ _extract_companies_from_snippets() ← LangChain → Ollama/OpenAI
→ saved with intent_signal field
↓
directory_scraper.scrape_directory() ← BeautifulSoup4 + requests (paginated)
→ company_extractor.extract_all_fields()
→ website_crawler.crawl_company_site()
↓
search_client.search_with_queries() ← Tavily API (web mode)
→ directory_scraper.scrape_directory() per found URL
↓
google_maps_client.search_companies() × 4 queries ← Google Places API v1
yelp_client.search_companies() ← Yelp API
↓
llm_deduplicator.deduplicate()
Pass 1: domain exact match
Pass 2: SequenceMatcher similarity → LangChain → Ollama/OpenAI
↓
If <80% found: llm_query_planner.plan_retry_queries() → retry loop
↓
scout_critic.update_source_performance() × per source ← SQLAlchemy upsert
↓
return saved company IDs