Analyst Agent

Analyst Agent#

Tech Stack Used#

Tech	Purpose
LangChain (`langchain_core.messages.HumanMessage`)	LLM calls in `llm_inspector.py`
Ollama / OpenAI	LLM provider via `LLM_PROVIDER` env var
Apollo API (`api.apollo.io`)	Company enrichment (employee count) + contact finding
Hunter API	Contact/email finding for decision-makers
SQLAlchemy ORM	All DB reads/writes — `Company`, `CompanyFeature`, `LeadScore`, `Contact`, `AgentRunLog`
JSON seed file (`industry_benchmarks.json`)	Benchmark data for spend estimation — loaded once, cached in memory
Python `requests`	HTTP calls to Apollo + Hunter APIs

File-by-File Breakdown#

1. `agents/analyst/analyst_agent.py` — Coordinator#

Entry point: run(company_ids, db_session, run_id, on_progress) at line 75

Loops over each company ID, calls process_one_company(), tracks progress via on_progress callback (used by UI for live updates), writes to agent_run_logs after each company.

Full pipeline per company — process_one_company() at line 176:

gather_company_data()        → enrichment loop (crawl → Apollo → LLM → re-enrich)
spend_calculator             → utility + telecom spend estimates
savings_calculator           → low/mid/high savings range
score_engine.compute_score() → 0–100 composite score
score_engine.assign_tier()   → high / medium / low
llm_inspector.generate_score_narrative() → 1-sentence human explanation
save_features()              → writes CompanyFeature row
save_score()                 → writes LeadScore row
company.status = "scored"    → updates Company row

2. `agents/analyst/analyst_agent.py` — `gather_company_data()` at line 279#

Agentic concept: Adaptive Re-enrichment Loop

This is the intelligent data-gathering phase. It doesn’t just crawl once — it inspects what’s missing and decides whether to try again:

Step 1: website_crawler.crawl_company_site()    ← only if site_count or employee_count missing
Step 2: enrichment_client.enrich_company_data() ← Apollo API fallback if employee_count still 0
Step 3: llm_inspector.inspect_company()         ← LLM decides: "score_now" OR "enrich_before_scoring"
Step 4: Re-enrichment loop (max 2 attempts)     ← only if LLM says action="enrich_before_scoring"

LLM is skipped entirely if industry is known AND employee_count > 0 AND site_count > 0 — no tokens wasted when data is already complete.

3. `agents/analyst/llm_inspector.py` — Two LLM Jobs#

Agentic concept: LLM as Data Quality Judge + Narrative Generator

Job 1 — inspect_company() at line 81:

Sends company name, website, industry, employee count, site count, and crawled text excerpt (600 chars) to LLM via LangChain HumanMessage
LLM returns structured JSON:

{
  "inferred_industry": "healthcare",
  "data_gaps": ["employee_count"],
  "action": "enrich_before_scoring",
  "confidence": "high"
}

If inferred_industry is returned and DB value was "unknown", it overwrites it
action = "enrich_before_scoring" triggers the re-enrichment loop in gather_company_data()
Falls back to {"action": "score_now", "confidence": "low"} on any LLM failure

Job 2 — generate_score_narrative() at line 183:

Sends score, tier, savings estimate, industry, employee count, sites, state, deregulated flag to LLM
LLM writes a single sentence (max 25 words) explaining why this company scored the way it did
Example output: “5-site healthcare group in NY’s deregulated market with $420k in recoverable utility savings.”
Falls back to _fallback_narrative() (rule-based template) if LLM fails

Both use LangChain HumanMessage → Ollama or OpenAI via _call_llm() at line 38.

4. `agents/analyst/enrichment_client.py` — Apollo + Hunter#

Two jobs:

enrich_company_data(domain) at line 64 — Apollo organization enrichment:

POST https://api.apollo.io/api/v1/organizations/enrich with {"domain": "example.com"}
Returns employee_count, city, state
Silently returns {} if APOLLO_API_KEY missing or domain unknown

find_contacts(company_name, domain, db) — Hunter + Apollo people search:

Targets decision-maker titles: CFO, VP Finance, Director of Facilities, Energy Manager, etc.
Title priority ranking: CFO=1 → VP Finance=2 → Director of Facilities=3 → VP Operations=4
Module-level flags _hunter_blocked / _apollo_blocked skip providers for the rest of the run if rate-limited

5. `agents/analyst/spend_calculator.py` — Spend Estimation#

No LLM. Pure math from benchmark data.

utility_spend = site_count × avg_sqft_per_site × kwh_per_sqft_per_year × electricity_rate
telecom_spend = employee_count × telecom_per_employee
total_spend   = utility_spend + telecom_spend

Key functions:

Function	Line	Purpose
`calculate_utility_spend(site_count, industry, state)`	13	Site count × sqft × kWh × rate
`calculate_telecom_spend(employee_count, industry)`	25	Employee count × telecom benchmark
`calculate_total_spend(utility, telecom)`	32	Sums both

All benchmark values come from benchmarks_loader.get_benchmark(industry, state).

6. `agents/analyst/benchmarks_loader.py` — Benchmark Data#

Loads database/seed_data/industry_benchmarks.json once at startup, caches in _BENCHMARKS_CACHE
get_benchmark(industry, state) at line 34 — returns avg_sqft_per_site, kwh_per_sqft_per_year, telecom_per_employee, electricity_rate
get_electricity_rate(state) at line 65 — state-level $/kWh rates; defaults to 0.12 if state unknown
refresh_benchmarks() at line 74 — clears cache to force reload (used in tests)

7. `agents/analyst/savings_calculator.py` — Savings Range#

Three-bracket estimate from total spend:

Bracket	Rate	Formula
Low	10%	`total_spend × 0.10`
Mid	13.5%	`total_spend × 0.135`
High	17%	`total_spend × 0.17`

calculate_tb_revenue(savings_mid) — multiplies savings_mid × TB_CONTINGENCY_FEE (default 24%) to get Troy & Banks expected revenue.

8. `agents/analyst/score_engine.py` — Composite Scoring#

compute_score() at line 38 — weighted 0–100 formula:

Component	Max Points	Driver
Recovery (savings_mid)	40 pts	≥$2M=100pts, ≥$1M=85, ≥$500k=70, ≥$250k=55, below=40
Industry fit	25 pts	healthcare/hospitality/manufacturing/retail=90, public_sector/office=70, unknown=45
Multi-site	20 pts	≥20 sites=20, ≥10=17, ≥5=13, ≥2=8, 1 site=3
Data quality	15 pts	0–10 score → mapped to 1/4/8/12/15 pts

Weights are configurable via settings.SCORE_WEIGHT_RECOVERY, SCORE_WEIGHT_INDUSTRY, etc.

assign_tier() at line 61:

Score	Tier
≥ `HIGH_SCORE_THRESHOLD`	`high`
≥ `MEDIUM_SCORE_THRESHOLD`	`medium`
Below	`low`

assess_data_quality() at line 106 — 0–10 quality signal:

Signal	Points
Has website	+2
Has locations page	+2
site_count > 0	+2
employee_count > 0	+2
Contact found in DB	+2

Key Agentic Concepts Used#

Concept	Tool / Tech	Where
Adaptive Re-enrichment Loop	`website_crawler` + Apollo API + LLM	`gather_company_data()` — loops up to 2x
LLM as Data Quality Judge	LangChain + Ollama/OpenAI	`llm_inspector.inspect_company()`
LLM Narrative Generation	LangChain + Ollama/OpenAI	`llm_inspector.generate_score_narrative()`
Benchmark-driven Spend Estimation	JSON seed file + `benchmarks_loader`	`spend_calculator.py`
Contact Targeting by Title Priority	Apollo + Hunter APIs	`enrichment_client.find_contacts()`
Progress Streaming	`on_progress` callback → UI	`analyst_agent.run()`

What Gets Written to DB#

Table	Written by	Contents
`company_features`	`save_features()`	site_count, utility_spend, telecom_spend, savings low/mid/high, industry_fit_score, deregulated_state, data_quality_score
`lead_scores`	`save_score()`	score (0–100), tier, score_reason (LLM narrative), `approved_human=False`
`companies`	`process_one_company()`	status → `"scored"`
`contacts`	`enrichment_client.find_contacts()`	decision-maker emails from Hunter/Apollo
`agent_run_logs`	`_log_action()`	per-company action log for UI

Full Data Flow#

run(company_ids)
  └─ for each company_id:
       process_one_company()
         │
         ├─ gather_company_data()
         │    ├─ website_crawler.crawl_company_site()    ← reuses Scout's crawler
         │    ├─ enrichment_client.enrich_company_data() ← Apollo API (employee_count)
         │    ├─ llm_inspector.inspect_company()         ← LangChain → Ollama/OpenAI
         │    └─ re-enrichment loop (max 2x) if LLM says "enrich_before_scoring"
         │
         ├─ spend_calculator.calculate_utility_spend()   ← benchmark JSON × site_count
         ├─ spend_calculator.calculate_telecom_spend()   ← benchmark JSON × employee_count
         ├─ savings_calculator.calculate_all_savings()   ← 10% / 13.5% / 17% of total_spend
         │
         ├─ score_engine.compute_score()                 ← weighted 4-component formula
         ├─ score_engine.assign_tier()                   ← high / medium / low
         │
         ├─ llm_inspector.generate_score_narrative()     ← LangChain → 1-sentence explanation
         │
         ├─ save_features()  → CompanyFeature DB row
         ├─ save_score()     → LeadScore DB row
         └─ company.status = "scored"

Component	Max Points	Driver
Recovery (savings_mid)	40 pts	≥\(2M=100pts, ≥\)1M=85, ≥\(500k=70, ≥\)250k=55, below=40
Industry fit	25 pts	healthcare/hospitality/manufacturing/retail=90, public_sector/office=70, unknown=45
Multi-site	20 pts	≥20 sites=20, ≥10=17, ≥5=13, ≥2=8, 1 site=3
Data quality	15 pts	0–10 score → mapped to 1/4/8/12/15 pts

Analyst Agent

Contents

Analyst Agent#

Tech Stack Used#

File-by-File Breakdown#

1. agents/analyst/analyst_agent.py — Coordinator#

2. agents/analyst/analyst_agent.py — gather_company_data() at line 279#

3. agents/analyst/llm_inspector.py — Two LLM Jobs#

4. agents/analyst/enrichment_client.py — Apollo + Hunter#

5. agents/analyst/spend_calculator.py — Spend Estimation#

6. agents/analyst/benchmarks_loader.py — Benchmark Data#

7. agents/analyst/savings_calculator.py — Savings Range#

8. agents/analyst/score_engine.py — Composite Scoring#