Use LLM Agents to research government data at scale¶
This notebook demonstrates using everyrow's rank() utility with web research capabilities to gather and rank real-world data that isn't available in a structured format.
Use Case: Real estate investors need permit processing timelines to evaluate markets—delays directly impact holding costs. But municipalities publish this data inconsistently: some on websites, some in PDFs, some not at all.
Why everyrow? The rank() function can perform web research to find permit processing times from official sources, contractor reports, and comparable city data—then rank cities by speed.
In [6]:
import asyncio
from dotenv import load_dotenv
load_dotenv()
import pandas as pd
from everyrow import create_session
from everyrow.ops import rank
Load Texas Cities Data¶
In [7]:
texas_cities_df = pd.read_csv("../data/texas_cities.csv")
print(f"Analyzing {len(texas_cities_df)} Texas cities")
texas_cities_df.head(10)
Out[7]:
Define Research & Ranking Task¶
The task instructs everyrow to research permit times from official sources.
In [8]:
RANKING_TASK = """
Research and score each Texas city by their RESIDENTIAL BUILDING PERMIT processing time.
The score should represent the NUMBER OF BUSINESS DAYS for typical residential permit approval.
Lower numbers = faster = better for real estate investors.
RESEARCH PRIORITIES (in order):
1. Official city development services performance metrics
2. City-stated standard processing times from permit office websites
3. Contractor reports and local builder forum discussions
4. Comparable city estimates if no direct data available
For cities without published data, estimate based on:
- City size (smaller cities often faster)
- Region patterns (some Texas regions known for faster permitting)
- Recent development activity levels
Output the score as estimated business days (e.g., 5 = 5 business days, 30 = 30 business days).
Include the source of information in your reasoning.
"""
Run the Research & Ranking¶
In [9]:
async def run_ranking():
async with create_session(name="Texas Permit Times Research") as session:
print(f"Session URL: {session.get_url()}")
print("\nResearching permit processing times (this may take a few minutes)...\n")
result = await rank(
session=session,
task=RANKING_TASK,
input=texas_cities_df,
field_name="score",
)
return result.data
results_df = await run_ranking()
Analyze Results¶
In [10]:
# Rename score to permit_days for clarity
results_df = results_df.rename(columns={"score": "permit_days"})
# Sort by permit time (ascending = fastest first)
results_df = results_df.sort_values("permit_days", ascending=True)
print(f"\n{'='*60}")
print("TEXAS CITIES BY PERMIT PROCESSING TIME")
print("(Fastest to Slowest)")
print(f"{'='*60}\n")
In [11]:
# Top 10 fastest
print("TOP 10 FASTEST (Best for Investors):")
print("-" * 50)
for i, (_, row) in enumerate(results_df.head(10).iterrows(), 1):
print(f"{i:2}. {row['city']:20} | {row['permit_days']:3} days | Pop: {row['population']:,}")
if 'research' in row and pd.notna(row['research']):
print(f" Source: {str(row['research'])[:60]}...")
print()
In [12]:
# Bottom 10 slowest
print("\nTOP 10 SLOWEST (Highest Holding Costs):")
print("-" * 50)
for i, (_, row) in enumerate(results_df.tail(10).iloc[::-1].iterrows(), 1):
print(f"{i:2}. {row['city']:20} | {row['permit_days']:3} days | Pop: {row['population']:,}")
In [13]:
# Average by region
print("\nAVERAGE PERMIT TIME BY REGION:")
print(results_df.groupby("region")["permit_days"].mean().sort_values().to_string())
In [14]:
# Summary stats
print(f"\nSUMMARY STATISTICS:")
print(f" Fastest city: {results_df.iloc[0]['city']} ({results_df.iloc[0]['permit_days']} days)")
print(f" Slowest city: {results_df.iloc[-1]['city']} ({results_df.iloc[-1]['permit_days']} days)")
print(f" Average: {results_df['permit_days'].mean():.1f} days")
print(f" Median: {results_df['permit_days'].median():.1f} days")
In [15]:
# Full results
results_df[["city", "region", "population", "permit_days", "research"]]
Out[15]: