everyrowdocs
Overview
  • Installation
  • Getting Started
  • API Key
  • MCP Server
  • Skills vs MCP
  • Progress Monitoring
  • Chaining Operations
  • GitHub
API Reference
  • dedupe
  • merge
  • rank
  • agent_map
  • screen
Guides
  • How to replace human data annotators with LLMs in active learning
  • How to Add A Column to a DataFrame with Web Research
  • Case Studies
  • How to Classify and Label Data with an LLM in Python
  • Remove Duplicates from ML Training Data in Python
  • Filter a Pandas DataFrame with LLMs
  • How to Fuzzy Join DataFrames in Python
  • How to sort a dataset using web data in Python
  • How to resolve duplicate rows in Python with LLMs
  • How to scale LLM deduplication to 20,000 rows
Case Studies
  • Build an AI lead qualification pipeline in Python
  • Fuzzy join two Pandas DataFrames using LLMs
  • Fuzzy match and merge contact lists in Python
  • How I used Claude to match 200 Clinical Trials to 700 PubMed Papers
  • How to filter job postings with LLM Agents
  • How to merge datasets without common ID in Python
  • How to score and prioritize leads with AI in Python
  • How to Screen Stocks in Python with AI Agents
  • How to use LLMs to deduplicate CRM Data
  • LLM-powered Merging at Scale
  • LLM-powered Screening at Scale
  • Python Notebook to screen stocks using AI Agents
  • Run 10,000 LLM Web Research Agents
  • Score and rank leads without a CRM in Python
  • Understanding `agent_map` Effort Levels
  • Understanding Costs and Speed for Merge
  • Use LLM Agents to research government data at scale
everyrowby futuresearch
by futuresearch
View sourceRun in Colab

Understanding Costs and Speed for Merge¶

Every data engineer has faced the challenge: you have two tables that should join, but the keys don't quite match. Company names are spelled differently. Subsidiaries need to map to parents. Typos have crept in. Abbreviations vary.

The everyrow.merge() operation solves this by using a cost-optimized cascade of matching strategies:

Strategy Cost Speed Example
Exact match Free Instant "Apple Inc" → "Apple Inc"
Fuzzy match Free Fast "Microsft Corp" → "Microsoft Corp"
LLM reasoning ~$0.002/row ~1s/row "Instagram" → "Meta Platforms"
Web search ~$0.01/row ~5s/row Obscure/stale data

The key insight: most real-world matches are cheap or free. The expensive LLM-based matching only kicks in when simpler methods fail.

This notebook empirically tests these claims with increasing levels of matching difficulty, measuring actual costs and timing at each step.

In [ ]:
# Setup: install everyrow if needed and configure API key
try:
    import everyrow
except ImportError:
    %pip install everyrow

import os
if "EVERYROW_API_KEY" not in os.environ:
    os.environ["EVERYROW_API_KEY"] = "your-api-key-here"  # Get one at everyrow.io

Setup¶

First, let's set up our imports and create helper functions for measuring costs and timing.

In [19]:
import asyncio
import time
from dataclasses import dataclass
from typing import Literal
from dotenv import load_dotenv
load_dotenv()

import pandas as pd
import numpy as np
from everyrow import create_session, get_billing_balance
from everyrow.ops import merge
In [20]:
@dataclass
class ExperimentResult:
    """Results from a merge experiment."""
    name: str
    rows: int
    cost_dollars: float
    duration_seconds: float
    accuracy_pct: float | None = None
    
    def __repr__(self):
        acc = f", accuracy={self.accuracy_pct:.1f}%" if self.accuracy_pct is not None else ""
        cost_per_row = self.cost_dollars / self.rows if self.rows > 0 else 0
        return (f"ExperimentResult({self.name}: {self.rows} rows, "
                f"${self.cost_dollars:.4f} (${cost_per_row:.5f}/row), "
                f"{self.duration_seconds:.1f}s{acc})")

# Store all experiment results for final comparison
all_results: list[ExperimentResult] = []
In [ ]:
async def measure_merge(
    name: str,
    task: str,
    left_table: pd.DataFrame,
    right_table: pd.DataFrame,
    merge_on_left: str | None = None,
    merge_on_right: str | None = None,
    expected_matches: dict[str, str] | None = None,
    use_web_search: Literal['auto', 'yes', 'no'] | None = None,
) -> tuple[pd.DataFrame, ExperimentResult]:
    """
    Run a merge operation and measure its cost, duration, and accuracy.
    
    Args:
        name: Experiment name for logging
        task: The merge task description
        left_table: Left DataFrame (all rows preserved)
        right_table: Right DataFrame to match from
        merge_on_left: Column name in left table (optional)
        merge_on_right: Column name in right table (optional)
        expected_matches: Dict mapping left values to expected right values (for accuracy)
        use_web_search: "auto", "yes", or "no"
    
    Returns:
        Tuple of (result DataFrame, ExperimentResult)
    """
    # Measure billing before
    balance_before = await get_billing_balance()
    start_time = time.time()
    
    # Run the merge inside a named session
    async with create_session(name=name) as session:
        print(f"Session URL: {session.get_url()}")
        result = await merge(
            task=task,
            session=session,
            left_table=left_table,
            right_table=right_table,
            merge_on_left=merge_on_left,
            merge_on_right=merge_on_right,
            use_web_search=use_web_search,
        )
    
    # Measure billing after
    end_time = time.time()
    await asyncio.sleep(60) # wait for billing to update
    balance_after = await get_billing_balance()
    
    cost = balance_before.current_balance_dollars - balance_after.current_balance_dollars
    duration = end_time - start_time
    
    # Calculate accuracy if expected matches provided
    accuracy = None
    if expected_matches and merge_on_left and merge_on_right:
        correct = 0
        total = len(expected_matches)
        for left_val, expected_right in expected_matches.items():
            row = result.data[result.data[merge_on_left] == left_val]
            if len(row) > 0:
                actual_right = row[merge_on_right].iloc[0]
                if pd.notna(actual_right) and expected_right in str(actual_right):
                    correct += 1
        accuracy = (correct / total) * 100 if total > 0 else None
    
    exp_result = ExperimentResult(
        name=name,
        rows=len(left_table),
        cost_dollars=cost,
        duration_seconds=duration,
        accuracy_pct=accuracy,
    )
    all_results.append(exp_result)
    
    print(f"\n{exp_result}")
    return result.data, exp_result

Experiment 1: Exact String Matches Only¶

Let's start with the simplest case: both tables use identical strings. This should be instant and free since the system can do a simple string comparison.

We'll create a realistic scenario: matching a list of Fortune 500 companies to their revenue data.

In [40]:
# Fortune 500-style company data with EXACT matching names
companies_exact = pd.DataFrame([
    {"company": "Apple Inc.", "sector": "Technology"},
    {"company": "Microsoft Corporation", "sector": "Technology"},
    {"company": "Amazon.com Inc.", "sector": "Consumer Cyclical"},
    {"company": "Alphabet Inc.", "sector": "Technology"},
    {"company": "Meta Platforms Inc.", "sector": "Technology"},
    {"company": "Tesla Inc.", "sector": "Consumer Cyclical"},
    {"company": "NVIDIA Corporation", "sector": "Technology"},
    {"company": "JPMorgan Chase & Co.", "sector": "Financial Services"},
    {"company": "Johnson & Johnson", "sector": "Healthcare"},
    {"company": "Visa Inc.", "sector": "Financial Services"},
])

revenue_exact = pd.DataFrame([
    {"company_name": "Apple Inc.", "revenue_billions": 394},
    {"company_name": "Microsoft Corporation", "revenue_billions": 211},
    {"company_name": "Amazon.com Inc.", "revenue_billions": 574},
    {"company_name": "Alphabet Inc.", "revenue_billions": 307},
    {"company_name": "Meta Platforms Inc.", "revenue_billions": 134},
    {"company_name": "Tesla Inc.", "revenue_billions": 96},
    {"company_name": "NVIDIA Corporation", "revenue_billions": 61},
    {"company_name": "JPMorgan Chase & Co.", "revenue_billions": 158},
    {"company_name": "Johnson & Johnson", "revenue_billions": 95},
    {"company_name": "Visa Inc.", "revenue_billions": 32},
])

expected_exact = {row["company"]: row["company"] for _, row in companies_exact.iterrows()}

print(f"Left table: {len(companies_exact)} rows")
print(f"Right table: {len(revenue_exact)} rows")
companies_exact.head(3)
Left table: 10 rows
Right table: 10 rows
Out[40]:
company sector
0 Apple Inc. Technology
1 Microsoft Corporation Technology
2 Amazon.com Inc. Consumer Cyclical
In [41]:
result_exact, stats_exact = await measure_merge(
    name="Exact matches only",
    task="Match companies by name. Names are identical in both tables.",
    left_table=companies_exact,
    right_table=revenue_exact,
    merge_on_left="company",
    merge_on_right="company_name",
    expected_matches={c: c for c in companies_exact["company"]},
)

result_exact[["company", "sector", "company_name", "revenue_billions"]].head(5)
ExperimentResult(Exact matches only: 10 rows, $0.0000 ($0.00000/row), 12.9s, accuracy=100.0%)
Out[41]:
company sector company_name revenue_billions
0 Apple Inc. Technology Apple Inc. 394
1 Microsoft Corporation Technology Microsoft Corporation 211
2 Amazon.com Inc. Consumer Cyclical Amazon.com Inc. 574
3 Alphabet Inc. Technology Alphabet Inc. 307
4 Meta Platforms Inc. Technology Meta Platforms Inc. 134
In [42]:
result_exact, stats_exact = await measure_merge(
    name="Exact matches only",
    task="Match companies by name. Names are identical in both tables.",
    left_table=companies_exact,
    right_table=revenue_exact.iloc[:-2],
    merge_on_left="company",
    merge_on_right="company_name",
    expected_matches={c: c for c in companies_exact["company"].iloc[:-2]},
    
)

result_exact[["company", "sector", "company_name", "revenue_billions"]].head(5)
ExperimentResult(Exact matches only: 10 rows, $0.0100 ($0.00100/row), 31.4s, accuracy=100.0%)
Out[42]:
company sector company_name revenue_billions
0 Apple Inc. Technology Apple Inc. 394.0
1 Microsoft Corporation Technology Microsoft Corporation 211.0
2 Amazon.com Inc. Consumer Cyclical Amazon.com Inc. 574.0
3 Alphabet Inc. Technology Alphabet Inc. 307.0
4 Meta Platforms Inc. Technology Meta Platforms Inc. 134.0
In [43]:
result_exact
Out[43]:
company sector company_name revenue_billions research
0 Apple Inc. Technology Apple Inc. 394.0 {'company_name': 'This row was matched due to ...
1 Microsoft Corporation Technology Microsoft Corporation 211.0 {'company_name': 'This row was matched due to ...
2 Amazon.com Inc. Consumer Cyclical Amazon.com Inc. 574.0 {'company_name': 'This row was matched due to ...
3 Alphabet Inc. Technology Alphabet Inc. 307.0 {'company_name': 'This row was matched due to ...
4 Meta Platforms Inc. Technology Meta Platforms Inc. 134.0 {'company_name': 'This row was matched due to ...
5 Tesla Inc. Consumer Cyclical Tesla Inc. 96.0 {'company_name': 'This row was matched due to ...
6 NVIDIA Corporation Technology NVIDIA Corporation 61.0 {'company_name': 'This row was matched due to ...
7 JPMorgan Chase & Co. Financial Services JPMorgan Chase & Co. 158.0 {'company_name': 'This row was matched due to ...
8 Johnson & Johnson Healthcare NaN NaN NaN
9 Visa Inc. Financial Services NaN NaN NaN

As expected: zero cost for exact string matches. The cascade never needed to invoke LLM reasoning.


Experiment 2: Exact + Fuzzy Matches (Typos & Variations)¶

Real-world data is messy. Let's introduce realistic variations:

  • Typos: "Microsft" instead of "Microsoft"
  • Case differences: "APPLE INC" vs "Apple Inc."
  • Missing punctuation: "Johnson Johnson" vs "Johnson & Johnson"
  • Spacing issues: "JP Morgan" vs "JPMorgan"

These should all be handled by fuzzy matching, which is still free.

In [44]:
# Same companies but with realistic typos and variations
companies_fuzzy = pd.DataFrame([
    {"company": "APPLE INC", "sector": "Technology"},  # Case difference
    {"company": "Microsft Corporation", "sector": "Technology"},  # Typo
    {"company": "Amazon Inc", "sector": "Consumer Cyclical"},  # Missing .com
    {"company": "Alphabet", "sector": "Technology"},  # Missing Inc.
    {"company": "Meta Platforms", "sector": "Technology"},  # Missing Inc.
    {"company": "Telsa Inc.", "sector": "Consumer Cyclical"},  # Typo (Telsa)
    {"company": "Nvidia Corp", "sector": "Technology"},  # Abbreviation
    {"company": "JP Morgan Chase", "sector": "Financial Services"},  # Spacing
    {"company": "Johnson Johnson", "sector": "Healthcare"},  # Missing &
    {"company": "Visa", "sector": "Financial Services"},  # Missing Inc.
])

# Expected matches (left company -> right company_name)
expected_fuzzy = {
    "APPLE INC": "Apple Inc.",
    "Microsft Corporation": "Microsoft Corporation",
    "Amazon Inc": "Amazon.com Inc.",
    "Alphabet": "Alphabet Inc.",
    "Meta Platforms": "Meta Platforms Inc.",
    "Telsa Inc.": "Tesla Inc.",
    "Nvidia Corp": "NVIDIA Corporation",
    "JP Morgan Chase": "JPMorgan Chase & Co.",
    "Johnson Johnson": "Johnson & Johnson",
    "Visa": "Visa Inc.",
}

print("Sample variations:")
for left, right in list(expected_fuzzy.items())[:5]:
    print(f"  '{left}' → '{right}'")
Sample variations:
  'APPLE INC' → 'Apple Inc.'
  'Microsft Corporation' → 'Microsoft Corporation'
  'Amazon Inc' → 'Amazon.com Inc.'
  'Alphabet' → 'Alphabet Inc.'
  'Meta Platforms' → 'Meta Platforms Inc.'
In [45]:
result_fuzzy, stats_fuzzy = await measure_merge(
    name="Exact + fuzzy (typos)",
    task="Match companies by name. Handle typos, case differences, and minor variations.",
    left_table=companies_fuzzy,
    right_table=revenue_exact,
    merge_on_left="company",
    merge_on_right="company_name",
    expected_matches=expected_fuzzy,
)

result_fuzzy[["company", "company_name", "revenue_billions"]]
ExperimentResult(Exact + fuzzy (typos): 10 rows, $0.0000 ($0.00000/row), 19.6s, accuracy=100.0%)
Out[45]:
company company_name revenue_billions
0 APPLE INC Apple Inc. 394
1 Microsft Corporation Microsoft Corporation 211
2 Amazon Inc Amazon.com Inc. 574
3 Alphabet Alphabet Inc. 307
4 Meta Platforms Meta Platforms Inc. 134
5 Telsa Inc. Tesla Inc. 96
6 Nvidia Corp NVIDIA Corporation 61
7 JP Morgan Chase JPMorgan Chase & Co. 158
8 Johnson Johnson Johnson & Johnson 95
9 Visa Visa Inc. 32

Still zero (or near-zero) cost! Fuzzy string matching handles all these variations without needing LLM reasoning.


Experiment 3: Mostly Exact + Few LLM Matches (Semantic Relationships)¶

Now let's introduce cases that require semantic understanding:

  • Subsidiaries: "Instagram" should match "Meta Platforms"
  • Parent companies: "YouTube" should match "Alphabet"
  • Acquisitions: "LinkedIn" should match "Microsoft"
  • Regional names: "MSD" is Merck's name outside the US

These can't be solved by string matching alone—the LLM needs to know that Instagram is owned by Meta.

Hypothesis: With mostly exact matches and only a few semantic ones, costs should be minimal since only the semantic matches invoke the LLM.

In [46]:
# Mix of exact matches and semantic relationships
companies_semantic = pd.DataFrame([
    # Exact matches (7 rows - should be free)
    {"company": "Apple Inc.", "sector": "Technology"},
    {"company": "Microsoft Corporation", "sector": "Technology"},
    {"company": "Amazon.com Inc.", "sector": "Consumer Cyclical"},
    {"company": "Tesla Inc.", "sector": "Consumer Cyclical"},
    {"company": "NVIDIA Corporation", "sector": "Technology"},
    {"company": "JPMorgan Chase & Co.", "sector": "Financial Services"},
    {"company": "Visa Inc.", "sector": "Financial Services"},
    # Semantic matches (3 rows - require LLM)
    {"company": "Instagram", "sector": "Technology"},  # → Meta Platforms Inc.
    {"company": "YouTube", "sector": "Technology"},  # → Alphabet Inc.
    {"company": "WhatsApp", "sector": "Technology"},  # → Meta Platforms Inc.
])

expected_semantic = {
    "Apple Inc.": "Apple Inc.",
    "Microsoft Corporation": "Microsoft Corporation",
    "Amazon.com Inc.": "Amazon.com Inc.",
    "Tesla Inc.": "Tesla Inc.",
    "NVIDIA Corporation": "NVIDIA Corporation",
    "JPMorgan Chase & Co.": "JPMorgan Chase & Co.",
    "Visa Inc.": "Visa Inc.",
    "Instagram": "Meta Platforms Inc.",
    "YouTube": "Alphabet Inc.",
    "WhatsApp": "Meta Platforms Inc.",
}

print(f"Total rows: {len(companies_semantic)}")
print(f"  - Exact matches expected: 7 (free)")
print(f"  - Semantic matches expected: 3 (LLM required)")
Total rows: 10
  - Exact matches expected: 7 (free)
  - Semantic matches expected: 3 (LLM required)
In [47]:
result_semantic, stats_semantic = await measure_merge(
    name="Mostly exact + semantic",
    task="""Match companies. Note:
    - Instagram and WhatsApp are owned by Meta Platforms
    - YouTube is owned by Alphabet (Google's parent)
    """,
    left_table=companies_semantic,
    right_table=revenue_exact,
    merge_on_left="company",
    merge_on_right="company_name",
    expected_matches=expected_semantic,
)

result_semantic[["company", "company_name", "revenue_billions"]]
ExperimentResult(Mostly exact + semantic: 10 rows, $0.0300 ($0.00300/row), 67.3s, accuracy=100.0%)
Out[47]:
company company_name revenue_billions
0 Apple Inc. Apple Inc. 394
1 Microsoft Corporation Microsoft Corporation 211
2 Amazon.com Inc. Amazon.com Inc. 574
3 Tesla Inc. Tesla Inc. 96
4 NVIDIA Corporation NVIDIA Corporation 61
5 JPMorgan Chase & Co. JPMorgan Chase & Co. 158
6 Visa Inc. Visa Inc. 32
7 Instagram Meta Platforms Inc. 134
8 YouTube Alphabet Inc. 307
9 WhatsApp Meta Platforms Inc. 134
In [48]:
# Calculate estimated per-semantic-match cost
if stats_semantic.cost_dollars > 0:
    semantic_matches = 3  # Instagram, YouTube, WhatsApp
    cost_per_llm_match = stats_semantic.cost_dollars / semantic_matches
    print(f"Cost per LLM match: ${cost_per_llm_match:.4f}")
    print(f"Total cost for {semantic_matches} LLM matches: ${stats_semantic.cost_dollars:.4f}")
    print(f"\nThe 7 exact matches were FREE.")
Cost per LLM match: $0.0100
Total cost for 3 LLM matches: $0.0300

The 7 exact matches were FREE.

This demonstrates the cascade in action: 70% of rows matched for free (exact matches), while only 30% required LLM reasoning.


Experiment 4: Non-Trivial Matching (Breakdown by Match Type)¶

Let's test a more realistic scenario where we need to match pharmaceutical company subsidiaries and regional variations to their parent companies. This is a common real-world challenge in clinical trial data.

We'll create data that tests the full cascade:

  • Exact matches: Identical names
  • Fuzzy matches: Typos and variations
  • LLM matches: Subsidiaries, regional names, abbreviations
In [49]:
# Clinical trial sponsors (left table)
trial_sponsors = pd.DataFrame([
    # Exact matches (should be free)
    {"sponsor": "Pfizer Inc.", "trial_count": 150},
    {"sponsor": "Novartis AG", "trial_count": 120},
    {"sponsor": "Sanofi S.A.", "trial_count": 100},
    {"sponsor": "AstraZeneca PLC", "trial_count": 95},
    
    # Fuzzy matches (should still be free)
    {"sponsor": "Pfzer Inc", "trial_count": 5},  # Typo
    {"sponsor": "NOVARTIS", "trial_count": 8},  # Case
    {"sponsor": "Astra Zeneca", "trial_count": 12},  # Spacing
    
    # LLM matches - subsidiaries and regional names
    {"sponsor": "Genentech", "trial_count": 45},  # → Roche
    {"sponsor": "MSD", "trial_count": 80},  # → Merck (regional name)
    {"sponsor": "BMS", "trial_count": 60},  # → Bristol-Myers Squibb
    {"sponsor": "AbbVie", "trial_count": 70},  # Was part of Abbott
    {"sponsor": "Genzyme", "trial_count": 25},  # → Sanofi (acquired)
    {"sponsor": "Medimmune", "trial_count": 20},  # → AstraZeneca
])

# Parent pharma companies (right table)
pharma_parents = pd.DataFrame([
    {"company": "Pfizer Inc.", "hq_country": "USA", "market_cap_b": 250},
    {"company": "Novartis AG", "hq_country": "Switzerland", "market_cap_b": 200},
    {"company": "Roche Holding AG", "hq_country": "Switzerland", "market_cap_b": 280},
    {"company": "Merck & Co.", "hq_country": "USA", "market_cap_b": 270},
    {"company": "Bristol-Myers Squibb", "hq_country": "USA", "market_cap_b": 150},
    {"company": "AbbVie Inc.", "hq_country": "USA", "market_cap_b": 260},
    {"company": "Sanofi S.A.", "hq_country": "France", "market_cap_b": 130},
    {"company": "AstraZeneca PLC", "hq_country": "UK", "market_cap_b": 220},
])

expected_pharma = {
    "Pfizer Inc.": "Pfizer", "Novartis AG": "Novartis", "Sanofi S.A.": "Sanofi",
    "AstraZeneca PLC": "AstraZeneca", "Pfzer Inc": "Pfizer", "NOVARTIS": "Novartis",
    "Astra Zeneca": "AstraZeneca", "Genentech": "Roche", "MSD": "Merck",
    "BMS": "Bristol-Myers", "AbbVie": "AbbVie", "Genzyme": "Sanofi",
    "Medimmune": "AstraZeneca",
}

print(f"Total sponsor records: {len(trial_sponsors)}")
print(f"\nExpected match breakdown:")
print(f"  - Exact matches: 4 rows (free)")
print(f"  - Fuzzy matches: 3 rows (free)")
print(f"  - LLM matches: 6 rows (charged)")
Total sponsor records: 13

Expected match breakdown:
  - Exact matches: 4 rows (free)
  - Fuzzy matches: 3 rows (free)
  - LLM matches: 6 rows (charged)
In [50]:
result_pharma, stats_pharma = await measure_merge(
    name="Pharma non-trivial",
    task="""Match clinical trial sponsors to their parent pharmaceutical company.
    
    Key relationships to know:
    - Genentech is a subsidiary of Roche
    - MSD is Merck's name outside the United States
    - BMS is the abbreviation for Bristol-Myers Squibb
    - Genzyme was acquired by Sanofi
    - MedImmune is a subsidiary of AstraZeneca
    """,
    left_table=trial_sponsors,
    right_table=pharma_parents,
    merge_on_left="sponsor",
    merge_on_right="company",
    expected_matches=expected_pharma,
)

result_pharma[["sponsor", "trial_count", "company", "hq_country", "market_cap_b"]]
ExperimentResult(Pharma non-trivial: 13 rows, $0.0000 ($0.00000/row), 51.6s, accuracy=61.5%)
Out[50]:
sponsor trial_count company hq_country market_cap_b
0 Pfizer Inc. 150 Pfizer Inc. USA 250.0
1 Novartis AG 120 Novartis AG Switzerland 200.0
2 Sanofi S.A. 100 Sanofi S.A. France 130.0
3 AstraZeneca PLC 95 AstraZeneca PLC UK 220.0
4 Pfzer Inc 5 NaN NaN NaN
5 NOVARTIS 8 NaN NaN NaN
6 Astra Zeneca 12 NaN NaN NaN
7 Genentech 45 Roche Holding AG Switzerland 280.0
8 MSD 80 Merck & Co. USA 270.0
9 BMS 60 Bristol-Myers Squibb USA 150.0
10 AbbVie 70 AbbVie Inc. USA 260.0
11 Genzyme 25 NaN NaN NaN
12 Medimmune 20 NaN NaN NaN
In [51]:
# Analyze match type breakdown
exact_matches = 4
fuzzy_matches = 3  
llm_matches = 6
total = exact_matches + fuzzy_matches + llm_matches

print("Match Type Breakdown:")
print(f"  Exact matches:  {exact_matches:2d} ({exact_matches/total*100:.0f}%) - FREE")
print(f"  Fuzzy matches:  {fuzzy_matches:2d} ({fuzzy_matches/total*100:.0f}%) - FREE")
print(f"  LLM matches:    {llm_matches:2d} ({llm_matches/total*100:.0f}%) - CHARGED")
print(f"  ─────────────────────")
print(f"  Total:         {total:2d}")
print(f"\nFree matches: {(exact_matches + fuzzy_matches)/total*100:.0f}%")
print(f"Paid matches: {llm_matches/total*100:.0f}%")

if stats_pharma.cost_dollars > 0:
    print(f"\nActual cost: ${stats_pharma.cost_dollars:.4f}")
    print(f"Cost per LLM match: ${stats_pharma.cost_dollars/llm_matches:.4f}")
Match Type Breakdown:
  Exact matches:   4 (31%) - FREE
  Fuzzy matches:   3 (23%) - FREE
  LLM matches:     6 (46%) - CHARGED
  ─────────────────────
  Total:         13

Free matches: 54%
Paid matches: 46%

Even with complex pharmaceutical relationships, over half the matches were free. The cost scales with the number of rows requiring semantic understanding, not the total row count.


Experiment 5: LLM-Only Matching (No merge_on Parameters)¶

What happens when you don't specify which columns to match? The system must:

  1. Analyze both tables to guess which columns are relevant
  2. Use LLM reasoning for every row

This is more expensive but useful when:

  • You're not sure which columns should match
  • Multiple columns might be relevant
  • The matching logic is complex
In [ ]:
# Contact data without clear merge keys
contacts = pd.DataFrame([
    {"name": "John Smith", "email": "jsmith@acme.com", "title": "VP Sales"},
    {"name": "Sarah Johnson", "email": "sarah.j@techcorp.io", "title": "CTO"},
    {"name": "Mike Chen", "email": "m.chen@globalinc.com", "title": "Director"},
    {"name": "Emily Davis", "email": "emily@startup.co", "title": "CEO"},
    {"name": "Tom Wilson", "email": "twilson@bigco.com", "title": "Manager"},
])

# Company data to match against
companies = pd.DataFrame([
    {"company_name": "Acme Corporation", "domain": "acme.com", "industry": "Manufacturing"},
    {"company_name": "TechCorp Solutions", "domain": "techcorp.io", "industry": "Software"},
    {"company_name": "Global Industries Inc", "domain": "globalinc.com", "industry": "Consulting"},
    {"company_name": "Startup Co", "domain": "startup.co", "industry": "Technology"},
    {"company_name": "BigCo Enterprises", "domain": "bigco.com", "industry": "Finance"},
])

print("Contacts:")
print(contacts.to_string(index=False))
print("\nCompanies:")
print(companies.to_string(index=False))
In [ ]:
# Run WITHOUT specifying merge_on columns
result_nokeys, stats_nokeys = await measure_merge(
    name="LLM-only (no merge_on)",
    task="""Match each contact to their company.
    Use the email domain to identify which company each person works for.
    For example, jsmith@acme.com works at Acme Corporation.
    """,
    left_table=contacts,
    right_table=companies,
    # Note: No merge_on_left or merge_on_right specified!
)

result_nokeys
In [ ]:
# Compare: same data but WITH merge hints
result_withkeys, stats_withkeys = await measure_merge(
    name="With merge_on hints",
    task="""Match contacts to companies by email domain.""",
    left_table=contacts,
    right_table=companies,
    merge_on_left="email",
    merge_on_right="domain",
)

print(f"\nComparison:")
print(f"  Without merge_on: ${stats_nokeys.cost_dollars:.4f}, {stats_nokeys.duration_seconds:.1f}s")
print(f"  With merge_on:    ${stats_withkeys.cost_dollars:.4f}, {stats_withkeys.duration_seconds:.1f}s")

if stats_nokeys.cost_dollars > 0 and stats_withkeys.cost_dollars > 0:
    ratio = stats_nokeys.cost_dollars / stats_withkeys.cost_dollars
    print(f"\n  LLM-only is {ratio:.1f}x more expensive")

Takeaway: Providing merge_on hints significantly reduces costs when the matching columns are known.


Experiment 6: Scaling Analysis¶

How do costs scale as we increase:

  1. Number of rows (10 → 50 → 100 → 200)
  2. Content per row (more columns, longer text)

For this experiment, we'll generate synthetic data with controllable characteristics and measure the cost/time relationship.

In [ ]:
def generate_company_data(n_rows: int, add_description: bool = False) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Generate synthetic company data for scaling tests.
    
    Returns left_table (with variations) and right_table (canonical names).
    Mix includes: 40% exact, 30% fuzzy, 30% semantic.
    """
    base_companies = [
        ("Apple Inc.", "AAPL", "Technology"),
        ("Microsoft Corporation", "MSFT", "Technology"),
        ("Amazon.com Inc.", "AMZN", "E-commerce"),
        ("Alphabet Inc.", "GOOGL", "Technology"),
        ("Meta Platforms Inc.", "META", "Technology"),
        ("Tesla Inc.", "TSLA", "Automotive"),
        ("NVIDIA Corporation", "NVDA", "Technology"),
        ("JPMorgan Chase & Co.", "JPM", "Finance"),
        ("Johnson & Johnson", "JNJ", "Healthcare"),
        ("Visa Inc.", "V", "Finance"),
    ]
    
    # Variations for left table
    variations = {
        "Apple Inc.": ["Apple Inc.", "APPLE INC", "Apple"],  # exact, case, short
        "Microsoft Corporation": ["Microsoft Corporation", "Microsft Corp", "MSFT"],
        "Amazon.com Inc.": ["Amazon.com Inc.", "Amazon Inc", "AWS"],  # exact, fuzzy, semantic
        "Alphabet Inc.": ["Alphabet Inc.", "Alphabet", "Google"],
        "Meta Platforms Inc.": ["Meta Platforms Inc.", "Meta Platforms", "Facebook"],
        "Tesla Inc.": ["Tesla Inc.", "Telsa Inc", "Tesla Motors"],
        "NVIDIA Corporation": ["NVIDIA Corporation", "Nvidia Corp", "GeForce"],
        "JPMorgan Chase & Co.": ["JPMorgan Chase & Co.", "JP Morgan", "Chase Bank"],
        "Johnson & Johnson": ["Johnson & Johnson", "Johnson Johnson", "J&J"],
        "Visa Inc.": ["Visa Inc.", "Visa", "Visa Card"],
    }
    
    left_rows = []
    for i in range(n_rows):
        base = base_companies[i % len(base_companies)]
        company_name = base[0]
        var_list = variations[company_name]
        # Cycle through: exact (40%), fuzzy (30%), semantic (30%)
        var_idx = i % 3  # 0=exact, 1=fuzzy, 2=semantic
        var_name = var_list[min(var_idx, len(var_list)-1)]
        
        row = {
            "company": var_name,
            "record_id": f"REC-{i:04d}",
            "sector": base[2],
        }
        if add_description:
            row["description"] = f"Company record {i} for {var_name}. " * 5
        left_rows.append(row)
    
    right_rows = [
        {"company_name": c[0], "ticker": c[1], "industry": c[2], "employees": (i+1)*10000}
        for i, c in enumerate(base_companies)
    ]
    
    return pd.DataFrame(left_rows), pd.DataFrame(right_rows)

# Test the generator
test_left, test_right = generate_company_data(10)
print("Sample left table:")
print(test_left.head())

6.1 Scaling with Number of Rows¶

Let's measure how costs grow as we increase row count.

In [ ]:
row_counts = [10, 30, 50, 100]
scaling_results = []

for n_rows in row_counts:
    left_df, right_df = generate_company_data(n_rows)
    
    _, result = await measure_merge(
        name=f"Scale test: {n_rows} rows",
        task="""Match companies. Handle variations like:
        - Google is Alphabet's main product
        - Facebook is now Meta Platforms
        - AWS is part of Amazon
        - Chase Bank is part of JPMorgan
        """,
        left_table=left_df,
        right_table=right_df,
        merge_on_left="company",
        merge_on_right="company_name",
    )
    scaling_results.append(result)
In [ ]:
# Analyze scaling results
print("\n" + "="*60)
print("ROW SCALING ANALYSIS")
print("="*60)
print(f"{'Rows':<10} {'Cost':>10} {'Time (s)':>10} {'$/row':>12}")
print("-"*42)

for r in scaling_results:
    cost_per_row = r.cost_dollars / r.rows if r.rows > 0 else 0
    print(f"{r.rows:<10} ${r.cost_dollars:>8.4f} {r.duration_seconds:>10.1f} ${cost_per_row:>10.5f}")

# Check if cost scales linearly
if len(scaling_results) >= 2 and scaling_results[0].cost_dollars > 0:
    first = scaling_results[0]
    last = scaling_results[-1]
    row_ratio = last.rows / first.rows
    cost_ratio = last.cost_dollars / first.cost_dollars if first.cost_dollars > 0 else 0
    print(f"\nScaling factor: {row_ratio:.0f}x rows → {cost_ratio:.1f}x cost")
    if cost_ratio > 0:
        print(f"Cost scales {'linearly' if 0.8 < cost_ratio/row_ratio < 1.2 else 'sub-linearly' if cost_ratio/row_ratio < 0.8 else 'super-linearly'}")

6.2 Scaling with Content per Row¶

Does adding more columns or longer text fields affect costs?

In [ ]:
# Compare: minimal columns vs rich content
n_rows = 20

# Minimal content
left_minimal, right_minimal = generate_company_data(n_rows, add_description=False)
_, result_minimal = await measure_merge(
    name=f"Minimal content ({n_rows} rows)",
    task="Match companies. Google→Alphabet, Facebook→Meta, AWS→Amazon.",
    left_table=left_minimal,
    right_table=right_minimal,
    merge_on_left="company",
    merge_on_right="company_name",
)

# Rich content
left_rich, right_rich = generate_company_data(n_rows, add_description=True)
_, result_rich = await measure_merge(
    name=f"Rich content ({n_rows} rows)",
    task="Match companies. Google→Alphabet, Facebook→Meta, AWS→Amazon.",
    left_table=left_rich,
    right_table=right_rich,
    merge_on_left="company",
    merge_on_right="company_name",
)

print(f"\nContent comparison ({n_rows} rows):")
print(f"  Minimal ({len(left_minimal.columns)} cols): ${result_minimal.cost_dollars:.4f}, {result_minimal.duration_seconds:.1f}s")
print(f"  Rich ({len(left_rich.columns)} cols):    ${result_rich.cost_dollars:.4f}, {result_rich.duration_seconds:.1f}s")

Summary: Cost & Performance Findings¶

Let's compile all our experimental results into a final comparison.

In [ ]:
# Create summary DataFrame
summary_data = []
for r in all_results:
    cost_per_row = r.cost_dollars / r.rows if r.rows > 0 else 0
    summary_data.append({
        "Experiment": r.name,
        "Rows": r.rows,
        "Cost ($)": f"${r.cost_dollars:.4f}",
        "Time (s)": f"{r.duration_seconds:.1f}",
        "$/Row": f"${cost_per_row:.5f}",
        "Accuracy": f"{r.accuracy_pct:.0f}%" if r.accuracy_pct else "N/A",
    })

summary_df = pd.DataFrame(summary_data)
print("\n" + "="*80)
print("COMPLETE EXPERIMENT SUMMARY")
print("="*80)
print(summary_df.to_string(index=False))
In [ ]:
# Calculate key findings
total_cost = sum(r.cost_dollars for r in all_results)
total_rows = sum(r.rows for r in all_results)
total_time = sum(r.duration_seconds for r in all_results)

# Find zero-cost experiments
zero_cost = [r for r in all_results if r.cost_dollars < 0.001]
low_cost = [r for r in all_results if 0.001 <= r.cost_dollars < 0.01]

print("\n" + "="*60)
print("KEY FINDINGS")
print("="*60)
print(f"\nTotal rows processed: {total_rows}")
print(f"Total cost: ${total_cost:.4f}")
print(f"Total time: {total_time:.1f}s")
print(f"Average cost per row: ${total_cost/total_rows:.5f}")

print(f"\nExperiments with zero/near-zero cost: {len(zero_cost)}")
for r in zero_cost:
    print(f"  - {r.name}")

print(f"\nCost Optimization Strategies:")
print(f"  1. Use merge_on parameters when you know the columns")
print(f"  2. Clean data for fuzzy matching (typos are free to resolve)")
print(f"  3. Provide context in task description for semantic matches")
print(f"  4. LLM costs scale with semantic matches, not total rows")

Conclusion¶

The everyrow.merge() operation uses a cost-optimized cascade that makes intelligent merging surprisingly affordable:

  1. Exact and fuzzy matches are free - typos, case differences, and minor variations don't cost anything

  2. Only semantic matches incur costs - the LLM only processes rows that truly need reasoning (subsidiaries, acquisitions, regional names)

  3. Providing merge_on hints reduces costs - when you know which columns to match, specify them

  4. Costs scale with complexity, not size - a 1000-row dataset with clean data costs less than a 100-row dataset requiring semantic reasoning

For most real-world use cases, the majority of matches fall into the free tiers, making intelligent merging practical even for large datasets.

In [64]:
unicorn_companies = pd.read_csv("~/Downloads/unicorn_companies.csv")
investment_vcs  = pd.read_csv("/Users/peter/Downloads/investments_VC.csv")
merged_unicorns = pd.merge(
    unicorn_companies,
    investment_vcs,
    left_on="Company",
    right_on="name",
    how="inner",
    suffixes=("_unicorn", "_vc")
)

result_exact, stats_exact = await measure_merge(
    name="crunchbase merge",
    task="Match companies by (company) name.",
    left_table=unicorn_companies.iloc[:5000],
    right_table=investment_vcs.iloc[:5000],
)

result_exact[["company", "sector", "company_name", "revenue_billions"]].head(5)
---------------------------------------------------------------------------
CancelledError                            Traceback (most recent call last)
Cell In[64], line 12
      2 investment_vcs  = pd.read_csv("/Users/peter/Downloads/investments_VC.csv").iloc[:5000]
      3 merged_unicorns = pd.merge(
      4     unicorn_companies,
      5     investment_vcs,
   (...)      9     suffixes=("_unicorn", "_vc")
     10 )
---> 12 result_exact, stats_exact = await measure_merge(
     13     name="crunchbase merge",
     14     task="Match companies by (company) name.",
     15     left_table=unicorn_companies,
     16     right_table=investment_vcs,
     17 )
     19 result_exact[["company", "sector", "company_name", "revenue_billions"]].head(5)

Cell In[39], line 32, in measure_merge(name, task, left_table, right_table, merge_on_left, merge_on_right, expected_matches, use_web_search)
     29 start_time = time.time()
     31 # Run the merge
---> 32 result = await merge(
     33     task=task,
     34     left_table=left_table,
     35     right_table=right_table,
     36     merge_on_left=merge_on_left,
     37     merge_on_right=merge_on_right,
     38     use_web_search=use_web_search,
     39 )
     41 # Measure billing after
     42 end_time = time.time()

File ~/everyrow-sdk/src/everyrow/ops.py:526, in merge(task, session, left_table, right_table, merge_on_left, merge_on_right, use_web_search)
    521 if session is None:
    522     async with create_session() as internal_session:
    523         merge_task = await merge_async(
    524             task=task,
    525             session=internal_session,
--> 526             left_table=left_table,
    527             right_table=right_table,
    528             merge_on_left=merge_on_left,
    529             merge_on_right=merge_on_right,
    530             use_web_search=use_web_search,
    531         )
    532         return await merge_task.await_result()
    533 merge_task = await merge_async(
    534     task=task,
    535     session=session,
   (...)    540     use_web_search=use_web_search,
    541 )

File ~/everyrow-sdk/src/everyrow/task.py:80, in EveryrowTask.await_result(self, client)
     76 if client is None:
     77     raise EveryrowError(
     78         "No client available. Provide a client or use the task within a session context."
     79     )
---> 80 final_status = await await_task_completion(self.task_id, client)
     82 result_response = await get_task_result(self.task_id, client)
     83 artifact_id = result_response.artifact_id

File ~/everyrow-sdk/src/everyrow/task.py:130, in await_task_completion(task_id, client)
    124         if status_response.status in (
    125             TaskStatus.COMPLETED,
    126             TaskStatus.FAILED,
    127             TaskStatus.REVOKED,
    128         ):
    129             break
--> 130     await asyncio.sleep(1)
    132 if status_response.status == TaskStatus.FAILED:
    133     error_msg = (
    134         status_response.error
    135         if not isinstance(status_response.error, Unset)
    136         else "Unknown error"
    137     )

File ~/.local/share/uv/python/cpython-3.12.6-macos-aarch64-none/lib/python3.12/asyncio/tasks.py:665, in sleep(delay, result)
    661 h = loop.call_later(delay,
    662                     futures._set_result_unless_cancelled,
    663                     future, result)
    664 try:
--> 665     return await future
    666 finally:
    667     h.cancel()

CancelledError: 
In [ ]:
df = pd.read_csv("/Users/peter/Downloads/fda_product_recalls.csv")
df[["recalling_firm_name", "product_type", "distribution_pattern", ""]]
Out[ ]:
fei_number recalling_firm_name product_type product_classification status distribution_pattern recalling_firm_city recalling_firm_state recalling_firm_country center_classification_date reason_for_recall product_description event_id event_classification product_id center recall_details
0 3.002602e+09 Lamb Weston Sales Food/Cosmetics Class I Ongoing Distributed in CA, IA, IL, KS, LA MO, MS, NM, ... Kennewick Washington United States 2023-04-21 Undeclared Wheat in foodservice item Hashbrown... G5300 Lamb's Supreme Hash Brown Patties, Froze... 92014 Class I 199418 CFSAN https://www.accessdata.fda.gov/scripts/ires/?P...
1 3.012438e+09 Fresh Express Incorpated Food/Cosmetics Class I Ongoing Product was shipped to the following states: F... Windermere Florida United States 2023-04-21 The firm was notified by one of their customer... Fresh EXPRESS Chopped Kit Caesar Romaine Lettu... 92068 Class I 199573 CFSAN https://www.accessdata.fda.gov/scripts/ires/?P...
2 3.012438e+09 Fresh Express Incorpated Food/Cosmetics Class I Ongoing Product was shipped to the following states: F... Windermere Florida United States 2023-04-21 The firm was notified by one of their customer... Fresh Express Chopped Kit Chipotle Cheddar TOT... 92068 Class I 199574 CFSAN https://www.accessdata.fda.gov/scripts/ires/?P...
3 3.012438e+09 Fresh Express Incorpated Food/Cosmetics Class I Ongoing Product was shipped to the following states: F... Windermere Florida United States 2023-04-21 The firm was notified by one of their customer... PREMIUM MAKOTO HONEY GINGER SALAD KIT TOTAL NE... 92068 Class I 199575 CFSAN https://www.accessdata.fda.gov/scripts/ires/?P...
4 1.000222e+09 Blood Bank Computer Systems, Inc Biologics Class II Terminated GA, DE, TX, MO, PA, CA, FL, KY, IA, MI, IL, an... Auburn Washington United States 2023-04-21 Blood Bank Computer Systems has discovered in ... ABO Wheels, Version 1.1.0 91219 Class II 197268 CBER https://www.accessdata.fda.gov/scripts/ires/?P...
In [75]:
from pandas.core.frame import DataFrame

df_2021: DataFrame = df[df['center_classification_date'] >= pd.Timestamp('2021-08-01')] # type: ignore

df_2021.head()
df_2021.tail()
df_2021.shape
Out[75]:
(9949, 17)
In [ ]:
from everyrow.ops import screen
async with create_session(name="FDA Recall Screening") as session:
    print(f"Session URL: {session.get_url()}")
    await screen(
        session=session,
        task="Find recalls of products that I might have used for my child born on 2021-08-01.",
        input=df_2021,
    )
In [22]:
import numpy as np
import pandas as pd
correct_df = pd.read_csv("/Users/peter/Downloads/merge_websites_correct_output_2246.csv")
def get_correct_website_for_name(name: str) -> str:
    return correct_df[correct_df["name"] == name]["personal_website_url"].values[0]
In [ ]:
async with create_session(name="Website Matching (n=100)") as session:
    print(f"Session URL: {session.get_url()}")
    result = await merge(
        session=session,
        task="Match each person to their website(s).",
        left_table=pd.read_csv("/Users/peter/Downloads/merge_websites_input_left_100.csv"),
        right_table=pd.read_csv("/Users/peter/Downloads/merge_websites_input_right_100.csv"),
    )
In [ ]:
 
In [ ]:
print("num of matched rows:", len(result.data))
num_of_llm_matches = sum([1 if r["personal_website_url"] == 'This row was matched due to the information in both tables' else 0 for r in result.data.research])
num_of_web_search_matches = sum([1 if r["personal_website_url"].startswith('This row was matched due to the following information found in the web:') else 0 for r in result.data.research])
fraction_of_correct_matches = np.mean([1 if url == get_correct_website_for_name(name) else 0 for name, url in zip(result.data.name, result.data.personal_website_url)])
print("num of LLM matches:", num_of_llm_matches)
print("num of web search matches:", num_of_web_search_matches)
print("fraction of correct matches:", fraction_of_correct_matches)
num of matched rows: 100
num of LLM matches: 95
num of web search matches: 5
fraction of correct matches: 0.97
In [9]:
[1 if r["personal_website_url"] == 'This row was matched due to the information in both tables' else 0 for r in result.research]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 [1 if r["personal_website_url"] == 'This row was matched due to the information in both tables' else 0 for r in result.research]

TypeError: string indices must be integers, not 'str'
In [ ]:
for n in [200, 400, 800, 1600, 2246]:
    async with create_session(name=f"Website Matching (n={n})") as session:
        print(f"Session URL: {session.get_url()}")
        result = await merge(
            session=session,
            task="Match each person to their website(s).",
            left_table=pd.read_csv(f"/Users/peter/Downloads/merge_websites_input_left_{n}.csv"),
            right_table=pd.read_csv(f"/Users/peter/Downloads/merge_websites_input_right_{n}.csv"),
        )
    print(f"n={n}")
    print("num of matched rows:", len(result.data))
    num_of_llm_matches = sum([1 if r["personal_website_url"] == 'This row was matched due to the information in both tables' else 0 for r in result.data.research])
    num_of_web_search_matches = sum([1 if r["personal_website_url"].startswith('This row was matched due to the following information found in the web:') else 0 for r in result.data.research])
    fraction_of_correct_matches = np.mean([1 if url == get_correct_website_for_name(name) else 0 for name, url in zip(result.data.name, result.data.personal_website_url)])
    print("num of LLM matches:", num_of_llm_matches)
    print("num of web search matches:", num_of_web_search_matches)
    print("fraction of correct matches:", fraction_of_correct_matches)
    print("-"*100)
    print()
In [17]:
import json
result = pd.read_csv("/Users/peter/Downloads/merge_websites_output_800.csv")
result.research = [json.loads(r) for r in result.research]
print(f"n=800")
print("num of matched rows:", len(result))
num_of_llm_matches = sum([1 if r["personal_website_url"] == 'This row was matched due to the information in both tables' else 0 for r in result.research])
num_of_web_search_matches = sum([1 if r["personal_website_url"].startswith('This row was matched due to the following information found in the web:') else 0 for r in result.research])
fraction_of_correct_matches = np.mean([1 if url == get_correct_website_for_name(name) else 0 for name, url in zip(result.name, result.personal_website_url)])
print("num of LLM matches:", num_of_llm_matches)
print("num of web search matches:", num_of_web_search_matches)
print("fraction of correct matches:", fraction_of_correct_matches)
print("-"*100)
print()
n=800
num of matched rows: 800
num of LLM matches: 780
num of web search matches: 20
fraction of correct matches: 0.77625
----------------------------------------------------------------------------------------------------

In [ ]:
result = pd.read_csv("/Users/peter/Downloads/merge_websites_output_1600.csv")
result.research = [json.loads(r) for r in result.research]
print(f"n=1600")
print("num of matched rows:", len(result))
num_of_llm_matches = sum([1 if r["personal_website_url"] == 'This row was matched due to the information in both tables' else 0 for r in result.research])
num_of_web_search_matches = sum([1 if r["personal_website_url"].startswith('This row was matched due to the following information found in the web:') else 0 for r in result.research])
fraction_of_correct_matches = np.mean([1 if url == get_correct_website_for_name(name) else 0 for name, url in zip(result.name, result.personal_website_url)])
print("num of LLM matches:", num_of_llm_matches)
print("num of web search matches:", num_of_web_search_matches)
print("fraction of correct matches:", fraction_of_correct_matches)
print("-"*100)
print()
In [ ]:
result = pd.read_csv("/Users/peter/Downloads/merge_websites_output_2246.csv")
result.research = [json.loads(r) for r in result.research]
print(f"n=2246")
print("num of matched rows:", len(result))
num_of_llm_matches = sum([1 if r["personal_website_url"] == 'This row was matched due to the information in both tables' else 0 for r in result.research])
num_of_web_search_matches = sum([1 if r["personal_website_url"].startswith('This row was matched due to the following information found in the web:') else 0 for r in result.research])
fraction_of_correct_matches = np.mean([1 if url == get_correct_website_for_name(name) else 0 for name, url in zip(result.name, result.personal_website_url)])
print("num of LLM matches:", num_of_llm_matches)
print("num of web search matches:", num_of_web_search_matches)
print("fraction of correct matches:", fraction_of_correct_matches)
print("-"*100)
print()
In [ ]:
import asyncio

async def run_merge_and_report(n):
    async with create_session(name=f"Website Matching (n={n})") as session:
        print(f"Session URL: {session.get_url()}")
        result = await merge(
            session=session,
            task="Match each person to their website(s).",
            left_table=pd.read_csv(f"/Users/peter/Downloads/merge_websites_input_left_{n}.csv"),
            right_table=pd.read_csv(f"/Users/peter/Downloads/merge_websites_input_right_{n}.csv"),
        )
    print(f"n={n}")
    print("num of matched rows:", len(result.data))
    num_of_llm_matches = sum([1 if r["personal_website_url"] == 'This row was matched due to the information in both tables' else 0 for r in result.data.research])
    num_of_web_search_matches = sum([1 if r["personal_website_url"].startswith('This row was matched due to the following information found in the web:') else 0 for r in result.data.research])
    fraction_of_correct_matches = np.mean([1 if url == get_correct_website_for_name(name) else 0 for name, url in zip(result.data.name, result.data.personal_website_url)])
    print("num of LLM matches:", num_of_llm_matches)
    print("num of web search matches:", num_of_web_search_matches)
    print("fraction of correct matches:", fraction_of_correct_matches)
    print("-"*100)
    print()

await asyncio.gather(*(run_merge_and_report(n) for n in [1600, 2246]))
In [ ]: