Everyrow
Getting Started
  • Installation
  • Skills vs MCP
Guides
  • How to Add A Column to a DataFrame with Web Research
  • How to Classify and Label Data with an LLM in Python
  • Remove Duplicates from ML Training Data in Python
  • Filter a Pandas DataFrame with LLMs
  • How to Fuzzy Join DataFrames in Python
  • How to sort a dataset using web data in Python
  • How to resolve duplicate rows in Python with LLMs
API Reference
  • dedupe
  • merge
  • rank
  • agent_map
  • screen
Case Studies
  • Build an AI lead qualification pipeline in Python
  • Fuzzy join two Pandas DataFrames using LLMs
  • Fuzzy match and merge contact lists in Python
  • How to filter job postings with LLM Agents
  • How to merge datasets without common ID in Python
  • How to score and prioritize leads with AI in Python
  • How to Screen Stocks in Python with AI Agents
  • How to use LLMs to deduplicate CRM Data
  • LLM-powered Merging at Scale
  • LLM-powered Screening at Scale
  • Python Notebook to screen stocks using AI Agents
  • Running LLM Web Research Agents at Scale
  • Score and rank leads without a CRM in Python
  • Use LLM Agents to research government data at scale

Fuzzy join two Pandas DataFrames using LLMs¶

This notebook demonstrates the everyrow.io SDK merge capabilities:

  1. Fuzzy String Matching - Handling typos and corrupted data
  2. LLM Merge - Matching without common columns (company ↔ ticker)
  3. Web Merge - Dynamic data requiring real-time verification (CEO matching)

The SDK implements a cascade: Exact → Fuzzy → LLM → Web, using the simplest method that works.

Setup¶

Install the SDK and set your API key:

In [ ]:
!pip install everyrow
%env EVERYROW_API_KEY=your_api_key

Load the data:

In [3]:
import pandas as pd
from pandas import DataFrame

from everyrow import create_session
from everyrow.ops import merge
from everyrow.generated.models import LLMEnum

# Load dataset: 438 S&P 500 companies
data = pd.read_csv("../data/companies.csv")
print(f"Dataset: {data.shape[0]} companies, {data.shape[1]} columns")
data.head()
Dataset: 438 companies, 7 columns
Out[3]:
company ticker fair_value price mkt_cap shares CEO
0 3M MMM 39.18 101.74 61.70678828 606514530 William M. Brown
1 A. O. Smith AOS 6.59 32.38 4.904416495 151464376 Stephen Shafer
2 Abbott Laboratories ABT 119.19 34.87 51.22933139 1469152033 Robert B. Ford
3 AbbVie ABBV 180.95 38.06 61.55666858 1617358607 Robert A. Michael
4 Accenture ACN 107.79 97.84 79.53540176 812912937 Julie Sweet

Experiment 1: Fuzzy String Matching Under Noise¶

We corrupt company names with increasing noise levels to test the merge cascade.

Setup:

  • Left table: all columns except fair_value
  • Right table: company (corrupted) + fair_value
  • Merge on company name columns

Let us define a function that adds noise to a string column of a DataFrame:

In [4]:
import random
import string

def randomize_string_column(df: DataFrame, column_name: str, p: float = 0.1) -> DataFrame:
    def randomize_string(text):
        if pd.isna(text) or not isinstance(text, str):
            return text
        return ''.join([random.choice(string.ascii_letters + '') if random.random() < p else char for char in text])
    df_copy: DataFrame = df.copy()
    df_copy[column_name] = df_copy[column_name].apply(randomize_string)
    return df_copy

Experiment 1a: 0% Noise (Baseline)¶

Here we pick the right columns from the groun truth, apply the noise, and call the everyrow.io SDK to perform the merge:

In [ ]:
# Prepare tables
left_table = data.drop(columns=["fair_value"])
right_table = data[["company", "fair_value"]].copy()  # No noise

async with create_session(name="Fuzzy Match p=0") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the tables on company name",
        left_table=left_table,
        right_table=right_table,
        merge_on_left="company",
        merge_on_right="company",
    )
result.data.head()

Results (0% Noise):

Matched Exact Fuzzy LLM Web Accuracy False Pos Price
100% 100% 0% 0% 0% 100% 0% $0.13

View session

Experiment 1b: 5% Noise¶

In [ ]:
left_table = data.drop(columns=["fair_value"])
right_table = randomize_string_column(data[["company", "fair_value"]].copy(), "company", p=0.05)

async with create_session(name="Fuzzy Match p=0.05") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the tables on company name",
        left_table=left_table,
        right_table=right_table,
        merge_on_left="company",
        merge_on_right="company",
    )
result.data.head()

Results (5% Noise):

Matched Exact Fuzzy LLM Web Accuracy False Pos Price
100% 49.8% 30.6% 19.6% 0% 100% 0% $0.32

View session

Experiment 1c: 10% Noise¶

In [ ]:
left_table = data.drop(columns=["fair_value"])
right_table = randomize_string_column(data[["company", "fair_value"]].copy(), "company", p=0.1)

async with create_session(name="Fuzzy Match p=0.1") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the tables on company name",
        left_table=left_table,
        right_table=right_table,
        merge_on_left="company",
        merge_on_right="company",
    )
result.data.head()

Results (10% Noise):

Matched Exact Fuzzy LLM Web Accuracy False Pos Price
100% 26.5% 30.8% 42.7% 0% 100% 0% $0.44

At 10% corruption, exact matching handles only 27% of rows. The cascade escalates to LLM matching for 43%.

View session

Experiment 2: LLM Merge for Semantic Relationships¶

Matching tables with no common columns — the LLM must use world knowledge to match company names to tickers.

Setup:

  • Left table: company info (no ticker)
  • Right table: ticker + fair_value only
  • No merge columns specified → skips string matching, goes straight to LLM
In [ ]:
left_table = data.drop(columns=["ticker", "fair_value"])
right_table = data[["ticker", "fair_value"]]

async with create_session(name="LLM Match (company to ticker)") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the tables based on company name and ticker",
        left_table=left_table,
        right_table=right_table,
        # No merge columns → LLM matching
    )
result.data.head()

Results:

Matched Exact Fuzzy LLM Web Accuracy False Pos Price
100% 0% 0% 99.8% 0.2% 100% 0% $1.00

437/438 rows matched via pure LLM reasoning. Company-ticker mappings are stable and well-learned.

View session

Experiment 3: Web Merge for Dynamic Data¶

CEO information changes frequently — LLM training data becomes stale. Web verification is needed.

Setup:

  • Left table: company info (no CEO)
  • Right table: CEO names only
  • Task instructs to use web search if unsure
In [ ]:
left_table = data.drop(columns=["CEO"])
right_table = data[["CEO"]]

async with create_session(name="Web Match (CEO)") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the CEO to the company information, use web search if unsure",
        left_table=left_table,
        right_table=right_table,
    )
result.data.head()

Results:

Matched LLM Web Accuracy False Pos Cost
95.7% 59.8% 35.8% 96.7% 3.2% $3.69

CEO matching requires web verification since leadership changes frequently.

View session