Fuzzy join two Pandas DataFrames using LLMs¶
This notebook demonstrates the everyrow.io SDK merge capabilities:
- Fuzzy String Matching - Handling typos and corrupted data
- LLM Merge - Matching without common columns (company ↔ ticker)
- Web Merge - Dynamic data requiring real-time verification (CEO matching)
The SDK implements a cascade: Exact → Fuzzy → LLM → Web, using the simplest method that works.
Setup¶
Install the SDK and set your API key:
!pip install everyrow
%env EVERYROW_API_KEY=your_api_key
Load the data:
import pandas as pd
from pandas import DataFrame
from everyrow import create_session
from everyrow.ops import merge
from everyrow.generated.models import LLMEnum
# Load dataset: 438 S&P 500 companies
data = pd.read_csv("../data/companies.csv")
print(f"Dataset: {data.shape[0]} companies, {data.shape[1]} columns")
data.head()
Experiment 1: Fuzzy String Matching Under Noise¶
We corrupt company names with increasing noise levels to test the merge cascade.
Setup:
- Left table: all columns except
fair_value - Right table:
company(corrupted) +fair_value - Merge on company name columns
Let us define a function that adds noise to a string column of a DataFrame:
import random
import string
def randomize_string_column(df: DataFrame, column_name: str, p: float = 0.1) -> DataFrame:
def randomize_string(text):
if pd.isna(text) or not isinstance(text, str):
return text
return ''.join([random.choice(string.ascii_letters + '') if random.random() < p else char for char in text])
df_copy: DataFrame = df.copy()
df_copy[column_name] = df_copy[column_name].apply(randomize_string)
return df_copy
Experiment 1a: 0% Noise (Baseline)¶
Here we pick the right columns from the groun truth, apply the noise, and call the everyrow.io SDK to perform the merge:
# Prepare tables
left_table = data.drop(columns=["fair_value"])
right_table = data[["company", "fair_value"]].copy() # No noise
async with create_session(name="Fuzzy Match p=0") as session:
print(f"Session: {session.get_url()}")
result = await merge(
session=session,
task="Merge the tables on company name",
left_table=left_table,
right_table=right_table,
merge_on_left="company",
merge_on_right="company",
)
result.data.head()
Results (0% Noise):
| Matched | Exact | Fuzzy | LLM | Web | Accuracy | False Pos | Price |
|---|---|---|---|---|---|---|---|
| 100% | 100% | 0% | 0% | 0% | 100% | 0% | $0.13 |
Experiment 1b: 5% Noise¶
left_table = data.drop(columns=["fair_value"])
right_table = randomize_string_column(data[["company", "fair_value"]].copy(), "company", p=0.05)
async with create_session(name="Fuzzy Match p=0.05") as session:
print(f"Session: {session.get_url()}")
result = await merge(
session=session,
task="Merge the tables on company name",
left_table=left_table,
right_table=right_table,
merge_on_left="company",
merge_on_right="company",
)
result.data.head()
Results (5% Noise):
| Matched | Exact | Fuzzy | LLM | Web | Accuracy | False Pos | Price |
|---|---|---|---|---|---|---|---|
| 100% | 49.8% | 30.6% | 19.6% | 0% | 100% | 0% | $0.32 |
Experiment 1c: 10% Noise¶
left_table = data.drop(columns=["fair_value"])
right_table = randomize_string_column(data[["company", "fair_value"]].copy(), "company", p=0.1)
async with create_session(name="Fuzzy Match p=0.1") as session:
print(f"Session: {session.get_url()}")
result = await merge(
session=session,
task="Merge the tables on company name",
left_table=left_table,
right_table=right_table,
merge_on_left="company",
merge_on_right="company",
)
result.data.head()
Results (10% Noise):
| Matched | Exact | Fuzzy | LLM | Web | Accuracy | False Pos | Price |
|---|---|---|---|---|---|---|---|
| 100% | 26.5% | 30.8% | 42.7% | 0% | 100% | 0% | $0.44 |
At 10% corruption, exact matching handles only 27% of rows. The cascade escalates to LLM matching for 43%.
Experiment 2: LLM Merge for Semantic Relationships¶
Matching tables with no common columns — the LLM must use world knowledge to match company names to tickers.
Setup:
- Left table: company info (no ticker)
- Right table: ticker + fair_value only
- No merge columns specified → skips string matching, goes straight to LLM
left_table = data.drop(columns=["ticker", "fair_value"])
right_table = data[["ticker", "fair_value"]]
async with create_session(name="LLM Match (company to ticker)") as session:
print(f"Session: {session.get_url()}")
result = await merge(
session=session,
task="Merge the tables based on company name and ticker",
left_table=left_table,
right_table=right_table,
# No merge columns → LLM matching
)
result.data.head()
Results:
| Matched | Exact | Fuzzy | LLM | Web | Accuracy | False Pos | Price |
|---|---|---|---|---|---|---|---|
| 100% | 0% | 0% | 99.8% | 0.2% | 100% | 0% | $1.00 |
437/438 rows matched via pure LLM reasoning. Company-ticker mappings are stable and well-learned.
Experiment 3: Web Merge for Dynamic Data¶
CEO information changes frequently — LLM training data becomes stale. Web verification is needed.
Setup:
- Left table: company info (no CEO)
- Right table: CEO names only
- Task instructs to use web search if unsure
left_table = data.drop(columns=["CEO"])
right_table = data[["CEO"]]
async with create_session(name="Web Match (CEO)") as session:
print(f"Session: {session.get_url()}")
result = await merge(
session=session,
task="Merge the CEO to the company information, use web search if unsure",
left_table=left_table,
right_table=right_table,
)
result.data.head()
Results:
| Matched | LLM | Web | Accuracy | False Pos | Cost |
|---|---|---|---|---|---|
| 95.7% | 59.8% | 35.8% | 96.7% | 3.2% | $3.69 |
CEO matching requires web verification since leadership changes frequently.