Fuzzy join two Pandas DataFrames using LLMs¶

This notebook demonstrates the everyrow.io SDK merge capabilities:

Fuzzy String Matching - Handling typos and corrupted data
LLM Merge - Matching without common columns (company ↔ ticker)
Web Merge - Dynamic data requiring real-time verification (CEO matching)

The SDK implements a cascade: Exact → Fuzzy → LLM → Web, using the simplest method that works.

Setup¶

Install the SDK and set your API key:

In [ ]:

!pip install everyrow
%env EVERYROW_API_KEY=your_api_key

Load the data:

In [3]:

import pandas as pd
from pandas import DataFrame

from everyrow import create_session
from everyrow.ops import merge
from everyrow.generated.models import LLMEnum

# Load dataset: 438 S&P 500 companies
data = pd.read_csv("../data/companies.csv")
print(f"Dataset: {data.shape[0]} companies, {data.shape[1]} columns")
data.head()

Dataset: 438 companies, 7 columns

Out[3]:

	company	ticker	fair_value	price	mkt_cap	shares	CEO
0	3M	MMM	39.18	101.74	61.70678828	606514530	William M. Brown
1	A. O. Smith	AOS	6.59	32.38	4.904416495	151464376	Stephen Shafer
2	Abbott Laboratories	ABT	119.19	34.87	51.22933139	1469152033	Robert B. Ford
3	AbbVie	ABBV	180.95	38.06	61.55666858	1617358607	Robert A. Michael
4	Accenture	ACN	107.79	97.84	79.53540176	812912937	Julie Sweet

Experiment 1: Fuzzy String Matching Under Noise¶

We corrupt company names with increasing noise levels to test the merge cascade.

Setup:

Left table: all columns except fair_value
Right table: company (corrupted) + fair_value
Merge on company name columns

Let us define a function that adds noise to a string column of a DataFrame:

In [4]:

import random
import string

def randomize_string_column(df: DataFrame, column_name: str, p: float = 0.1) -> DataFrame:
    def randomize_string(text):
        if pd.isna(text) or not isinstance(text, str):
            return text
        return ''.join([random.choice(string.ascii_letters + '') if random.random() < p else char for char in text])
    df_copy: DataFrame = df.copy()
    df_copy[column_name] = df_copy[column_name].apply(randomize_string)
    return df_copy

Experiment 1a: 0% Noise (Baseline)¶

Here we pick the right columns from the groun truth, apply the noise, and call the everyrow.io SDK to perform the merge:

In [ ]:

# Prepare tables
left_table = data.drop(columns=["fair_value"])
right_table = data[["company", "fair_value"]].copy()  # No noise

async with create_session(name="Fuzzy Match p=0") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the tables on company name",
        left_table=left_table,
        right_table=right_table,
        merge_on_left="company",
        merge_on_right="company",
    )
result.data.head()

Results (0% Noise):

Matched	Exact	Fuzzy	LLM	Web	Accuracy	False Pos	Price
100%	100%	0%	0%	0%	100%	0%	$0.13

View session

Experiment 1b: 5% Noise¶

In [ ]:

left_table = data.drop(columns=["fair_value"])
right_table = randomize_string_column(data[["company", "fair_value"]].copy(), "company", p=0.05)

async with create_session(name="Fuzzy Match p=0.05") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the tables on company name",
        left_table=left_table,
        right_table=right_table,
        merge_on_left="company",
        merge_on_right="company",
    )
result.data.head()

Results (5% Noise):

Matched	Exact	Fuzzy	LLM	Web	Accuracy	False Pos	Price
100%	49.8%	30.6%	19.6%	0%	100%	0%	$0.32

View session

Experiment 1c: 10% Noise¶

In [ ]:

left_table = data.drop(columns=["fair_value"])
right_table = randomize_string_column(data[["company", "fair_value"]].copy(), "company", p=0.1)

async with create_session(name="Fuzzy Match p=0.1") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the tables on company name",
        left_table=left_table,
        right_table=right_table,
        merge_on_left="company",
        merge_on_right="company",
    )
result.data.head()

Results (10% Noise):

Matched	Exact	Fuzzy	LLM	Web	Accuracy	False Pos	Price
100%	26.5%	30.8%	42.7%	0%	100%	0%	$0.44

At 10% corruption, exact matching handles only 27% of rows. The cascade escalates to LLM matching for 43%.

View session

Experiment 2: LLM Merge for Semantic Relationships¶

Matching tables with no common columns — the LLM must use world knowledge to match company names to tickers.

Setup:

Left table: company info (no ticker)
Right table: ticker + fair_value only
No merge columns specified → skips string matching, goes straight to LLM

In [ ]:

left_table = data.drop(columns=["ticker", "fair_value"])
right_table = data[["ticker", "fair_value"]]

async with create_session(name="LLM Match (company to ticker)") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the tables based on company name and ticker",
        left_table=left_table,
        right_table=right_table,
        # No merge columns → LLM matching
    )
result.data.head()

Results:

Matched	Exact	Fuzzy	LLM	Web	Accuracy	False Pos	Price
100%	0%	0%	99.8%	0.2%	100%	0%	$1.00

437/438 rows matched via pure LLM reasoning. Company-ticker mappings are stable and well-learned.

View session

Experiment 3: Web Merge for Dynamic Data¶

CEO information changes frequently — LLM training data becomes stale. Web verification is needed.

Setup:

Left table: company info (no CEO)
Right table: CEO names only
Task instructs to use web search if unsure

In [ ]:

left_table = data.drop(columns=["CEO"])
right_table = data[["CEO"]]

async with create_session(name="Web Match (CEO)") as session:
    print(f"Session: {session.get_url()}")
    result = await merge(
        session=session,
        task="Merge the CEO to the company information, use web search if unsure",
        left_table=left_table,
        right_table=right_table,
    )
result.data.head()

Results:

Matched	LLM	Web	Accuracy	False Pos	Cost
95.7%	59.8%	35.8%	96.7%	3.2%	$3.69

CEO matching requires web verification since leadership changes frequently.

View session

Fuzzy join two Pandas DataFrames using LLMs¶

This notebook demonstrates the everyrow.io SDK merge capabilities:

Fuzzy String Matching - Handling typos and corrupted data

LLM Merge - Matching without common columns (company ↔ ticker)

Web Merge - Dynamic data requiring real-time verification (CEO matching)

The SDK implements a cascade: Exact → Fuzzy → LLM → Web, using the simplest method that works.

import pandas as pd from pandas import DataFrame from everyrow import create_session from everyrow.ops import merge from everyrow.generated.models import LLMEnum # Load dataset: 438 S&P 500 companies data = pd.read_csv("../data/companies.csv") print(f"Dataset: {data.shape[0]} companies, {data.shape[1]} columns") data.head()

company

ticker

fair_value

price

mkt_cap

CEO

MMM

39.18

101.74

61.70678828

606514530

William M. Brown

A. O. Smith

AOS

6.59

32.38

4.904416495

151464376

Stephen Shafer

Abbott Laboratories

ABT

119.19

34.87

51.22933139

1469152033

Robert B. Ford

AbbVie

ABBV

180.95

38.06

61.55666858

1617358607

Robert A. Michael

Accenture

ACN

107.79

97.84

79.53540176

812912937

Julie Sweet

import random import string def randomize_string_column(df: DataFrame, column_name: str, p: float = 0.1) -> DataFrame: def randomize_string(text): if pd.isna(text) or not isinstance(text, str): return text return ''.join([random.choice(string.ascii_letters + '') if random.random() < p else char for char in text]) df_copy: DataFrame = df.copy() df_copy[column_name] = df_copy[column_name].apply(randomize_string) return df_copy

# Prepare tables left_table = data.drop(columns=["fair_value"]) right_table = data[["company", "fair_value"]].copy() # No noise async with create_session(name="Fuzzy Match p=0") as session: print(f"Session: {session.get_url()}") result = await merge( session=session, task="Merge the tables on company name", left_table=left_table, right_table=right_table, merge_on_left="company", merge_on_right="company", ) result.data.head()

Matched

Exact

Fuzzy

LLM

Web

Accuracy

False Pos

Price

100%

$0.13

left_table = data.drop(columns=["fair_value"]) right_table = randomize_string_column(data[["company", "fair_value"]].copy(), "company", p=0.05) async with create_session(name="Fuzzy Match p=0.05") as session: print(f"Session: {session.get_url()}") result = await merge( session=session, task="Merge the tables on company name", left_table=left_table, right_table=right_table, merge_on_left="company", merge_on_right="company", ) result.data.head()

Matched

Exact

Fuzzy

LLM

Web

Accuracy

False Pos

Price

100%

49.8%

30.6%

19.6%

100%

$0.32

left_table = data.drop(columns=["fair_value"]) right_table = randomize_string_column(data[["company", "fair_value"]].copy(), "company", p=0.1) async with create_session(name="Fuzzy Match p=0.1") as session: print(f"Session: {session.get_url()}") result = await merge( session=session, task="Merge the tables on company name", left_table=left_table, right_table=right_table, merge_on_left="company", merge_on_right="company", ) result.data.head()

Matched

Exact

Fuzzy

LLM

Web

Accuracy

False Pos

Price

100%

26.5%

30.8%

42.7%

100%

$0.44

left_table = data.drop(columns=["ticker", "fair_value"]) right_table = data[["ticker", "fair_value"]] async with create_session(name="LLM Match (company to ticker)") as session: print(f"Session: {session.get_url()}") result = await merge( session=session, task="Merge the tables based on company name and ticker", left_table=left_table, right_table=right_table, # No merge columns → LLM matching ) result.data.head()

Matched

Exact

Fuzzy

LLM

Web

Accuracy

False Pos

Price

100%

99.8%

0.2%

100%

$1.00

left_table = data.drop(columns=["CEO"]) right_table = data[["CEO"]] async with create_session(name="Web Match (CEO)") as session: print(f"Session: {session.get_url()}") result = await merge( session=session, task="Merge the CEO to the company information, use web search if unsure", left_table=left_table, right_table=right_table, ) result.data.head()

Matched

LLM

Web

Accuracy

False Pos

Cost

95.7%

59.8%

35.8%

96.7%

3.2%

$3.69