Everyrow
Getting Started
  • Installation
  • Skills vs MCP
Guides
  • How to Add A Column to a DataFrame with Web Research
  • How to Classify and Label Data with an LLM in Python
  • Remove Duplicates from ML Training Data in Python
  • Filter a Pandas DataFrame with LLMs
  • How to Fuzzy Join DataFrames in Python
  • How to sort a dataset using web data in Python
  • How to resolve duplicate rows in Python with LLMs
API Reference
  • dedupe
  • merge
  • rank
  • agent_map
  • screen
Case Studies
  • Build an AI lead qualification pipeline in Python
  • Fuzzy join two Pandas DataFrames using LLMs
  • Fuzzy match and merge contact lists in Python
  • How to filter job postings with LLM Agents
  • How to merge datasets without common ID in Python
  • How to score and prioritize leads with AI in Python
  • How to Screen Stocks in Python with AI Agents
  • How to use LLMs to deduplicate CRM Data
  • LLM-powered Merging at Scale
  • LLM-powered Screening at Scale
  • Python Notebook to screen stocks using AI Agents
  • Running LLM Web Research Agents at Scale
  • Score and rank leads without a CRM in Python
  • Use LLM Agents to research government data at scale

LLM-powered Merging at Scale¶

The everyrow merge() function joins two tables using LLMs, and LLM research agents, to identify matching rows at high accuracy. This notebook demonstrates how this scales to two tables of 2,246 rows. So each row gets LLM-level intelligence and research to find which of the 2,246 rows in the other table is the most likely match.

Cost grows super linearly with the number of rows. At small scale (100 to 400 rows) the cost is negligible; at 2,246 x 2,246 rows, this cost $26.80.

Example: Matching 2,246 People to Personal Websites¶

This example takes two tables: one with people's names and professional information (position, university, email), and another with a shuffled list of personal website URLs. The task is to determine which website belongs to which person.

Most matches can be resolved by comparing names and emails against URL patterns. But some require web search to confirm ownership when the connection is not obvious from the data alone.

Load Data¶

In [ ]:
import numpy as np
import pandas as pd
from everyrow.ops import merge

pd.set_option("display.max_colwidth", None)
In [ ]:
left_df = pd.read_csv("merge_websites_input_left_2246.csv")
right_df = pd.read_csv("merge_websites_input_right_2246.csv")

print(f"Left table: {len(left_df)} rows")
left_df.head(3)
Left table: 2246 rows
name position university email_address organization
0 Stefan Heimersheim Research Scientist NaN 1. stefan@heimersheim.eu \n2. heimersheim@ast.cam.ac.uk \n3. sh2061@cam.ac.uk Apollo Research
1 Dr Nikola Simidjievski Postdoctoral Researcher University of Cambridge 1. nikola.simidjievski@cl.cam.ac.uk Artificial Intelligence Group
2 Ruotong Wang PhD Student University of Washington 1. ruotongw@cs.washington.edu Social Futures Lab
In [ ]:
print(f"Right table: {len(right_df)} rows")
right_df.head(3)
Right table: 2246 rows
personal_website_url
0 1. https://beau-coup.github.io/
1 1. https://nair-p.github.io/ \n2. https://nair-p.github.io/contact/
2 1. https://www.murtylab.com \n2. http://ratan.mit.edu

Run Merge¶

Run the merge at increasing scales to see how it behaves.

In [ ]:
for n in [100, 200, 400, 800, 1600, 2246]:
    result = await merge(
        task="Match each person to their website(s).",
        left_table=pd.read_csv(f"merge_websites_input_left_{n}.csv"),
        right_table=pd.read_csv(f"merge_websites_input_right_{n}.csv"),
    )
    print(f"n={n}")
    print("num of matched rows:", len(result.data))
    print("-" * 100)
    print()
n=100
num of matched rows: 100
num of LLM matches: 95
num of web search matches: 5
----------------------------------------------------------------------------------------------------

n=200
num of matched rows: 200
num of LLM matches: 196
num of web search matches: 4
----------------------------------------------------------------------------------------------------

n=400
num of matched rows: 400
num of LLM matches: 386
num of web search matches: 14
----------------------------------------------------------------------------------------------------

n=800
num of matched rows: 800
num of LLM matches: 780
num of web search matches: 20
----------------------------------------------------------------------------------------------------

n=1600
num of matched rows: 1600
----------------------------------------------------------------------------------------------------

n=2246
num of matched rows: 2246
num of LLM matches: 2228
num of web search matches: 18
----------------------------------------------------------------------------------------------------

Cost¶

In [ ]:
import matplotlib.pyplot as plt

rows = [100, 200, 400, 800, 1600, 2246]
costs = [0.000465, 0.142, 0.293, 2.32, 16.6, 26.8]

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(rows, costs, "o-", color="#2563eb", linewidth=2, markersize=8)
for x, y in zip(rows, costs):
    ax.annotate(f"${y:.2f}", (x, y), textcoords="offset points", xytext=(0, 12), ha="center", fontsize=9)
ax.set_xlabel("Number of rows")
ax.set_ylabel("Cost (USD)")
ax.set_title("Merge cost vs. number of rows")
ax.set_xticks(rows)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Cost grows super linearly with the number of rows. As the number of rows increases, each match becomes harder because the LLM has more candidates to consider, and more rows require web search to resolve ambiguity. At small scale (100 to 400 rows) the cost is negligible; at 2,246 rows it is $26.80.

Inspecting Results¶

Sample matches from the n=800 run.

In [ ]:
results_df = pd.read_csv("merge_websites_output_800.csv")

Most matches are resolved by the LLM alone. It can often match a person to their website by comparing names, emails, and URL patterns without any web search.

In [ ]:
llm_matches = results_df[results_df["research"].str.contains("information in both tables", na=False)]
llm_matches[["name", "email_address", "personal_website_url", "research"]].head(2)
name email_address personal_website_url research
0 Stefan Heimersheim 1. stefan@heimersheim.eu \n2. heimersheim@ast.cam.ac.uk \n3. sh2061@cam.ac.uk 1. https://ndaheim.github.io/ \n2. https://ndaheim.github.io/publications/ \n3. https://ndaheim.github.io/aboutme/ {"personal_website_url":"This row was matched due to the information in both tables"}
1 Dr Nikola Simidjievski 1. nikola.simidjievski@cl.cam.ac.uk 1. https://www.cl.cam.ac.uk/~btd26/ {"personal_website_url":"This row was matched due to the information in both tables"}

For harder cases where the LLM cannot confidently match from the table data alone, everyrow automatically falls back to web search.

In [ ]:
web_matches = results_df[results_df["research"].str.contains("information found in the web", na=False)]
web_matches[["name", "organization", "personal_website_url", "research"]].head(1)
name organization personal_website_url research
10 Charles London NaN 1. https://le-big-mac.github.io/ {"personal_website_url":"This row was matched due to the following information found in the web:\n\nBased on the provided information for Charles London (PhD Student at the University of Oxford, Department of Computer Science), the following personal website URL and identifiers were found to assist in matching:\n\n- **Personal Website URL:** https://le-big-mac.github.io/\n- **Official University Profile:** https://www.cs.ox.ac.uk/people/charles.london/\n- **Google Scholar Profile:** https://scholar.google.com/citations?user=ghU-4hUAAAAJ\n- **LinkedIn Profile:** https://uk.linkedin.com/in/charles-london\n- **GitHub Username:** le-big-mac (associated with the personal website)\n\nThe entity is confirmed as a DPhil (PhD) student in the Artificial Intelligence and Machine Learning research theme at the University of Oxford, supervised by Prof. Varun Kanade. His research focuses on machine learning theory, LLMs, and reinforcement learning."}

In this case, there is no obvious connection between "Charles London" and le-big-mac.github.io from the table data alone. everyrow searched the web, found his Oxford profile and GitHub username, and confirmed the match.