LLM-powered Merging at Scale¶

The everyrow merge() function joins two tables using LLMs, and LLM research agents, to identify matching rows at high accuracy. This notebook demonstrates how this scales to two tables of 2,246 rows. So each row gets LLM-level intelligence and research to find which of the 2,246 rows in the other table is the most likely match.

Cost grows super linearly with the number of rows. At small scale (100 to 400 rows) the cost is negligible; at 2,246 x 2,246 rows, this cost $26.80.

Example: Matching 2,246 People to Personal Websites¶

This example takes two tables: one with people's names and professional information (position, university, email), and another with a shuffled list of personal website URLs. The task is to determine which website belongs to which person.

Most matches can be resolved by comparing names and emails against URL patterns. But some require web search to confirm ownership when the connection is not obvious from the data alone.

Load Data¶

In [ ]:

import numpy as np
import pandas as pd
from everyrow.ops import merge

pd.set_option("display.max_colwidth", None)

In [ ]:

left_df = pd.read_csv("merge_websites_input_left_2246.csv")
right_df = pd.read_csv("merge_websites_input_right_2246.csv")

print(f"Left table: {len(left_df)} rows")
left_df.head(3)

Left table: 2246 rows

	name	position	university	email_address	organization
0	Stefan Heimersheim	Research Scientist	NaN	1. stefan@heimersheim.eu \n2. heimersheim@ast.cam.ac.uk \n3. sh2061@cam.ac.uk	Apollo Research
1	Dr Nikola Simidjievski	Postdoctoral Researcher	University of Cambridge	1. nikola.simidjievski@cl.cam.ac.uk	Artificial Intelligence Group
2	Ruotong Wang	PhD Student	University of Washington	1. ruotongw@cs.washington.edu	Social Futures Lab

In [ ]:

print(f"Right table: {len(right_df)} rows")
right_df.head(3)

Right table: 2246 rows

	personal_website_url
0	1. https://beau-coup.github.io/
1	1. https://nair-p.github.io/ \n2. https://nair-p.github.io/contact/
2	1. https://www.murtylab.com \n2. http://ratan.mit.edu

Run Merge¶

Run the merge at increasing scales to see how it behaves.

In [ ]:

for n in [100, 200, 400, 800, 1600, 2246]:
    result = await merge(
        task="Match each person to their website(s).",
        left_table=pd.read_csv(f"merge_websites_input_left_{n}.csv"),
        right_table=pd.read_csv(f"merge_websites_input_right_{n}.csv"),
    )
    print(f"n={n}")
    print("num of matched rows:", len(result.data))
    print("-" * 100)
    print()

n=100
num of matched rows: 100
num of LLM matches: 95
num of web search matches: 5
----------------------------------------------------------------------------------------------------

n=200
num of matched rows: 200
num of LLM matches: 196
num of web search matches: 4
----------------------------------------------------------------------------------------------------

n=400
num of matched rows: 400
num of LLM matches: 386
num of web search matches: 14
----------------------------------------------------------------------------------------------------

n=800
num of matched rows: 800
num of LLM matches: 780
num of web search matches: 20
----------------------------------------------------------------------------------------------------

n=1600
num of matched rows: 1600
----------------------------------------------------------------------------------------------------

n=2246
num of matched rows: 2246
num of LLM matches: 2228
num of web search matches: 18
----------------------------------------------------------------------------------------------------

Cost¶

In [ ]:

import matplotlib.pyplot as plt

rows = [100, 200, 400, 800, 1600, 2246]
costs = [0.000465, 0.142, 0.293, 2.32, 16.6, 26.8]

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(rows, costs, "o-", color="#2563eb", linewidth=2, markersize=8)
for x, y in zip(rows, costs):
    ax.annotate(f"${y:.2f}", (x, y), textcoords="offset points", xytext=(0, 12), ha="center", fontsize=9)
ax.set_xlabel("Number of rows")
ax.set_ylabel("Cost (USD)")
ax.set_title("Merge cost vs. number of rows")
ax.set_xticks(rows)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Cost grows super linearly with the number of rows. As the number of rows increases, each match becomes harder because the LLM has more candidates to consider, and more rows require web search to resolve ambiguity. At small scale (100 to 400 rows) the cost is negligible; at 2,246 rows it is $26.80.

Inspecting Results¶

Sample matches from the n=800 run.

In [ ]:

results_df = pd.read_csv("merge_websites_output_800.csv")

Most matches are resolved by the LLM alone. It can often match a person to their website by comparing names, emails, and URL patterns without any web search.

In [ ]:

llm_matches = results_df[results_df["research"].str.contains("information in both tables", na=False)]
llm_matches[["name", "email_address", "personal_website_url", "research"]].head(2)

	name	email_address	personal_website_url	research
0	Stefan Heimersheim	1. stefan@heimersheim.eu \n2. heimersheim@ast.cam.ac.uk \n3. sh2061@cam.ac.uk	1. https://ndaheim.github.io/ \n2. https://ndaheim.github.io/publications/ \n3. https://ndaheim.github.io/aboutme/	{"personal_website_url":"This row was matched due to the information in both tables"}
1	Dr Nikola Simidjievski	1. nikola.simidjievski@cl.cam.ac.uk	1. https://www.cl.cam.ac.uk/~btd26/	{"personal_website_url":"This row was matched due to the information in both tables"}

For harder cases where the LLM cannot confidently match from the table data alone, everyrow automatically falls back to web search.

In [ ]:

web_matches = results_df[results_df["research"].str.contains("information found in the web", na=False)]
web_matches[["name", "organization", "personal_website_url", "research"]].head(1)

	name	organization	personal_website_url	research
10	Charles London	NaN	1. https://le-big-mac.github.io/	{"personal_website_url":"This row was matched due to the following information found in the web:\n\nBased on the provided information for Charles London (PhD Student at the University of Oxford, Department of Computer Science), the following personal website URL and identifiers were found to assist in matching:\n\n- Personal Website URL: https://le-big-mac.github.io/\n- Official University Profile: https://www.cs.ox.ac.uk/people/charles.london/\n- Google Scholar Profile: https://scholar.google.com/citations?user=ghU-4hUAAAAJ\n- LinkedIn Profile: https://uk.linkedin.com/in/charles-london\n- GitHub Username: le-big-mac (associated with the personal website)\n\nThe entity is confirmed as a DPhil (PhD) student in the Artificial Intelligence and Machine Learning research theme at the University of Oxford, supervised by Prof. Varun Kanade. His research focuses on machine learning theory, LLMs, and reinforcement learning."}

In this case, there is no obvious connection between "Charles London" and le-big-mac.github.io from the table data alone. everyrow searched the web, found his Oxford profile and GitHub username, and confirmed the match.

LLM-powered Merging at Scale¶

Cost grows super linearly with the number of rows. At small scale (100 to 400 rows) the cost is negligible; at 2,246 x 2,246 rows, this cost $26.80.

Example: Matching 2,246 People to Personal Websites¶

Most matches can be resolved by comparing names and emails against URL patterns. But some require web search to confirm ownership when the connection is not obvious from the data alone.

name

position

university

email_address

organization

Stefan Heimersheim

Research Scientist

NaN

1. stefan@heimersheim.eu \n2. heimersheim@ast.cam.ac.uk \n3. sh2061@cam.ac.uk

Apollo Research

Dr Nikola Simidjievski

Postdoctoral Researcher

University of Cambridge

1. nikola.simidjievski@cl.cam.ac.uk

Artificial Intelligence Group

Ruotong Wang

PhD Student

University of Washington

1. ruotongw@cs.washington.edu

Social Futures Lab

personal_website_url

1. https://beau-coup.github.io/

1. https://nair-p.github.io/ \n2. https://nair-p.github.io/contact/

1. https://www.murtylab.com \n2. http://ratan.mit.edu

for n in [100, 200, 400, 800, 1600, 2246]: result = await merge( task="Match each person to their website(s).", left_table=pd.read_csv(f"merge_websites_input_left_{n}.csv"), right_table=pd.read_csv(f"merge_websites_input_right_{n}.csv"), ) print(f"n={n}") print("num of matched rows:", len(result.data)) print("-" * 100) print()

n=100 num of matched rows: 100 num of LLM matches: 95 num of web search matches: 5 ---------------------------------------------------------------------------------------------------- n=200 num of matched rows: 200 num of LLM matches: 196 num of web search matches: 4 ---------------------------------------------------------------------------------------------------- n=400 num of matched rows: 400 num of LLM matches: 386 num of web search matches: 14 ---------------------------------------------------------------------------------------------------- n=800 num of matched rows: 800 num of LLM matches: 780 num of web search matches: 20 ---------------------------------------------------------------------------------------------------- n=1600 num of matched rows: 1600 ---------------------------------------------------------------------------------------------------- n=2246 num of matched rows: 2246 num of LLM matches: 2228 num of web search matches: 18 ----------------------------------------------------------------------------------------------------

import matplotlib.pyplot as plt rows = [100, 200, 400, 800, 1600, 2246] costs = [0.000465, 0.142, 0.293, 2.32, 16.6, 26.8] fig, ax = plt.subplots(figsize=(8, 5)) ax.plot(rows, costs, "o-", color="#2563eb", linewidth=2, markersize=8) for x, y in zip(rows, costs): ax.annotate(f"${y:.2f}", (x, y), textcoords="offset points", xytext=(0, 12), ha="center", fontsize=9) ax.set_xlabel("Number of rows") ax.set_ylabel("Cost (USD)") ax.set_title("Merge cost vs. number of rows") ax.set_xticks(rows) ax.grid(True, alpha=0.3) plt.tight_layout() plt.show()

name

email_address

personal_website_url

research

Stefan Heimersheim

1. stefan@heimersheim.eu \n2. heimersheim@ast.cam.ac.uk \n3. sh2061@cam.ac.uk

1. https://ndaheim.github.io/ \n2. https://ndaheim.github.io/publications/ \n3. https://ndaheim.github.io/aboutme/

{"personal_website_url":"This row was matched due to the information in both tables"}

Dr Nikola Simidjievski

1. nikola.simidjievski@cl.cam.ac.uk

1. https://www.cl.cam.ac.uk/~btd26/

{"personal_website_url":"This row was matched due to the information in both tables"}

name

organization

personal_website_url

research

Charles London

NaN

1. https://le-big-mac.github.io/

{"personal_website_url":"This row was matched due to the following information found in the web:\n\nBased on the provided information for Charles London (PhD Student at the University of Oxford, Department of Computer Science), the following personal website URL and identifiers were found to assist in matching:\n\n- **Personal Website URL:** https://le-big-mac.github.io/\n- **Official University Profile:** https://www.cs.ox.ac.uk/people/charles.london/\n- **Google Scholar Profile:** https://scholar.google.com/citations?user=ghU-4hUAAAAJ\n- **LinkedIn Profile:** https://uk.linkedin.com/in/charles-london\n- **GitHub Username:** le-big-mac (associated with the personal website)\n\nThe entity is confirmed as a DPhil (PhD) student in the Artificial Intelligence and Machine Learning research theme at the University of Oxford, supervised by Prof. Varun Kanade. His research focuses on machine learning theory, LLMs, and reinforcement learning."}