LLM-powered Merging at Scale¶
The everyrow merge() function joins two tables using LLMs, and LLM research agents, to identify matching rows at high accuracy. This notebook demonstrates how this scales to two tables of 2,246 rows. So each row gets LLM-level intelligence and research to find which of the 2,246 rows in the other table is the most likely match.
Cost grows super linearly with the number of rows. At small scale (100 to 400 rows) the cost is negligible; at 2,246 x 2,246 rows, this cost $26.80.
Example: Matching 2,246 People to Personal Websites¶
This example takes two tables: one with people's names and professional information (position, university, email), and another with a shuffled list of personal website URLs. The task is to determine which website belongs to which person.
Most matches can be resolved by comparing names and emails against URL patterns. But some require web search to confirm ownership when the connection is not obvious from the data alone.
Load Data¶
import numpy as np
import pandas as pd
from everyrow.ops import merge
pd.set_option("display.max_colwidth", None)
left_df = pd.read_csv("merge_websites_input_left_2246.csv")
right_df = pd.read_csv("merge_websites_input_right_2246.csv")
print(f"Left table: {len(left_df)} rows")
left_df.head(3)
print(f"Right table: {len(right_df)} rows")
right_df.head(3)
Run Merge¶
Run the merge at increasing scales to see how it behaves.
for n in [100, 200, 400, 800, 1600, 2246]:
result = await merge(
task="Match each person to their website(s).",
left_table=pd.read_csv(f"merge_websites_input_left_{n}.csv"),
right_table=pd.read_csv(f"merge_websites_input_right_{n}.csv"),
)
print(f"n={n}")
print("num of matched rows:", len(result.data))
print("-" * 100)
print()
Cost¶
import matplotlib.pyplot as plt
rows = [100, 200, 400, 800, 1600, 2246]
costs = [0.000465, 0.142, 0.293, 2.32, 16.6, 26.8]
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(rows, costs, "o-", color="#2563eb", linewidth=2, markersize=8)
for x, y in zip(rows, costs):
ax.annotate(f"${y:.2f}", (x, y), textcoords="offset points", xytext=(0, 12), ha="center", fontsize=9)
ax.set_xlabel("Number of rows")
ax.set_ylabel("Cost (USD)")
ax.set_title("Merge cost vs. number of rows")
ax.set_xticks(rows)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Cost grows super linearly with the number of rows. As the number of rows increases, each match becomes harder because the LLM has more candidates to consider, and more rows require web search to resolve ambiguity. At small scale (100 to 400 rows) the cost is negligible; at 2,246 rows it is $26.80.
Inspecting Results¶
Sample matches from the n=800 run.
results_df = pd.read_csv("merge_websites_output_800.csv")
Most matches are resolved by the LLM alone. It can often match a person to their website by comparing names, emails, and URL patterns without any web search.
llm_matches = results_df[results_df["research"].str.contains("information in both tables", na=False)]
llm_matches[["name", "email_address", "personal_website_url", "research"]].head(2)
For harder cases where the LLM cannot confidently match from the table data alone, everyrow automatically falls back to web search.
web_matches = results_df[results_df["research"].str.contains("information found in the web", na=False)]
web_matches[["name", "organization", "personal_website_url", "research"]].head(1)
In this case, there is no obvious connection between "Charles London" and le-big-mac.github.io from the table data alone. everyrow searched the web, found his Oxford profile and GitHub username, and confirmed the match.