Fuzzy match and merge contact lists in Python¶
This notebook demonstrates using everyrow's merge() utility to combine two overlapping contact lists where records lack exact matches.
Use Case: You have candidate lists from two different sources and need to merge them to avoid sending duplicate recruiting emails. The challenge: less than 50% match exactly by name or email due to typos, nicknames, different email domains, and incomplete data.
Why everyrow? Traditional approaches (VLOOKUP, fuzzy matching) fail on semantic variations. everyrow's merge() uses LLM-powered matching to intelligently identify duplicates despite significant data variations.
In [1]:
import asyncio
from dotenv import load_dotenv
load_dotenv()
import pandas as pd
from everyrow import create_session
from everyrow.ops import merge
Load Contact Lists¶
In [2]:
# List A: From a conference attendee export
list_a = pd.read_csv("../data/contacts_list_a.csv").fillna("")
print(f"List A: {len(list_a)} contacts")
list_a
Out[2]:
In [3]:
# List B: From a research collaboration database
list_b = pd.read_csv("../data/contacts_list_b.csv").fillna("")
print(f"List B: {len(list_b)} contacts")
list_b
Out[3]:
Define Merge Task¶
In [4]:
MERGE_TASK = """
Match contacts between two lists to identify the same person.
Two records represent the SAME PERSON if:
- Names match (accounting for nicknames: Bob/Robert, Mike/Michael, Tom/Thomas)
- Names match with initials (S. Chen = Sarah Chen)
- Same institution/lab despite different name formats
- Email domains suggest same organization
Do NOT match if:
- Only first names match but institutions differ
- Names are completely different people
When in doubt, favor false negatives over false positives (better to not match than to wrongly match).
"""
Run the Merge¶
In [5]:
async def run_merge():
async with create_session(name="Contact List Merge") as session:
print(f"Session URL: {session.get_url()}")
print("\nMerging contact lists...\n")
result = await merge(
session=session,
task=MERGE_TASK,
left_table=list_a,
right_table=list_b,
merge_on_left="name",
merge_on_right="full_name",
)
return result.data
results_df = await run_merge()
Analyze Results¶
In [6]:
# Count matches
matched = results_df[results_df["full_name"].notna()]
unmatched_a = results_df[results_df["full_name"].isna()]
print(f"\n{'='*60}")
print(f"MERGE RESULTS")
print(f"{'='*60}")
print(f" List A contacts: {len(list_a)}")
print(f" List B contacts: {len(list_b)}")
print(f" Matched pairs: {len(matched)}")
print(f" List A only: {len(unmatched_a)}")
In [7]:
# Show matched pairs
print("\nMATCHED CONTACTS:")
print("-" * 70)
for _, row in matched.iterrows():
print(f" List A: {row['name']:25} | List B: {row['full_name']}")
print(f" {row['affiliation']:25} | {row['lab']}")
print()
In [8]:
# Show unmatched from List A
if len(unmatched_a) > 0:
print("\nUNMATCHED FROM LIST A (unique to conference):")
print("-" * 50)
for _, row in unmatched_a.iterrows():
print(f" {row['name']} - {row['affiliation']}")
In [9]:
# Find contacts unique to List B
matched_from_b = set(matched["full_name"].dropna())
unique_to_b = list_b[~list_b["full_name"].isin(matched_from_b)]
if len(unique_to_b) > 0:
print("\nUNIQUE TO LIST B (not at conference):")
print("-" * 50)
for _, row in unique_to_b.iterrows():
print(f" {row['full_name']} - {row['lab']}")
In [10]:
# Full merged results
results_df
Out[10]: