Fuzzy match and merge contact lists in Python¶

This notebook demonstrates using everyrow's merge() utility to combine two overlapping contact lists where records lack exact matches.

Use Case: You have candidate lists from two different sources and need to merge them to avoid sending duplicate recruiting emails. The challenge: less than 50% match exactly by name or email due to typos, nicknames, different email domains, and incomplete data.

Why everyrow? Traditional approaches (VLOOKUP, fuzzy matching) fail on semantic variations. everyrow's merge() uses LLM-powered matching to intelligently identify duplicates despite significant data variations.

In [1]:

import asyncio
from dotenv import load_dotenv
load_dotenv()

import pandas as pd
from everyrow import create_session
from everyrow.ops import merge

Load Contact Lists¶

In [2]:

# List A: From a conference attendee export
list_a = pd.read_csv("../data/contacts_list_a.csv").fillna("")

print(f"List A: {len(list_a)} contacts")
list_a

List A: 12 contacts

Out[2]:

	name	email	affiliation	title
0	Dr. Sarah Chen	sarah.chen@stanford.edu	Stanford University	Assistant Professor
1	Michael O'Brien	mobrien@mit.edu	MIT	PhD Candidate
2	Priya Sharma	p.sharma@berkeley.edu	UC Berkeley	Postdoc
3	James Wilson	jwilson@cmu.edu	Carnegie Mellon	Professor
4	Elena Rodriguez	elena.r@caltech.edu	Caltech	Research Scientist
5	David Kim	dkim@uw.edu	University of Washington	Associate Professor
6	Anna Kowalski	akowalski@gatech.edu	Georgia Tech	PhD Student
7	Robert Johnson	rjohnson@princeton.edu	Princeton	Senior Researcher
8	Maria Santos	msantos@columbia.edu	Columbia University	Assistant Professor
9	Thomas Lee	tlee@harvard.edu	Harvard	Postdoc
10	Jennifer Park	jpark@yale.edu	Yale	PhD Candidate
11	Christopher Davis	cdavis@upenn.edu	UPenn	Professor

In [3]:

# List B: From a research collaboration database
list_b = pd.read_csv("../data/contacts_list_b.csv").fillna("")

print(f"List B: {len(list_b)} contacts")
list_b

List B: 10 contacts

Out[3]:

	full_name	personal_email	github	lab
0	S. Chen	sarahchen@gmail.com	sarahchen-ml	Stanford AI Lab
1	Mike O'Brien	mike.obrien@gmail.com	mikeob	MIT CSAIL
2	Priya S.		priyasharma	Berkeley AI Research
3	Alexandra Petrov	apetrov@gmail.com	alex-petrov	Oxford ML Group
4	James R. Wilson	james.wilson@cmu.edu	jrwilson	CMU Robotics
5	Wei Zhang	wzhang@outlook.com	weizhang-ai	Tsinghua University
6	Elena R.	elena.rodriguez@gmail.com		Caltech Computing
7	Bob Johnson	bob.j@princeton.edu	bobjohnson	Princeton NLP
8	Yuki Tanaka	ytanaka@u-tokyo.ac.jp	yukitanaka	University of Tokyo
9	Tom Lee	thomaslee@fas.harvard.edu	tomlee-research	Harvard SEAS

Define Merge Task¶

In [4]:

MERGE_TASK = """
Match contacts between two lists to identify the same person.

Two records represent the SAME PERSON if:
- Names match (accounting for nicknames: Bob/Robert, Mike/Michael, Tom/Thomas)
- Names match with initials (S. Chen = Sarah Chen)
- Same institution/lab despite different name formats
- Email domains suggest same organization

Do NOT match if:
- Only first names match but institutions differ
- Names are completely different people

When in doubt, favor false negatives over false positives (better to not match than to wrongly match).
"""

Run the Merge¶

In [5]:

async def run_merge():
    async with create_session(name="Contact List Merge") as session:
        print(f"Session URL: {session.get_url()}")
        print("\nMerging contact lists...\n")
        
        result = await merge(
            session=session,
            task=MERGE_TASK,
            left_table=list_a,
            right_table=list_b,
            merge_on_left="name",
            merge_on_right="full_name",
        )
        
        return result.data

results_df = await run_merge()

Session URL: https://everyrow.io/sessions/d3692fc5-1f81-4633-b64f-0f905235e894

Merging contact lists...

Analyze Results¶

In [6]:

# Count matches
matched = results_df[results_df["full_name"].notna()]
unmatched_a = results_df[results_df["full_name"].isna()]

print(f"\n{'='*60}")
print(f"MERGE RESULTS")
print(f"{'='*60}")
print(f"  List A contacts:    {len(list_a)}")
print(f"  List B contacts:    {len(list_b)}")
print(f"  Matched pairs:      {len(matched)}")
print(f"  List A only:        {len(unmatched_a)}")

============================================================
MERGE RESULTS
============================================================
  List A contacts:    12
  List B contacts:    10
  Matched pairs:      7
  List A only:        5

In [7]:

# Show matched pairs
print("\nMATCHED CONTACTS:")
print("-" * 70)
for _, row in matched.iterrows():
    print(f"  List A: {row['name']:25} | List B: {row['full_name']}")
    print(f"          {row['affiliation']:25} |         {row['lab']}")
    print()

MATCHED CONTACTS:
----------------------------------------------------------------------
  List A: Dr. Sarah Chen            | List B: S. Chen
          Stanford University       |         Stanford AI Lab

  List A: Michael O'Brien           | List B: Mike O'Brien
          MIT                       |         MIT CSAIL

  List A: Priya Sharma              | List B: Priya S.
          UC Berkeley               |         Berkeley AI Research

  List A: James Wilson              | List B: James R. Wilson
          Carnegie Mellon           |         CMU Robotics

  List A: Elena Rodriguez           | List B: Elena R.
          Caltech                   |         Caltech Computing

  List A: Robert Johnson            | List B: Bob Johnson
          Princeton                 |         Princeton NLP

  List A: Thomas Lee                | List B: Tom Lee
          Harvard                   |         Harvard SEAS

In [8]:

# Show unmatched from List A
if len(unmatched_a) > 0:
    print("\nUNMATCHED FROM LIST A (unique to conference):")
    print("-" * 50)
    for _, row in unmatched_a.iterrows():
        print(f"  {row['name']} - {row['affiliation']}")

UNMATCHED FROM LIST A (unique to conference):
--------------------------------------------------
  David Kim - University of Washington
  Anna Kowalski - Georgia Tech
  Maria Santos - Columbia University
  Jennifer Park - Yale
  Christopher Davis - UPenn

In [9]:

# Find contacts unique to List B
matched_from_b = set(matched["full_name"].dropna())
unique_to_b = list_b[~list_b["full_name"].isin(matched_from_b)]

if len(unique_to_b) > 0:
    print("\nUNIQUE TO LIST B (not at conference):")
    print("-" * 50)
    for _, row in unique_to_b.iterrows():
        print(f"  {row['full_name']} - {row['lab']}")

UNIQUE TO LIST B (not at conference):
--------------------------------------------------
  Alexandra Petrov - Oxford ML Group
  Wei Zhang - Tsinghua University
  Yuki Tanaka - University of Tokyo

In [10]:

# Full merged results
results_df

Out[10]:

	name	email	affiliation	title	full_name	personal_email	github	lab	research
0	Dr. Sarah Chen	sarah.chen@stanford.edu	Stanford University	Assistant Professor	S. Chen	sarahchen@gmail.com	sarahchen-ml	Stanford AI Lab	{'full_name': 'This row was matched due to the...
1	Michael O'Brien	mobrien@mit.edu	MIT	PhD Candidate	Mike O'Brien	mike.obrien@gmail.com	mikeob	MIT CSAIL	{'full_name': 'This row was matched due to the...
2	Priya Sharma	p.sharma@berkeley.edu	UC Berkeley	Postdoc	Priya S.	None	priyasharma	Berkeley AI Research	{'full_name': 'This row was matched due to the...
3	James Wilson	jwilson@cmu.edu	Carnegie Mellon	Professor	James R. Wilson	james.wilson@cmu.edu	jrwilson	CMU Robotics	{'full_name': 'This row was matched due to the...
4	Elena Rodriguez	elena.r@caltech.edu	Caltech	Research Scientist	Elena R.	elena.rodriguez@gmail.com	None	Caltech Computing	{'full_name': 'This row was matched due to the...
5	David Kim	dkim@uw.edu	University of Washington	Associate Professor	NaN	NaN	NaN	NaN	NaN
6	Anna Kowalski	akowalski@gatech.edu	Georgia Tech	PhD Student	NaN	NaN	NaN	NaN	NaN
7	Robert Johnson	rjohnson@princeton.edu	Princeton	Senior Researcher	Bob Johnson	bob.j@princeton.edu	bobjohnson	Princeton NLP	{'full_name': 'This row was matched due to the...
8	Maria Santos	msantos@columbia.edu	Columbia University	Assistant Professor	NaN	NaN	NaN	NaN	NaN
9	Thomas Lee	tlee@harvard.edu	Harvard	Postdoc	Tom Lee	thomaslee@fas.harvard.edu	tomlee-research	Harvard SEAS	{'full_name': 'This row was matched due to the...
10	Jennifer Park	jpark@yale.edu	Yale	PhD Candidate	NaN	NaN	NaN	NaN	NaN
11	Christopher Davis	cdavis@upenn.edu	UPenn	Professor	NaN	NaN	NaN	NaN	NaN

Fuzzy match and merge contact lists in Python¶

This notebook demonstrates using everyrow's merge() utility to combine two overlapping contact lists where records lack exact matches.

name

affiliation

title

Dr. Sarah Chen

sarah.chen@stanford.edu

Stanford University

Assistant Professor

Michael O'Brien

mobrien@mit.edu

MIT

PhD Candidate

Priya Sharma

p.sharma@berkeley.edu

UC Berkeley

Postdoc

James Wilson

jwilson@cmu.edu

Carnegie Mellon

Professor

Elena Rodriguez

elena.r@caltech.edu

Caltech

Research Scientist

David Kim

dkim@uw.edu

University of Washington

Associate Professor

Anna Kowalski

akowalski@gatech.edu

Georgia Tech

PhD Student

Robert Johnson

rjohnson@princeton.edu

Princeton

Senior Researcher

Maria Santos

msantos@columbia.edu

Columbia University

Assistant Professor

Thomas Lee

tlee@harvard.edu

Harvard

Postdoc

Jennifer Park

jpark@yale.edu

Yale

PhD Candidate

Christopher Davis

cdavis@upenn.edu

UPenn

Professor

full_name

personal_email

github

lab

S. Chen

sarahchen@gmail.com

sarahchen-ml

Stanford AI Lab

Mike O'Brien

mike.obrien@gmail.com

mikeob

MIT CSAIL

Priya S.

priyasharma

Berkeley AI Research

Alexandra Petrov

apetrov@gmail.com

alex-petrov

Oxford ML Group

James R. Wilson

james.wilson@cmu.edu

jrwilson

CMU Robotics

Wei Zhang

wzhang@outlook.com

weizhang-ai

Tsinghua University

Elena R.

elena.rodriguez@gmail.com

Caltech Computing

Bob Johnson

bob.j@princeton.edu

bobjohnson

Princeton NLP

Yuki Tanaka

ytanaka@u-tokyo.ac.jp

yukitanaka

University of Tokyo

Tom Lee

thomaslee@fas.harvard.edu

tomlee-research

Harvard SEAS

MERGE_TASK = """ Match contacts between two lists to identify the same person. Two records represent the SAME PERSON if: - Names match (accounting for nicknames: Bob/Robert, Mike/Michael, Tom/Thomas) - Names match with initials (S. Chen = Sarah Chen) - Same institution/lab despite different name formats - Email domains suggest same organization Do NOT match if: - Only first names match but institutions differ - Names are completely different people When in doubt, favor false negatives over false positives (better to not match than to wrongly match). """

async def run_merge(): async with create_session(name="Contact List Merge") as session: print(f"Session URL: {session.get_url()}") print("\nMerging contact lists...\n") result = await merge( session=session, task=MERGE_TASK, left_table=list_a, right_table=list_b, merge_on_left="name", merge_on_right="full_name", ) return result.data results_df = await run_merge()

# Count matches matched = results_df[results_df["full_name"].notna()] unmatched_a = results_df[results_df["full_name"].isna()] print(f"\n{'='*60}") print(f"MERGE RESULTS") print(f"{'='*60}") print(f" List A contacts: {len(list_a)}") print(f" List B contacts: {len(list_b)}") print(f" Matched pairs: {len(matched)}") print(f" List A only: {len(unmatched_a)}")

============================================================ MERGE RESULTS ============================================================ List A contacts: 12 List B contacts: 10 Matched pairs: 7 List A only: 5

# Show matched pairs print("\nMATCHED CONTACTS:") print("-" * 70) for _, row in matched.iterrows(): print(f" List A: {row['name']:25} | List B: {row['full_name']}") print(f" {row['affiliation']:25} | {row['lab']}") print()

# Show unmatched from List A if len(unmatched_a) > 0: print("\nUNMATCHED FROM LIST A (unique to conference):") print("-" * 50) for _, row in unmatched_a.iterrows(): print(f" {row['name']} - {row['affiliation']}")

UNMATCHED FROM LIST A (unique to conference): -------------------------------------------------- David Kim - University of Washington Anna Kowalski - Georgia Tech Maria Santos - Columbia University Jennifer Park - Yale Christopher Davis - UPenn

# Find contacts unique to List B matched_from_b = set(matched["full_name"].dropna()) unique_to_b = list_b[~list_b["full_name"].isin(matched_from_b)] if len(unique_to_b) > 0: print("\nUNIQUE TO LIST B (not at conference):") print("-" * 50) for _, row in unique_to_b.iterrows(): print(f" {row['full_name']} - {row['lab']}")

name

affiliation

title

full_name

personal_email

github

lab

research

Dr. Sarah Chen

sarah.chen@stanford.edu

Stanford University

Assistant Professor

S. Chen

sarahchen@gmail.com

sarahchen-ml

Stanford AI Lab

{'full_name': 'This row was matched due to the...

Michael O'Brien

mobrien@mit.edu

MIT

PhD Candidate

Mike O'Brien

mike.obrien@gmail.com

mikeob

MIT CSAIL

{'full_name': 'This row was matched due to the...

Priya Sharma

p.sharma@berkeley.edu

UC Berkeley

Postdoc

Priya S.

None

priyasharma

Berkeley AI Research

{'full_name': 'This row was matched due to the...

James Wilson

jwilson@cmu.edu

Carnegie Mellon

Professor

James R. Wilson

james.wilson@cmu.edu

jrwilson

CMU Robotics

{'full_name': 'This row was matched due to the...

Elena Rodriguez

elena.r@caltech.edu

Caltech

Research Scientist

Elena R.

elena.rodriguez@gmail.com

None

Caltech Computing

{'full_name': 'This row was matched due to the...

David Kim

dkim@uw.edu

University of Washington

Associate Professor

NaN

Anna Kowalski

akowalski@gatech.edu

Georgia Tech

PhD Student

NaN

Robert Johnson

rjohnson@princeton.edu

Princeton

Senior Researcher

Bob Johnson

bob.j@princeton.edu

bobjohnson

Princeton NLP

{'full_name': 'This row was matched due to the...

Maria Santos

msantos@columbia.edu

Columbia University

Assistant Professor

NaN

Thomas Lee

tlee@harvard.edu

Harvard

Postdoc

Tom Lee

thomaslee@fas.harvard.edu

tomlee-research

Harvard SEAS

{'full_name': 'This row was matched due to the...

Jennifer Park

jpark@yale.edu

Yale

PhD Candidate

NaN

Christopher Davis

cdavis@upenn.edu

UPenn

Professor

NaN