Everyrow
Getting Started
  • Installation
  • Skills vs MCP
Guides
  • How to Add A Column to a DataFrame with Web Research
  • How to Classify and Label Data with an LLM in Python
  • Remove Duplicates from ML Training Data in Python
  • Filter a Pandas DataFrame with LLMs
  • How to Fuzzy Join DataFrames in Python
  • How to sort a dataset using web data in Python
  • How to resolve duplicate rows in Python with LLMs
API Reference
  • dedupe
  • merge
  • rank
  • agent_map
  • screen
Case Studies
  • Build an AI lead qualification pipeline in Python
  • Fuzzy join two Pandas DataFrames using LLMs
  • Fuzzy match and merge contact lists in Python
  • How to filter job postings with LLM Agents
  • How to merge datasets without common ID in Python
  • How to score and prioritize leads with AI in Python
  • How to Screen Stocks in Python with AI Agents
  • How to use LLMs to deduplicate CRM Data
  • LLM-powered Merging at Scale
  • LLM-powered Screening at Scale
  • Python Notebook to screen stocks using AI Agents
  • Running LLM Web Research Agents at Scale
  • Score and rank leads without a CRM in Python
  • Use LLM Agents to research government data at scale

Fuzzy match and merge contact lists in Python¶

This notebook demonstrates using everyrow's merge() utility to combine two overlapping contact lists where records lack exact matches.

Use Case: You have candidate lists from two different sources and need to merge them to avoid sending duplicate recruiting emails. The challenge: less than 50% match exactly by name or email due to typos, nicknames, different email domains, and incomplete data.

Why everyrow? Traditional approaches (VLOOKUP, fuzzy matching) fail on semantic variations. everyrow's merge() uses LLM-powered matching to intelligently identify duplicates despite significant data variations.

In [1]:
import asyncio
from dotenv import load_dotenv
load_dotenv()

import pandas as pd
from everyrow import create_session
from everyrow.ops import merge

Load Contact Lists¶

In [2]:
# List A: From a conference attendee export
list_a = pd.read_csv("../data/contacts_list_a.csv").fillna("")

print(f"List A: {len(list_a)} contacts")
list_a
List A: 12 contacts
Out[2]:
name email affiliation title
0 Dr. Sarah Chen sarah.chen@stanford.edu Stanford University Assistant Professor
1 Michael O'Brien mobrien@mit.edu MIT PhD Candidate
2 Priya Sharma p.sharma@berkeley.edu UC Berkeley Postdoc
3 James Wilson jwilson@cmu.edu Carnegie Mellon Professor
4 Elena Rodriguez elena.r@caltech.edu Caltech Research Scientist
5 David Kim dkim@uw.edu University of Washington Associate Professor
6 Anna Kowalski akowalski@gatech.edu Georgia Tech PhD Student
7 Robert Johnson rjohnson@princeton.edu Princeton Senior Researcher
8 Maria Santos msantos@columbia.edu Columbia University Assistant Professor
9 Thomas Lee tlee@harvard.edu Harvard Postdoc
10 Jennifer Park jpark@yale.edu Yale PhD Candidate
11 Christopher Davis cdavis@upenn.edu UPenn Professor
In [3]:
# List B: From a research collaboration database
list_b = pd.read_csv("../data/contacts_list_b.csv").fillna("")

print(f"List B: {len(list_b)} contacts")
list_b
List B: 10 contacts
Out[3]:
full_name personal_email github lab
0 S. Chen sarahchen@gmail.com sarahchen-ml Stanford AI Lab
1 Mike O'Brien mike.obrien@gmail.com mikeob MIT CSAIL
2 Priya S. priyasharma Berkeley AI Research
3 Alexandra Petrov apetrov@gmail.com alex-petrov Oxford ML Group
4 James R. Wilson james.wilson@cmu.edu jrwilson CMU Robotics
5 Wei Zhang wzhang@outlook.com weizhang-ai Tsinghua University
6 Elena R. elena.rodriguez@gmail.com Caltech Computing
7 Bob Johnson bob.j@princeton.edu bobjohnson Princeton NLP
8 Yuki Tanaka ytanaka@u-tokyo.ac.jp yukitanaka University of Tokyo
9 Tom Lee thomaslee@fas.harvard.edu tomlee-research Harvard SEAS

Define Merge Task¶

In [4]:
MERGE_TASK = """
Match contacts between two lists to identify the same person.

Two records represent the SAME PERSON if:
- Names match (accounting for nicknames: Bob/Robert, Mike/Michael, Tom/Thomas)
- Names match with initials (S. Chen = Sarah Chen)
- Same institution/lab despite different name formats
- Email domains suggest same organization

Do NOT match if:
- Only first names match but institutions differ
- Names are completely different people

When in doubt, favor false negatives over false positives (better to not match than to wrongly match).
"""

Run the Merge¶

In [5]:
async def run_merge():
    async with create_session(name="Contact List Merge") as session:
        print(f"Session URL: {session.get_url()}")
        print("\nMerging contact lists...\n")
        
        result = await merge(
            session=session,
            task=MERGE_TASK,
            left_table=list_a,
            right_table=list_b,
            merge_on_left="name",
            merge_on_right="full_name",
        )
        
        return result.data

results_df = await run_merge()
Session URL: https://everyrow.io/sessions/d3692fc5-1f81-4633-b64f-0f905235e894

Merging contact lists...

Analyze Results¶

In [6]:
# Count matches
matched = results_df[results_df["full_name"].notna()]
unmatched_a = results_df[results_df["full_name"].isna()]

print(f"\n{'='*60}")
print(f"MERGE RESULTS")
print(f"{'='*60}")
print(f"  List A contacts:    {len(list_a)}")
print(f"  List B contacts:    {len(list_b)}")
print(f"  Matched pairs:      {len(matched)}")
print(f"  List A only:        {len(unmatched_a)}")
============================================================
MERGE RESULTS
============================================================
  List A contacts:    12
  List B contacts:    10
  Matched pairs:      7
  List A only:        5
In [7]:
# Show matched pairs
print("\nMATCHED CONTACTS:")
print("-" * 70)
for _, row in matched.iterrows():
    print(f"  List A: {row['name']:25} | List B: {row['full_name']}")
    print(f"          {row['affiliation']:25} |         {row['lab']}")
    print()
MATCHED CONTACTS:
----------------------------------------------------------------------
  List A: Dr. Sarah Chen            | List B: S. Chen
          Stanford University       |         Stanford AI Lab

  List A: Michael O'Brien           | List B: Mike O'Brien
          MIT                       |         MIT CSAIL

  List A: Priya Sharma              | List B: Priya S.
          UC Berkeley               |         Berkeley AI Research

  List A: James Wilson              | List B: James R. Wilson
          Carnegie Mellon           |         CMU Robotics

  List A: Elena Rodriguez           | List B: Elena R.
          Caltech                   |         Caltech Computing

  List A: Robert Johnson            | List B: Bob Johnson
          Princeton                 |         Princeton NLP

  List A: Thomas Lee                | List B: Tom Lee
          Harvard                   |         Harvard SEAS

In [8]:
# Show unmatched from List A
if len(unmatched_a) > 0:
    print("\nUNMATCHED FROM LIST A (unique to conference):")
    print("-" * 50)
    for _, row in unmatched_a.iterrows():
        print(f"  {row['name']} - {row['affiliation']}")
UNMATCHED FROM LIST A (unique to conference):
--------------------------------------------------
  David Kim - University of Washington
  Anna Kowalski - Georgia Tech
  Maria Santos - Columbia University
  Jennifer Park - Yale
  Christopher Davis - UPenn
In [9]:
# Find contacts unique to List B
matched_from_b = set(matched["full_name"].dropna())
unique_to_b = list_b[~list_b["full_name"].isin(matched_from_b)]

if len(unique_to_b) > 0:
    print("\nUNIQUE TO LIST B (not at conference):")
    print("-" * 50)
    for _, row in unique_to_b.iterrows():
        print(f"  {row['full_name']} - {row['lab']}")
UNIQUE TO LIST B (not at conference):
--------------------------------------------------
  Alexandra Petrov - Oxford ML Group
  Wei Zhang - Tsinghua University
  Yuki Tanaka - University of Tokyo
In [10]:
# Full merged results
results_df
Out[10]:
name email affiliation title full_name personal_email github lab research
0 Dr. Sarah Chen sarah.chen@stanford.edu Stanford University Assistant Professor S. Chen sarahchen@gmail.com sarahchen-ml Stanford AI Lab {'full_name': 'This row was matched due to the...
1 Michael O'Brien mobrien@mit.edu MIT PhD Candidate Mike O'Brien mike.obrien@gmail.com mikeob MIT CSAIL {'full_name': 'This row was matched due to the...
2 Priya Sharma p.sharma@berkeley.edu UC Berkeley Postdoc Priya S. None priyasharma Berkeley AI Research {'full_name': 'This row was matched due to the...
3 James Wilson jwilson@cmu.edu Carnegie Mellon Professor James R. Wilson james.wilson@cmu.edu jrwilson CMU Robotics {'full_name': 'This row was matched due to the...
4 Elena Rodriguez elena.r@caltech.edu Caltech Research Scientist Elena R. elena.rodriguez@gmail.com None Caltech Computing {'full_name': 'This row was matched due to the...
5 David Kim dkim@uw.edu University of Washington Associate Professor NaN NaN NaN NaN NaN
6 Anna Kowalski akowalski@gatech.edu Georgia Tech PhD Student NaN NaN NaN NaN NaN
7 Robert Johnson rjohnson@princeton.edu Princeton Senior Researcher Bob Johnson bob.j@princeton.edu bobjohnson Princeton NLP {'full_name': 'This row was matched due to the...
8 Maria Santos msantos@columbia.edu Columbia University Assistant Professor NaN NaN NaN NaN NaN
9 Thomas Lee tlee@harvard.edu Harvard Postdoc Tom Lee thomaslee@fas.harvard.edu tomlee-research Harvard SEAS {'full_name': 'This row was matched due to the...
10 Jennifer Park jpark@yale.edu Yale PhD Candidate NaN NaN NaN NaN NaN
11 Christopher Davis cdavis@upenn.edu UPenn Professor NaN NaN NaN NaN NaN