Everyrow
Getting Started
  • Installation
  • Skills vs MCP
Guides
  • How to Add A Column to a DataFrame with Web Research
  • How to Classify and Label Data with an LLM in Python
  • Remove Duplicates from ML Training Data in Python
  • Filter a Pandas DataFrame with LLMs
  • How to Fuzzy Join DataFrames in Python
  • How to sort a dataset using web data in Python
  • How to resolve duplicate rows in Python with LLMs
API Reference
  • dedupe
  • merge
  • rank
  • agent_map
  • screen
Case Studies
  • Build an AI lead qualification pipeline in Python
  • Fuzzy join two Pandas DataFrames using LLMs
  • Fuzzy match and merge contact lists in Python
  • How to filter job postings with LLM Agents
  • How to merge datasets without common ID in Python
  • How to score and prioritize leads with AI in Python
  • How to Screen Stocks in Python with AI Agents
  • How to use LLMs to deduplicate CRM Data
  • LLM-powered Merging at Scale
  • LLM-powered Screening at Scale
  • Python Notebook to screen stocks using AI Agents
  • Running LLM Web Research Agents at Scale
  • Score and rank leads without a CRM in Python
  • Use LLM Agents to research government data at scale

How to use LLMs to deduplicate CRM Data¶

This notebook demonstrates how to use the everyrow SDK's dedupe operation to deduplicate messy CRM data using AI-powered semantic matching.

Setup¶

First, let's import the required libraries and load environment variables. You should have a COHORT_API_KEY environment variable set in your .env file.

In [ ]:
from datetime import datetime
from textwrap import dedent

import pandas as pd
from dotenv import load_dotenv

from everyrow.ops import dedupe

load_dotenv()

Load the Data¶

Let's load the messy CRM data and take a look at some examples of duplicate entries.

In [13]:
data = pd.read_csv("../data/case_01_crm_data.csv", engine="pyarrow")

print(f"Total records: {len(data)}")
data.sort_values(by="company_name").head(15)
Total records: 500
Out[13]:
company_name contact_name email_address
289 AAPL Sarah Jobs contact@apple-corp.com
197 AB InBev None ops@abinbev-beer.com
257 AB Inbev Inc Michel Doukeris ops@abinbev-beer.com
490 AMAT Gary sales@amat-semi.com
167 AMD Corp None tech@amd-processors.com
56 AMD Inc Lisa Su-Wang tech@amd-processors.com
372 AMSL None litho@asml-semi.com
86 ARM Inc Rene Haas-V None
128 ASML Holding N.V. Peter Van-Lith None
486 ASML Inc Peter Van-Lith litho@asml-semi.com
334 ASML NV Peter Van-Lith None
291 AbbVie Inc. Richard Gonzales info@abbvie-bio.com
355 AbbVie Pharmaceutical Richard Gonzales None
419 Abbvie None info@abbvie-bio.com
300 Abvie Inc Richard Gonzales None

Notice how the data has several issues:

  • companies have inconsistent names, e.g. AbbVie Inc, AbbVie, AbbVie Pharmaceutical
  • contacts and email addresses are sometimes missing

Traditional deduplication would miss these semantic duplicates in the company name field.

Run everyrow.io/dedupe¶

Now let's use the everyrow SDK to deduplicate the data. We define an equivalence relation in natural language that tells the AI what makes two entries duplicates.

In [ ]:
equivalence_relation = dedent("""
    Two entries are duplicates if they include data for the same legal entity.
""")

print("Deduplicating CRM data...\n")

result = await dedupe(
    input=data,
    equivalence_relation=equivalence_relation,
)

This returns a dataframe with information on identified clusters and which rows should be kept and which should be removed as duplicates.

everyrow.io/dedupe automatically selects the best and most complete entry in each cluster.

In [14]:
result.data.sort_values(
    by=["equivalence_class_name", "equivalence_class_id", "selected"],
    ascending=[True, True, False]
).head(10)
Out[14]:
company_name contact_name email_address equivalence_class_id selected equivalence_class_name
486 ASML Inc Peter Van-Lith litho@asml-semi.com 7a7cd704-d5dc-4830-85b8-e346b50fa2a6 True ASML
128 ASML Holding N.V. Peter Van-Lith None 7a7cd704-d5dc-4830-85b8-e346b50fa2a6 False ASML
334 ASML NV Peter Van-Lith None 7a7cd704-d5dc-4830-85b8-e346b50fa2a6 False ASML
372 AMSL None litho@asml-semi.com 7a7cd704-d5dc-4830-85b8-e346b50fa2a6 False ASML
291 AbbVie Inc. Richard Gonzales info@abbvie-bio.com e407c0b4-300f-4ef3-8644-3aa0f65b6daa True AbbVie Inc
300 Abvie Inc Richard Gonzales None e407c0b4-300f-4ef3-8644-3aa0f65b6daa False AbbVie Inc
355 AbbVie Pharmaceutical Richard Gonzales None e407c0b4-300f-4ef3-8644-3aa0f65b6daa False AbbVie Inc
419 Abbvie None info@abbvie-bio.com e407c0b4-300f-4ef3-8644-3aa0f65b6daa False AbbVie Inc
170 Accenture plc Julie Sweetland consult@accenture-global.com 224e560a-238f-4baf-be35-19494b4870f2 True Accenture - Julie Sweetland
225 Accenture Corp Julie Sweetland consult@accenture-global.com 224e560a-238f-4baf-be35-19494b4870f2 False Accenture - Julie Sweetland

We can now easily filter out all duplicates. You could even go further and use python merge the duplicates that were grouped together into a single entry.

View Results¶

Let's examine the deduplicated data.

In [ ]:
result = result.data[result.data['selected']]
result[['company_name', 'contact_name', 'email_address']]
Out[ ]:
company_name contact_name email_address
2 Enel S.p.A. Flavio Cattaneo None
5 UPS Inc Carol Tome-Lee shipping@ups-logistics.com
10 Honeywell Inc Darius Adamczyk ops@honeywell-intl.com
12 Intuit Inc. None tax@intuit-finance.com
14 Marvell Technology, Inc. Matt Murphy-Jr None
... ... ... ...
482 General Motors Inc Mary Barra-Smith service@gm-auto.com
486 ASML Inc Peter Van-Lith litho@asml-semi.com
487 Airbus SE Guillaume Faury fleet@airbus-aero.com
489 Brookfield Corporation Bruce Flatt-Jr invest@brookfield-corp.com
491 Sony Group Corporation Kenichiro Yoshida media@sony-electronics.com

124 rows × 3 columns

Cost and Runtime¶

Running this cost $3.52 and took 102 seconds.

Summary¶

The everyrow.io/dedupe operation:

  1. Takes a DataFrame with potential duplicates
  2. Uses AI to understand semantic similarity based on your equivalence relation
  3. Returns a DataFrame with clusters of duplicates and the best candidate selected for each cluster

This approach catches duplicates that traditional fuzzy matching would miss, such as:

  • Company name variations (abbreviations, typos, legal suffixes)
  • Partial data matches (when some fields are missing)
  • Semantic equivalence ("Google" vs "Alphabet Inc.")