How to use LLMs to deduplicate CRM Data¶
This notebook demonstrates how to use the everyrow SDK's dedupe operation to deduplicate messy CRM data using AI-powered semantic matching.
Setup¶
First, let's import the required libraries and load environment variables. You should have a COHORT_API_KEY environment variable set in your .env file.
from datetime import datetime
from textwrap import dedent
import pandas as pd
from dotenv import load_dotenv
from everyrow.ops import dedupe
load_dotenv()
Load the Data¶
Let's load the messy CRM data and take a look at some examples of duplicate entries.
data = pd.read_csv("../data/case_01_crm_data.csv", engine="pyarrow")
print(f"Total records: {len(data)}")
data.sort_values(by="company_name").head(15)
Notice how the data has several issues:
- companies have inconsistent names, e.g. AbbVie Inc, AbbVie, AbbVie Pharmaceutical
- contacts and email addresses are sometimes missing
Traditional deduplication would miss these semantic duplicates in the company name field.
Run everyrow.io/dedupe¶
Now let's use the everyrow SDK to deduplicate the data. We define an equivalence relation in natural language that tells the AI what makes two entries duplicates.
equivalence_relation = dedent("""
Two entries are duplicates if they include data for the same legal entity.
""")
print("Deduplicating CRM data...\n")
result = await dedupe(
input=data,
equivalence_relation=equivalence_relation,
)
This returns a dataframe with information on identified clusters and which rows should be kept and which should be removed as duplicates.
everyrow.io/dedupe automatically selects the best and most complete entry in each cluster.
result.data.sort_values(
by=["equivalence_class_name", "equivalence_class_id", "selected"],
ascending=[True, True, False]
).head(10)
We can now easily filter out all duplicates. You could even go further and use python merge the duplicates that were grouped together into a single entry.
View Results¶
Let's examine the deduplicated data.
result = result.data[result.data['selected']]
result[['company_name', 'contact_name', 'email_address']]
Cost and Runtime¶
Running this cost $3.52 and took 102 seconds.
Summary¶
The everyrow.io/dedupe operation:
- Takes a DataFrame with potential duplicates
- Uses AI to understand semantic similarity based on your equivalence relation
- Returns a DataFrame with clusters of duplicates and the best candidate selected for each cluster
This approach catches duplicates that traditional fuzzy matching would miss, such as:
- Company name variations (abbreviations, typos, legal suffixes)
- Partial data matches (when some fields are missing)
- Semantic equivalence ("Google" vs "Alphabet Inc.")