How to use LLMs to deduplicate CRM Data¶

This notebook demonstrates how to use the everyrow SDK's dedupe operation to deduplicate messy CRM data using AI-powered semantic matching.

Setup¶

First, let's import the required libraries and load environment variables. You should have a COHORT_API_KEY environment variable set in your .env file.

In [ ]:

from datetime import datetime
from textwrap import dedent

import pandas as pd
from dotenv import load_dotenv

from everyrow.ops import dedupe

load_dotenv()

Load the Data¶

Let's load the messy CRM data and take a look at some examples of duplicate entries.

In [13]:

data = pd.read_csv("../data/case_01_crm_data.csv", engine="pyarrow")

print(f"Total records: {len(data)}")
data.sort_values(by="company_name").head(15)

Total records: 500

Out[13]:

	company_name	contact_name	email_address
289	AAPL	Sarah Jobs	contact@apple-corp.com
197	AB InBev	None	ops@abinbev-beer.com
257	AB Inbev Inc	Michel Doukeris	ops@abinbev-beer.com
490	AMAT	Gary	sales@amat-semi.com
167	AMD Corp	None	tech@amd-processors.com
56	AMD Inc	Lisa Su-Wang	tech@amd-processors.com
372	AMSL	None	litho@asml-semi.com
86	ARM Inc	Rene Haas-V	None
128	ASML Holding N.V.	Peter Van-Lith	None
486	ASML Inc	Peter Van-Lith	litho@asml-semi.com
334	ASML NV	Peter Van-Lith	None
291	AbbVie Inc.	Richard Gonzales	info@abbvie-bio.com
355	AbbVie Pharmaceutical	Richard Gonzales	None
419	Abbvie	None	info@abbvie-bio.com
300	Abvie Inc	Richard Gonzales	None

Notice how the data has several issues:

companies have inconsistent names, e.g. AbbVie Inc, AbbVie, AbbVie Pharmaceutical
contacts and email addresses are sometimes missing

Traditional deduplication would miss these semantic duplicates in the company name field.

Run everyrow.io/dedupe¶

Now let's use the everyrow SDK to deduplicate the data. We define an equivalence relation in natural language that tells the AI what makes two entries duplicates.

In [ ]:

equivalence_relation = dedent("""
    Two entries are duplicates if they include data for the same legal entity.
""")

print("Deduplicating CRM data...\n")

result = await dedupe(
    input=data,
    equivalence_relation=equivalence_relation,
)

This returns a dataframe with information on identified clusters and which rows should be kept and which should be removed as duplicates.

everyrow.io/dedupe automatically selects the best and most complete entry in each cluster.

In [14]:

result.data.sort_values(
    by=["equivalence_class_name", "equivalence_class_id", "selected"],
    ascending=[True, True, False]
).head(10)

Out[14]:

	company_name	contact_name	email_address	equivalence_class_id	selected	equivalence_class_name
486	ASML Inc	Peter Van-Lith	litho@asml-semi.com	7a7cd704-d5dc-4830-85b8-e346b50fa2a6	True	ASML
128	ASML Holding N.V.	Peter Van-Lith	None	7a7cd704-d5dc-4830-85b8-e346b50fa2a6	False	ASML
334	ASML NV	Peter Van-Lith	None	7a7cd704-d5dc-4830-85b8-e346b50fa2a6	False	ASML
372	AMSL	None	litho@asml-semi.com	7a7cd704-d5dc-4830-85b8-e346b50fa2a6	False	ASML
291	AbbVie Inc.	Richard Gonzales	info@abbvie-bio.com	e407c0b4-300f-4ef3-8644-3aa0f65b6daa	True	AbbVie Inc
300	Abvie Inc	Richard Gonzales	None	e407c0b4-300f-4ef3-8644-3aa0f65b6daa	False	AbbVie Inc
355	AbbVie Pharmaceutical	Richard Gonzales	None	e407c0b4-300f-4ef3-8644-3aa0f65b6daa	False	AbbVie Inc
419	Abbvie	None	info@abbvie-bio.com	e407c0b4-300f-4ef3-8644-3aa0f65b6daa	False	AbbVie Inc
170	Accenture plc	Julie Sweetland	consult@accenture-global.com	224e560a-238f-4baf-be35-19494b4870f2	True	Accenture - Julie Sweetland
225	Accenture Corp	Julie Sweetland	consult@accenture-global.com	224e560a-238f-4baf-be35-19494b4870f2	False	Accenture - Julie Sweetland

We can now easily filter out all duplicates. You could even go further and use python merge the duplicates that were grouped together into a single entry.

View Results¶

Let's examine the deduplicated data.

In [ ]:

result = result.data[result.data['selected']]
result[['company_name', 'contact_name', 'email_address']]

Out[ ]:

	company_name	contact_name	email_address
2	Enel S.p.A.	Flavio Cattaneo	None
5	UPS Inc	Carol Tome-Lee	shipping@ups-logistics.com
10	Honeywell Inc	Darius Adamczyk	ops@honeywell-intl.com
12	Intuit Inc.	None	tax@intuit-finance.com
14	Marvell Technology, Inc.	Matt Murphy-Jr	None
...	...	...	...
482	General Motors Inc	Mary Barra-Smith	service@gm-auto.com
486	ASML Inc	Peter Van-Lith	litho@asml-semi.com
487	Airbus SE	Guillaume Faury	fleet@airbus-aero.com
489	Brookfield Corporation	Bruce Flatt-Jr	invest@brookfield-corp.com
491	Sony Group Corporation	Kenichiro Yoshida	media@sony-electronics.com

124 rows × 3 columns

Cost and Runtime¶

Running this cost $3.52 and took 102 seconds.

Summary¶

The everyrow.io/dedupe operation:

Takes a DataFrame with potential duplicates
Uses AI to understand semantic similarity based on your equivalence relation
Returns a DataFrame with clusters of duplicates and the best candidate selected for each cluster

This approach catches duplicates that traditional fuzzy matching would miss, such as:

Company name variations (abbreviations, typos, legal suffixes)
Partial data matches (when some fields are missing)
Semantic equivalence ("Google" vs "Alphabet Inc.")

company_name

contact_name

email_address

289

AAPL

Sarah Jobs

contact@apple-corp.com

197

AB InBev

None

ops@abinbev-beer.com

257

AB Inbev Inc

Michel Doukeris

ops@abinbev-beer.com

490

AMAT

Gary

sales@amat-semi.com

167

AMD Corp

None

tech@amd-processors.com

AMD Inc

Lisa Su-Wang

tech@amd-processors.com

372

AMSL

None

litho@asml-semi.com

ARM Inc

Rene Haas-V

None

128

ASML Holding N.V.

Peter Van-Lith

None

486

ASML Inc

Peter Van-Lith

litho@asml-semi.com

334

ASML NV

Peter Van-Lith

None

291

AbbVie Inc.

Richard Gonzales

info@abbvie-bio.com

355

AbbVie Pharmaceutical

Richard Gonzales

None

419

Abbvie

None

info@abbvie-bio.com

300

Abvie Inc

Richard Gonzales

None

equivalence_relation = dedent(""" Two entries are duplicates if they include data for the same legal entity. """) print("Deduplicating CRM data...\n") result = await dedupe( input=data, equivalence_relation=equivalence_relation, )

company_name

contact_name

email_address

equivalence_class_id

selected

equivalence_class_name

486

ASML Inc

Peter Van-Lith

litho@asml-semi.com

7a7cd704-d5dc-4830-85b8-e346b50fa2a6

True

ASML

128

ASML Holding N.V.

Peter Van-Lith

None

7a7cd704-d5dc-4830-85b8-e346b50fa2a6

False

ASML

334

ASML NV

Peter Van-Lith

None

7a7cd704-d5dc-4830-85b8-e346b50fa2a6

False

ASML

372

AMSL

None

litho@asml-semi.com

7a7cd704-d5dc-4830-85b8-e346b50fa2a6

False

ASML

291

AbbVie Inc.

Richard Gonzales

info@abbvie-bio.com

e407c0b4-300f-4ef3-8644-3aa0f65b6daa

True

AbbVie Inc

300

Abvie Inc

Richard Gonzales

None

e407c0b4-300f-4ef3-8644-3aa0f65b6daa

False

AbbVie Inc

355

AbbVie Pharmaceutical

Richard Gonzales

None

e407c0b4-300f-4ef3-8644-3aa0f65b6daa

False

AbbVie Inc

419

Abbvie

None

info@abbvie-bio.com

e407c0b4-300f-4ef3-8644-3aa0f65b6daa

False

AbbVie Inc

170

Accenture plc

Julie Sweetland

consult@accenture-global.com

224e560a-238f-4baf-be35-19494b4870f2

True

Accenture - Julie Sweetland

225

Accenture Corp

Julie Sweetland

consult@accenture-global.com

224e560a-238f-4baf-be35-19494b4870f2

False

Accenture - Julie Sweetland

company_name

contact_name

email_address

Enel S.p.A.

Flavio Cattaneo

None

UPS Inc

Carol Tome-Lee

shipping@ups-logistics.com

Honeywell Inc

Darius Adamczyk

ops@honeywell-intl.com

Intuit Inc.

None

tax@intuit-finance.com

Marvell Technology, Inc.

Matt Murphy-Jr

None

...

482

General Motors Inc

Mary Barra-Smith

service@gm-auto.com

486

ASML Inc

Peter Van-Lith

litho@asml-semi.com

487

Airbus SE

Guillaume Faury

fleet@airbus-aero.com

489

Brookfield Corporation

Bruce Flatt-Jr

invest@brookfield-corp.com

491

Sony Group Corporation

Kenichiro Yoshida

media@sony-electronics.com

Summary¶

The everyrow.io/dedupe operation:

Takes a DataFrame with potential duplicates

Uses AI to understand semantic similarity based on your equivalence relation

Returns a DataFrame with clusters of duplicates and the best candidate selected for each cluster

This approach catches duplicates that traditional fuzzy matching would miss, such as:

Company name variations (abbreviations, typos, legal suffixes)

Partial data matches (when some fields are missing)

Semantic equivalence ("Google" vs "Alphabet Inc.")