Dedupe

dedupe groups duplicate rows in a DataFrame based on a natural-language equivalence relation, assigns cluster IDs, and selects a canonical row per cluster. The duplicate criterion is semantic and LLM-powered: agents reason over the data and, when needed, search the web for external information to establish equivalence. This handles abbreviations, name variations, job changes, and entity relationships that no string similarity threshold can capture.

Examples

from everyrow.ops import dedupe

result = await dedupe(
    input=crm_data,
    equivalence_relation="Two entries are duplicates if they represent the same legal entity",
)
print(result.data.head())

The equivalence_relation is natural language. Be as specific as you need:

result = await dedupe(
    input=researchers,
    equivalence_relation="""
        Two rows are duplicates if they're the same person, even if:
        - They changed jobs (different org/email)
        - Name is abbreviated (A. Smith vs Alex Smith)
        - There are typos (Naomi vs Namoi)
        - They use a nickname (Bob vs Robert)
    """,
)
print(result.data.head())

What you get back

Three columns added to your data:

equivalence_class_id — rows with the same ID are duplicates of each other
equivalence_class_name — human-readable label for the cluster ("Alexandra Butoi", "Naomi Saphra", etc.)
selected — True for the canonical record in each cluster (usually the most complete one)

To get just the deduplicated rows:

deduped = result.data[result.data["selected"] == True]

Example

Input:

name	org	email
A. Butoi	Rycolab	a.butoi@edu
Alexandra Butoi	Ryoclab	—
Namoi Saphra	—	nsaphra@alumni
Naomi Saphra	Harvard	nsaphra@harvard.edu

Output (selected rows only):

name	org	email
Alexandra Butoi	Rycolab	a.butoi@edu
Naomi Saphra	Harvard	nsaphra@harvard.edu

Parameters

Name	Type	Description
`input`	DataFrame	Data with potential duplicates
`equivalence_relation`	str	What makes two rows duplicates
`session`	Session	Optional, auto-created if omitted

Performance

Rows	Time	Cost
200	~90 sec	~$0.40
500	~2 min	~$1.67
2,000	~8 min	~$7

Case studies

CRM Deduplication — 500 rows down to 124 (75% were duplicates)
Researcher Deduplication — 98% accuracy handling career changes and typos