Dedupe

dedupe groups duplicate rows in a DataFrame based on a natural-language equivalence relation, assigns cluster IDs, and selects a canonical row per cluster. The duplicate criterion is semantic and LLM-powered: agents reason over the data and, when needed, search the web for external information to establish equivalence. This handles abbreviations, name variations, job changes, and entity relationships that no string similarity threshold can capture.

Examples

from everyrow.ops import dedupe

result = await dedupe(
    input=crm_data,
    equivalence_relation="Two entries are duplicates if they represent the same legal entity",
)
print(result.data.head())

The equivalence_relation is natural language. Be as specific as you need:

result = await dedupe(
    input=researchers,
    equivalence_relation="""
        Two rows are duplicates if they're the same person, even if:
        - They changed jobs (different org/email)
        - Name is abbreviated (A. Smith vs Alex Smith)
        - There are typos (Naomi vs Namoi)
        - They use a nickname (Bob vs Robert)
    """,
)
print(result.data.head())

What you get back

Three columns added to your data:

  • equivalence_class_id — rows with the same ID are duplicates of each other
  • equivalence_class_name — human-readable label for the cluster ("Alexandra Butoi", "Naomi Saphra", etc.)
  • selected — True for the canonical record in each cluster (usually the most complete one)

To get just the deduplicated rows:

deduped = result.data[result.data["selected"] == True]

Example

Input:

name org email
A. Butoi Rycolab a.butoi@edu
Alexandra Butoi Ryoclab
Namoi Saphra nsaphra@alumni
Naomi Saphra Harvard nsaphra@harvard.edu

Output (selected rows only):

name org email
Alexandra Butoi Rycolab a.butoi@edu
Naomi Saphra Harvard nsaphra@harvard.edu

Parameters

Name Type Description
input DataFrame Data with potential duplicates
equivalence_relation str What makes two rows duplicates
session Session Optional, auto-created if omitted

Performance

Rows Time Cost
200 ~90 sec ~$0.40
500 ~2 min ~$1.67
2,000 ~8 min ~$7

Case studies