Dedupe
dedupe groups duplicate rows in a DataFrame based on a natural-language equivalence relation, assigns cluster IDs, and selects a canonical row per cluster. The duplicate criterion is semantic and LLM-powered: agents reason over the data and, when needed, search the web for external information to establish equivalence. This handles abbreviations, name variations, job changes, and entity relationships that no string similarity threshold can capture.
Examples
from everyrow.ops import dedupe
result = await dedupe(
input=crm_data,
equivalence_relation="Two entries are duplicates if they represent the same legal entity",
)
print(result.data.head())
The equivalence_relation is natural language. Be as specific as you need:
result = await dedupe(
input=researchers,
equivalence_relation="""
Two rows are duplicates if they're the same person, even if:
- They changed jobs (different org/email)
- Name is abbreviated (A. Smith vs Alex Smith)
- There are typos (Naomi vs Namoi)
- They use a nickname (Bob vs Robert)
""",
)
print(result.data.head())
What you get back
Three columns added to your data:
equivalence_class_id— rows with the same ID are duplicates of each otherequivalence_class_name— human-readable label for the cluster ("Alexandra Butoi", "Naomi Saphra", etc.)selected— True for the canonical record in each cluster (usually the most complete one)
To get just the deduplicated rows:
deduped = result.data[result.data["selected"] == True]
Example
Input:
| name | org | |
|---|---|---|
| A. Butoi | Rycolab | a.butoi@edu |
| Alexandra Butoi | Ryoclab | — |
| Namoi Saphra | — | nsaphra@alumni |
| Naomi Saphra | Harvard | nsaphra@harvard.edu |
Output (selected rows only):
| name | org | |
|---|---|---|
| Alexandra Butoi | Rycolab | a.butoi@edu |
| Naomi Saphra | Harvard | nsaphra@harvard.edu |
Parameters
| Name | Type | Description |
|---|---|---|
input |
DataFrame | Data with potential duplicates |
equivalence_relation |
str | What makes two rows duplicates |
session |
Session | Optional, auto-created if omitted |
Performance
| Rows | Time | Cost |
|---|---|---|
| 200 | ~90 sec | ~$0.40 |
| 500 | ~2 min | ~$1.67 |
| 2,000 | ~8 min | ~$7 |
Case studies
- CRM Deduplication — 500 rows down to 124 (75% were duplicates)
- Researcher Deduplication — 98% accuracy handling career changes and typos