How to merge datasets without common ID in Python¶
This notebook demonstrates using everyrow's merge() utility to join contact-level data with organization-level data before CRM upload.
Use Case: Your data lives across multiple tables—contacts in one, company information in another. Before uploading to HubSpot/Salesforce, you need a flattened export where each contact row includes the associated company context.
Why everyrow? Company names may not match exactly between tables ("Acme Corp" vs "Acme Corporation" vs "ACME"). The merge() function handles these variations semantically.
In [1]:
import asyncio
from dotenv import load_dotenv
load_dotenv()
import pandas as pd
from everyrow import create_session
from everyrow.ops import merge
Load Data¶
In [2]:
# Contacts table
contacts_df = pd.read_csv("../data/crm_contacts.csv")
print(f"Contacts: {len(contacts_df)}")
contacts_df
Out[2]:
In [3]:
# Funds/Companies table with enriched information
funds_df = pd.read_csv("../data/crm_funds.csv")
print(f"Funds: {len(funds_df)}")
funds_df
Out[3]:
Define Merge Task¶
In [4]:
MERGE_TASK = """
Match contacts to their associated fund/company.
Company names may vary between tables:
- "Bridgewater" should match "Bridgewater Associates"
- "Citadel LLC" should match "Citadel"
- "Two Sigma Investments" should match "Two Sigma"
- "D.E. Shaw" should match "D. E. Shaw & Co."
- "Point72 Asset Mgmt" should match "Point72 Asset Management"
- "Renaissance Tech" should match "Renaissance Technologies"
- "Elliott Mgmt" should match "Elliott Management"
Match based on the core company name, ignoring variations in:
- Legal suffixes (LLC, Inc, LP, & Co.)
- Abbreviations (Mgmt = Management, Tech = Technologies)
- Additional descriptors (Associates, Capital, Asset Management)
"""
Run the Merge¶
In [5]:
async def run_merge():
async with create_session(name="CRM Merge Workflow") as session:
print(f"Session URL: {session.get_url()}")
print("\nMerging contacts with fund data...\n")
result = await merge(
session=session,
task=MERGE_TASK,
left_table=contacts_df,
right_table=funds_df,
merge_on_left="company_name",
merge_on_right="fund_name",
)
return result.data
results_df = await run_merge()
Review Merged Data¶
In [6]:
# Count successful merges
merged = results_df[results_df["fund_name"].notna()]
unmerged = results_df[results_df["fund_name"].isna()]
print(f"\n{'='*60}")
print(f"MERGE RESULTS")
print(f"{'='*60}")
print(f" Total contacts: {len(results_df)}")
print(f" Successfully merged: {len(merged)}")
print(f" Unmatched: {len(unmerged)}")
In [7]:
# Show CRM-ready output
print("\nCRM-READY CONTACT LIST:")
print("-" * 80)
for _, row in merged.iterrows():
print(f"\n{row['contact_name']} - {row['title']}")
print(f" Email: {row['email']}")
print(f" Fund: {row['fund_name']} (${row['aum_billions']}B AUM)")
print(f" Strategy: {row['strategy']} | Research Intensity: {row['research_intensity']}")
print(f" Outreach Hook: {row['outreach_hook']}")
In [8]:
# Filter to high-priority contacts (high research intensity)
high_priority = merged[merged["research_intensity"].isin(["High", "Very High"])]
print(f"\n{'='*60}")
print(f"HIGH PRIORITY CONTACTS ({len(high_priority)} contacts)")
print(f"{'='*60}")
for _, row in high_priority.iterrows():
print(f" {row['contact_name']:20} | {row['fund_name']:25} | {row['research_intensity']}")
In [9]:
# Export CRM-ready file
crm_export = merged[["contact_name", "title", "email", "fund_name", "aum_billions", "strategy", "research_intensity", "outreach_hook"]]
crm_export.to_csv("crm_ready_contacts.csv", index=False)
print(f"\nExported {len(crm_export)} contacts to crm_ready_contacts.csv")
In [10]:
# Full results
results_df
Out[10]: