everyrowdocs
Overview
  • Installation
  • Getting Started
  • API Key
  • Skills vs MCP
  • Chaining Operations
  • GitHub
API Reference
  • dedupe
  • merge
  • rank
  • agent_map
  • screen
Guides
  • How to replace human data annotators with LLMs in active learning
  • How to Add A Column to a DataFrame with Web Research
  • How to Classify and Label Data with an LLM in Python
  • Remove Duplicates from ML Training Data in Python
  • Filter a Pandas DataFrame with LLMs
  • How to Fuzzy Join DataFrames in Python
  • How to sort a dataset using web data in Python
  • How to resolve duplicate rows in Python with LLMs
  • How to scale LLM deduplication to 20,000 rows
Case Studies
  • Build an AI lead qualification pipeline in Python
  • Fuzzy join two Pandas DataFrames using LLMs
  • Fuzzy match and merge contact lists in Python
  • How to filter job postings with LLM Agents
  • How to merge datasets without common ID in Python
  • How to score and prioritize leads with AI in Python
  • How to Screen Stocks in Python with AI Agents
  • How to use LLMs to deduplicate CRM Data
  • LLM-powered Merging at Scale
  • LLM-powered Screening at Scale
  • Python Notebook to screen stocks using AI Agents
  • Run 10,000 LLM Web Research Agents
  • Score and rank leads without a CRM in Python
  • Understanding `agent_map` Effort Levels
  • Use LLM Agents to research government data at scale
everyrowby futuresearch
by futuresearch

Understanding agent_map Effort Levels¶

Every call to agent_map dispatches an AI agent per row of your DataFrame. The effort_level parameter controls what kind of agent runs and which model powers it:

Effort Model Iterations Web Research What it does
LOW (default) Gemini 3 Flash (minimal) 0 No Single LLM call — quick classification or extraction
MEDIUM Gemini 3 Flash (low) up to 5 Yes Agentic research — searches the web, reads pages, synthesises
HIGH Claude 4.6 Opus (low) up to 10 Yes Deep research — more iterations, stronger model

LOW is fundamentally different from the other two: it's a single model call with no tool use, no web browsing, and no iterative reasoning. It's the right choice when you just need an LLM to look at a row and produce a quick answer.

MEDIUM and HIGH run full research agents that search, read, and cross-reference sources. For these, model selection matters a lot — and we choose models based on their position on the Pareto frontier of accuracy, cost, and speed. This notebook shows how.

1. Fetch benchmark data¶

The Deep Research Bench (DRB) evaluates models on agentic web-research tasks: finding datasets, compiling numbers, validating claims, and other information-retrieval-heavy work. This is exactly what MEDIUM and HIGH effort agents do, which makes DRB the right benchmark for picking those models.

(DRB scores are less meaningful for LOW, since that effort level doesn't do any research — it's just a single LLM call.)

In [1]:
import requests
import pandas as pd

url = "https://rguraxphqescakvvzmju.supabase.co/rest/v1/rpc/get_average_scores_by_model"

PUBLIC_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InJndXJheHBocWVzY2FrdnZ6bWp1Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDAwNjkyODUsImV4cCI6MjA1NTY0NTI4NX0.8O7hbuvWSGGSuvjANtgrf3BihWP-aLdan3x48sc7kHk"

headers = {
    "apikey": PUBLIC_API_KEY,
    "authorization": f"Bearer {PUBLIC_API_KEY}",
    "content-type": "application/json",
    "content-profile": "public",
}

response = requests.post(url, headers=headers, json={"min_num_of_distinct_instances": 150})
response.raise_for_status()

df = pd.DataFrame(response.json())

# Friendly display names (same mapping used on evals.futuresearch.ai)
MODEL_NAME_MAP = {
    "GROK_4": "Grok 4",
    "CLAUDE_3_7_SONNET_THINKING": "Claude 3.7 Sonnet (medium)",
    "CLAUDE_4_SONNET_THINKING": "Claude 4 Sonnet (medium)",
    "CLAUDE_4_OPUS_THINKING": "Claude 4 Opus (medium)",
    "CLAUDE_4_1_OPUS_THINKING": "Claude 4.1 Opus (medium)",
    "CLAUDE_4_5_OPUS_LOW": "Claude 4.5 Opus (low)",
    "CLAUDE_4_6_OPUS_LOW": "Claude 4.6 Opus (low)",
    "CLAUDE_4_6_OPUS_HIGH": "Claude 4.6 Opus (high)",
    "CLAUDE_4_5_OPUS_HIGH": "Claude 4.5 Opus (high)",
    "CLAUDE_4_5_OPUS_THINKING": "Claude 4.5 Opus (medium)",
    "CLAUDE_4_5_SONNET_MINIMAL": "Claude 4.5 Sonnet (low)",
    "CLAUDE_4_5_HAIKU_MINIMAL": "Claude 4.5 Haiku (low)",
    "GEMINI_2_5_PRO": "Gemini 2.5 Pro (dynamic)",
    "GEMINI_3_FLASH_MINIMAL": "Gemini 3 Flash (minimal)",
    "GEMINI_3_FLASH_LOW": "Gemini 3 Flash (low)",
    "GEMINI_3_FLASH_HIGH": "Gemini 3 Flash (high)",
    "GEMINI_3_PREVIEW_LOW": "Gemini 3 Pro (low)",
    "O3": "o3",
    "GPT_5_MINIMAL": "GPT-5 (minimal)",
    "GPT_5_LOW": "GPT-5 (low)",
    "GPT_5": "GPT-5 (medium)",
    "GPT_5_HIGH": "GPT-5 (high)",
    "GPT_5_1_LOW": "GPT-5.1 (low)",
    "GPT_5_2_LOW": "GPT-5.2 (low)",
}
df["model_name"] = df["model_name"].map(lambda x: MODEL_NAME_MAP.get(x, x))

print(f"Loaded {len(df)} model configurations from Deep Research Bench")
df.sort_values("average_score", ascending=False).reset_index(drop=True)
Loaded 22 model configurations from Deep Research Bench
Out[1]:
model_name average_score total_runs average_spend average_runtime_seconds
0 Claude 4.6 Opus (high) 0.550008 335 0.552964 460.783344
1 Claude 4.5 Opus (low) 0.548509 329 0.312314 412.932676
2 Claude 4.5 Opus (high) 0.547884 337 0.457780 375.158008
3 Claude 4.6 Opus (low) 0.530857 330 0.242862 158.676590
4 Gemini 3 Flash (minimal) 0.503644 338 0.102952 680.680925
5 Gemini 3 Flash (low) 0.499314 330 0.050674 389.948922
6 GPT-5 (low) 0.496311 330 0.251314 525.672959
7 GPT-5 (minimal) 0.488744 329 0.235946 502.485720
8 GPT-5 (medium) 0.485694 299 0.345617 468.550553
9 Claude 4.1 Opus (medium) 0.483028 329 1.203487 246.791378
10 GPT-5 (high) 0.481404 327 0.388968 582.383317
11 Gemini 3 Flash (high) 0.478604 338 0.141666 630.898910
12 Claude 4.5 Sonnet (low) 0.475172 294 0.356755 625.251226
13 Grok 4 0.472558 320 0.485631 726.129258
14 Claude 4 Opus (medium) 0.467930 329 1.635857 284.182346
15 Claude 4 Sonnet (medium) 0.465519 329 0.454494 280.194073
16 Gemini 3 Pro (low) 0.462971 330 NaN 392.279528
17 Claude 4.5 Haiku (low) 0.454502 329 0.098306 431.052201
18 o3 0.452155 329 0.118776 240.988206
19 GPT-5.1 (low) 0.427902 327 0.039637 271.585667
20 Gemini 2.5 Pro (dynamic) 0.415192 310 0.110789 240.193460
21 GPT-5.2 (low) 0.411199 322 0.261318 483.193635

2. Compute Pareto frontiers¶

A model sits on the Pareto frontier when no other model is both cheaper (or faster) and more accurate. Everything off the frontier is strictly dominated.

After computing the raw frontier we apply a pruning step: if moving to a cheaper/faster model only saves a tiny amount on the x-axis but sacrifices a disproportionate amount of accuracy, we drop it. This uses a normalised-slope threshold (the same approach used on the DRB dashboard) so the frontier traces a meaningful trade-off curve.

In [2]:
import numpy as np


def pareto_frontier(df, x_col, y_col, max_normalised_slope=4):
    """Compute and prune a Pareto frontier (lower x is better, higher y is better)."""
    valid = df.dropna(subset=[x_col, y_col]).copy()
    if valid.empty:
        return pd.DataFrame(), pd.DataFrame()

    # Raw frontier: sort by x ascending, keep points that set a new best y
    sorted_df = valid.sort_values(x_col)
    raw_frontier = []
    best_y = -np.inf
    for _, row in sorted_df.iterrows():
        if row[y_col] > best_y:
            best_y = row[y_col]
            raw_frontier.append(row)
    raw_frontier_df = pd.DataFrame(raw_frontier)

    # Prune: walk from best score downward, drop points whose normalised
    # score-loss / x-savings ratio exceeds the threshold
    x_range = valid[x_col].max() - valid[x_col].min() or 1
    y_range = valid[y_col].max() - valid[y_col].min() or 1

    descending = raw_frontier_df.sort_values(x_col, ascending=False)
    kept = [descending.iloc[0]]
    for i in range(1, len(descending)):
        prev = kept[-1]
        curr = descending.iloc[i]
        dx = (prev[x_col] - curr[x_col]) / x_range
        dy = (prev[y_col] - curr[y_col]) / y_range
        if dx > 0 and dy / dx <= max_normalised_slope:
            kept.append(curr)
    pruned = pd.DataFrame(kept).sort_values(x_col)

    return valid, pruned

3. Best accuracy per dollar¶

The green rings mark the models used by everyrow's MEDIUM and HIGH effort levels. Both sit on or near the Pareto frontier — you get the best accuracy available at their price points.

In [3]:
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

EVERYROW_MODELS = {
    "Gemini 3 Flash (low)": "MEDIUM",
    "Claude 4.6 Opus (low)": "HIGH",
}


def plot_pareto(all_pts, frontier, x_col, y_col, x_label, y_label, title):
    fig, ax = plt.subplots(figsize=(11, 7))

    frontier_names = set(frontier["model_name"])
    non_frontier = all_pts[~all_pts["model_name"].isin(frontier_names)]

    # Non-frontier points
    ax.scatter(
        non_frontier[x_col], non_frontier[y_col],
        color="#636EFA", s=60, alpha=0.7, zorder=2,
    )
    # Frontier points + line
    ax.plot(
        frontier[x_col].values, frontier[y_col].values,
        color="#EF553B", linestyle="--", linewidth=2, zorder=3,
    )
    ax.scatter(
        frontier[x_col], frontier[y_col],
        color="#EF553B", s=90, marker="D", zorder=4,
    )

    # Highlight everyrow defaults
    for model_name, effort in EVERYROW_MODELS.items():
        row = all_pts[all_pts["model_name"] == model_name]
        if not row.empty:
            ax.scatter(
                row[x_col], row[y_col],
                facecolors="none", edgecolors="#2CA02C", s=260, linewidths=2.5,
                marker="o", zorder=5,
            )

    # Labels
    x_min, x_max = all_pts[x_col].min(), all_pts[x_col].max()
    x_range = x_max - x_min or 1
    for _, row in all_pts.iterrows():
        frac = (row[x_col] - x_min) / x_range
        ha = "left" if frac < 0.25 else ("right" if frac > 0.75 else "center")
        label = row["model_name"]
        if label in EVERYROW_MODELS:
            label = f"{label}  [{EVERYROW_MODELS[label]}]"
        ax.annotate(
            label, (row[x_col], row[y_col]),
            fontsize=7.5, ha=ha, va="bottom",
            xytext=(0, 6), textcoords="offset points",
            bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=1),
        )

    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title)
    ax.grid(True, alpha=0.3)

    legend_elements = [
        Line2D([0], [0], marker="o", color="#636EFA", linestyle="None", markersize=7, label="Models"),
        Line2D([0], [0], marker="D", color="#EF553B", linestyle="--", markersize=7, label="Pareto frontier"),
        Line2D([0], [0], marker="o", color="#2CA02C", linestyle="None",
               markersize=10, markerfacecolor="none", markeredgewidth=2,
               label="everyrow defaults (MEDIUM / HIGH)"),
    ]
    ax.legend(handles=legend_elements, loc="lower right")
    fig.tight_layout()
    plt.show()
In [5]:
all_spend, frontier_spend = pareto_frontier(df, "average_spend", "average_score")

plot_pareto(
    all_spend, frontier_spend,
    x_col="average_spend", y_col="average_score",
    x_label="Average Cost ($)", y_label="Average Score",
    title="Best Accuracy per Dollar on DRB",
)
No description has been provided for this image

4. Best accuracy per second¶

When you're running agent_map over thousands of rows, wall-clock time matters as much as cost. Claude 4.6 Opus (low) — the HIGH default — is the fastest high-accuracy model on DRB, finishing in under 3 minutes on average.

Caveats:

  1. Wall-clock time is influenced by a few relatively arbitrary factors like which deployment is being used, our rate limits, TPM quotas, whether we happened to have a lot of traffic when evals were run, etc. So it should be considered to be somewhat noisy. Pay less attention to this than to cost.
  2. Lower reasoning models may take less time to generate an response to a single query, but in some cases being less clever translates to having to use more tools and hence more iterations; so while it may look counterintuitive that some "slow" models (higher reasoning or Opus vs Sonnet) are faster, they may just make better use of their tool calls.
In [7]:
all_time, frontier_time = pareto_frontier(df, "average_runtime_seconds", "average_score")

plot_pareto(
    all_time, frontier_time,
    x_col="average_runtime_seconds", y_col="average_score",
    x_label="Average Runtime (s)", y_label="Average Score",
    title="Best Accuracy per Second on DRB",
)
No description has been provided for this image

5. Comparing the effort levels¶

Let's pull out the numbers for the three default models side-by-side, alongside Claude 4.6 Opus (high) — the strongest model on DRB — to show what you'd gain (and pay) by overriding.

In [8]:
EFFORT_MODELS = [
    ("LOW", "Gemini 3 Flash (minimal)"),
    ("MEDIUM", "Gemini 3 Flash (low)"),
    ("HIGH", "Claude 4.6 Opus (low)"),
    ("(override)", "Claude 4.6 Opus (high)"),
]

rows = []
for effort, model in EFFORT_MODELS:
    match = df[df["model_name"] == model]
    if not match.empty:
        r = match.iloc[0]
        rows.append({
            "Effort Level": effort,
            "Model": model,
            "DRB Score": f"{r['average_score']:.3f}",
            "Avg Cost": f"${r['average_spend']:.3f}" if pd.notna(r.get("average_spend")) else "—",
            "Avg Runtime": f"{r['average_runtime_seconds']:.0f}s" if pd.notna(r.get("average_runtime_seconds")) else "—",
        })

comparison = pd.DataFrame(rows)
comparison.style.hide(axis="index")
Out[8]:
Effort Level Model DRB Score Avg Cost Avg Runtime
LOW Gemini 3 Flash (minimal) 0.504 $0.103 681s
MEDIUM Gemini 3 Flash (low) 0.499 $0.051 390s
HIGH Claude 4.6 Opus (low) 0.531 $0.243 159s
(override) Claude 4.6 Opus (high) 0.550 $0.553 461s

6. Choosing the right effort level¶

LOW is the default, and it's the right choice for most tasks that don't require web research — classifying rows, extracting fields, reformatting data. It runs a single LLM call with no tool use, so it's fast and cheap. Because DRB measures agentic information retrieval, the DRB score for the LOW model isn't very meaningful here: in practice LOW doesn't do research at all.

MEDIUM turns on the research agent. Gemini 3 Flash (low) sits on the cost Pareto frontier — it's the cheapest model that delivers strong research accuracy. Use this when you need agents to look things up on the web but want to keep costs down.

HIGH uses Claude 4.6 Opus (low), which sits on both the cost and speed Pareto frontiers. It's the fastest high-accuracy model on DRB and delivers the best score-per-dollar among top-tier models. Use this when accuracy matters and you're willing to pay more per row.

Want the absolute best accuracy? You can override the model directly by setting effort_level=None and specifying all parameters explicitly:

from everyrow.ops import agent_map
from everyrow.task import LLM

result = await agent_map(
    task="Find each company's latest funding round",
    input=companies_df,
    effort_level=None,
    llm=LLM.CLAUDE_4_6_OPUS_HIGH,
    iteration_budget=10,
    include_research=True,
)

Claude 4.6 Opus (high) is the top-scoring model on DRB, but it costs roughly twice as much and takes about three times as long as the HIGH default. For most workloads the HIGH preset already captures the bulk of that accuracy at a fraction of the price — but the option is there when you need it.

We re-run these benchmarks as new models launch, so the model behind each effort level may change over time. You always get the current best trade-off without changing your code.