Understanding `agent_map` Effort Levels¶

Every call to agent_map dispatches an AI agent per row of your DataFrame. The effort_level parameter controls what kind of agent runs and which model powers it:

Effort	Model	Iterations	Web Research	What it does
`LOW` (default)	Gemini 3 Flash (minimal)	0	No	Single LLM call — quick classification or extraction
`MEDIUM`	Gemini 3 Flash (low)	up to 5	Yes	Agentic research — searches the web, reads pages, synthesises
`HIGH`	Claude 4.6 Opus (low)	up to 10	Yes	Deep research — more iterations, stronger model

LOW is fundamentally different from the other two: it's a single model call with no tool use, no web browsing, and no iterative reasoning. It's the right choice when you just need an LLM to look at a row and produce a quick answer.

MEDIUM and HIGH run full research agents that search, read, and cross-reference sources. For these, model selection matters a lot — and we choose models based on their position on the Pareto frontier of accuracy, cost, and speed. This notebook shows how.

1. Fetch benchmark data¶

The Deep Research Bench (DRB) evaluates models on agentic web-research tasks: finding datasets, compiling numbers, validating claims, and other information-retrieval-heavy work. This is exactly what MEDIUM and HIGH effort agents do, which makes DRB the right benchmark for picking those models.

(DRB scores are less meaningful for LOW, since that effort level doesn't do any research — it's just a single LLM call.)

In [1]:

import requests
import pandas as pd

url = "https://rguraxphqescakvvzmju.supabase.co/rest/v1/rpc/get_average_scores_by_model"

PUBLIC_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InJndXJheHBocWVzY2FrdnZ6bWp1Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDAwNjkyODUsImV4cCI6MjA1NTY0NTI4NX0.8O7hbuvWSGGSuvjANtgrf3BihWP-aLdan3x48sc7kHk"

headers = {
    "apikey": PUBLIC_API_KEY,
    "authorization": f"Bearer {PUBLIC_API_KEY}",
    "content-type": "application/json",
    "content-profile": "public",
}

response = requests.post(url, headers=headers, json={"min_num_of_distinct_instances": 150})
response.raise_for_status()

df = pd.DataFrame(response.json())

# Friendly display names (same mapping used on evals.futuresearch.ai)
MODEL_NAME_MAP = {
    "GROK_4": "Grok 4",
    "CLAUDE_3_7_SONNET_THINKING": "Claude 3.7 Sonnet (medium)",
    "CLAUDE_4_SONNET_THINKING": "Claude 4 Sonnet (medium)",
    "CLAUDE_4_OPUS_THINKING": "Claude 4 Opus (medium)",
    "CLAUDE_4_1_OPUS_THINKING": "Claude 4.1 Opus (medium)",
    "CLAUDE_4_5_OPUS_LOW": "Claude 4.5 Opus (low)",
    "CLAUDE_4_6_OPUS_LOW": "Claude 4.6 Opus (low)",
    "CLAUDE_4_6_OPUS_HIGH": "Claude 4.6 Opus (high)",
    "CLAUDE_4_5_OPUS_HIGH": "Claude 4.5 Opus (high)",
    "CLAUDE_4_5_OPUS_THINKING": "Claude 4.5 Opus (medium)",
    "CLAUDE_4_5_SONNET_MINIMAL": "Claude 4.5 Sonnet (low)",
    "CLAUDE_4_5_HAIKU_MINIMAL": "Claude 4.5 Haiku (low)",
    "GEMINI_2_5_PRO": "Gemini 2.5 Pro (dynamic)",
    "GEMINI_3_FLASH_MINIMAL": "Gemini 3 Flash (minimal)",
    "GEMINI_3_FLASH_LOW": "Gemini 3 Flash (low)",
    "GEMINI_3_FLASH_HIGH": "Gemini 3 Flash (high)",
    "GEMINI_3_PREVIEW_LOW": "Gemini 3 Pro (low)",
    "O3": "o3",
    "GPT_5_MINIMAL": "GPT-5 (minimal)",
    "GPT_5_LOW": "GPT-5 (low)",
    "GPT_5": "GPT-5 (medium)",
    "GPT_5_HIGH": "GPT-5 (high)",
    "GPT_5_1_LOW": "GPT-5.1 (low)",
    "GPT_5_2_LOW": "GPT-5.2 (low)",
}
df["model_name"] = df["model_name"].map(lambda x: MODEL_NAME_MAP.get(x, x))

print(f"Loaded {len(df)} model configurations from Deep Research Bench")
df.sort_values("average_score", ascending=False).reset_index(drop=True)

Loaded 22 model configurations from Deep Research Bench

Out[1]:

	model_name	average_score	total_runs	average_spend	average_runtime_seconds
0	Claude 4.6 Opus (high)	0.550008	335	0.552964	460.783344
1	Claude 4.5 Opus (low)	0.548509	329	0.312314	412.932676
2	Claude 4.5 Opus (high)	0.547884	337	0.457780	375.158008
3	Claude 4.6 Opus (low)	0.530857	330	0.242862	158.676590
4	Gemini 3 Flash (minimal)	0.503644	338	0.102952	680.680925
5	Gemini 3 Flash (low)	0.499314	330	0.050674	389.948922
6	GPT-5 (low)	0.496311	330	0.251314	525.672959
7	GPT-5 (minimal)	0.488744	329	0.235946	502.485720
8	GPT-5 (medium)	0.485694	299	0.345617	468.550553
9	Claude 4.1 Opus (medium)	0.483028	329	1.203487	246.791378
10	GPT-5 (high)	0.481404	327	0.388968	582.383317
11	Gemini 3 Flash (high)	0.478604	338	0.141666	630.898910
12	Claude 4.5 Sonnet (low)	0.475172	294	0.356755	625.251226
13	Grok 4	0.472558	320	0.485631	726.129258
14	Claude 4 Opus (medium)	0.467930	329	1.635857	284.182346
15	Claude 4 Sonnet (medium)	0.465519	329	0.454494	280.194073
16	Gemini 3 Pro (low)	0.462971	330	NaN	392.279528
17	Claude 4.5 Haiku (low)	0.454502	329	0.098306	431.052201
18	o3	0.452155	329	0.118776	240.988206
19	GPT-5.1 (low)	0.427902	327	0.039637	271.585667
20	Gemini 2.5 Pro (dynamic)	0.415192	310	0.110789	240.193460
21	GPT-5.2 (low)	0.411199	322	0.261318	483.193635

2. Compute Pareto frontiers¶

A model sits on the Pareto frontier when no other model is both cheaper (or faster) and more accurate. Everything off the frontier is strictly dominated.

After computing the raw frontier we apply a pruning step: if moving to a cheaper/faster model only saves a tiny amount on the x-axis but sacrifices a disproportionate amount of accuracy, we drop it. This uses a normalised-slope threshold (the same approach used on the DRB dashboard) so the frontier traces a meaningful trade-off curve.

In [2]:

import numpy as np


def pareto_frontier(df, x_col, y_col, max_normalised_slope=4):
    """Compute and prune a Pareto frontier (lower x is better, higher y is better)."""
    valid = df.dropna(subset=[x_col, y_col]).copy()
    if valid.empty:
        return pd.DataFrame(), pd.DataFrame()

    # Raw frontier: sort by x ascending, keep points that set a new best y
    sorted_df = valid.sort_values(x_col)
    raw_frontier = []
    best_y = -np.inf
    for _, row in sorted_df.iterrows():
        if row[y_col] > best_y:
            best_y = row[y_col]
            raw_frontier.append(row)
    raw_frontier_df = pd.DataFrame(raw_frontier)

    # Prune: walk from best score downward, drop points whose normalised
    # score-loss / x-savings ratio exceeds the threshold
    x_range = valid[x_col].max() - valid[x_col].min() or 1
    y_range = valid[y_col].max() - valid[y_col].min() or 1

    descending = raw_frontier_df.sort_values(x_col, ascending=False)
    kept = [descending.iloc[0]]
    for i in range(1, len(descending)):
        prev = kept[-1]
        curr = descending.iloc[i]
        dx = (prev[x_col] - curr[x_col]) / x_range
        dy = (prev[y_col] - curr[y_col]) / y_range
        if dx > 0 and dy / dx <= max_normalised_slope:
            kept.append(curr)
    pruned = pd.DataFrame(kept).sort_values(x_col)

    return valid, pruned

3. Best accuracy per dollar¶

The green rings mark the models used by everyrow's MEDIUM and HIGH effort levels. Both sit on or near the Pareto frontier — you get the best accuracy available at their price points.

In [3]:

import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

EVERYROW_MODELS = {
    "Gemini 3 Flash (low)": "MEDIUM",
    "Claude 4.6 Opus (low)": "HIGH",
}


def plot_pareto(all_pts, frontier, x_col, y_col, x_label, y_label, title):
    fig, ax = plt.subplots(figsize=(11, 7))

    frontier_names = set(frontier["model_name"])
    non_frontier = all_pts[~all_pts["model_name"].isin(frontier_names)]

    # Non-frontier points
    ax.scatter(
        non_frontier[x_col], non_frontier[y_col],
        color="#636EFA", s=60, alpha=0.7, zorder=2,
    )
    # Frontier points + line
    ax.plot(
        frontier[x_col].values, frontier[y_col].values,
        color="#EF553B", linestyle="--", linewidth=2, zorder=3,
    )
    ax.scatter(
        frontier[x_col], frontier[y_col],
        color="#EF553B", s=90, marker="D", zorder=4,
    )

    # Highlight everyrow defaults
    for model_name, effort in EVERYROW_MODELS.items():
        row = all_pts[all_pts["model_name"] == model_name]
        if not row.empty:
            ax.scatter(
                row[x_col], row[y_col],
                facecolors="none", edgecolors="#2CA02C", s=260, linewidths=2.5,
                marker="o", zorder=5,
            )

    # Labels
    x_min, x_max = all_pts[x_col].min(), all_pts[x_col].max()
    x_range = x_max - x_min or 1
    for _, row in all_pts.iterrows():
        frac = (row[x_col] - x_min) / x_range
        ha = "left" if frac < 0.25 else ("right" if frac > 0.75 else "center")
        label = row["model_name"]
        if label in EVERYROW_MODELS:
            label = f"{label}  [{EVERYROW_MODELS[label]}]"
        ax.annotate(
            label, (row[x_col], row[y_col]),
            fontsize=7.5, ha=ha, va="bottom",
            xytext=(0, 6), textcoords="offset points",
            bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=1),
        )

    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title)
    ax.grid(True, alpha=0.3)

    legend_elements = [
        Line2D([0], [0], marker="o", color="#636EFA", linestyle="None", markersize=7, label="Models"),
        Line2D([0], [0], marker="D", color="#EF553B", linestyle="--", markersize=7, label="Pareto frontier"),
        Line2D([0], [0], marker="o", color="#2CA02C", linestyle="None",
               markersize=10, markerfacecolor="none", markeredgewidth=2,
               label="everyrow defaults (MEDIUM / HIGH)"),
    ]
    ax.legend(handles=legend_elements, loc="lower right")
    fig.tight_layout()
    plt.show()

In [5]:

all_spend, frontier_spend = pareto_frontier(df, "average_spend", "average_score")

plot_pareto(
    all_spend, frontier_spend,
    x_col="average_spend", y_col="average_score",
    x_label="Average Cost ($)", y_label="Average Score",
    title="Best Accuracy per Dollar on DRB",
)

No description has been provided for this image

4. Best accuracy per second¶

When you're running agent_map over thousands of rows, wall-clock time matters as much as cost. Claude 4.6 Opus (low) — the HIGH default — is the fastest high-accuracy model on DRB, finishing in under 3 minutes on average.

Caveats:

Wall-clock time is influenced by a few relatively arbitrary factors like which deployment is being used, our rate limits, TPM quotas, whether we happened to have a lot of traffic when evals were run, etc. So it should be considered to be somewhat noisy. Pay less attention to this than to cost.
Lower reasoning models may take less time to generate an response to a single query, but in some cases being less clever translates to having to use more tools and hence more iterations; so while it may look counterintuitive that some "slow" models (higher reasoning or Opus vs Sonnet) are faster, they may just make better use of their tool calls.

In [7]:

all_time, frontier_time = pareto_frontier(df, "average_runtime_seconds", "average_score")

plot_pareto(
    all_time, frontier_time,
    x_col="average_runtime_seconds", y_col="average_score",
    x_label="Average Runtime (s)", y_label="Average Score",
    title="Best Accuracy per Second on DRB",
)

5. Comparing the effort levels¶

Let's pull out the numbers for the three default models side-by-side, alongside Claude 4.6 Opus (high) — the strongest model on DRB — to show what you'd gain (and pay) by overriding.

In [8]:

EFFORT_MODELS = [
    ("LOW", "Gemini 3 Flash (minimal)"),
    ("MEDIUM", "Gemini 3 Flash (low)"),
    ("HIGH", "Claude 4.6 Opus (low)"),
    ("(override)", "Claude 4.6 Opus (high)"),
]

rows = []
for effort, model in EFFORT_MODELS:
    match = df[df["model_name"] == model]
    if not match.empty:
        r = match.iloc[0]
        rows.append({
            "Effort Level": effort,
            "Model": model,
            "DRB Score": f"{r['average_score']:.3f}",
            "Avg Cost": f"${r['average_spend']:.3f}" if pd.notna(r.get("average_spend")) else "—",
            "Avg Runtime": f"{r['average_runtime_seconds']:.0f}s" if pd.notna(r.get("average_runtime_seconds")) else "—",
        })

comparison = pd.DataFrame(rows)
comparison.style.hide(axis="index")

Out[8]:

Effort Level	Model	DRB Score	Avg Cost	Avg Runtime
LOW	Gemini 3 Flash (minimal)	0.504	$0.103	681s
MEDIUM	Gemini 3 Flash (low)	0.499	$0.051	390s
HIGH	Claude 4.6 Opus (low)	0.531	$0.243	159s
(override)	Claude 4.6 Opus (high)	0.550	$0.553	461s

6. Choosing the right effort level¶

LOW is the default, and it's the right choice for most tasks that don't require web research — classifying rows, extracting fields, reformatting data. It runs a single LLM call with no tool use, so it's fast and cheap. Because DRB measures agentic information retrieval, the DRB score for the LOW model isn't very meaningful here: in practice LOW doesn't do research at all.

MEDIUM turns on the research agent. Gemini 3 Flash (low) sits on the cost Pareto frontier — it's the cheapest model that delivers strong research accuracy. Use this when you need agents to look things up on the web but want to keep costs down.

HIGH uses Claude 4.6 Opus (low), which sits on both the cost and speed Pareto frontiers. It's the fastest high-accuracy model on DRB and delivers the best score-per-dollar among top-tier models. Use this when accuracy matters and you're willing to pay more per row.

Want the absolute best accuracy? You can override the model directly by setting effort_level=None and specifying all parameters explicitly:

from everyrow.ops import agent_map
from everyrow.task import LLM

result = await agent_map(
    task="Find each company's latest funding round",
    input=companies_df,
    effort_level=None,
    llm=LLM.CLAUDE_4_6_OPUS_HIGH,
    iteration_budget=10,
    include_research=True,
)

Claude 4.6 Opus (high) is the top-scoring model on DRB, but it costs roughly twice as much and takes about three times as long as the HIGH default. For most workloads the HIGH preset already captures the bulk of that accuracy at a fraction of the price — but the option is there when you need it.

We re-run these benchmarks as new models launch, so the model behind each effort level may change over time. You always get the current best trade-off without changing your code.

Understanding `agent_map` Effort Levels¶

Every call to agent_map dispatches an AI agent per row of your DataFrame. The effort_level parameter controls what kind of agent runs and which model powers it:

Effort	Model	Iterations	Web Research	What it does
`LOW` (default)	Gemini 3 Flash (minimal)	0	No	Single LLM call — quick classification or extraction
`MEDIUM`	Gemini 3 Flash (low)	up to 5	Yes	Agentic research — searches the web, reads pages, synthesises
`HIGH`	Claude 4.6 Opus (low)	up to 10	Yes	Deep research — more iterations, stronger model

1. Fetch benchmark data¶

(DRB scores are less meaningful for LOW, since that effort level doesn't do any research — it's just a single LLM call.)

In [1]:

import requests
import pandas as pd

url = "https://rguraxphqescakvvzmju.supabase.co/rest/v1/rpc/get_average_scores_by_model"

PUBLIC_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InJndXJheHBocWVzY2FrdnZ6bWp1Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDAwNjkyODUsImV4cCI6MjA1NTY0NTI4NX0.8O7hbuvWSGGSuvjANtgrf3BihWP-aLdan3x48sc7kHk"

headers = {
    "apikey": PUBLIC_API_KEY,
    "authorization": f"Bearer {PUBLIC_API_KEY}",
    "content-type": "application/json",
    "content-profile": "public",
}

response = requests.post(url, headers=headers, json={"min_num_of_distinct_instances": 150})
response.raise_for_status()

df = pd.DataFrame(response.json())

# Friendly display names (same mapping used on evals.futuresearch.ai)
MODEL_NAME_MAP = {
    "GROK_4": "Grok 4",
    "CLAUDE_3_7_SONNET_THINKING": "Claude 3.7 Sonnet (medium)",
    "CLAUDE_4_SONNET_THINKING": "Claude 4 Sonnet (medium)",
    "CLAUDE_4_OPUS_THINKING": "Claude 4 Opus (medium)",
    "CLAUDE_4_1_OPUS_THINKING": "Claude 4.1 Opus (medium)",
    "CLAUDE_4_5_OPUS_LOW": "Claude 4.5 Opus (low)",
    "CLAUDE_4_6_OPUS_LOW": "Claude 4.6 Opus (low)",
    "CLAUDE_4_6_OPUS_HIGH": "Claude 4.6 Opus (high)",
    "CLAUDE_4_5_OPUS_HIGH": "Claude 4.5 Opus (high)",
    "CLAUDE_4_5_OPUS_THINKING": "Claude 4.5 Opus (medium)",
    "CLAUDE_4_5_SONNET_MINIMAL": "Claude 4.5 Sonnet (low)",
    "CLAUDE_4_5_HAIKU_MINIMAL": "Claude 4.5 Haiku (low)",
    "GEMINI_2_5_PRO": "Gemini 2.5 Pro (dynamic)",
    "GEMINI_3_FLASH_MINIMAL": "Gemini 3 Flash (minimal)",
    "GEMINI_3_FLASH_LOW": "Gemini 3 Flash (low)",
    "GEMINI_3_FLASH_HIGH": "Gemini 3 Flash (high)",
    "GEMINI_3_PREVIEW_LOW": "Gemini 3 Pro (low)",
    "O3": "o3",
    "GPT_5_MINIMAL": "GPT-5 (minimal)",
    "GPT_5_LOW": "GPT-5 (low)",
    "GPT_5": "GPT-5 (medium)",
    "GPT_5_HIGH": "GPT-5 (high)",
    "GPT_5_1_LOW": "GPT-5.1 (low)",
    "GPT_5_2_LOW": "GPT-5.2 (low)",
}
df["model_name"] = df["model_name"].map(lambda x: MODEL_NAME_MAP.get(x, x))

print(f"Loaded {len(df)} model configurations from Deep Research Bench")
df.sort_values("average_score", ascending=False).reset_index(drop=True)

Loaded 22 model configurations from Deep Research Bench

Out[1]:

	model_name	average_score	total_runs	average_spend	average_runtime_seconds
0	Claude 4.6 Opus (high)	0.550008	335	0.552964	460.783344
1	Claude 4.5 Opus (low)	0.548509	329	0.312314	412.932676
2	Claude 4.5 Opus (high)	0.547884	337	0.457780	375.158008
3	Claude 4.6 Opus (low)	0.530857	330	0.242862	158.676590
4	Gemini 3 Flash (minimal)	0.503644	338	0.102952	680.680925
5	Gemini 3 Flash (low)	0.499314	330	0.050674	389.948922
6	GPT-5 (low)	0.496311	330	0.251314	525.672959
7	GPT-5 (minimal)	0.488744	329	0.235946	502.485720
8	GPT-5 (medium)	0.485694	299	0.345617	468.550553
9	Claude 4.1 Opus (medium)	0.483028	329	1.203487	246.791378
10	GPT-5 (high)	0.481404	327	0.388968	582.383317
11	Gemini 3 Flash (high)	0.478604	338	0.141666	630.898910
12	Claude 4.5 Sonnet (low)	0.475172	294	0.356755	625.251226
13	Grok 4	0.472558	320	0.485631	726.129258
14	Claude 4 Opus (medium)	0.467930	329	1.635857	284.182346
15	Claude 4 Sonnet (medium)	0.465519	329	0.454494	280.194073
16	Gemini 3 Pro (low)	0.462971	330	NaN	392.279528
17	Claude 4.5 Haiku (low)	0.454502	329	0.098306	431.052201
18	o3	0.452155	329	0.118776	240.988206
19	GPT-5.1 (low)	0.427902	327	0.039637	271.585667
20	Gemini 2.5 Pro (dynamic)	0.415192	310	0.110789	240.193460
21	GPT-5.2 (low)	0.411199	322	0.261318	483.193635

2. Compute Pareto frontiers¶

A model sits on the Pareto frontier when no other model is both cheaper (or faster) and more accurate. Everything off the frontier is strictly dominated.

In [2]:

import numpy as np


def pareto_frontier(df, x_col, y_col, max_normalised_slope=4):
    """Compute and prune a Pareto frontier (lower x is better, higher y is better)."""
    valid = df.dropna(subset=[x_col, y_col]).copy()
    if valid.empty:
        return pd.DataFrame(), pd.DataFrame()

    # Raw frontier: sort by x ascending, keep points that set a new best y
    sorted_df = valid.sort_values(x_col)
    raw_frontier = []
    best_y = -np.inf
    for _, row in sorted_df.iterrows():
        if row[y_col] > best_y:
            best_y = row[y_col]
            raw_frontier.append(row)
    raw_frontier_df = pd.DataFrame(raw_frontier)

    # Prune: walk from best score downward, drop points whose normalised
    # score-loss / x-savings ratio exceeds the threshold
    x_range = valid[x_col].max() - valid[x_col].min() or 1
    y_range = valid[y_col].max() - valid[y_col].min() or 1

    descending = raw_frontier_df.sort_values(x_col, ascending=False)
    kept = [descending.iloc[0]]
    for i in range(1, len(descending)):
        prev = kept[-1]
        curr = descending.iloc[i]
        dx = (prev[x_col] - curr[x_col]) / x_range
        dy = (prev[y_col] - curr[y_col]) / y_range
        if dx > 0 and dy / dx <= max_normalised_slope:
            kept.append(curr)
    pruned = pd.DataFrame(kept).sort_values(x_col)

    return valid, pruned

3. Best accuracy per dollar¶

The green rings mark the models used by everyrow's MEDIUM and HIGH effort levels. Both sit on or near the Pareto frontier — you get the best accuracy available at their price points.

In [3]:

import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

EVERYROW_MODELS = {
    "Gemini 3 Flash (low)": "MEDIUM",
    "Claude 4.6 Opus (low)": "HIGH",
}


def plot_pareto(all_pts, frontier, x_col, y_col, x_label, y_label, title):
    fig, ax = plt.subplots(figsize=(11, 7))

    frontier_names = set(frontier["model_name"])
    non_frontier = all_pts[~all_pts["model_name"].isin(frontier_names)]

    # Non-frontier points
    ax.scatter(
        non_frontier[x_col], non_frontier[y_col],
        color="#636EFA", s=60, alpha=0.7, zorder=2,
    )
    # Frontier points + line
    ax.plot(
        frontier[x_col].values, frontier[y_col].values,
        color="#EF553B", linestyle="--", linewidth=2, zorder=3,
    )
    ax.scatter(
        frontier[x_col], frontier[y_col],
        color="#EF553B", s=90, marker="D", zorder=4,
    )

    # Highlight everyrow defaults
    for model_name, effort in EVERYROW_MODELS.items():
        row = all_pts[all_pts["model_name"] == model_name]
        if not row.empty:
            ax.scatter(
                row[x_col], row[y_col],
                facecolors="none", edgecolors="#2CA02C", s=260, linewidths=2.5,
                marker="o", zorder=5,
            )

    # Labels
    x_min, x_max = all_pts[x_col].min(), all_pts[x_col].max()
    x_range = x_max - x_min or 1
    for _, row in all_pts.iterrows():
        frac = (row[x_col] - x_min) / x_range
        ha = "left" if frac < 0.25 else ("right" if frac > 0.75 else "center")
        label = row["model_name"]
        if label in EVERYROW_MODELS:
            label = f"{label}  [{EVERYROW_MODELS[label]}]"
        ax.annotate(
            label, (row[x_col], row[y_col]),
            fontsize=7.5, ha=ha, va="bottom",
            xytext=(0, 6), textcoords="offset points",
            bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=1),
        )

    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title)
    ax.grid(True, alpha=0.3)

    legend_elements = [
        Line2D([0], [0], marker="o", color="#636EFA", linestyle="None", markersize=7, label="Models"),
        Line2D([0], [0], marker="D", color="#EF553B", linestyle="--", markersize=7, label="Pareto frontier"),
        Line2D([0], [0], marker="o", color="#2CA02C", linestyle="None",
               markersize=10, markerfacecolor="none", markeredgewidth=2,
               label="everyrow defaults (MEDIUM / HIGH)"),
    ]
    ax.legend(handles=legend_elements, loc="lower right")
    fig.tight_layout()
    plt.show()

In [5]:

all_spend, frontier_spend = pareto_frontier(df, "average_spend", "average_score")

plot_pareto(
    all_spend, frontier_spend,
    x_col="average_spend", y_col="average_score",
    x_label="Average Cost ($)", y_label="Average Score",
    title="Best Accuracy per Dollar on DRB",
)

4. Best accuracy per second¶

Caveats:

Wall-clock time is influenced by a few relatively arbitrary factors like which deployment is being used, our rate limits, TPM quotas, whether we happened to have a lot of traffic when evals were run, etc. So it should be considered to be somewhat noisy. Pay less attention to this than to cost.
Lower reasoning models may take less time to generate an response to a single query, but in some cases being less clever translates to having to use more tools and hence more iterations; so while it may look counterintuitive that some "slow" models (higher reasoning or Opus vs Sonnet) are faster, they may just make better use of their tool calls.

In [7]:

all_time, frontier_time = pareto_frontier(df, "average_runtime_seconds", "average_score")

plot_pareto(
    all_time, frontier_time,
    x_col="average_runtime_seconds", y_col="average_score",
    x_label="Average Runtime (s)", y_label="Average Score",
    title="Best Accuracy per Second on DRB",
)

5. Comparing the effort levels¶

Let's pull out the numbers for the three default models side-by-side, alongside Claude 4.6 Opus (high) — the strongest model on DRB — to show what you'd gain (and pay) by overriding.

In [8]:

EFFORT_MODELS = [
    ("LOW", "Gemini 3 Flash (minimal)"),
    ("MEDIUM", "Gemini 3 Flash (low)"),
    ("HIGH", "Claude 4.6 Opus (low)"),
    ("(override)", "Claude 4.6 Opus (high)"),
]

rows = []
for effort, model in EFFORT_MODELS:
    match = df[df["model_name"] == model]
    if not match.empty:
        r = match.iloc[0]
        rows.append({
            "Effort Level": effort,
            "Model": model,
            "DRB Score": f"{r['average_score']:.3f}",
            "Avg Cost": f"${r['average_spend']:.3f}" if pd.notna(r.get("average_spend")) else "—",
            "Avg Runtime": f"{r['average_runtime_seconds']:.0f}s" if pd.notna(r.get("average_runtime_seconds")) else "—",
        })

comparison = pd.DataFrame(rows)
comparison.style.hide(axis="index")

Out[8]:

Effort Level	Model	DRB Score	Avg Cost	Avg Runtime
LOW	Gemini 3 Flash (minimal)	0.504	$0.103	681s
MEDIUM	Gemini 3 Flash (low)	0.499	$0.051	390s
HIGH	Claude 4.6 Opus (low)	0.531	$0.243	159s
(override)	Claude 4.6 Opus (high)	0.550	$0.553	461s

6. Choosing the right effort level¶

Want the absolute best accuracy? You can override the model directly by setting effort_level=None and specifying all parameters explicitly:

from everyrow.ops import agent_map
from everyrow.task import LLM

result = await agent_map(
    task="Find each company's latest funding round",
    input=companies_df,
    effort_level=None,
    llm=LLM.CLAUDE_4_6_OPUS_HIGH,
    iteration_budget=10,
    include_research=True,
)

We re-run these benchmarks as new models launch, so the model behind each effort level may change over time. You always get the current best trade-off without changing your code.

Understanding agent_map Effort Levels¶

1. Fetch benchmark data¶

2. Compute Pareto frontiers¶

3. Best accuracy per dollar¶

4. Best accuracy per second¶

5. Comparing the effort levels¶

6. Choosing the right effort level¶

Understanding agent_map Effort Levels¶

1. Fetch benchmark data¶

2. Compute Pareto frontiers¶

3. Best accuracy per dollar¶

4. Best accuracy per second¶

5. Comparing the effort levels¶

6. Choosing the right effort level¶

Understanding `agent_map` Effort Levels¶

Understanding `agent_map` Effort Levels¶