Understanding agent_map Effort Levels¶
Every call to agent_map dispatches an AI agent per row of your DataFrame. The effort_level parameter controls what kind of agent runs and which model powers it:
| Effort | Model | Iterations | Web Research | What it does |
|---|---|---|---|---|
LOW (default) |
Gemini 3 Flash (minimal) | 0 | No | Single LLM call — quick classification or extraction |
MEDIUM |
Gemini 3 Flash (low) | up to 5 | Yes | Agentic research — searches the web, reads pages, synthesises |
HIGH |
Claude 4.6 Opus (low) | up to 10 | Yes | Deep research — more iterations, stronger model |
LOW is fundamentally different from the other two: it's a single model call with no tool use, no web browsing, and no iterative reasoning. It's the right choice when you just need an LLM to look at a row and produce a quick answer.
MEDIUM and HIGH run full research agents that search, read, and cross-reference sources. For these, model selection matters a lot — and we choose models based on their position on the Pareto frontier of accuracy, cost, and speed. This notebook shows how.
1. Fetch benchmark data¶
The Deep Research Bench (DRB) evaluates models on agentic web-research tasks: finding datasets, compiling numbers, validating claims, and other information-retrieval-heavy work. This is exactly what MEDIUM and HIGH effort agents do, which makes DRB the right benchmark for picking those models.
(DRB scores are less meaningful for LOW, since that effort level doesn't do any research — it's just a single LLM call.)
import requests
import pandas as pd
url = "https://rguraxphqescakvvzmju.supabase.co/rest/v1/rpc/get_average_scores_by_model"
PUBLIC_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InJndXJheHBocWVzY2FrdnZ6bWp1Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NDAwNjkyODUsImV4cCI6MjA1NTY0NTI4NX0.8O7hbuvWSGGSuvjANtgrf3BihWP-aLdan3x48sc7kHk"
headers = {
"apikey": PUBLIC_API_KEY,
"authorization": f"Bearer {PUBLIC_API_KEY}",
"content-type": "application/json",
"content-profile": "public",
}
response = requests.post(url, headers=headers, json={"min_num_of_distinct_instances": 150})
response.raise_for_status()
df = pd.DataFrame(response.json())
# Friendly display names (same mapping used on evals.futuresearch.ai)
MODEL_NAME_MAP = {
"GROK_4": "Grok 4",
"CLAUDE_3_7_SONNET_THINKING": "Claude 3.7 Sonnet (medium)",
"CLAUDE_4_SONNET_THINKING": "Claude 4 Sonnet (medium)",
"CLAUDE_4_OPUS_THINKING": "Claude 4 Opus (medium)",
"CLAUDE_4_1_OPUS_THINKING": "Claude 4.1 Opus (medium)",
"CLAUDE_4_5_OPUS_LOW": "Claude 4.5 Opus (low)",
"CLAUDE_4_6_OPUS_LOW": "Claude 4.6 Opus (low)",
"CLAUDE_4_6_OPUS_HIGH": "Claude 4.6 Opus (high)",
"CLAUDE_4_5_OPUS_HIGH": "Claude 4.5 Opus (high)",
"CLAUDE_4_5_OPUS_THINKING": "Claude 4.5 Opus (medium)",
"CLAUDE_4_5_SONNET_MINIMAL": "Claude 4.5 Sonnet (low)",
"CLAUDE_4_5_HAIKU_MINIMAL": "Claude 4.5 Haiku (low)",
"GEMINI_2_5_PRO": "Gemini 2.5 Pro (dynamic)",
"GEMINI_3_FLASH_MINIMAL": "Gemini 3 Flash (minimal)",
"GEMINI_3_FLASH_LOW": "Gemini 3 Flash (low)",
"GEMINI_3_FLASH_HIGH": "Gemini 3 Flash (high)",
"GEMINI_3_PREVIEW_LOW": "Gemini 3 Pro (low)",
"O3": "o3",
"GPT_5_MINIMAL": "GPT-5 (minimal)",
"GPT_5_LOW": "GPT-5 (low)",
"GPT_5": "GPT-5 (medium)",
"GPT_5_HIGH": "GPT-5 (high)",
"GPT_5_1_LOW": "GPT-5.1 (low)",
"GPT_5_2_LOW": "GPT-5.2 (low)",
}
df["model_name"] = df["model_name"].map(lambda x: MODEL_NAME_MAP.get(x, x))
print(f"Loaded {len(df)} model configurations from Deep Research Bench")
df.sort_values("average_score", ascending=False).reset_index(drop=True)
2. Compute Pareto frontiers¶
A model sits on the Pareto frontier when no other model is both cheaper (or faster) and more accurate. Everything off the frontier is strictly dominated.
After computing the raw frontier we apply a pruning step: if moving to a cheaper/faster model only saves a tiny amount on the x-axis but sacrifices a disproportionate amount of accuracy, we drop it. This uses a normalised-slope threshold (the same approach used on the DRB dashboard) so the frontier traces a meaningful trade-off curve.
import numpy as np
def pareto_frontier(df, x_col, y_col, max_normalised_slope=4):
"""Compute and prune a Pareto frontier (lower x is better, higher y is better)."""
valid = df.dropna(subset=[x_col, y_col]).copy()
if valid.empty:
return pd.DataFrame(), pd.DataFrame()
# Raw frontier: sort by x ascending, keep points that set a new best y
sorted_df = valid.sort_values(x_col)
raw_frontier = []
best_y = -np.inf
for _, row in sorted_df.iterrows():
if row[y_col] > best_y:
best_y = row[y_col]
raw_frontier.append(row)
raw_frontier_df = pd.DataFrame(raw_frontier)
# Prune: walk from best score downward, drop points whose normalised
# score-loss / x-savings ratio exceeds the threshold
x_range = valid[x_col].max() - valid[x_col].min() or 1
y_range = valid[y_col].max() - valid[y_col].min() or 1
descending = raw_frontier_df.sort_values(x_col, ascending=False)
kept = [descending.iloc[0]]
for i in range(1, len(descending)):
prev = kept[-1]
curr = descending.iloc[i]
dx = (prev[x_col] - curr[x_col]) / x_range
dy = (prev[y_col] - curr[y_col]) / y_range
if dx > 0 and dy / dx <= max_normalised_slope:
kept.append(curr)
pruned = pd.DataFrame(kept).sort_values(x_col)
return valid, pruned
3. Best accuracy per dollar¶
The green rings mark the models used by everyrow's MEDIUM and HIGH effort levels. Both sit on or near the Pareto frontier — you get the best accuracy available at their price points.
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
EVERYROW_MODELS = {
"Gemini 3 Flash (low)": "MEDIUM",
"Claude 4.6 Opus (low)": "HIGH",
}
def plot_pareto(all_pts, frontier, x_col, y_col, x_label, y_label, title):
fig, ax = plt.subplots(figsize=(11, 7))
frontier_names = set(frontier["model_name"])
non_frontier = all_pts[~all_pts["model_name"].isin(frontier_names)]
# Non-frontier points
ax.scatter(
non_frontier[x_col], non_frontier[y_col],
color="#636EFA", s=60, alpha=0.7, zorder=2,
)
# Frontier points + line
ax.plot(
frontier[x_col].values, frontier[y_col].values,
color="#EF553B", linestyle="--", linewidth=2, zorder=3,
)
ax.scatter(
frontier[x_col], frontier[y_col],
color="#EF553B", s=90, marker="D", zorder=4,
)
# Highlight everyrow defaults
for model_name, effort in EVERYROW_MODELS.items():
row = all_pts[all_pts["model_name"] == model_name]
if not row.empty:
ax.scatter(
row[x_col], row[y_col],
facecolors="none", edgecolors="#2CA02C", s=260, linewidths=2.5,
marker="o", zorder=5,
)
# Labels
x_min, x_max = all_pts[x_col].min(), all_pts[x_col].max()
x_range = x_max - x_min or 1
for _, row in all_pts.iterrows():
frac = (row[x_col] - x_min) / x_range
ha = "left" if frac < 0.25 else ("right" if frac > 0.75 else "center")
label = row["model_name"]
if label in EVERYROW_MODELS:
label = f"{label} [{EVERYROW_MODELS[label]}]"
ax.annotate(
label, (row[x_col], row[y_col]),
fontsize=7.5, ha=ha, va="bottom",
xytext=(0, 6), textcoords="offset points",
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=1),
)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.set_title(title)
ax.grid(True, alpha=0.3)
legend_elements = [
Line2D([0], [0], marker="o", color="#636EFA", linestyle="None", markersize=7, label="Models"),
Line2D([0], [0], marker="D", color="#EF553B", linestyle="--", markersize=7, label="Pareto frontier"),
Line2D([0], [0], marker="o", color="#2CA02C", linestyle="None",
markersize=10, markerfacecolor="none", markeredgewidth=2,
label="everyrow defaults (MEDIUM / HIGH)"),
]
ax.legend(handles=legend_elements, loc="lower right")
fig.tight_layout()
plt.show()
all_spend, frontier_spend = pareto_frontier(df, "average_spend", "average_score")
plot_pareto(
all_spend, frontier_spend,
x_col="average_spend", y_col="average_score",
x_label="Average Cost ($)", y_label="Average Score",
title="Best Accuracy per Dollar on DRB",
)
4. Best accuracy per second¶
When you're running agent_map over thousands of rows, wall-clock time matters as much as cost. Claude 4.6 Opus (low) — the HIGH default — is the fastest high-accuracy model on DRB, finishing in under 3 minutes on average.
Caveats:
- Wall-clock time is influenced by a few relatively arbitrary factors like which deployment is being used, our rate limits, TPM quotas, whether we happened to have a lot of traffic when evals were run, etc. So it should be considered to be somewhat noisy. Pay less attention to this than to cost.
- Lower reasoning models may take less time to generate an response to a single query, but in some cases being less clever translates to having to use more tools and hence more iterations; so while it may look counterintuitive that some "slow" models (higher reasoning or Opus vs Sonnet) are faster, they may just make better use of their tool calls.
all_time, frontier_time = pareto_frontier(df, "average_runtime_seconds", "average_score")
plot_pareto(
all_time, frontier_time,
x_col="average_runtime_seconds", y_col="average_score",
x_label="Average Runtime (s)", y_label="Average Score",
title="Best Accuracy per Second on DRB",
)
5. Comparing the effort levels¶
Let's pull out the numbers for the three default models side-by-side, alongside Claude 4.6 Opus (high) — the strongest model on DRB — to show what you'd gain (and pay) by overriding.
EFFORT_MODELS = [
("LOW", "Gemini 3 Flash (minimal)"),
("MEDIUM", "Gemini 3 Flash (low)"),
("HIGH", "Claude 4.6 Opus (low)"),
("(override)", "Claude 4.6 Opus (high)"),
]
rows = []
for effort, model in EFFORT_MODELS:
match = df[df["model_name"] == model]
if not match.empty:
r = match.iloc[0]
rows.append({
"Effort Level": effort,
"Model": model,
"DRB Score": f"{r['average_score']:.3f}",
"Avg Cost": f"${r['average_spend']:.3f}" if pd.notna(r.get("average_spend")) else "—",
"Avg Runtime": f"{r['average_runtime_seconds']:.0f}s" if pd.notna(r.get("average_runtime_seconds")) else "—",
})
comparison = pd.DataFrame(rows)
comparison.style.hide(axis="index")
6. Choosing the right effort level¶
LOW is the default, and it's the right choice for most tasks that don't require web research — classifying rows, extracting fields, reformatting data. It runs a single LLM call with no tool use, so it's fast and cheap. Because DRB measures agentic information retrieval, the DRB score for the LOW model isn't very meaningful here: in practice LOW doesn't do research at all.
MEDIUM turns on the research agent. Gemini 3 Flash (low) sits on the cost Pareto frontier — it's the cheapest model that delivers strong research accuracy. Use this when you need agents to look things up on the web but want to keep costs down.
HIGH uses Claude 4.6 Opus (low), which sits on both the cost and speed Pareto frontiers. It's the fastest high-accuracy model on DRB and delivers the best score-per-dollar among top-tier models. Use this when accuracy matters and you're willing to pay more per row.
Want the absolute best accuracy? You can override the model directly by setting effort_level=None and specifying all parameters explicitly:
from everyrow.ops import agent_map
from everyrow.task import LLM
result = await agent_map(
task="Find each company's latest funding round",
input=companies_df,
effort_level=None,
llm=LLM.CLAUDE_4_6_OPUS_HIGH,
iteration_budget=10,
include_research=True,
)
Claude 4.6 Opus (high) is the top-scoring model on DRB, but it costs roughly twice as much and takes about three times as long as the HIGH default. For most workloads the HIGH preset already captures the bulk of that accuracy at a fraction of the price — but the option is there when you need it.
We re-run these benchmarks as new models launch, so the model behind each effort level may change over time. You always get the current best trade-off without changing your code.