How to Replace Human Data Annotators with LLMs in Active Learning

Active Learning: Ground Truth vs LLM Oracle

Human data labeling is slow and expensive. We replaced the human annotator with an LLM oracle in an active learning loop and achieved identical classifier performance — 200 labels in under 5 minutes for $0.26.

Install

pip install everyrow
export EVERYROW_API_KEY=your_key_here  # Get one at everyrow.io/api-key

Experiment design

Active learning reduces labeling costs by letting the model choose which examples to label next, focusing on the ones it is most uncertain about. But you still need an oracle to provide those labels, traditionally a human annotator.

We used a TF-IDF + LightGBM classifier with entropy based uncertainty sampling. Each iteration selects the 20 most uncertain examples, sends them to the LLM for annotation, and retrains. 10 iterations, 200 labels total.

We ran 10 independent repeats with different seeds, each time running both a ground truth oracle (human labels) and the LLM oracle with the same seed, a direct, controlled comparison.

from typing import Literal

import pandas as pd
from pydantic import BaseModel, Field

from everyrow import create_session
from everyrow.ops import agent_map
from everyrow.task import EffortLevel


LABEL_NAMES = {
    0: "Company", 1: "Educational Institution", 2: "Artist",
    3: "Athlete", 4: "Office Holder", 5: "Mean Of Transportation",
    6: "Building", 7: "Natural Place", 8: "Village",
    9: "Animal", 10: "Plant", 11: "Album", 12: "Film", 13: "Written Work",
}
CATEGORY_TO_ID = {v: k for k, v in LABEL_NAMES.items()}


class DBpediaClassification(BaseModel):
    category: Literal[
        "Company", "Educational Institution", "Artist",
        "Athlete", "Office Holder", "Mean Of Transportation",
        "Building", "Natural Place", "Village",
        "Animal", "Plant", "Album", "Film", "Written Work",
    ] = Field(description="The DBpedia ontology category")


async def query_llm_oracle(texts_df: pd.DataFrame) -> list[int]:
    async with create_session(name="Active Learning Oracle") as session:
        result = await agent_map(
            session=session,
            task="Classify this text into exactly one DBpedia ontology category.",
            input=texts_df[["text"]],
            response_model=DBpediaClassification,
            effort_level=EffortLevel.LOW,
        )
        return [CATEGORY_TO_ID.get(result.data["category"].iloc[i], -1)
                for i in range(len(texts_df))]

Results

Metric	Value
Labels per run	200
Cost per run	$0.26
Cost per labeled item	$0.0013
Final accuracy (LLM)	80.7% ± 0.8%
Final accuracy (human)	80.6% ± 1.0%
LLM–human label agreement	96.1% ± 1.6%
Repeats	10
Dataset	DBpedia-14 (14-class text classification)

The learning curves overlap almost perfectly. The shaded bands show ±1 standard deviation across 10 repeats — the LLM oracle tracks the ground truth oracle at every iteration.

Final test accuracies averaged over 10 repeats:

Data Labeling Method	Final Accuracy (mean ± std)
Human annotation (ground truth)	80.6% ± 1.0%
LLM annotation (everyrow)	80.7% ± 0.8%

The LLM oracle is within noise of the ground truth baseline. Automated data labeling produces classifiers just as good as human labeled data.

The LLM agreed with ground truth labels 96.1% ± 1.6% of the time. Roughly 1 in 25 labels disagrees with the human annotation, but that does not hurt the downstream classifier.

Metric	Value
Cost per run (200 labels)	$0.26
Cost per labeled item	$0.0013
Total (10 repeats)	$2.58

200 labels in under 5 minutes for $0.26, fully automated.

Limitations

We tested on one dataset with well separated categories. More ambiguous labeling tasks may see a gap between human and LLM annotation quality. We used a simple classifier (TF-IDF + LightGBM); neural models that overfit individual examples may be less noise tolerant.

The low cost in this experiment comes from using EffortLevel.LOW, which selects a small, fast model and doesn't use web research to improve the label quality. For simple classification tasks with well separated categories, this is sufficient.

For more ambiguous labeling tasks, you can use EffortLevel.MEDIUM or EffortLevel.HIGH to get higher quality labels from smarter models using the web. The cost scales accordingly, but even at higher effort levels, LLM labeling remains cheaper and faster than human annotation.

Reproduce this experiment

The full pipeline is available as a companion notebook on Kaggle. The experiment uses the DBpedia-14 dataset, a 14-class text classification benchmark. See also the full blog post for additional discussion.

How to Classify DataFrame Rows with an LLM — label data at scale with agent_map
How to Deduplicate Training Data in Python — clean ML datasets before training

Experiment design

We ran 10 independent repeats with different seeds, each time running both a ground truth oracle (human labels) and the LLM oracle with the same seed, a direct, controlled comparison.

from typing import Literal import pandas as pd from pydantic import BaseModel, Field from everyrow import create_session from everyrow.ops import agent_map from everyrow.task import EffortLevel LABEL_NAMES = { 0: "Company", 1: "Educational Institution", 2: "Artist", 3: "Athlete", 4: "Office Holder", 5: "Mean Of Transportation", 6: "Building", 7: "Natural Place", 8: "Village", 9: "Animal", 10: "Plant", 11: "Album", 12: "Film", 13: "Written Work", } CATEGORY_TO_ID = {v: k for k, v in LABEL_NAMES.items()} class DBpediaClassification(BaseModel): category: Literal[ "Company", "Educational Institution", "Artist", "Athlete", "Office Holder", "Mean Of Transportation", "Building", "Natural Place", "Village", "Animal", "Plant", "Album", "Film", "Written Work", ] = Field(description="The DBpedia ontology category") async def query_llm_oracle(texts_df: pd.DataFrame) -> list[int]: async with create_session(name="Active Learning Oracle") as session: result = await agent_map( session=session, task="Classify this text into exactly one DBpedia ontology category.", input=texts_df[["text"]], response_model=DBpediaClassification, effort_level=EffortLevel.LOW, ) return [CATEGORY_TO_ID.get(result.data["category"].iloc[i], -1) for i in range(len(texts_df))]

Results

Metric

Value

Labels per run

200

Cost per run

$0.26

Cost per labeled item

$0.0013

Final accuracy (LLM)

80.7% ± 0.8%

Final accuracy (human)

80.6% ± 1.0%

LLM–human label agreement

96.1% ± 1.6%

Repeats

Dataset

DBpedia-14 (14-class text classification)

The learning curves overlap almost perfectly. The shaded bands show ±1 standard deviation across 10 repeats — the LLM oracle tracks the ground truth oracle at every iteration.

Final test accuracies averaged over 10 repeats:

Data Labeling Method

Final Accuracy (mean ± std)

Human annotation (ground truth)

80.6% ± 1.0%

LLM annotation (everyrow)

80.7% ± 0.8%

The LLM oracle is within noise of the ground truth baseline. Automated data labeling produces classifiers just as good as human labeled data.

The LLM agreed with ground truth labels 96.1% ± 1.6% of the time. Roughly 1 in 25 labels disagrees with the human annotation, but that does not hurt the downstream classifier.

Metric

Value

Cost per run (200 labels)

$0.26

Cost per labeled item

$0.0013

Total (10 repeats)

$2.58

200 labels in under 5 minutes for $0.26, fully automated.

Limitations

How to Replace Human Data Annotators with LLMs in Active Learning

Install

Experiment design

Results

Limitations

Reproduce this experiment

Related

How to Replace Human Data Annotators with LLMs in Active Learning

Install

Experiment design

Results

Limitations

Reproduce this experiment

Related