Name: everyrow
Author: FutureSearch

If you've ever tried to swap one LLM provider for another, you've probably noticed that things aren't as interchangeable as the docs suggest. On paper, the APIs are converging: they all accept messages, they all support tool calling, they all accept JSON Schemas for structured output. In practice, each provider has opinions about what "valid" means, and those opinions don't always agree with each other or with the relevant specs.

At FutureSearch, our everyrow.io app is powered by tens of thousands of LLM calls per day across Anthropic, Google, and OpenAI. Our system needs to be able to route any task to any provider depending on performance, cost, and availability. There are middleware services which aim to abstract away the provider-specific differences. But they aren't perfect, so we often need to address these quirks ourselves.

Over the past year, we've accumulated a collection of provider-specific patches and workarounds. None of them are individually complicated, but together they paint a picture of an ecosystem that's less standardized than it appears.

This post documents the quirks we've found. Some of these might save you a debugging session.

JSON Schema: three providers, three interpretations

Structured output is one of the most useful features of modern LLM APIs. You pass a JSON Schema, and the model returns output that conforms to it. Except "conforms to it" means different things to different providers.

Gemini requires array items to have a type

Consider this JSON Schema:

{
  "type": "array",
  "items": {}
}

In JSON Schema, items: {} means "the array can contain elements of any type." If you write a Pydantic model with a bare list annotation, it will produce a schema like this. However, Gemini requires items to have an explicit type field, so this gets rejected at the API level. Unfortunately, the structured output documentation doesn't mention this requirement.

Our fix replaces empty items with {"type": "string"} for Gemini requests. It's not semantically identical, but string is the most universal type and it works in every case we've encountered:

def fix_empty_items(schema: dict) -> dict:
    if schema.get("items") == {}:
        schema["items"] = {"type": "string"}
    for key, value in schema.items():
        if isinstance(value, dict):
            schema[key] = fix_empty_items(value)
        elif isinstance(value, list):
            schema[key] = [
                fix_empty_items(item)
                if isinstance(item, dict) else item
                for item in value
            ]
    return schema

The fix needs to be applied recursively to the entire schema. And it needs to be applied in two cases: when generating structured output and when specifying tool signatures. Easy to miss the second one.

Interestingly, I can reproduce the problem when using their API directly but not with their genai Python package. Perhaps they're aware of this issue, but they decided to address it in the Python package rather than in their backend. Or more likely: the team that maintains the Python package got fed up waiting for the backend team to do something about this.

OpenAI can't agree with itself on structured output

When generating structured output with OpenAI's chat/completions API, you need to pass the schema like this:

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "Response",
      "description": "A response to the user's question",
      "schema": {
        "type": "object",
        "properties": {
          "answer": {
            "type": "string"
          }
        },
        "required": ["answer"]
      }
    }
  }
}

But the responses API requires this instead:

{
  "text": {
    "format": {
      "type": "json_schema",
      "name": "Response",
      "description": "A response to the user's question",
      "schema": {
        "type": "object",
        "properties": {
          "answer": {
            "type": "string"
          }
        },
        "required": ["answer"]
      }
    }
  }
}

Key names are slightly different (format/response_format) and things are nested at different levels. These are easy to overlook, and the documentation doesn't make this clear. The only way to figure it out is to read the source code. It's obvious these two APIs were designed by different teams.

No provider supports top-level $ref

JSON Schema has the concept of definitions and references, which allows you to define a schema in one place and reuse it in another. These are helpful if you want to reuse a subschema without repeating yourself. And Pydantic automatically generates them for certain model hierarchies. Here's an example:

{
  "$ref": "#/$defs/Response",
  "$defs": {
    "Response": {
      "type": "object",
      "properties": {
        "answer": {
          "type": "string"
        }
      },
      "required": ["answer"]
    }
  }
}

That's a perfectly valid JSON Schema. But none of the major LLM providers accept this. They don't allow the root to be a $ref.

The solution is straightforward enough: find the $ref, look it up in $defs, and use the resolved definition as the top-level schema. But it's the kind of thing that works fine in tests with simple response formats, only to break in production when your schema happens to be structured differently.

Tool calling: Anthropic can't force tool calls when thinking

When you set tool_choice: {"type": "any"} on a Claude model with extended thinking enabled, Anthropic returns an error. You're not allowed to force tool use while the model is in thinking mode.

This is a real problem if you have a system that relies on structured tool call outputs, because reasoning models are often the ones you want for complex tasks. Our workaround has two steps: first, call the thinking model with tool_choice: {"type": "auto"}. If the model happens to return tool calls, great. If it doesn't, make a second call with extended thinking disabled and force tool use with tool_choice: {"type": "any"}.

It's not ideal since it sometimes means an extra API call, but it works. OpenAI and Gemini don't have this restriction. To be fair to Anthropic, at least they document this clearly.

Prompt caching: only Anthropic makes you do the work

Anthropic is the only major provider where prompt caching requires explicit markers in your request. You add cache_control: {"type": "ephemeral"} to specific messages that mark cache breakpoints, and Anthropic caches everything up to that point. On cache hits, you pay 90% less for the cached tokens (but 25% more on the initial write).

OpenAI and Google handle caching automatically on their end. You don't need to do anything.

The upside of Anthropic's approach is control: you can be strategic about where your cache boundaries fall. The downside is that it's one more thing to get right. And at scale, the cost of getting it wrong can be huge! At one point, we were wasting thousands of dollars per month because of incorrect cache markers.

In our implementation, we identify the last message in the first contiguous block of cacheable messages and add the cache marker there. It's only a few dozen lines of code. But it's logic that the other providers have internalized on your behalf.

Temperature constraints for reasoning models

Back in the early days of LLMs, temperature was an important parameter for controlling the "creativity" of the model's output. It was fun to turn it up and watch the model's response unravel into utter nonsense.

But with the rise of reasoning models, temperature has become deprecated. The big three providers restrict it to various degrees.

Google still supports temperature for Gemini 3 models, but it strongly discourages setting it to anything other than 1.
Anthropic requires temperature to be set to 1 when extended thinking is enabled.
OpenAI's o3 and GPT-5 families don't support temperature at all. But their responses API docs still lists temperature as a parameter with no mention of its lack of support.

Living with the mess

None of these issues are showstoppers. They're all fixable with a few lines of code each. But they add up. Our LLM client module has accumulated hundreds of lines of provider-specific logic, and new patches keep coming.

There are solutions out there which claim to abstract away all these silly details and provide a clean, unified interface. But in our experience, these glue layers are highly imperfect and introduce their own set of bugs and edge cases, requiring even more patches and workarounds.

But this is what it's like to work at the bleeding edge. If you're building multi-provider LLM infrastructure and have encountered other quirks not covered here, we'd love to hear about them. Our team at FutureSearch works on this stuff daily to make everyrow.io the best it can possibly be.

The Hidden Incompatibilities Between LLM Providers