AI Strategy

Coinbase Slashed AI Costs by 50% Using This Routing Strategy

Published: Jun 29, 20265 min read

Coinbase has successfully halved its AI expenditure by implementing an automated model routing layer and strategic caching. Discover how this shift in procurement is forcing a pricing stress test on Western AI labs.

Coinbase has cut its AI spending by 50% — not by using less AI, but by using it smarter. The cryptocurrency exchange has joined a growing wave of major tech companies rerouting workloads away from premium Western models toward lower-cost Chinese alternatives, deploying automated model selection and strategic caching to maintain — and even grow — token usage while dramatically shrinking the bill.

The move signals something larger than a single company's budget decision. It marks a structural shift in how enterprises think about AI procurement, and it puts Western AI labs on notice.

The Mechanics: How Coinbase Halved Its AI Bill

The core strategy isn't complicated, but executing it well requires real infrastructure investment. Coinbase built an automated model routing layer that evaluates each incoming task and selects the most cost-effective model capable of handling it. Not every task needs GPT-4-class reasoning. A classification job, a summarization request, or a structured data extraction task can often be handled just as effectively by a cheaper model — and the router knows the difference.

Two Chinese models now feature prominently in Coinbase's stack: GLM 5.2, developed by Zhipu AI, and Kimi 2.7, from Moonshot AI. Both have emerged as credible performers on standard benchmarks at price points that significantly undercut their Western counterparts. By routing appropriate workloads to these models rather than defaulting to a single premium provider, Coinbase is essentially practicing AI labor arbitrage — matching task complexity to model cost in real time.

The second lever is strategic caching. Repeated or structurally similar prompts don't need to be re-processed from scratch each time. By caching responses to common query patterns, Coinbase reduces redundant inference calls — one of the most wasteful cost drivers in production AI deployments. Together, routing and caching have allowed the company to keep its token consumption growing while its spend falls.

Coinbase cut AI spending in half while maintaining token usage growth through strategic model selection and improved caching.

A Pricing Stress Test for Western Labs

Coinbase is not alone. According to reporting by The Decoder, a broad cohort of major tech companies is pivoting toward Chinese AI models as cost pressures intensify. The pattern is consistent enough that analysts are now framing it as a "pricing stress test" for Western AI labs — OpenAI, Anthropic, Google, and others who have built their business models around premium inference pricing.

The stress test has two dimensions. First, Chinese labs have aggressively undercut Western pricing, partly enabled by lower operational costs and, in some cases, government subsidization. Second, the performance gap that once justified Western premium pricing has narrowed considerably. Models like GLM 5.2 and Kimi 2.7 now perform competitively on many enterprise workloads, even if they don't match frontier models on every benchmark.

For a company like Coinbase — which runs AI across customer support, fraud detection, developer tooling, and internal operations — even a modest per-token price difference compounds into millions of dollars at scale. The 50% spending reduction isn't a rounding error; it's a material line-item shift.

What This Means for Reducing Operational Costs with AI

The Coinbase playbook offers a replicable template for any organization serious about reducing operational costs with AI at scale:

1. Audit your model usage by task type. Most enterprise AI deployments are over-engineered for a significant portion of their workloads. A frontier model handling routine text classification is like using a Formula 1 car for a grocery run.

2. Build or adopt a routing layer. Model routing doesn't require building from scratch. Tools like LiteLLM, PortKey, and several enterprise gateway products now offer multi-provider routing with cost optimization logic built in.

3. Implement prompt caching aggressively. OpenAI, Anthropic, and most Chinese providers now offer native prompt caching. For applications with repetitive system prompts or high query overlap, caching alone can cut costs by 30–60%.

4. Treat model selection as a dynamic decision. The model landscape is moving fast. Locking into a single provider means missing cost improvements as new models enter the market — exactly the flexibility Coinbase's routing architecture is designed to exploit.

The Geopolitical Wrinkle

The shift toward Chinese AI models isn't without friction. Enterprise procurement teams, legal departments, and security-conscious CTOs are asking hard questions about data residency, model transparency, and the regulatory environment surrounding Chinese technology companies. For some industries — defense, healthcare, certain financial services — the risk calculus may rule out Chinese model providers entirely regardless of price.

Coinbase's willingness to move in this direction suggests the company has concluded that, for at least a portion of its workloads, those risks are manageable. It's a judgment call that other enterprises will need to make for themselves, and the answer will vary considerably by sector, jurisdiction, and the sensitivity of the data involved.

What to Watch

The immediate pressure point is on Western AI labs' pricing strategies. If enterprise customers can achieve equivalent output quality at half the cost by routing to Chinese alternatives, the premium pricing model becomes harder to defend. Expect to see further price reductions from OpenAI, Anthropic, and Google — a dynamic that, ironically, benefits all AI buyers regardless of which models they ultimately use.

Longer term, the rise of automated model routing as standard enterprise infrastructure is worth tracking. As routing layers become more sophisticated — incorporating quality scoring, latency requirements, compliance filters, and real-time cost signals — the notion of a single "preferred AI provider" may give way to a fluid, multi-model architecture where procurement is continuous and algorithmic.

Coinbase's 50% cost reduction is a proof point, not an outlier. The firms that build this kind of infrastructure now will have a structural cost advantage over those still paying premium rates for undifferentiated inference.

Sources: