KV cache memory is a major bottleneck for long-context LLMs. Together AI's new OSCAR framework offers an 8x memory reduction through attention-aware 2-bit quantization, promising a shift in production serving economics.
KV cache memory is quietly becoming one of the most consequential bottlenecks in production large language model (LLM) deployment. As context windows stretch to 100K tokens and beyond, the memory required to store key-value attention states grows linearly — and at full precision, that growth becomes prohibitive. A single long-context inference pass can consume tens of gigabytes of GPU memory just in KV cache alone, leaving little room for batch sizes large enough to make serving economically viable.
Together AI's newly open-sourced OSCAR system attacks this problem directly. OSCAR is an attention-aware INT2 KV cache quantization framework that compresses KV cache entries to approximately 2.28 bits per KV element, achieving roughly 8× KV memory reduction and up to 3× decode speedup at 100K context length. The release marks a meaningful shift in how the industry can approach long-context serving infrastructure — not by throwing more hardware at the problem, but by rethinking the precision-efficiency tradeoff at the attention layer.
The KV Cache Problem at Scale
To understand why OSCAR matters, it helps to quantify the memory pressure that long-context deployments create. In a standard transformer, the KV cache stores intermediate attention keys and values for every token in the context window across every layer. For a model with 32 layers, 32 attention heads, and a 128-dimensional head size operating at FP16 precision, a single sequence of 100K tokens requires approximately:
100,000 tokens × 32 layers × 2 (K and V) × 32 heads × 128 dims × 2 bytes (FP16) ≈ 52 GB of KV cache memory per sequence
Even with grouped-query attention (GQA) reducing head counts, these figures remain staggering. Serving multiple concurrent users at 100K context on a single A100 node becomes essentially impossible at FP16 without aggressive batching constraints.
Existing mitigation strategies — FP8 quantization, KV eviction policies, sliding window attention — each carry tradeoffs. FP8 reduces memory by 2× but doesn't move the needle enough for extreme context lengths. Eviction-based approaches like H2O and StreamingLLM sacrifice recall quality for tokens outside a fixed budget. Sliding window attention fundamentally limits what the model can attend to.
INT2 quantization offers a more aggressive path: a theoretical 8× reduction from FP16 baseline. The challenge has always been that naive 2-bit quantization destroys attention quality — the dynamic range of KV tensors is too wide, and outlier values dominate the quantization error budget.
How OSCAR's Attention-Aware Design Works
OSCAR's core insight is that not all KV cache entries are equally sensitive to quantization error, and that sensitivity is predictable from attention patterns. Rather than applying uniform 2-bit quantization across all tokens and layers, OSCAR uses an attention-aware allocation strategy that identifies which tokens are likely to receive high attention weights and preserves higher precision for those entries.
The system operates at approximately 2.28 bits per KV element — slightly above the theoretical INT2 floor — because it maintains a small fraction of entries at higher precision based on per-head importance scores derived from attention distributions. This mixed-precision approach within the INT2 budget is what separates OSCAR from earlier naive quantization attempts.
Several architectural decisions underpin this design:
Per-Head Quantization Granularity
Rather than quantizing at the layer level, OSCAR applies quantization decisions at the attention head level. Different heads in the same layer exhibit dramatically different attention entropy and sparsity patterns. A head with sharp, concentrated attention (attending to a small set of highly relevant tokens) has a very different sensitivity profile than a diffuse head that distributes attention broadly. Per-head granularity lets OSCAR tune precision allocation accordingly.
Dynamic Importance Scoring
OSCAR computes lightweight importance scores for KV entries as the context is processed, using accumulated attention weights as a proxy for which tokens are likely to matter in future decoding steps. Tokens that have received consistently high attention are flagged for precision protection. This is conceptually related to the eviction heuristics in H2O (Heavy Hitter Oracle), but instead of evicting low-importance tokens, OSCAR retains them at INT2 while preserving high-importance tokens at higher effective precision.
Kernel-Level Dequantization
The 3× decode speedup isn't just a consequence of reduced memory bandwidth — it also reflects optimized CUDA kernels that fuse dequantization with attention computation. At 100K context, memory bandwidth is the dominant bottleneck during autoregressive decoding (each decode step reads the entire KV cache once). Reducing KV cache size from ~52 GB to ~6.5 GB at 2.28 bits means proportionally less data movement per decode step, which translates directly to throughput gains on memory-bandwidth-bound hardware like A100s and H100s.
Benchmark Performance and Validated Models
Together AI has validated OSCAR against Qwen3-4B-Thinking-2507 and Qwen3-8B, two models that represent the current generation of reasoning-capable open-weight architectures. These models are particularly relevant test cases because thinking/reasoning models tend to generate very long chains of thought, making long-context efficiency directly impactful for their practical utility.
The headline figures from Together AI's release:
- 8× KV memory reduction at 100K context length versus FP16 baseline
- 3× decode speedup at 100K context length
- Quality preservation sufficient for production use cases across the validated model families
The quality preservation claim deserves scrutiny. At 2-bit quantization, the expected degradation on standard benchmarks (MMLU, HumanEval, long-context needle-in-a-haystack evaluations) is a critical datapoint. Together AI's framing as production-ready implies the quality loss is within acceptable bounds for deployment — a threshold that varies by use case but generally means less than 1-2% degradation on core benchmarks.
For context on where this sits in the broader quantization landscape: FP8 KV cache (already widely deployed) achieves roughly 2× memory reduction with near-zero quality loss. INT4 KV cache achieves 4× reduction with modest degradation. OSCAR's 8× reduction at INT2 represents the aggressive end of the spectrum, and the attention-aware mechanism is specifically designed to recover the quality that naive INT2 would sacrifice.
Infrastructure Implications for LLM Deployment
The practical consequences of OSCAR's efficiency gains extend beyond raw benchmark numbers. Consider the deployment economics at scale:
Batch size expansion: With 8× smaller KV cache, an operator serving 100K-context requests can fit proportionally more concurrent sequences in GPU memory. For a system that was previously memory-limited to batch size 4 at 100K context on an 8×A100 node, OSCAR could theoretically enable batch sizes approaching 32 — a transformation in throughput and cost-per-token.
Hardware tier reduction: Long-context workloads that previously required H100 80GB instances (or multi-GPU setups) may become viable on A100 40GB or even consumer-grade hardware with sufficient VRAM. This has direct implications for inference cost curves and the accessibility of long-context capabilities.
Latency profiles: The 3× decode speedup at 100K context is particularly valuable for interactive applications. Reasoning models generating 10K-20K token chains of thought become meaningfully more responsive — the difference between a 30-second wait and a 10-second wait is often the difference between a usable and unusable product experience.
Multi-tenant serving: In shared inference infrastructure, KV cache memory is a primary constraint on how many users can be served simultaneously. OSCAR's compression directly increases the effective capacity of existing hardware fleets.
Situating OSCAR in the Quantization Ecosystem
OSCAR arrives in a crowded but still-evolving space. Several prior systems have addressed KV cache compression with different approaches:
| System | Method | Compression | Quality Tradeoff |
|---|---|---|---|
| FP8 KV (vLLM, TensorRT-LLM) | Uniform FP8 quantization | ~2× | Near-zero |
| KIVI | INT2/INT4 per-channel | 4-8× | Moderate |
| H2O | Attention-based eviction | Variable | Task-dependent |
| SnapKV | Observation-based compression | Variable | Moderate |
| OSCAR | Attention-aware INT2 | ~8× | Production-viable |
What distinguishes OSCAR is the combination of aggressive compression ratio and the attention-aware mechanism designed to maintain quality. The open-source release also matters: production teams can inspect, adapt, and integrate the system rather than relying on a black-box service.
The choice to validate on Qwen3 models is strategically significant. Qwen3's architecture (with its GQA configuration and strong long-context performance) represents a popular foundation for enterprise deployments. Demonstrating OSCAR's effectiveness on these specific models gives practitioners a concrete starting point for integration.
What the Open-Source Release Means
Together AI's decision to open-source OSCAR reflects a broader pattern in the inference optimization space: companies that build infrastructure capabilities are increasingly releasing them publicly, both to drive adoption of their platforms and to establish technical credibility. The release on GitHub (alongside the MarkTechPost coverage at marktechpost.com) makes OSCAR available for the broader community to evaluate and extend.
For practitioners evaluating large language model LLM deployment infrastructure, the open-source availability means:
- Independent quality validation is possible — teams can run their own needle-in-a-haystack and task-specific benchmarks rather than relying solely on Together AI's reported figures
- Integration into existing serving frameworks (vLLM, SGLang, TGI) becomes a community engineering effort rather than waiting for vendor support
- The attention-aware quantization logic can be studied and potentially applied to other compression techniques or model architectures
The Road Ahead for Long-Context Serving
OSCAR represents a technically credible answer to a real infrastructure problem, but it's worth noting what remains unresolved. The 2.28 bits per KV element figure implies a mixed-precision scheme that adds implementation complexity. The attention-aware importance scoring introduces a small computational overhead during prefill. And the quality validation, while promising, covers a limited set of model families — generalization to architectures with different attention patterns (MLA in DeepSeek, for instance) remains an open question.
The broader trajectory is clear: as context windows push toward 1M tokens — a direction already signaled by Gemini 1.5 Pro and Claude's extended context offerings — the memory pressure on KV cache will continue to intensify. Techniques like OSCAR are not a final answer but a necessary step in a longer progression toward context-efficient inference.
For teams running long-context workloads today, OSCAR's 8× memory reduction and 3× decode speedup at 100K context length represent gains that are difficult to achieve through any other single intervention. The attention-aware INT2 approach is technically sound, the open-source release enables independent validation, and the validated performance on current-generation reasoning models makes it immediately relevant to production deployment decisions.
The KV cache bottleneck isn't solved — but OSCAR moves the frontier meaningfully forward.
Sources
- Together AI / MarkTechPost: Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving
Last reviewed: May 26, 2026



