The Twilight of RAG: How LLM In-Context Ranking is Rewriting the Rules

For the past few years, Retrieval-Augmented Generation (RAG) has been the cornerstone of scaling Large Language Models (LLMs) to massive knowledge bases. Since early LLMs suffered from limited input length—with models like GPT-4 handling only about 8,192 tokens (roughly 12 pages)—RAG provided an elegant, if complex, workaround: retrieve the most relevant fragments and feed those to the LLM. However, the rapid evolution of LLMs and their specialized ranking capabilities, combined with exploding context windows, suggests that the traditional RAG architecture we built and optimized is fundamentally on the decline.

Release Date	Model Family	Model Name	Max Context Tokens
Nov 30, 2022	OpenAI	GPT-3.5 Turbo	16,385 (16K)
Mar 14, 2023	OpenAI	GPT-4	8,192 (8K)
Sep 27, 2023	Mistral	Mistral 7B	32,768 (32K)
Nov 6, 2023	OpenAI	GPT-4 Turbo	128,000 (128K)
Mar 4, 2024	Anthropic	Claude 3 (Opus, Sonnet, Haiku)	200,000 (200K)
Apr 18, 2024	Meta	Llama 3	8,192 (8K)
May 13, 2024	OpenAI	GPT-4o	128,000 (128K)
Jun 20, 2024	Anthropic	Claude 3.5 Sonnet	200,000 (200K)
Jul 23, 2024	Meta	Llama 3.1	128,000 (128K)
Jul 24, 2024	Mistral	Mistral Large 2	128,000 (128K)
Jun 17, 2025	Google	Gemini 2.5 Pro	1,048,576 (1M)

This table is automatically generated by Gemini 2.5 Pro.

The table shows a clear and dramatic trend: the exponential growth of LLM context windows, driven by intense and rapid competition. In less than three years, the standard for flagship models has exploded from 16K tokens (GPT-3.5) to a competitive benchmark of 128K-200K in 2024 across all major labs. This race immediately escalated into a new “million-token” era in mid-2025, with Google’s Gemini 2.5 series setting a 1M token precedent, demonstrating that massive context size has become a critical and rapidly advancing frontier for model capability. The trajectory is even more dramatic: we’re likely heading toward 10M+ context windows by 2027, with Sam Altman hinting at billions of context tokens on the horizon.

As a result, the future is shifting away from fragmented retrieval pipelines and moving toward highly efficient, comprehensive LLM ranking and intelligent navigation.

The Unbearable Burden of Traditional RAG

RAG was a brilliant band-aid, but it relies on a complex, multi-step pipeline fraught with points of failure and computational bottlenecks:

The Chunking Challenge: Long documents must be broken into digestible pieces, typically 400-1,000 tokens. Even with sophisticated techniques to preserve hierarchical structure and table integrity, chunking ultimately destroys context and semantic relationships permanently, leading to fragmented understanding.
The Hybrid Search Nightmare: Achieving accurate retrieval required combining keyword search (BM25) with semantic search (embeddings). This necessitates sophisticated engineering for parallel processing, dynamic weighting, and score normalization using methods like Reciprocal Rank Fusion (RRF).
The Reranking Bottleneck: After all that work, an expensive second step—reranking—was necessary to limit the number of chunks sent to the context-poor LLM. This adds significant latency (300ms to 2000ms per query) and increases API costs. This multi-stage process results in a “cascading failure problem,” where errors compound from chunking to embedding to fusion to reranking. RAG treats long documents as independent paragraphs, fundamentally failing on complex analysis that requires causal understanding and tracing cross-references.

In-context Ranking: An Emerging Paradigm

In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR) that utilizes the comprehensive contextual understanding capabilities of Large Language Models (LLMs) to perform listwise ranking. In this setup, the LLM is directly tasked with processing a list of candidate documents and a query simultaneously to determine relevance and output a ranked list. There are active research in this area, for example:

LATTICE is introduced as a training-free, LLM-guided hierarchical retrieval framework specifically engineered for complex, reasoning-intensive queries, offering an alternative to the limitations of retrieve-then-rerank and long-context paradigms. The system organizes the document corpus into a semantic tree offline, a process achieved through LLM-driven strategies like bottom-up clustering or top-down divisive summarization. This hierarchy allows the framework to achieve search complexity that is logarithmic in the number of documents. During online query processing, a “LLM” navigates this structure using a greedy, best-first traversal. A crucial component of the traversal algorithm is the estimation of calibrated latent relevance scores, which are aggregated into a path relevance metric to reliably guide the search across different branches and levels, mitigating the challenge of noisy, context-dependent LLM relevance judgments.
This paper presents a comprehensive study on utilizing long-context Large Language Models (LLMs) for listwise passage ranking, primarily contrasting the traditional, inefficient sliding window strategy with a novel, more efficient full ranking strategy. The sliding window approach, used historically due to limited context length, incurs redundant API costs and serialized processing; in contrast, full ranking processes all passages in a single inference step, resulting in superior efficiency and roughly 50% lower API costs. While the full ranking strategy exhibited higher efficiency but lower effectiveness in a zero-shot setting, the paper reveals that in the supervised fine-tuning setting, the full ranking model achieves superior performance compared to the sliding window model. To effectively fine-tune the full ranking model, the authors propose two key innovations: a multi-pass sliding window approach to generate a complete listwise label, and an importance-aware learning objective that assigns greater weight to top-ranked passage IDs during loss calculation, ensuring the fine-tuned full ranking model outperforms baselines in both ranking effectiveness and efficiency.
The paper presents BlockRank (Blockwise In-context Ranking), a novel and efficient method developed to address the significant efficiency challenges associated with using generative Large Language Models (LLMs) for In-context Ranking (ICR), particularly the quadratic scaling of attention with increasing context length. BlockRank is motivated by an analysis of LLMs fine-tuned for ICR, which revealed inherent structures in attention patterns: inter-document block sparsity (document tokens focus locally) and query-document block relevance (specific query tokens develop strong retrieval signals toward relevant documents in middle layers). Building on these insights, BlockRank introduces two key modifications: (1) a structured sparse attention mechanism that architecturally enforces sparsity, successfully reducing the attention complexity from quadratic to linear concerning the number of documents (N); and (2) an auxiliary contrastive learning objective applied at a middle layer to explicitly optimize these internal attention scores to reflect relevance. This optimization enables a highly efficient attention-based inference method that bypasses iterative auto-regressive decoding

Conclusion: The Post-Retrieval Age is Here

RAG was an indispensable training wheel for the context-poor era. It allowed us to solve problems that exceeded the short attention span of early LLMs. Now, however, we have crossed a threshold. We are entering the post-retrieval age where LLM-native approaches offer superior performance, lower cost, and streamlined infrastructure:

RAG (The Old Way)	LLM-Native Ranking (The Future)
Fragmented chunking	Full document context/Hierarchical summaries
Latency and cost multiplication	Linear scaling, massive efficiency gains
Relies on similarity (embeddings)	Relies on precision and reasoning
Complex, brittle pipeline	Simple, agentic navigation

The shift is clear: the LLM is no longer just a summarizer waiting for fragments; it is the core ranking engine and the intelligent search agent. The cumbersome infrastructure of RAG is being replaced by the sheer power and efficiency of LLM ranking and reasoning in abundant context.

The Unbearable Burden of Traditional RAG#

In-context Ranking: An Emerging Paradigm#

Conclusion: The Post-Retrieval Age is Here#

References#

The Unbearable Burden of Traditional RAG

In-context Ranking: An Emerging Paradigm

Conclusion: The Post-Retrieval Age is Here

References