RAG in the Era of LLMs with 10 Million Token Context Windows

F5 Ecosystem | April 09, 2025

Hunter SmitSenior Product Marketing Manager

Meta recently unveiled the Llama 4 herd of LLMs—Scout, Maverick, and Behemoth preview—featuring a 10 million token context window with Scout. Soon afterward on X, LinkedIn, and other forums, comments around retrieval-augmented generation (RAG) becoming obsolete were gaining momentum, suggesting such an expansive context window could render RAG useless. However, we believe RAG will continue to be a foundational generative AI design pattern given the nuances of context windows, ever-changing corporate data, distributed data stores, regulatory concerns, model performance, and the relevance of enterprise-scale AI applications.

RAG is critical architecture for enterprises

Despite Llama 4’s achievement in supporting 10 million token context windows, RAG remains a critical component in enterprise AI applications. Enterprises often operate with dynamic, ever-changing data sets stored across distributed systems. RAG enables models to fetch and incorporate the most current and relevant information from these vast data stores in real time, ensuring AI outputs are both accurate and contextually relevant, all of which is extremely unique depending on the organization, team, or user. Real-time retrieval is vital for applications requiring up-to-date knowledge, such as customer support, market analysis, and knowledge bases.

Relying solely on large context windows without external retrieval can be both inefficient and a security liability. When data is continuously fed into a model, it becomes harder to control who can access that data, whether it’s stored securely, and how it might be inadvertently exposed through logs or model outputs. Insider threats, malicious prompts, or accidental leaks become more likely as the data volume grows, and organizations risk violating privacy or compliance mandates if confidential records are mishandled.

By adopting RAG, enterprises can retrieve only the most pertinent data for each query, aligning with regional and industry-specific regulatory constraints that often necessitate highly correlated data selection. This approach reduces the attack surface while ensuring consistent enforcement of policies like role-based access controls, encryption in transit, and detailed audit mechanisms. This selective retrieval not only cuts down on computational overhead but also enforces a robust security posture by limiting exposure of sensitive assets to precisely what is needed at the time of inference.

“When data is continuously fed into a model, it becomes harder to control who can access that data, whether it’s stored securely, and how it might be inadvertently exposed through logs or model outputs.”

Context windows and implications

In LLMs, the context window denotes the maximum number of tokens the model can process in a single input. Expanding this window allows the model to consider more extensive information simultaneously, resulting in more detailed conversations, more comprehensive analysis, and improved personalization. For perspective, raw text composed of 100,000 tokens is approximately 325 KB in size; a 10 million token context would equate to roughly 32 MB of text data. This capacity enables Llama 4 Scout to handle large amounts of information in a single query.

While an extended context window offers the advantage of processing more data at once, it introduces challenges related to model performance, accuracy, and efficiency. Processing millions of tokens demands substantial computational resources, leading to increased latency and higher operational costs. As context length grows, models can experience difficulties in maintaining attention and relevance across the entire input, potentially impacting the quality of AI outputs. On this topic, Andriy Burkov, Ph.D., an author and recognized AI expert, wrote on X, “The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it, you will get low-quality output most of the time.”

While larger context windows present new opportunities, the need to balance performance and resource utilization is critical. The optimal scenario is to present all the relevant information, but nothing that is not needed. In fact, some studies seem to indicate that, just as for humans, feeding too much information to an LLM detracts it from being able to identify and focus. For those interested, the white paper, Lost in the Middle: How Language Models Use Long Contexts, explores this topic in depth.

Infrastructure considerations for deploying advanced AI models

Many enterprises find it daunting to securely connect hundreds or thousands of widely dispersed data stores for RAG without compromising performance or security of the data in transit. The challenge of consolidating on-premises, hybrid, and multicloud-based storage locations requires a high-performance global interconnect fabric such as that provided by F5 Distributed Cloud Services. Ensuring only authorized LLM endpoints can access the data using an integrated WAF and policy-based controls, enterprises dramatically reduce the risks and overhead associated with managing multiple gateways or VPNs.

By providing a unified approach to networking and security, F5 Distributed Cloud Network Connect streamlines RAG implementations, allowing organizations to seamlessly connect distributed data sources for more accurate and timely LLM-driven outputs. Additionally, with F5 AI Gateway, organizations can protect against prompt injection attacks that could violate data security boundaries to ensure a defense in depth approach at inference time.

Deploying models like Llama 4 Scout, with its expansive context window, necessitates robust and efficient infrastructure. High-performance proxies capable of managing substantial data throughput are essential to maintain low latency and ensure seamless operation. F5 BIG-IP Next for Kubernetes deployed on NVIDIA BlueField-3 DPUs offers a compelling solution in this context, providing high-performance traffic management and security tailored for cloud-scale AI infrastructure and AI factories.

By offloading data-intensive tasks to DPUs, CPU resources are freed up for core application processes, enhancing overall system efficiency. With multi-tenancy support, multiple AI workloads can operate securely and efficiently within the same infrastructure, which aligns well with AI clouds, hyperscalers, and service providers. Such capabilities are indispensable for AI factories aiming to leverage models with extensive context windows while maintaining optimal performance and security.

Another important consideration is that large and highly variable context windows can drive significant fluctuations in resource consumption. This places greater emphasis on intelligently balancing incoming requests to match available compute capacity. Advanced, adaptive load balancing solutions help distribute these large queries across multiple clusters or regions, mitigating bottlenecks and maintaining overall performance in complex AI deployments, even if they don’t directly reduce computing costs.

RAG is still here to stay

RAG is just as relevant today as it ever has been, for reasons that go beyond the scaling of context windows. One key benefit is its ability to customize data retrieval based on the access rights of the user. Another is its capability to incorporate timely information without requiring model retraining or fine-tuning. This becomes especially important when considering the vast size of corporate data, which often spans terabytes or even petabytes that enterprises may seek to integrate with AI models.

The impressive innovations in increasing context window size, such as Llama 4 Scout’s 10 million token context window, are a significant leap forward in LLMs, but context still needs to be used thoughtfully. Large context sizes increase cost and latency and can even, in some cases, reduce the quality of the final response. Equally important are the robust infrastructure and security controls required to ensure high performance as organizations scale their AI applications.

F5’s focus on AI doesn’t stop here—explore how F5 secures and delivers AI apps everywhere.

Featured Blog Posts

F5 accelerates and secures AI inference at scale with NVIDIA Cloud Partner reference architecture

Securing AI models and agents without compromise: How F5’s acquisition of CalypsoAI will deliver end-to-end AI runtime protection

Quantum ready: A practical guide to enabling PQC with F5

Tags: 2025