What if the next AI infrastructure challenge is not simply obtaining more GPUs, but getting more value from every GPU already deployed?
That question is becoming more important as organizations move AI from experimentation into production.
Early AI projects often begin with access to compute. Teams need GPU capacity to test models, build applications, and prove value. But as AI services scale across more users, applications, tenants, and models, the infrastructure challenge changes.
Production AI is not just a compute problem. It is an efficiency problem, a governance problem, a routing problem, and a cost problem. And increasingly, it is a business problem.
That is the focus of the recent F5 BIG-IP Next for Kubernetes 2.3.0 release.
With this release, F5 introduced enhanced AI optimization for production-scale AI inference. BIG-IP Next for Kubernetes 2.3.0 brings key AI traffic optimization capabilities into native, production-ready infrastructure, helping organizations reduce redundant AI compute, improve GPU utilization, route requests intelligently across models and services, and govern token consumption across users and tenants.
“The future of AI will not be won by raw compute alone, but by organizations that can make every request smarter, every token more accountable, and every GPU work harder. F5 BIG-IP Next for Kubernetes 2.3.0 helps customers take that next step.”
The release also expands support for next-generation NVIDIA accelerated networking platforms, including ConnectX-8 SuperNIC support on x86 systems, along with support for ConnectX-7 and BlueField-3 running in SuperNIC mode.
AI traffic is becoming a control point
As AI moves into production, traffic management becomes much more than connectivity.
A single request may generate thousands of tokens. Similar prompts may appear repeatedly across users or applications. Some requests may need the most capable model available, while others may be better served by a lower-cost model or specialized AI service.
Every AI request becomes a decision. Where should it go? Which model should handle it? How much will it cost? How much infrastructure capacity will it consume? And who gets to decide?
These decisions need to happen consistently, intelligently, and close to the traffic itself. BIG-IP Next for Kubernetes 2.3.0 helps create that smarter control point, optimizing how AI requests are handled before they consume expensive compute resources. Because in production AI, the unit of competition is not just the model. It is the system around the model.
Preparing for more dynamic AI workflows
This becomes even more important as AI applications evolve from single prompt-and-response interactions toward more dynamic agentic workflows.
In these environments, AI traffic may move across agents, tools, inference services, and application components as part of a multi-step process. That creates new requirements for traffic control, identity context, policy enforcement, session continuity, and resiliency.
For organizations, the implication is clear: production AI infrastructure needs to do more than connect applications to models. It needs to provide a consistent control point for how AI interactions are routed, governed, authenticated, and delivered across increasingly distributed workflows.
BIG-IP Next for Kubernetes 2.3.0 helps establish that foundation by bringing enhanced AI optimization capabilities into native, production-ready infrastructure. By improving how AI requests are cached, routed, load balanced, governed, and accelerated, this release helps organizations prepare for a future where AI traffic becomes more distributed, more autonomous, and more business-critical.
Reducing redundant AI compute
One of the most direct ways to improve AI economics is to avoid doing the same expensive work over and over again.
As AI adoption grows, users and applications often generate similar or repeated prompts. Without optimization, each of those requests may consume additional inference capacity, even when a similar response has already been generated. That is waste. And at AI scale, waste gets expensive fast.
BIG-IP Next for Kubernetes 2.3.0 introduces semantic AI model caching as a generally available capability. It helps reduce redundant GPU compute for similar or repeated prompts, improving token economics, while lowering latency and increasing GPU efficiency.
At production scale, redundant AI compute can become a significant source of unnecessary cost. This release helps address that challenge by making AI inference more efficient at the traffic layer.
Improving GPU utilization and model selection
Traditional load balancing was not designed for the complexity of modern AI inference.
AI workloads introduce new variables, including GPU utilization, queue depth, model availability, backend performance, and infrastructure capacity. Sending traffic to the next available endpoint is not enough. AI requests should be directed to the best available resource based on real-time conditions.
BIG-IP Next for Kubernetes 2.3.0 introduces intelligent AI load balancing as a generally available capability. It dynamically distributes AI inference traffic using real-time telemetry and infrastructure awareness to help optimize GPU utilization, reduce queue bottlenecks, and improve response times.
The release also introduces large language model (LLM) routing integration as a generally available capability, enabling intelligent routing across different LLMs and AI services based on policies such as cost, performance, model specialization, or operational requirements.
Most organizations will not run production AI on a single model. The right model depends on the use case, the performance requirement, the business policy, and the cost profile.
The most powerful model is not always the right model. Sometimes the smartest AI decision is choosing the model that is good enough, fast enough, and cost-efficient enough for the job. That requires infrastructure that can put policy behind the decision.
Governing token consumption
Tokens are becoming one of the most important units of AI infrastructure economics.
They influence cost, capacity planning, tenant usage, and performance. They are also where AI enthusiasm starts to meet financial reality.
BIG-IP Next for Kubernetes 2.3.0 introduces token governance as a generally available capability, providing token monitoring, accounting, and rate limiting. This helps organizations better manage AI infrastructure costs and enforce usage policies across users, applications, and tenants.
For business leaders, this helps make AI consumption more transparent. For platform teams, it provides a practical way to govern shared AI infrastructure.
In production AI environments, token governance is how organizations bring discipline to one of the fastest-growing cost drivers in modern infrastructure.
Expanding infrastructure flexibility
AI infrastructure is evolving quickly. Some organizations are moving toward DPU-based architectures. Others are optimizing x86 systems with next-generation network acceleration. Many are modernizing in phases, balancing performance, cost, and operational readiness.
BIG-IP Next for Kubernetes 2.3.0 expands hardware support with ConnectX-8 SuperNIC support on x86 systems. The release also supports ConnectX-7 and BlueField-3 running in SuperNIC mode.
This gives organizations more flexibility as they optimize AI traffic processing performance and CPU efficiency. For teams not yet ready to adopt DPUs, SuperNIC support provides another path to improve infrastructure efficiency while maintaining alignment with next-generation NVIDIA networking platforms.
AI infrastructure modernization is not a single step. Organizations need options that support where they are today, while preserving a path toward more advanced accelerated architectures over time.
Why it matters
BIG-IP Next for Kubernetes 2.3.0 reflects a broader shift in how organizations need to operate AI infrastructure.
The first phase of AI was about access to compute. The next phase is about control.
As AI services move into production, teams need to engineer performance, cost efficiency, policy, and trust together. They need to optimize requests before they consume expensive infrastructure, govern usage across tenants and applications, route traffic intelligently across models and services, and support accelerated AI performance today and tomorrow.
That is the value of enhanced AI optimization in BIG-IP Next for Kubernetes 2.3.0.
This recent release helps organizations make AI infrastructure more efficient, more controlled, and more ready for production scale.
The future of AI will not be won by raw compute alone, but by organizations that can make every request smarter, every token more accountable, and every GPU work harder. F5 BIG-IP Next for Kubernetes 2.3.0 helps customers take that next step.
To learn more, visit the F5 BIG-IP Next for Kubernetes webpage.
About the Author

Related Blog Posts

Kubernetes-native WAF for the gateway era: F5 WAF for NGINX now integrates with F5 NGINX Gateway Fabric
F5 extends WAFs to deliver consistent, scalable protection across clusters and environments with F5 NGINX Gateway Fabric and F5 NGINX Ingress Controller.

From dashboard fatigue to operational excellence: Why XOps needs F5 Insight for ADSP
Learn how F5 Insight for ADSP lays the visibility foundation for XOps—turning fragmented signals across applications and infrastructure into actionable intelligence.

The hidden cost of unmanaged AI infrastructure
AI platforms don’t lose value because of models. They lose value because of instability. See how intelligent traffic management improves token throughput while protecting expensive GPU infrastructure.

Govern your AI present and anticipate your AI future
Learn from our field CISO, Chuck Herrin, how to prepare for the new challenge of securing AI models and agents.

F5 recognized as one of the Emerging Visionaries in the Emerging Market Quadrant of the 2025 Gartner® Innovation Guide for Generative AI Engineering
We’re excited to share that F5 has been recognized in 2025 Gartner Emerging Market Quadrant(eMQ) for Generative AI Engineering.
Self-Hosting vs. Models-as-a-Service: The Runtime Security Tradeoff
As GenAI systems continue to move from experimental pilots to enterprise-wide deployments, one architectural choice carries significant weight: how will your organization deploy runtime-based capabilities?
