The influence of inference: APIs, DPUs, and context chaos

Industry Trends | November 13, 2025

We keep pretending models are just endpoints. It’s as if the servers hosting the engines that encode input into a series of incomprehensible numbers don’t exist.

Ah, my friend, they exist, and they’re going to be as important as web and app servers have been to delivering applications. They will also, once again, force you to reimagine your infrastructure architecture.

You see, the moment you put a model behind an endpoint (you know, deploy an inference server), it steps onto the same layer 4-layer 7 (L4–L7) highways that run your business. Our latest research into AI security shows 80% of organizations are already running inference in the cloud or on-prem, which means your model is now another production workload riding through load balancers, gateways, WAFs, and API brokers.

We also learned a number of possibly uncomfortable truths.

Inference multiplies APIs

First, operating inference multiplies APIs. Every new assistant, scoring service, and feature flag wrapped around a model becomes another ingress. That creates more places for prompt injection, leakage, and output manipulation to enter, and more east-west paths for context to slosh around. Deploying inference correlates with better runtime hygiene, but the gap is still a problem. Among teams that operate inference, 14% still let prompts pass uninspected. Among SaaS-only consumers, the number jumps to 40%. Skipping inspection is not “keeping it simple.” It is leaving the barn door open and complimenting the horse on its cardio as it bolts.

This is why the control point matters. Inference rides the same pipes as your APIs, so policy has to live in-path. Deployers act like they know this. Sixty-one percent use an AI gateway for input and output inspection, 39% add firewalls, and 35% use custom proxies. SaaS-only consumers lean on vendor defaults, and it shows. They are three times more likely to skip testing or monitoring entirely, and 30% do not log prompts or completions for audits, compared with only 8% of inference deployers who skip logging. That is a maturity divide, but it is also an opportunity. If your apps consume third-party AI, an AI gateway becomes the insertion point for continuous testing, filtering, and policy. Do not rely on a polite prompt and a prayer.

Inference increases importance of DPUs

Second, inference is pushing DPUs from nice-to-have to necessary. Among teams that run inference, 71% consider or adopt DPUs when buying hardware or choosing instances. In the non-deploying cohort, that number sits at 43%. When you push encrypted traffic, inline inspection, and policy evaluation through the same boxes that keep your apps alive, offload is not a performance party trick. It is how you keep latency tolerable while you decrypt, classify, inspect, and enforce at scale. DPUs accelerate the exact chores inference needs most: transport layer security (TLS), telemetry, fine-grained policy, and per-request checks. Put simply, more inference means more work on the wire, and DPUs carry that weight.

Inference implies attention to context

Lastly, context is the second rail most teams forget to respect. Inference is not only inputs and outputs. It is history, retrievals, tools, and agent-to-agent chatter. As organizations eye Model Context Protocol (MCP) adoption, we already see the same pattern. Deployers plan to standardize, with 58% signaling MCP adoption against 21% in the non-deploying group. The draw is obvious. MCP promises structure for context flows.

The risk is also obvious. Opaque context can become a side channel that bypasses your pretty policies. That calls for inspection that understands prompts and completions, but also the provenance and policy of the context itself. If you would not let an unaudited data feed into your trading engine, do not let unaudited context into your model.

The impact on infrastructure architecture

So what does all this mean for your architecture? Because you know it means something.

Key takeaways
  1. Put policy in the path. Gateways and L7 proxies are already in the flow. Make them inference-aware. Inspect prompts, filter outputs, tag data, and enforce per-app and per-user limits that actually matter for models.
  2. Use DPUs to hold the line on latency. If you expect to decrypt, classify, inspect, and enforce at runtime without offload, you will trade security for performance on a Friday night and pretend it was a thoughtful decision on Monday.
  3. Treat context like a supply chain. Standardize how it is assembled, label it, validate it before use, and record the lineage. MCP helps, but only if you pair it with observability and enforcement that can see and stop context abuse.

The most profound finding in our latest research is simply that inference is now a first-class workload. It sits on the same pipes that run everything else, but it brings new risks that your “API reflexes” do not fully cover.

Make the pipes smarter, move heavy lifting to DPUs, and govern the context, not just the call. The teams that do this will ship faster and sleep better. The rest will learn resilience the hard way.

Share

About the Author

Lori Mac Vittie
Lori Mac VittieDistinguished Engineer and Chief Evangelist

More blogs by Lori Mac Vittie

Related Blog Posts

Managing context windows in AI isn’t magic. It’s architecture.
Industry Trends | 08/27/2025

Managing context windows in AI isn’t magic. It’s architecture.

Discover why AI app builders must manage context, memory, and infrastructure while evolving delivery and security to prevent drift, latency, and prompt attacks.

Deliver and Secure Every App
F5 application delivery and security solutions are built to ensure that every app and API deployed anywhere is fast, available, and secure. Learn how we can partner to deliver exceptional experiences every time.
Connect With Us