MTTR is not "Mean Time to Reboot"

Lori MacVittie 缩略图
Lori MacVittie
Published March 14, 2019

Fail fast is the mantra of speed today. Whether DevOps or business, the premise of operating in a digital economy demands uptime as close to perfect as you can get it.

While the theory of this philosophy is good, in practice the result is often just more failure. By not focusing on finding the root cause (MTTR) and instead just on assuring availability (uptime), we're losing valuable data at unprecedented rates. Face it - when uptime is all you care about, MTTR becomes Mean-time-to-Reboot instead of Mean-Time-to-Resolution. And without a resolution - a reason for the downtime - you can't prevent it from happening again.

This approach is detrimental to the business.

You see, you aren't dropping packets, you're dropping parts of pennies. And as the classic criminal trope of siphoning off fractions of pennies from transactions to build up millions teaches us, every fraction of a penny counts. Every second in which a component, a service, a server fails to respond, you're losing value - both experiential and existential. Consumers won't stand for poor performance or downtime, and business ledgers can't tolerate either, either.

And if you know anything about throughput and bandwidth, you know that the basis for both calculations lies in the packets per second that can be processed by the underlying system. That's not just true in the network, but for every component that interacts with a transaction. The app. The application services. Routers. Switches. Databases. If it has a network connection, it is bound to this same calculation and constrained by its capacity to pass packets.


The speed of today's networks ensures that we're doing just that at a rate of millions of packets per second. Business transactions, of course, are primarily conducted via a (literal) web of HTTP transactions, each one passing information crucial to conducting business. The number of packets required to conduct a transaction depend on the amount of data required. The average packet carries 1500 bytes of data (that's the MTU). So if an HTTP-based message carrying a JSON payload that represents a transaction requires 4500 bytes (after encryption, of course), that's about three packets. So let's be generous and say a typical digital business transaction requires five packets. A 10Gbps network can process just under 15M packets per second. Assuming enough compute capacity is available, you could then say that equates to 3M transactions. Let's assume every transaction is worth a fraction (one-third) of a penny. That's $1M per second.

Now, no one actually processes transactions at that speed or volume. Even Visa - who inarguably processes data at rates most enterprises don’t require - claims its capacity is about 24,000 transactions per second. Assuming the same value of those transactions - one third of a penny - that's still $8,000 per second.

The point being that failure in the transaction chain comprised of routers, switches, network and application service infrastructure, app infrastructure, and components is A Very Bad Thing™. It's costly, because a failure means packets aren't being processed, and neither are the pennies they represent. And there is no part of the digital economy that does not rely on packets being passed.

The answer thus far is in the "fail fast" mantra - just spin up a new instance of X or Y or Z or whatever component failed. But that component failed *for a reason* and it is of the utmost importance that the reason is uncovered and addressed. Quickly. Because there are still expensive seconds between failure and restoration that cost business value. If it failed once, it's likely to fail again. And again.


This is why visibility is so critical to success in the digital economy. Because it is visibility that enables all the ops to find and remediate the cause of failure. Unfortunately, it is visibility that is often sacrificed for speed. Not literal speed of transactions, but time to value. In our rush to get apps to market faster and more frequently, we have not adequately invested in enabling the visibility necessary to mitigate failure.

In fact, one might argue that the "fail fast" philosophy of DevOps is a response to that failure. Without the ability to find and address the cause of failure, DevOps has determined it's better to restore availability than waste time. That ability is growing more and more elusive as organizations adopt multi-cloud approaches to deploying applications.

In 2018, the multi-cloud challenge of visibility was cited by fewer than one-third (31%) In 2019, that jumped to more than one-third (39%) to tie with performance and security as top challenges for multi-cloud. Visibility is a critical component of the over-arching "observability" that brings together monitoring, analytics, and alerting to provide valuable insights into the state of a system at any time. That's particularly important during a failure, because the state of the system is spread across multiple IT fiefdoms that may or may not enable sharing of information that can quickly lead to a resolution instead of just a reboot.

The ability of a service mesh to add value through distributed tracing is an excellent example of enabling visibility. But we need to extend that to include the entire chain of application services that scale and secure the applications executing in a containerized world. And that includes distributed components and applications running in public cloud that may be part of the execution chain. Visibility across environments, infrastructure, and applications is required to find and address issues that cause downtime or poor performance.

Visibility is imperative to enable organizations to return to measuring success on MTTResolution rather than MTTReboot.