Top 3 Takeaways from the Site Reliability Engineering Panel with LinkedIn, Dell, and Gremlin at NGINX Conf 2019

F5 Ecosystem | November 12, 2019

At our NGINX Conf 2019, we conducted more than 50 recorded sessions covering various subjects, but in this blog I’ll share takeaways from one of the hottest topics in the industry: Site Reliability Engineering (and also the related topic of Chaos Engineering). I’ll just focus on three key takeaways, but you’re encouraged to watch the entire session here.

1. SRE Definition

The conversation started on how the panelists defined the term Site Reliability Engineering, with the consistent comment that it is essentially: “Anything to make sure a site is up and running.” But, beyond that, they also emphasized “going really deep and fixing the issue as quickly as possible when any problem occurs” and “empowering development teams with a customer-centric mindset.” Also, did you recognize some approximate similarities with traditional Networking Operations teams in the descriptions? Yes, me too, but one panelist really read my mind in highlighting that, “Some organizations establish an SRE team just by renaming their Network Ops team, but that is not the best way.” There was some discussion on this, but my takeaway here is that the biggest difference between SRE and NetOps is that SRE personnel “sit on a Dev team or customer-facing team and truly focus on business goals.”

2. Chaos Engineering and Failure Injection

One of the key topics for an SRE function is the concept of Chaos Engineering. I will defer the detailed explanation of Chaos Engineering to this article, but in this session it’s really about “an approach to identify critical failures and get them fixed quickly” – something similar to fire drills. And although it has similarities with fire drills, the goal of Chaos Engineering is broader, in that it focuses on quantitatively analyzing recovery, durability, and availability metrics.

Failure Injection is a fairly common method, introduced by Netflix back in 2014. It is a testing approach to push failure simulation metadata into the production environment for testing purposes, but with control. These efforts are typically led by SRE teams in order to ensure higher availability and reliability of the service (or site).

3. KPI and Skillset of SRE

There was some interesting discussion around how SRE should be measured. While there were several points made around MTTD (Mean-time-to-Detect) and MTTR (Mean-time-to-Respond) being significant metrics, all panelist agreed that metrics will differ depending on the industry you're in, as well as the systems or sites you operate. A good suggestion captured from the discussion was, “You can start by asking this question: ‘What are your top 5 most critical systems?’ and that will help you prioritize things.”

The preferred skillset for an SRE position was another topic covered. According to panelists, this also depends on what system you run. (For example, if you are running NGINX, then NGINX experience would be crucial for an SRE hire.) A great suggestion from the group was to explore ways to rotate SRE personnel across different areas of the company and systems to scale – and better equip – SRE resources. Also, ensuring your SRE teams participate in SRE community events and activities such as training, offsites, dedicated Slack channels, and ‘game days,’ among other helpful suggestions.

Conclusion – Is 2020 the Time to Define your own SRE Strategy?

In a nutshell, the discussion revealed that many organizations are still learning how to define and leverage the concept and role of SRE – and like the panelists reiterated, these will often vary depending on industries and systems (and even individual companies). Overall, Chaos Engineering will continue to be tackled next year – maybe this is a perfect time to start thinking through what this means for you and your organization?

Featured Blog Posts

F5 accelerates and secures AI inference at scale with NVIDIA Cloud Partner reference architecture

Securing AI models and agents without compromise: How F5’s acquisition of CalypsoAI will deliver end-to-end AI runtime protection

Quantum ready: A practical guide to enabling PQC with F5

Tags: 2019

About the Author

More blogs by F5

Featured Blog Posts

F5 accelerates and secures AI inference at scale with NVIDIA Cloud Partner reference architecture

Securing AI models and agents without compromise: How F5’s acquisition of CalypsoAI will deliver end-to-end AI runtime protection

Quantum ready: A practical guide to enabling PQC with F5

Related Blog Posts

F5 Ecosystem | 12/09/2025

Build a quantum-safe backbone for AI with F5 and NetApp

By deploying F5 and NetApp solutions, enterprises can meet the demands of AI workloads, while preparing for a quantum future.

F5 Application Delivery and Security Platform (ADSP),

BIG-IP,

AI Security

F5 Ecosystem | 11/19/2025

F5 ADSP Partner Program streamlines adoption of F5 platform

The new F5 ADSP Partner Program creates a dynamic ecosystem that drives growth and success for our partners and customers.

F5 Application Delivery and Security Platform (ADSP),

Strategic alliance

F5 Ecosystem | 11/17/2025

Accelerate Kubernetes and AI workloads with F5 BIG-IP and AWS EKS

The F5 BIG-IP Next for Kubernetes software will soon be available in AWS Marketplace to accelerate managed Kubernetes performance on AWS EKS.

BIG-IP,

F5 on AWS

F5 Ecosystem | 11/11/2025

F5 NGINX Gateway Fabric is a certified solution for Red Hat OpenShift

F5 collaborates with Red Hat to deliver a solution that combines the high-performance app delivery of F5 NGINX with Red Hat OpenShift’s enterprise Kubernetes capabilities.

F5 NGINX,

2025

F5 Ecosystem | 08/26/2021

F5 Silverline Mitigates Record-Breaking DDoS Attacks

Malicious attacks are increasing in scale and complexity, threatening to overwhelm and breach the internal resources of businesses globally. Often, these attacks combine high-volume traffic with stealthy, low-and-slow, application-targeted attack techniques, powered by either automated botnets or human-driven tools.

Silverline Managed Services,

F5 Silverline DDoS Protection

F5 Ecosystem | 12/08/2020

Phishing Attacks Soar 220% During COVID-19 Peak as Cybercriminal Opportunism Intensifies

David Warburton, author of the F5 Labs 2020 Phishing and Fraud Report, describes how fraudsters are adapting to the pandemic and maps out the trends ahead in this video, with summary comments.

Fraud,

Phishing

Top 3 Takeaways from the Site Reliability Engineering Panel with LinkedIn, Dell, and Gremlin at NGINX Conf 2019

About the Author

Related Blog Posts

Build a quantum-safe backbone for AI with F5 and NetApp

F5 ADSP Partner Program streamlines adoption of F5 platform

Accelerate Kubernetes and AI workloads with F5 BIG-IP and AWS EKS

F5 NGINX Gateway Fabric is a certified solution for Red Hat OpenShift

F5 Silverline Mitigates Record-Breaking DDoS Attacks

Phishing Attacks Soar 220% During COVID-19 Peak as Cybercriminal Opportunism Intensifies

WHAT WE OFFER

RESOURCES

SUPPORT

PARTNERS

COMPANY