Yes, we do SRE!

How do you think we keep our reference instance up and running?

We, the Open Build Service contributors, practice Site Reliability Engineering (SRE) to guarantee the service’s health and availability. In the last couple of weeks, we have paid even more attention to this vital part of our job. As result, we now have our SRE process better documented, we do new measurements and we are better informed about problems thanks to newly implemented alerts.

Operations & SRE

Deployment, continuous integration or monitoring are well-known terms we immediately connect with operations. They are part of OBS life cycle as we follow DevOps philosophy. But apart from that, we also do practice Site Reliability Engineering.

We can say SRE tries to manage IT operations efficiently. Where DevOps focuses on ensuring an effective deployment and avoiding system downtime, among others, SRE focuses on the post-failure situation analyzing the causes and aiming to maximize uptime and stability.

After deployment, SRE makes use of monitoring tools to find out where automation is crucial to improving a system’s health and availability.

SRE in OBS

Introducing SRE into the OBS operations makes it possible for us to know and manage the state of the deployments. As we aim to perform continuous deployment, we will have quicker feedback loops about the code we deploy and hence avoid downtime in the future.

It also puts the OBS on the path of observability where the team is able to ask arbitrary questions about our app/deployment without - and this is the key part - having to know ahead of time what we need to ask for. The deeper we drill into the OBS code base, the more complex the issues will become, the more we will encounter unknown unknowns. Problems we never have faced before that require the team to ask questions they have never asked before. And the infrastructure needs to support this.

With that goal in mind, during the last few weeks, we have deeply analyzed what we are measuring and we have extended our documentation regarding this topic. A proper documentation will help us to get more valuable information from the tracked data, as well as synchronize all the contributors knowledge about how we do SRE.

As result of that work, we have also set up some new alerts, so from now on we are better informed when something goes wrong in our reference instance.

We take the data we need to do SRE from five different points:

  • Logging
  • Exception Tracking
  • System/Services Monitoring
  • Application Performance Monitoring (APM)
  • Application Health Monitoring (AHM)

You can read more about all of them starting in our GitHub Wiki.

Screenshot of Grafana panel

Stay tuned, we will share with you any achievements on this area.

How To Give Us Feedback

There are two ways to reach us:

Please note that we favor GitHub to gather feedback as it allows us to easily keep track of the discussions.