Severe Service Degradation: OBS Unreliable/Unavailable

by Henne Vogelsang posted on 23rd Oct 2025

There was a service degradation of our reference server.

On Wednesday, October 15 from 05:00 in the morning, the response time of OBS was slow for anyone trying to use the server and in many cases connections were even dropped completely. Additionally at 14:10 UTC, the service went offline for a period of 84 minutes.

build.opensuse.org was back to normal operation at 15:37 UTC, so users were impacted by this during roughly 10.5 hours.

We want to give you some insight into what happened and what we are doing to avoid similar problems in the future.

Detection

Our monitoring system alerted us about increased traffic, high failure rates and unavailable services automatically.

Root Cause

Yet another malicious (AI?) bot crawling OBS resources and us failing to handle this traffic somehow. The downtime was caused by a failure in our firewall configuration, apparently firewalld/nftables does not like if you have overlapping IP ranges.

Resolution

We blocked another 85 IP ranges that we saw illegitimate traffic from.
We fixed our firewall config and wrote test to make sure we don’t deploy overlapping IP ranges.

Action Items

Scaling OBS is since a while an ongoing effort for us. We are working on figuring out which of the classic scaling topics are useful for us.

scale up (more hardware)
scale out (more nodes + load balance)
caching
optimizing performance

But also mending the symptoms by introducing traffic prioritization or and “blocking” illegitimate traffic by rate limiting, proof-of-work/javascript gating (berghain) and in general getting faster at identifying and blocking sources.

Lessons Learned

We need to put more effort into scaling and handling illegitimate traffic.

Timeline (All Time in UTC)

2025-10-15 05:00 The system started to experience increased traffic on certain OBS routes (requests)
2025-10-15 09:37 First alerts about earlyoom killing processes arrive
2025-10-15 09:54 The team starts to investigate the source of the traffic
2025-10-15 10:05 The team identifies a specific cloud via IP ranges and starts gathering data
2025-10-15 10:41 We issue a first warning about potential problems ahead to users
2025-10-15 10:43 As this is over 1.5K individual IPs we start gathering methods to block all IP ranges from the cloud
2025-10-15 13:00 First alerts about general memory consumption arrive, some OBS threads and support processes get killed by earlyoom
2025-10-15 14:01 We finally identified all subnets from which we’ve seen traffic with the help of https://www.openrdap.org/
2025-10-15 14:10 We block all subnets we have seen illegitimate traffic from
2025-10-15 14:13 Applying changes to (one of our) firewall(s) goes wrong and left OBS network unreachable
2025-10-15 15:02 We discover that firewalld/nftables does not like if you have overlapping IP ranges
2025-10-15 15:37 We fixed the firewall configuration, systems are back online