Content

Unplanned outage on June 03

On June 03, 2020, at around 16:15 local time (14:15 UTC), disk accesses on ftp.fau.de became extremely slow. Less than an hour later, any attempts to access the disks with the ftp-data on it failed. Investigation revealed that the big RAID controller that manages all the external disk enclosures for the data had stopped responding completely.
While the failed controller could temporarily be brought back by a powercycle at around 18:00 local time (16:00 UTC), it failed again within 10 minutes of booting the machine.
Unfortunately, there was no compatible replacement for the failed controller onsite. A replacement has been ordered and shipped, but has not arrived yet. As all parcel services are severely overloaded due to the Corona crisis, it is currently unclear when it will arrive.

We were able to bring ftp.fau.de partially back after noon on June 04: It seems the broken controller does not crash as long as it does not get too much load. We have therefore had to disable automatic updates of all mirrors for now. They will mostly remain at the version they had at around 2020-06-03 14:30 UTC. We have however updated a few select mirrors manually.

The controller also is significantly slower than normal, even if it has a significantly lower workload than usual. This is mostly because we have disabled any write caching on it, which indirectly automatically slows the throughput it can achieve to a crawl. While many accesses can be handled by the big SSD that serves as a cache (and is working perfectly fine), in those cases where a fetch from the spinning hard discs is needed because data is not in the cache, these will be significantly slower than usual.

We are sorry for the inconvenience and trying our best to return to full service ASAP.

We will update this article as needed.

Update 1 @2020-06-06 08:30: Parcel tracking now says that our delivery has arrived in our city and will be delivered to us on Monday, so we expect to be back in business by Monday evening.

Update 2 @2020-06-07 08:00: While we are not back to our usual sync schedules yet, all mirrors should be updated at least once a day.

Update 3 @2020-06-08 14:00: The replacement controller has arrived.

Update 4 @2020-06-08 22:00: The replacement-controller is working fine. All mirrors are current again, and normal update intervals have been resumed.
While we were working on the machine anyways, we also upgraded the main memory from 64 to 128 GB.

4 Responses to Unplanned outage on June 03

  1. mindbyte says:

    Thanks for being so open about what happened, what you did and keeping us updated on the issue!
    Sadly though, I was pretty much cuffed as all my servers rely on http://ftp.fau.de as a mirror. I’d be glad to see a backup server on your side, for redundancy. But, for a free service, that would be too much to ask for.
    Last but not least, I want to express that I’m very grateful for your super fast, https backed, Germany based and free mirror! Thank you!

  2. Michael Meier says:

    Sadly, a second server for redundancy is not going to happen, as the FTP-server is not considered a mission-critical service and the cost for doing that would be substantial (far north of 10000 EUR).
    There however already is a lot of redundancy within the one server – e.g. RAID, redundant power supplies, all the external disk shelves have two controllers, etc.

  3. Hans Schulze says:

    Thanks for mirroring, good job.
    The symlinks at ftp: /opensuse/tumbleweed/repo/ are broken…
    Regards, Hans

  4. Michael Meier says:

    (edited to correct the module mirrored)
    The symlinks are broken because they point to directories that we do not mirror. OpenSuSE offers different modules that mirrors can choose to sync, ranging in size from 160 GB (with only the most frequently requested files) to 22 TB (everything). The ‘everything’ option also generates loads of traffic for syncing, because it contains experimental package builds that by their nature are very volatile – if we would sync that, we would waste far more traffic syncing than would ever be requested from the mirror. The list of the available modules is at https://en.opensuse.org/openSUSE:Mirror_infrastructure#rsync_modules . We have always synced opensuse-full and switched to opensuse-full-with-factory to include Tumbleweed around 2017, because that seems to be the best compromise between “amount of space and traffic used for syncing” and “usefulness”. And that does not contain the directory /source to which the broken symlinks in /tumbleweed/repo point, it is probably only available in the ‘everything’ or separate ‘opensuse-source’ repo.
    Very few mirrors seem to mirror that /source directory – currently 2 of the 9 in Germany. I also cannot see a huge demand for it, which is why we do not intend to start mirroring it.

Leave a comment

Your email address will not be published Required fields are marked *

You can use the following HTML tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>