Content

Unplanned outage on June 03

On June 03, 2020, at around 16:15 local time (14:15 UTC), disk accesses on ftp.fau.de became extremely slow. Less than an hour later, any attempts to access the disks with the ftp-data on it failed. Investigation revealed that the big RAID controller that manages all the external disk enclosures for the data had stopped responding completely.
While the failed controller could temporarily be brought back by a powercycle at around 18:00 local time (16:00 UTC), it failed again within 10 minutes of booting the machine.
Unfortunately, there was no compatible replacement for the failed controller onsite. A replacement has been ordered and shipped, but has not arrived yet. As all parcel services are severely overloaded due to the Corona crisis, it is currently unclear when it will arrive.

We were able to bring ftp.fau.de partially back after noon on June 04: It seems the broken controller does not crash as long as it does not get too much load. We have therefore had to disable automatic updates of all mirrors for now. They will mostly remain at the version they had at around 2020-06-03 14:30 UTC. We have however updated a few select mirrors manually.

The controller also is significantly slower than normal, even if it has a significantly lower workload than usual. This is mostly because we have disabled any write caching on it, which indirectly automatically slows the throughput it can achieve to a crawl. While many accesses can be handled by the big SSD that serves as a cache (and is working perfectly fine), in those cases where a fetch from the spinning hard discs is needed because data is not in the cache, these will be significantly slower than usual.

We are sorry for the inconvenience and trying our best to return to full service ASAP.

We will update this article as needed.

Update 1 @2020-06-06 08:30: Parcel tracking now says that our delivery has arrived in our city and will be delivered to us on Monday, so we expect to be back in business by Monday evening.

Update 2 @2020-06-07 08:00: While we are not back to our usual sync schedules yet, all mirrors should be updated at least once a day.

Update 3 @2020-06-08 14:00: The replacement controller has arrived.

Update 4 @2020-06-08 22:00: The replacement-controller is working fine. All mirrors are current again, and normal update intervals have been resumed.
While we were working on the machine anyways, we also upgraded the main memory from 64 to 128 GB.