A memory leak in the firmware of the central switch/router serving the HPC systems caused significant packet loss since around midnight on Friday, June 8th. Therefore, the switch/router had to be rebooted unannounced in the morning of June 8th at around 8:30. The resulting 5 minute network outage caused a severe hickup on the servers serving /home/hpc and /home/vault, leading to these filesystems being unavailable until around 9:20. HSM on /home/vault was not back until around 10:30.
Almost all jobs running on any of RRZE’s HPC clusters between 0:00 and 9:30 on Friday, June 6th, will have experienced problems. Check your jobs and resubmit them if necessary. Regular batch processing is gradually resumed.