Downtime on January 12

We had a little downtime today between around 12:10 and 13:20. About 5 minutes of this downtime were planned – we simply wanted to reboot the machine. Unfortunately, the machine did not come back up after the reboot, and it took a while to figure out what the problem was.

As it turns out, the machine was unable to mount the huge volume with all our mirrors on it during boot. Manually mounting the disk failed as well because the device just wasn’t there. In lvdisplay the volume was listed as unavailable, and trying to set it available with lvchange failed with the message
/usr/sbin/cache_check: execvp failed: No such file or directory
Check of pool bigdata/ftpcachedata failed (status:2). Manual repair required!

As written in a previous post, we nowadays have a cache SSD in As it turns out, you can create a cached volume and use it without any issues at all, until you try to reboot – because then LVM suddenly decides it needs a cache_check binary that naturally isn’t shipped in the LVM package. Of course, this is not a new problem: There is a bug report about that in Debian since 2014. Of course, slightly more than 3 years later, the problem still isn’t fixed (e.g. by checking if the cache_check-binary is available on creation of a cached volume). The problem is that the missing binary is in the thin-provisioning-tools-package, which to maximise confusion belongs to LVM, but doesn’t have LVM anywhere in its name. I also wouldn’t exactly associate caching with thinly provisioning volumes, but maybe that’s just me. The LVM2 package does not depend on thin-provisioning-tools, it only “suggests” it, so it doesn’t get installed automatically in any sane APT config for servers.

So once the problem was clear, it was at least easy to fix: We installed the missing package, rebooted, and was back in action.

Changing filesystems: From XFS to EXT4

Since we moved the ftp to the (then) new hardware in October 2013, we had been using XFS as the filesystem on which we store all our mirror trees. All data resided in one single large XFS filesystem of 35 TB. That worked rather well until we updated the operating system on the machine in January.

After the update, the machine behaved extremely unstable – and in crass contrast to its 460 days uptime before the update. Sometimes all file I/O would stop for a few minutes (up to 30), and then suddenly continue as if nothing had happened. Those hangs happened in average twice a day. Sometimes the machine would also lock up completely and need to be reset.
During the hangs, the machine would log many of the following error-messages to the kernel log:

XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)

The problems soon were traced to the new kernel 3.13 that came with the update. Simply booting the old 3.2 Kernel from before the update again would get rid of the weird hangs and lockups. We first tried to update the kernel to 3.16, but to no avail – it showed exactly the same problem.

Apparently, there was a bug in newer XFS versions. So I took the problem to the XFS mailing list. The responses were surprisingly quick and competent: Apparently there are situations where XFS will need large amounts of unfragmented kernel memory, and when such memory is not available, it will essentially block until there is. Which might be “when hell freezes over”. Until then there is no I/O to that filesystem anymore. The suggested workaround of increasing vm.min_free_kbytes and similar settings so the kernel would be more likely to have enough memory immediately available for XFS did not work out, hangs were still happening at an unacceptable rate – a FTP server that is unavailble for 2 times 10 minutes a day isn’t exactly my idea of a reliable service for the public. The XFS developers seemed to have some ideas on how to “do better”, but it would not be simple. I did not want to wait for them to implement something – and even after they did I would still have to run a handpatched kernel all the time, which I was trying to avoid. So a switch of filesystems was in order. That was probably a good idea, because according to a recent post on LKML the situation is still unchanged and no fix has been implemented yet.

We decided to switch to EXT4 in 64bit mode. Classic EXT4 would not be able to handle a filesystem of 35 TB, but in 64bit mode it can. That mode was implemented some time ago, but until recently the e2fsprogs-versions shipped with distributions were unable to create or handle filesystems with the 64bit option.

To switch filesystems with as little downtime as possible, we had to pull a few tricks.
Unfortunately, it is not possible to shrink XFS-filesystems, so we could not simply shrink the XFS filesystem to make room for an EXT4 filesystem. So first, we temporarily attached another storage-box with enough space for the new filesystem. We created a LVM-volume on it and an EXT4-filesystem in it. We then started to rsync all files over. That took about 3 days, but of course happened in the background without downtime. Next step was to temporarily stop all cronjobs that would update the mirrors, do another final rsync run, and then unmount the old filesystem and mount the new one. That naturally caused a few minutes of downtime, but still went without too many users noticing. We then killed off the old filesystem, and used LVM’s pvmove to move the data from the temporary storage-box back to the space that was previously occupied by the old filesystem. This again happened in the background and completed in about a day. We could then remove the temporary storage-box. So far, we had done the whole move with less than an hour of downtime. The only thing left to do was that we still had to resize the EXT4 to fill all the available space – the one created on the temporary storage-box had been smaller because it had a smaller capacity than our regular RAIDs.

This was where we hit the next snatch: Running resize2fs to do an online-resize of the filesystem would just send resize2fs into an endless loop. It turned out this is another known bug for large 64bit EXT4 filesystems: Apparently nobody had ever tested resizing 64bit EXT4 filesystems that actually used block numbers larger than 2^32, which is why both online- and offline-resize-functions would try to stick a 64 bit block number into 32 bit and then naturally explode. Lucky for us, it just went into an endless-loop instead of corrupting the filesystem by trimming some 64 bit block numbers to 32 bit…

As fixing the online-resize would again mean to compile and run a handpatched kernel, the only option really was to do an offline-resize with a recent (>= 1.42.12) e2fstools-version. Unfortunately, that meant another downtime of a little over an hour for resize and fsck. But in the end, it was successful.

After 8 days and numerous obstacles, we have successfully moved from XFS to EXT4 without data loss.