We had a little downtime today between around 12:10 and 13:20. About 5 minutes of this downtime were planned – we simply wanted to reboot the machine. Unfortunately, the machine did not come back up after the reboot, and it took a while to figure out what the problem was.
As it turns out, the machine was unable to mount the huge volume with all our mirrors on it during boot. Manually mounting the disk failed as well because the device just wasn’t there. In lvdisplay the volume was listed as unavailable, and trying to set it available with lvchange failed with the message
/usr/sbin/cache_check: execvp failed: No such file or directory
Check of pool bigdata/ftpcachedata failed (status:2). Manual repair required!
As written in a previous post, we nowadays have a cache SSD in ftp.fau.de. As it turns out, you can create a cached volume and use it without any issues at all, until you try to reboot – because then LVM suddenly decides it needs a cache_check binary that naturally isn’t shipped in the LVM package. Of course, this is not a new problem: There is a bug report about that in Debian since 2014. Of course, slightly more than 3 years later, the problem still isn’t fixed (e.g. by checking if the cache_check-binary is available on creation of a cached volume). The problem is that the missing binary is in the thin-provisioning-tools-package, which to maximise confusion belongs to LVM, but doesn’t have LVM anywhere in its name. I also wouldn’t exactly associate caching with thinly provisioning volumes, but maybe that’s just me. The LVM2 package does not depend on thin-provisioning-tools, it only “suggests” it, so it doesn’t get installed automatically in any sane APT config for servers.
So once the problem was clear, it was at least easy to fix: We installed the missing package, rebooted, and ftp.fau.de was back in action.