Sunday, June 16, 2019

KExec on modern distributions

Many years ago, I mentioned that KExec is a good way to reduce server downtime due to reboots. It is still the case, and it is especially true for hosters such as OVH and Hetzner that set their dedicated servers to boot from network (so that one can activate their rescue system from the web) before passing the control to the boot loader installed on the hard disk. Look, it takes 1.5 minutes to reboot an OVH server, and, on some server models, you can reduce this time to 15 seconds by avoiding all the time spent in the BIOS and in their netbooted micro-OS! Well, to be fair, on OVH you can avoid the netbooted environment by changing the boot priorities from the BIOS.

The problem is, distributions make KExec too complicated to use. E.g., Debian and Ubuntu, when one installs kexec-tools, present a prompt whether KExec should handle reboots. The catch is, it works only with sysvinit, not with systemd. With systemd, you are supposed to remember to type "systemctl kexec" if you want the next reboot to be handled by KExec. And it's not only distributions: since systemd version 236, KExec is supported only together with UEFI and the "sd-boot" boot loader, while the majority of hosters still stick with the legacy boot process and the majority of Linux distributions still use GRUB2 as their boot loader. An attempt to run "systemctl kexec" on something unsupported results in this error message:

Cannot find the ESP partition mount point.

Or, if /boot is on mdadm-based RAID1, another, equally stupid and unhelpful, error:

Failed to probe partition scheme of "/dev/block/9:2": Input/output error

While switching to UEFI and sd-boot is viable in some cases, it is not always the case. Fortunately, there is a way to override systemd developers' stance on what's supported, and even make the "reboot" command invoke KExec. Note that the setup is a big bad unsupported hack. There are no guarantees that the setup below will work with future versions of systemd.

The trick is to create the service that loads the new kernel and to override the commands that systemd executes when doing the actual reboot.

Here is the unit that loads the new kernel.

# File: /etc/systemd/system/kexec-load.service
[Unit]
Description=Loading new kernel into memory
Documentation=man:kexec(8)
DefaultDependencies=no
Before=reboot.target
RequiresMountsFor=/boot

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/sbin/kexec -d -l /vmlinuz --initrd=/initrd.img --reuse-cmdline

[Install]
WantedBy=default.target

It assumes that symlinks to the installed kernel and the initrd are available in the root directory, which is true for Debian and Ubuntu. On other systems please adjust the paths as appropriate. E.g., on Arch Linux, the correct paths are /boot/vmlinuz-linux and /boot/initramfs-linux.img.

There are other variants of this unit circulating around. A common mistake is that nothing ensures that the attempt to load the new kernel from /boot happens before the /boot partition is unmounted. The unit above does not have this race condition issue.

The second part of the puzzle is an override file that replaces the commands that reboot the system.

# File: /etc/systemd/system/systemd-reboot.service.d/override.conf
[Service]
Type=oneshot
ExecStart=
ExecStart=-/bin/systemctl --force kexec
ExecStart=/bin/systemctl --force reboot

That's it: try to kexec, and hopefully it does not return. If it does, then ignore the error and try the regular reboot.

For safety, let's also create a script that temporarily disables the override and thus performs one normal BIOS-based reboot.

#!/bin/sh
# File: /usr/local/bin/normal-reboot
mkdir -p /run/systemd/transient/systemd-reboot.service.d/
ln -sf /dev/null /run/systemd/transient/systemd-reboot.service.d/override.conf
ln -sf /dev/null /run/systemd/transient/kexec-load.service
systemctl daemon-reload
reboot

Give the script proper permissions and enable the service:

chmod 0755 /usr/local/bin/normal-reboot
systemctl enable kexec-load.service
reboot

If everything goes well, this will be the last BIOS-based reboot. Further reboots will be handled by KExec, even if you type "reboot".

This blog post would be incomplete without instructions what to do if the setup fails. And it can fail for various reasons, e.g. due to incompatible hardware or some driver assuming that its device has been properly reset by the BIOS.

Well, the most common problem is with a corrupted graphical framebuffer console. In this case, it may be sufficient to add "nomodeset" to the kernel command line.

Other systems may not be fixable so easily, or at all. E.g., on some OVH dedicated servers (in particular, on their "EG-32" product which is based on the Intel Corporation S1200SPL board), the kexec-ed kernel cannot properly route IRQs, and therefore does not detect SATA disks, and the on-board Ethernet adapter also becomes non-functional. In such cases, it is necessary to hard-reset the server and undo the setup. Here is how:

systemctl disable kexec-load.service
rm -rf /etc/systemd/system/systemd-reboot.service.d
rm -f /etc/systemd/system/kexec-load.service
rm -f /usr/local/bin/normal-reboot
systemctl daemon-reload
reboot

This reboot, and all further reboots, will be going through the BIOS.