Wednesday, July 10, 2019

How to work around slow IPMI virtual media

Sometimes, you need to perform a custom installation of an operating system (likely Linux) on a dedicated server. This happens, e.g., if the hoster does not support your preferred Linux distribution, or if you want to make some non-default decisions during the installation that are not easy to change later, like custom partitioning or other storage-related settings. For this use case, some hosters offer IPMI access. Using a Java applet, you can remotely look at the display of the server and send keystrokes and mouse input - even to the BIOS. Also, you can connect a local ISO image as a virtual USB CD-ROM and install an operating system from it.

The problem is - such virtual CD-ROM is unbearably slow if your server is not in the same city. The CD-ROM access pattern when loading the installer is all about random reads, because it consists mostly of loading (i.e., page faulting) executable programs and libraries from a squashfs image. Therefore there is an inherent speed limit based on the read access size and the latency between the IPMI viewer and the server. Even for linear reads, where the readahead mechanism should help, the virtual CD-ROM read speed is limited at something like 250 kilobytes per second if the latency is 100 ms. This is lower than 2x CD speed. If you are working from home using an ADSL or LTE connection, the speed would be even worse.

With distant servers or slow internet links on the viewer side, Linux installers sometimes time out loading their own components from the virtual media. In fact, when trying to install CentOS 7 over IPMI on a server located in Latvia, I always got a disconnection before it could load 100 megabytes from the installation CD. Therefore, this installation method was unusable for me, even with a good internet connection. This demonstrates why it is a good idea to avoid installing operating systems over IPMI from large ISO images. Even "minimal" or "netinstall" images are typically too large!

So, what are the available alternatives?

Having another host near the server being installed, with a remote desktop and a good internet connection, definitely helps, and is the best option if available.

Debian and Ubuntu offer special well-hidden "netboot" CD images which are even smaller than "netinstall". They are less than 75 MB in size and therefore have a better chance of success. In fact, even when not constrained by IPMI, I never use anything else to install these Linux distributions.

And it turns out that, if your server should obtain its IP address with DHCP, there is an even easier option: load the installer from the network. There is a special website, netboot.xyz, that hosts iPXE configuration files. All you need to download is a very small (1 MB) ISO with iPXE itself. Boot the server from it, and you will get a menu with many Linux variants available for installation.

There are two limitations with the netboot.xyz approach. First, to install some Linux variants, you do need DHCP, and not all hosting providers use it for their dedicated servers. Even though there is a failsafe menu where you can set a static IP address, some installers, including CentOS, assume DHCP. Second, you will not get UEFI, because their UEFI ISO is broken. For both limitations, bugs are filed and hopefully will get fixed soon.

A last-resort option, but still an option, would also be to install the system as you need it in a local virtual machine, and then transfer the block device contents to the server's disks over the network. To do so, you need a temporary lightweight Linux distribution. Tiny Core Linux (especially the Pure 64 port) fits the bill quite well. You can boot from their ISO image (which is currently only 26 MB in size, and supports UEFI) without installing it, configure the network if needed, install openssh using the "tce-load" tool, set the "tc" user password, and do what you need over ssh.

With the three techniques demonstrated so far, I can now say with confidence that slow IPMI virtual media will no longer stop me from remotely installing the Linux distribution that I need, exactly as I need.

Sunday, June 16, 2019

KExec on modern distributions

Many years ago, I mentioned that KExec is a good way to reduce server downtime due to reboots. It is still the case, and it is especially true for hosters such as OVH and Hetzner that set their dedicated servers to boot from network (so that one can activate their rescue system from the web) before passing the control to the boot loader installed on the hard disk. Look, it takes 1.5 minutes to reboot an OVH server, and, on some server models, you can reduce this time to 15 seconds by avoiding all the time spent in the BIOS and in their netbooted micro-OS! Well, to be fair, on OVH you can avoid the netbooted environment by changing the boot priorities from the BIOS.

The problem is, distributions make KExec too complicated to use. E.g., Debian and Ubuntu, when one installs kexec-tools, present a prompt whether KExec should handle reboots. The catch is, it works only with sysvinit, not with systemd. With systemd, you are supposed to remember to type "systemctl kexec" if you want the next reboot to be handled by KExec. And it's not only distributions: since systemd version 236, KExec is supported only together with UEFI and the "sd-boot" boot loader, while the majority of hosters still stick with the legacy boot process and the majority of Linux distributions still use GRUB2 as their boot loader. An attempt to run "systemctl kexec" on something unsupported results in this error message:

Cannot find the ESP partition mount point.

Or, if /boot is on mdadm-based RAID1, another, equally stupid and unhelpful, error:

Failed to probe partition scheme of "/dev/block/9:2": Input/output error

While switching to UEFI and sd-boot is viable in some cases, it is not always the case. Fortunately, there is a way to override systemd developers' stance on what's supported, and even make the "reboot" command invoke KExec. Note that the setup is a big bad unsupported hack. There are no guarantees that the setup below will work with future versions of systemd.

The trick is to create the service that loads the new kernel and to override the commands that systemd executes when doing the actual reboot.

Here is the unit that loads the new kernel.

# File: /etc/systemd/system/kexec-load.service
[Unit]
Description=Loading new kernel into memory
Documentation=man:kexec(8)
DefaultDependencies=no
Before=reboot.target
RequiresMountsFor=/boot

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/sbin/kexec -d -l /vmlinuz --initrd=/initrd.img --reuse-cmdline

[Install]
WantedBy=default.target

It assumes that symlinks to the installed kernel and the initrd are available in the root directory, which is true for Debian and Ubuntu. On other systems please adjust the paths as appropriate. E.g., on Arch Linux, the correct paths are /boot/vmlinuz-linux and /boot/initramfs-linux.img.

There are other variants of this unit circulating around. A common mistake is that nothing ensures that the attempt to load the new kernel from /boot happens before the /boot partition is unmounted. The unit above does not have this race condition issue.

The second part of the puzzle is an override file that replaces the commands that reboot the system.

# File: /etc/systemd/system/systemd-reboot.service.d/override.conf
[Service]
Type=oneshot
ExecStart=
ExecStart=-/bin/systemctl --force kexec
ExecStart=/bin/systemctl --force reboot

That's it: try to kexec, and hopefully it does not return. If it does, then ignore the error and try the regular reboot.

For safety, let's also create a script that temporarily disables the override and thus performs one normal BIOS-based reboot.

#!/bin/sh
# File: /usr/local/bin/normal-reboot
mkdir -p /run/systemd/transient/systemd-reboot.service.d/
ln -sf /dev/null /run/systemd/transient/systemd-reboot.service.d/override.conf
ln -sf /dev/null /run/systemd/transient/kexec-load.service
systemctl daemon-reload
reboot

Give the script proper permissions and enable the service:

chmod 0755 /usr/local/bin/normal-reboot
systemctl enable kexec-load.service
reboot

If everything goes well, this will be the last BIOS-based reboot. Further reboots will be handled by KExec, even if you type "reboot".

This blog post would be incomplete without instructions what to do if the setup fails. And it can fail for various reasons, e.g. due to incompatible hardware or some driver assuming that its device has been properly reset by the BIOS.

Well, the most common problem is with a corrupted graphical framebuffer console. In this case, it may be sufficient to add "nomodeset" to the kernel command line.

Other systems may not be fixable so easily, or at all. E.g., on some OVH dedicated servers (in particular, on their "EG-32" product which is based on the Intel Corporation S1200SPL board), the kexec-ed kernel cannot properly route IRQs, and therefore does not detect SATA disks, and the on-board Ethernet adapter also becomes non-functional. In such cases, it is necessary to hard-reset the server and undo the setup. Here is how:

systemctl disable kexec-load.service
rm -rf /etc/systemd/system/systemd-reboot.service.d
rm -f /etc/systemd/system/kexec-load.service
rm -f /usr/local/bin/normal-reboot
systemctl daemon-reload
reboot

This reboot, and all further reboots, will be going through the BIOS.

Friday, January 18, 2019

dynv6.com: IPv6 dynamic DNS done right

Sometimes, your home PC or router does not have a static IP address. In this case, if you want to access your home network remotely, a common solution is to use a dynamic DNS provider, and configure your router to update its A record (a DNS record that holds an IPv4 address) each time the external address changes. Then you can be sure that your domain name always points to the current external IPv4 address of the router.

For accessing PCs behind the router, some trickery is needed, because usually there is only one public IPv4 address available for the whole network. Therefore, the router performs network address translation (NAT), and home PCs are not directly reachable. So, you have to either use port forwarding, or a full-blown VPN. Still, you need only one dynamic DNS record, because there is only one dynamic IP — the external IP of your router, and that's where you connect to.

Enter the IPv6 world. There is no NAT anymore, the ISP allocates a whole /64 (or maybe larger) prefix for your home network(s), and every home PC becomes reachable using its individual IPv6 address. Except, now all addresses are dynamic. On "Rostelecom" ISP in Yekaterinburg, Russia, they are dynamic even if you order a static IP address, i.e. only IPv4 is static then, and there is no way to get a statically allocated IPv6 network.

A typical IPv6 network has a prefix length of 64. It means that the first 64 bits denote the network, and are assigned (dynamically) by the ISP, while the lower 64 bits refer to the host and do not change when the ISP assigns the new prefix. Often but not always, the host part is just a MAC address with the second-lowest bit in the first octet inverted, and ff:fe inserted into the middle. This mechanism is often called EUI-64. For privacy reasons, typically there are also other short-lived IPv6 addresses on the interface, but let's ignore them.

Unfortunately, many dynamic DNS providers have implemented their IPv6 support equivalently to IPv4, even though it does not really make sense. That is, a dynamic DNS client can update its own AAAA record using some web API call, and that's it. If you run a dynamic DNS client on the router, then only the router's DNS record is updated, and there is still no way to access home PCs individually, short of running DynDNS clients on all of them. In other words, the fact that the addresses in the LAN are, in fact, related, and should be updated as a group, is usually completely ignored.

The dynv6.com dynamic DNS provider is a pleasant exception. After registration, you get a third-level domain corresponding to your home network. You can also add records to that domain, corresponding to each of your home PCs. And while doing so, you can either specify the full IPv6 address (as you can do with the traditional dynamic DNS providers), or only the host part, or the MAC address. The ability to specify only the host part (or infer it from the MAC address) is what makes their service useful. Indeed, if the parent record (corresponding to your whole network) changes, then its network part is reapplied to all host records that don't specify the network part of their IPv6 address explicitly. So, you can run only one dynamic DNS client on the router, and get domain names corresponding to all of your home PCs.

Let me illustrate this with an example.

Suppose that your router has obtained the following addresses from the ISP:

2001:db8:5:65bc:d0a2:1545:fbfe:d0b9/64 for the WAN interface
2001:db8:b:9a00::/56 as a delegated prefix

Then, it will (normally) use 2001:db8:b:9a00::1/64 as its LAN address, and PCs will get addresses from the 2001:db8:b:9a00::/64 network. You need to configure the router to update the AAAA record (let's use example.dynv6.net) with its LAN IPv6 address. Yes, LAN (and many router firmwares, including OpenWRT, get it wrong by default), because the WAN IPv6 address is completely unrelated to your home network. Then, using the web, create some additional AAAA records under the example.dynv6.net address:

desktop AAAA ::0206:29ff:fe6c:f3e5   # corresponds to MAC address 00:06:29:6c:f3:e5
qemu AAAA ::5054:00ff:fe12:3456   # corresponds to MAC address 52:54:00:12:34:56

Or, you could enter MAC addresses directly.

As I have already mentioned, the beauty of dynv6.com is that it does not interpret these addresses literally, but prepends the proper network part. That is, name resolution would actually yield reachable addresses:

example.dynv6.net. AAAA 2001:db8:b:9a00::1   # The only record that the router has to update
desktop.example.dynv6.net. AAAA 2001:db8:b:9a00:206:29ff:fe6c:f3e5   # Generated
qemu.example.dynv6.net. AAAA 2001:db8:b:9a00:5054:ff:fe12:3456    # Also generated

And you can, finally, connect remotely to any of those devices.

Sunday, January 6, 2019

Resizing Linux virtual machine disks

Sometimes, one runs out of disk space on a virtual machine, and realizes that it was a mistake to provide such a small disk to it in the first place. Fortunately, unlike real disks, the virtual ones can be resized at will. A handy command for this task comes with QEMU (and, if you are on Linux, why are you using anything else?). Here is how to extend a raw disk image to 10 GB:

qemu-img resize -f raw vda.img 10G

After running this command, the beginning of the disk will contain the old bytes that were there before, and at the end there will be a long run of zeroes. qemu-img is smart enough to avoid actually writing these zeros to the disk image, it creates a sparse file instead.

Resizing the disk image is only one-third of the job. The partition table still lists partitions of the old sizes, and the end of the disk is unused. Traditionally, fdisk has been the tool for altering the partition table. You can run it either from within your virtual machine, or directly on the disk image. All that is needed is to delete the last partition, and then recreate it with the same start sector, but with the correct size, so that it also covers the new part of the disk. Here is an example session with a simple MBR-based disk with two partitions:

# fdisk /dev/vda

Welcome to fdisk (util-linux 2.31.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): p
Disk /dev/vda: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xc02e3411

Device     Boot  Start      End  Sectors  Size Id Type
/dev/vda1  *      2048   997375   995328  486M 83 Linux
/dev/vda2       997376 12580863 11583488  5.5G 83 Linux

Command (m for help): d
Partition number (1,2, default 2): 2

Partition 2 has been deleted.

Command (m for help): n
Partition type
   p   primary (1 primary, 0 extended, 3 free)
   e   extended (container for logical partitions)
Select (default p): p
Partition number (2-4, default 2): 2
First sector (997376-20971519, default 997376): 
Last sector, +sectors or +size{K,M,G,T,P} (997376-20971519, default 20971519): 

Created a new partition 2 of type 'Linux' and of size 9.5 GiB.
Partition #2 contains a ext4 signature.

Do you want to remove the signature? [Y]es/[N]o: n

Command (m for help): p

Disk /dev/vda: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xc02e3411

Device     Boot  Start      End  Sectors  Size Id Type
/dev/vda1  *      2048   997375   995328  486M 83 Linux
/dev/vda2       997376 20971519 19974144  9.5G 83 Linux

Command (m for help): w
The partition table has been altered.
Syncing disks.

As you see, it went smoothly. The kernel will pick up the new partition table after a reboot, and then you will be able to resize the filesystem with resize2fs (or some other tool if you are not using ext4).

Things are not so simple if the virtual disk is partitioned with GPT, not MBR, to begin with. The complication stems from the fact that there is a backup copy of GPT at the end of the disk. When we added zeros to the end of the disk, the backup copy ended up in the middle of the disk, not to be found. Also, the protective MBR now covers only the first part of the disk. The kernel is able to deal with this, but some versions of fdisk (at least fdisk found in Ubuntu 18.04) cannot. What happens is that fdisk is not able to create partitions that extend beyond the end of the old disk. And saving anything (in fact, even saving what already exists) fails with a rather unhelpful error message:

# fdisk /dev/vda

Welcome to fdisk (util-linux 2.31.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

GPT PMBR size mismatch (12582911 != 20971519) will be corrected by w(rite).
GPT PMBR size mismatch (12582911 != 20971519) will be corrected by w(rite).

Command (m for help): w
GPT PMBR size mismatch (12582911 != 20971519) will be corrected by w(rite).
fdisk: failed to write disklabel: Invalid argument

Modern versions of fdisk do not have this problem:
# fdisk /dev/vda

Welcome to fdisk (util-linux 2.33).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

GPT PMBR size mismatch (12582911 != 20971519) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.


Still, with GPT and not-so-recent versions of fdisk, it looks like we cannot use fdisk to take advantage of the newly added disk space. There is another tool, gdisk, that can manipulate GPT structures. However, it claims that there is almost no usable free space on the disk, and thus refuses to usefully resize the last partition by default.

# gdisk /dev/vda
GPT fdisk (gdisk) version 1.0.3

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help): p
Disk /dev/vda: 20971520 sectors, 10.0 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 61E744EF-1CD3-5145-BC59-4646E6CB03DE
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 12582878
Partitions will be aligned on 2048-sector boundaries
Total free space is 2015 sectors (1007.5 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048            4095   1024.0 KiB  EF02  
   2            4096          999423   486.0 MiB   8300  
   3          999424        12580863   5.5 GiB     8300  


What we need to do is to use the "expert" functionality in order to move the backup GPT to the end of the disk. After that, new free space will be available, and we will be able to resize the last partition.

Command (? for help): x

Expert command (? for help): e
Relocating backup data structures to the end of the disk

Expert command (? for help): m

Command (? for help): d
Partition number (1-3): 3

Command (? for help): n
Partition number (3-128, default 3): 
First sector (999424-20969472, default = 999424) or {+-}size{KMGTP}: 
Last sector (999424-20969472, default = 20969472) or {+-}size{KMGTP}: 
Current type is 'Linux filesystem'
Hex code or GUID (L to show codes, Enter = 8300): 
Changed type of partition to 'Linux filesystem'

Command (? for help): p
Disk /dev/vda: 20971520 sectors, 10.0 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 61E744EF-1CD3-5145-BC59-4646E6CB03DE
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 20969472
Partitions will be aligned on 2048-sector boundaries
Total free space is 0 sectors (0 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048            4095   1024.0 KiB  EF02  
   2            4096          999423   486.0 MiB   8300  
   3          999424        20969472   9.5 GiB     8300  Linux filesystem

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/vda.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.


OK, it worked, but was a bit too complicated, to the degree sufficient to consider sticking with MBR where possible. And it changed the UUID of the last partition, which may or may not be OK in your setup (it is definitely not OK if /etc/fstab or the kernel command line mentions PARTUUID). The same warning about PARTUUID applies to modern versions of fdisk, too.

Anyway, it turns out that gdisk is not the simplest solution to the problem of resizing a GPT-based disk. The "sfdisk" program that comes with util-linux (i.e. with the same package that provides fdisk, even with not-so-recent versions) works just as well. We need to dump the existing partitions, edit the resulting script, and feed it back to sfdisk so that it recreates these partitions for us from scratch, with the correct sizes, and we can preserve all partition UUIDs, too.

Here is what this dump looks like:

# sfdisk --dump /dev/vda > disk.dump
GPT PMBR size mismatch (12582911 != 20971519) will be corrected by w(rite).
# cat disk.dump
label: gpt
label-id: 61E744EF-1CD3-5145-BC59-4646E6CB03DE
device: /dev/vda
unit: sectors
first-lba: 2048
last-lba: 12582878

/dev/vda1 : start=        2048, size=        2048, type=[...], uuid=[...]
/dev/vda2 : start=        4096, size=      995328, type=[...], uuid=[...]
/dev/vda3 : start=      999424, size=    11581440, type=[...], uuid=[...]

We need to fix the "last-lba" parameter and change the size of the last partition. Fortunately, sfdisk has reasonable defaults (use as much space as possible) for both parameters, so we can just delete them instead. Quite easy to do with sed:

# sed -i -e '/^last-lba:/d' -e '$s/size=[^,]*,//' disk.dump
# cat disk.dump
label: gpt
label-id: 61E744EF-1CD3-5145-BC59-4646E6CB03DE
device: /dev/vda
unit: sectors
first-lba: 2048

/dev/vda1 : start=        2048, size=        2048, type=[...], uuid=[...]
/dev/vda2 : start=        4096, size=      995328, type=[...], uuid=[...]
/dev/vda3 : start=      999424,  type=[...], uuid=[...]

Then, with some flags to turn off various checks, sfdisk loads the modified partition table dump:

# sfdisk --no-reread --no-tell-kernel -f --wipe never /dev/vda < disk.dump
GPT PMBR size mismatch (12582911 != 20971519) will be corrected by w(rite).
Disk /dev/vda: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 61E744EF-1CD3-5145-BC59-4646E6CB03DE

Old situation:

Device      Start      End  Sectors  Size Type
/dev/vda1    2048     4095     2048    1M BIOS boot
/dev/vda2    4096   999423   995328  486M Linux filesystem
/dev/vda3  999424 12580863 11581440  5.5G Linux filesystem

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: 61E744EF-1CD3-5145-BC59-4646E6CB03DE).
/dev/vda1: Created a new partition 1 of type 'BIOS boot' and of size 1 MiB.
/dev/vda2: Created a new partition 2 of type 'Linux filesystem' and of size 486 MiB.
Partition #2 contains a ext4 signature.
/dev/vda3: Created a new partition 3 of type 'Linux filesystem' and of size 9.5 GiB.
Partition #3 contains a ext4 signature.
/dev/vda4: Done.

New situation:
Disklabel type: gpt
Disk identifier: 61E744EF-1CD3-5145-BC59-4646E6CB03DE

Device      Start      End  Sectors  Size Type
/dev/vda1    2048     4095     2048    1M BIOS boot
/dev/vda2    4096   999423   995328  486M Linux filesystem
/dev/vda3  999424 20971486 19972063  9.5G Linux filesystem

The partition table has been altered.


Let me repeat. The following lines, ready to be copy-pasted without much thinking except for the disk name, resize the last partition to occupy as much disk space as possible, and they work both on GPT and MBR:

DISK=/dev/vda
sfdisk --dump $DISK > disk.dump
sed -i -e '/^last-lba:/d' -e '$s/size=[^,]*,//' disk.dump
sfdisk --no-reread --no-tell-kernel -f --wipe never $DISK < disk.dump

On Ubuntu, there is also a "cloud-guest-utils" package that provides an even-easier "growpart" command, here is how it works:
# growpart /dev/vda 3
CHANGED: partition=3 start=999424 old: size=11581440 end=12580864 new: size=19972063,end=20971487

Just as with MBR, after resizing the partition, you have to reboot your virtual machine, so that the kernel picks up the new partition size, and then you can resize the filesystem to match the partition size.

Thursday, January 3, 2019

Using Let's Encrypt certificates with GeoDNS

Let's Encrypt is a popular free TLS certificate authority. It currently issues certificates valid for only 90 days, and thus it is a good idea to automate their renewal. Fortunately, there are many tools to do so, including the official client called Certbot.

When Certbot or any other client asks Let's Encrypt for a certificate, it must prove that it indeed controls the domain names that are to be listed in the certificate. There are several ways to obtain such proof, by solving one of the possible challenges. HTTP-01 challenge requires the client to make a plain-text file with a given name and content available under the domain in question via HTTP, on port 80. DNS-01 challenge requires publishing a specific TXT record in DNS. There are other, less popular, kinds of challenges. HTTP-01 is the challenge which is the simplest to use in a situation where you have only one server that needs to have a non-wildcard TLS certificate for a given domain name (or several domain names).

Sometimes, however, you need to have a certificate for a given domain name available on more than one server. Such need arises e.g. if you use GeoDNS or DNS-based load balancing, i.e. answer DNS requests for your domain name (e.g., www.example.com) differently for different clients. E.g., you may want to have three servers, one in France, one in Singapore, and one in USA, and respond based on the client's IP address by returning the IP address of the geographically closest server. However, this presents a problem when trying to obtain a Let's Encrypt certificate. E.g., the HTTP-01 challenge fails out of the box because Let's Encrypt will likely connect to a different node than the one asking for the certificate, and will not find the text file that it looks for.

A traditional solution to this problem would be to set up a central server, let it respond to the challenges, and copy the certificates from it periodically to all the nodes.

Making the central server solve DNS-01 challenges is trivial — all that is needed is an automated way to change DNS records in your zone, and scripts are available for many popular DNS providers. I am not really comfortable with this approach, because if an intruder gets access to your central server, they can not only get a certificate and a private key for www.example.com, but also take over the whole domain, i.e. point the DNS records (including non-www) to their own server. This security concern can be alleviated by the use of CNAMEs that point _acme-challenge to a separate DNS zone with separate permissions, but doing so breaks all Let's Encrypt clients known to me. Some links: two bug reports for the official client, and my own GitHub gist for a modified Dehydrated hook script for Amazon Route 53.

For HTTP-01, the setup is different: you need to make the central server available over HTTP on a separate domain name (e.g. auth.example.com), and configure all the nodes to issue redirects when Let's Encrypt tries to verify the challenge. E.g., http://www.example.com/.well-known/acme-challenge/anything must redirect to http://auth.example.com/.well-known/acme-challenge/anything, and then Certbot running on auth.example.com will be able to obtain certificates for www.example.com without the security risk inherent for DNS-01 challenges. Proxying the requests, instead of redirecting them, also works.

Scripting the process of certificate distribution back to cluster nodes, handling network errors, reloading Apache (while avoiding needless restarts) and monitoring the result is another matter.

So, I asked myself a question: would it be possible to simplify this setup, if there are only a few nodes in the cluster? In particular, avoid the need to copy files from server to server, and to get rid of the central server altogether. And ideally get rid of any fragile custom scripts. It turns out that, with a bit of Apache magic, you can do that. No custom scripts are needed, no ssh keys for unattended distribution of files, no central server, just some simple rewrite rules.

Each of the servers will run Certbot and request certificates independently. The idea is to have a server ask someone else when it doesn't know the answer to the HTTP-01 challenge.

To do so, we need to enable mod_rewrite, mod_proxy, and mod_proxy_http on each server. Also, I assume that you already have some separate domain names (not for the general public) pointing to each of the cluster nodes, just for the purpose of solving the challenges. E.g., www-fr.example.com, www-sg.example.com, and www-us.example.com.

So, here is the definition of the Apache virtual host that responds to unencrypted HTTP requests. The same configuration file works for all cluster nodes.

<VirtualHost *:80>
    ServerName www.example.com
    ServerAlias example.com
    ServerAlias www-fr.example.com
    ServerAlias www-sg.example.com
    ServerAlias www-us.example.com

    ProxyPreserveHost On
    RewriteEngine On

    # First block of rules - solving known challenges.
    RewriteCond /var/www/letsencrypt/.well-known/acme-challenge/$2 -f
    RewriteRule ^/\.well-known/acme-challenge(|-fr|-sg|-us)/(.*) \
        /var/www/letsencrypt/.well-known/acme-challenge/$2 [L]

    # Second block of rules - passing unknown challenges further.
    # Due to RewriteCond in the first block, we already know at this
    # point that the file does not exist locally.
    RewriteRule ^/\.well-known/acme-challenge/(.*) \
        http://www-fr.example.com/.well-known/acme-challenge-fr/$1 [P,L]
    RewriteRule ^/\.well-known/acme-challenge-fr/(.*) \
        http://www-sg.example.com/.well-known/acme-challenge-sg/$1 [P,L]
    RewriteRule ^/\.well-known/acme-challenge-sg/(.*) \
        http://www-us.example.com/.well-known/acme-challenge-us/$1 [P,L]
    RewriteRule ^/\.well-known/acme-challenge-us/(.*) - [R=404]

    # HTTP to HTTPS redirection for everything not matched above
    RewriteRule /?(.*) https://www.example.com/$1 [R=301,L]
</VirtualHost>

For a complete example, add a virtual host for port 443 that serves your web application on https://www.example.com.
<VirtualHost *:443>
    ServerName www.example.com
    ServerAlias example.com

    # You may want to have a separate virtual host or a RewriteRule
    # for redirecting browsers who visit https://example.com or any
    # other unwanted domain name to https://www.example.com.
    # E.g.:

    RewriteEngine On
    RewriteCond %{HTTP_HOST} !=www.example.com [NC]
    RewriteRule /?(.*) https://www.example.com/$1 [R=301,L]

    # Configure Apache to serve your content
    DocumentRoot /var/www/example

    SSLEngine on
    Include /etc/letsencrypt/options-ssl-apache.conf

    # Use any temporary certificate here, even a self-signed one works.
    # This piece of configuration will be replaced by Certbot.
    SSLCertificateFile /etc/ssl/certs/ssl-cert-snakeoil.pem
    SSLCertificateKeyFile /etc/ssl/private/ssl-cert-snakeoil.key
</VirtualHost>

Run Certbot like this, on all servers:
mkdir -p /var/www/letsencrypt/.well-known/acme-challenge
certbot -d example.com -d www.example.com -w /var/www/letsencrypt \
    --noninteractive --authenticator webroot --installer apache

Any other Let's Encrypt client than works by placing files into a directory will also be good enough. Apache's mod_md will not work, though, because it deliberately blocks all requests for unknown challenge files, which is contrary to what we need.

Let's see how it works.

Certbot asks Let's Encrypt for a certificate. Let's Encrypt tells Certbot the file name that it will try to fetch, and the expected contents. Certbot places this file under /var/www/letsencrypt/.well-known/acme-challenge and tells Let's Encrypt that they can verify that it is there. Let's Encrypt resolves www.example.com (and example.com, but let's forget about it) in the DNS, and then asks for this file under http://www.example.com/.well-known/acme-challenge.

If their verifier is lucky enough to hit the same server that asked for the certificate, the RewriteCond for the first RewriteRule will be true (it just tests the file existence), and, due to this rule, Apache will serve the file. Note that the rule responds not only to acme-challenge URLs, but also to acme-challenge-fr, acme-challenge-sg, and acme-challenge-us URLs used internally by other servers.

If the verifier is unlucky, then the challenge file will not be found, and the second block of RewriteRule directives will come into play. Let's say that it was the Singapore server that requested the certificate (and thus can respond), but Let's Encrypt has contacted the server in USA.

For the request sent by Let's Encrypt verifier, we can see that only the first rule in the second block will match. It will (conditionally on the file not being found locally, as tested by the first block) proxy the request from the server in USA to the French server, and use the "acme-challenge-fr" directory in the URL to record the fact that it is not the original request. The French server will not find the file either, so will skip the first block, and apply the second rule in the second block of RewriteRules (because it sees "acme-challenge-fr" in the URL). Thus, the request will be proxied again, this time to the Singapore server, and with "acme-challenge-sg" in the URL. As it was the Singapore server who requested the certificate, it will find the file and respond with its contents, and through the French and US servers, Let's Encrypt verifier will get the response and issue the certificate.

The last RewriteRule in the second block terminates the chain for stray requests not originating from Let's Encrypt. Such requests get proxied three times and finally get a 404.

The proposed scheme is, in theory, extensible to any number of servers — all that is needed is that they are all online, and the chain of proxying the request through all of them is not too slow. But, there is a limit on Let's Encrypt side on the number of duplicate certificates, 5 per week. I would guess (but have not verified) that, in practice, due to both factors, it means at most 5 servers in the cluster would be safe — which is still good enough for some purposes.

Friday, December 28, 2018

LDAP with STARTTLS considered harmful

A few weeks ago, I hit a security issue on the company's LDAP server. Namely, it was not protected well-enough against misconfigured clients who send passwords in cleartext.

There are two mechanisms defined in LDAP that protect passwords in transit:

  1. SASL binds
  2. SSL/TLS
SASL is an extensible framework that allows arbitrary authentication mechanisms. However, all of the widely-implemented ones are either not based on passwords at all (so not suitable for our use case), or send the password in cleartext (so not better than a simple non-SASL bind), or require the server to store the password in cleartext for verification (even worse). In addition to this non-suitability for our purposes, web applications usually do not support LDAP with SASL binding.

SSL/TLS, on the other hand, is a widely-supported industry standard for encrypting and authenticating the data (including passwords) in transit.

OpenLDAP assigns so-called Security Strength Factor to each authentication mechanism, based on how well it protects authentication data on the network. For SSL, it is usually (but not always) the number of bits in the symmetric cipher key used during the session.

For LDAP, the "normal" way of implementing SSL is to support the "STARTTLS" request on port 389 (the same port as for unencrypted LDAP sessions). There is also a not-really-standard way, with TLS right from the start of the connection (i.e. no STARTTLS request), on port 636. This is called "ldaps".

OpenLDAP can require a certain minimum Security Strength Factor for authentication attempts. In slapd.conf, it is set like this: "security ssf=128". There are also related configuration directives, TLSProtocolMin, which sets the minimum SSL/TLS protocol version, and localSSF, which is the Security Strength Factor assumed on local unix-socket connections (ldapi:///).

So, we configure certificates, set an appropriate security strength factor, disable anonymous bind, and that's it? No. This still doesn't prevent the password leak.

Suppose that someone configures Apache like this:

    <Location />
        AuthType basic
        AuthName "example.com users only"
        AuthBasicProvider ldap
        AuthLDAPInitialBindAsUser on
        AuthLDAPInitialBindPattern (.+) uid=$1,ou=users,dc=example,dc=com
        AuthLDAPURL "ldap://ldap.example.com/ou=users,dc=example,dc=com?uid?sub?"
        Require valid-user
    </Location>

See the mistake? They forgot the client to use STARTTLS (i.e., forgot to add the word "TLS" as the last parameter to AuthLDAPURL).

Let's look what happens if a user tries to log in. Apache will connect to the LDAP server on port 389, successfully. Then, it will create a LDAP request for a simple bind, using the user's username and password, and send it. And it will be sent successfully, in cleartext over the network. Of course, OpenLDAP will carefully receive the request, parse it, and then refuse to authenticate the user, but it's too late. The password has been already sent in cleartext, and somebody between the servers has already captured it with the government-mandated tcpdump equivalent.

This would not have happened if the LDAP server were listening on port 636 (SSL) only. In this case, requests to port 389 will get an RST before Apache gets a chance to send the password. And requests which use the ldaps:// scheme are always encrypted. An additional benefit is that PHP-based software that is not specifically coded to use STARTTLS for LDAP (i.e. does not contain ldap_start_tls() function call) will continue to work when given the ldaps:// URL, and I don't have to audit it for this specific issue. Isn't that wonderful? So, please (after reconfiguring all existing clients) make sure that your LDAP server does not listen on port 389, and listens securely on port 636, instead.

Friday, October 19, 2018

I received a fake job alert

Today I received a job alert from a Russian job site, https://moikrug.ru/, which I do use to get job alerts. The job title was "Application Security Engineer", and it looked like I almost qualify. So, I decided to go to the company page behind the offer, and look what else they do.

Result: they do a lot of interesting research, and they also have a "jobs" page, which is empty. Also, the job offer that I have received contained some links to Google Docs, and an email address not on the company's domain, which looked quite non-serious for a company of that size.

So, I went to the "contact us" page, and called their phone. The secretary was unaware of the exact list of the currently opened positions, but told me that all official offers would be on a different job site, https://hh.ru/. It lists 8 positions, but nothing related to what I have received, and nothing for what I have the skills. So, we have concluded that I have received a fake job offer from impostors that illegally used the company name.

Conclusion: please beware of such scams, even from big and reputable job sites.