OpenStack
Kernel Modules
Overview
This is not strictly an OpenStack problem but is a fallout of many things including having some old kit and this page has to live somewhere:
I bought some Mellanox ConnectX dual 10G ethernet cards (off eBay) -- primarily so I could have 10G in the lab -- which work a treat in CentOS 7.x. Woo!
OpenStack gradually cycles its generations of releases and the old ones are archived meaning you cannot rely on being able to re-install them (should catastrophe occur). You need to keep up (or archive the repos or ... something).
I've chosen to gradually refine my OpenStack installation scripts every few releases by excising kit from the old OpenStack, installing the new OpenStack and then snapshotting instances in old OpenStack and copying them into new OpenStack.
This is made slightly more onerous by the lack of kit meaning that the underlying storage, ceph, is running on the same boxes. A bit painful. I really could do with the upgrade instructions!
There is no documented OpenStack upgrade mechanism to date. Rather, there is, it starts "if you used the Ansible installer..."
RedHat dropped support for CentOS (downstream of RHEL) -- without even giving the name back to the community -- in favour of CentOS Stream (upstream of RHEL).
This caused plenty of well-documented rancour but the real doozy is that CentOS Stream cannot be relied upon to be binary compatible with RHEL. It's ahead of RHEL, not running in lock-sync with RHEL. It's more like a slow-moving Fedora. That's not a bad thing (I do all my development on Fedora $LATEST) but it's not the same as RHEL.
The real problem, here, is support by third parties of a thing that isn't RHEL compatible. See below.
CentOS Stream does have ongoing support from the CentOS Cloud SIG -- the people creating the centos-openstack-release-<name> repos, so that's good.
CentOS Stream does not have support for Mellanox ConnectX 10G cards. Boo!
Too old! It seems.
Mellanox don't support CentOS Stream (and the latest code dropped support for the ConnectX cards anyway -- a long time ago).
So it's not great. I need to upgrade OpenStack, I need to switch to CentOS Stream and CentOS Stream doesn't support the network card I'm using. *shakes fist*
Evidence
Note
rel=$(uname -r)
is going to be very useful from now on!
The obvious bit of evidence is that the interfaces disappear. You can get more specific, though:
$lspci -nn | grep Eth ... 2b:00.0 Ethernet controller [0200]: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0)
The -nn flag for the PCIe identifiers, 15b3:6750, is important in a minute.
This CentOS driver bug confirms that the kernel build flag CONFIG_MLX4_CORE_GEN2 is what is missing for (Gen 1) Mellanox card support. Was this build compiled with it?
$ grep MLX4 /boot/config-${rel} CONFIG_MLX4_EN=m CONFIG_MLX4_EN_DCB=y CONFIG_MLX4_CORE=m CONFIG_MLX4_DEBUG=y # CONFIG_MLX4_CORE_GEN2 is not set CONFIG_MLX4_INFINIBAND=m
OK, and on CentOS 7.9:
... CONFIG_MLX4_CORE_GEN2=y
Right, we can take a hint.
We can dig a bit deeper, though, just to be clear which driver support there is:
$ modinfo mlx4_core | grep -i 15b3 | grep -i 6750
(nada)
On CentOS 7.9:
$ modinfo mlx4_core | grep -i 15b3 | grep -i 6750 alias: pci:v000015B3d00006750sv*sd*bc*sc*i*
Kernel Modules
That driver bug page also includes a hint about how to build your own Mellanox driver.
Note
This is for CentOS Stream 8 -- YMMV
Cutting and pasting from the CentOS bug page we must:
$ sudo yum install audit-libs-devel binutils-devel elfutils-devel java-devel kabi-dw libcap-devel libcap-ng-devel llvm-toolset newt-devel pciutils-devel perl-devel python3-devel python3-docutils xmlto perl-ExtUtils-Embed
We're then instructed to run rpmbuild which suggests we install the following before it'll work:
$ sudo yum install rpm-build asciidoc bc bison libmnl-devel make ncurses-devel net-tools nss-tools numactl-devel openssl-devel perl-generators pesign $ sudo yum install --enablerepo powertools dwarves libbabeltrace-devel libbpf-devel
Back on the trail, we now do the downloads and build as ourselves (not as root):
$ rpm -ivh https://vault.centos.org/8-stream/BaseOS/Source/SPackages/kernel-${rel%.*}.src.rpm $ cd rpmbuild/SPECS/ $ rpmbuild -bp --target=$(uname -m) kernel.spec $ cd ../BUILD/kernel-${rel%.*}/linux-${rel} $ cp configs/kernel-${rel%-*}-$(uname -m).config .config $ echo "CONFIG_MLX4_CORE_GEN2=y" >> .config $ make -j 12 modules ...
Careful with the make -j N flag. On a reasonable sized box the load average went north of 3000 and the OOM killer kicked in big style. Something like make -j $(lscpu -p | grep ^[0-9] | wc -l) modules should keep the machine maxed out for a few minutes.
You might get away with:
$ make drivers/net/ethernet/mellanox/mlx4/mlx4_{core,en}.ko
which will save a lot of time, effort and carbon.
Installation
We now have the required kernel modules (somewhere) which we need to install, er, somewhere.
Looking at the ELRepo Files section for their MLX4 installation (it's down near the bottom of the page) it looks like we only need a couple of .ko files. Where they are put is a bit more interesting.
Hiding in the discussions https://forums.centos.org/viewtopic.php?t=11270 and https://access.redhat.com/discussions/3189552 we can glean that:
- the kernel will pick up drivers in the extra directory (and, indeed, subdirectories of that) and
- the weak-modules command will be run (by someone, see below) during an upgrade and any binary compatible drivers in the old kernel's extra directories will be linked into the new kernel's weak-updates directory.
- I assume, in fact, that even older kernels will also be rummaged about in ensuring that any binary compatible driver will be maintained -- at least until the kernel is removed.
ELRepo is suggesting:
$ sudo mkdir -p /lib/modules/${rel}/extra/mlx4 $ sudo cp drivers/net/ethernet/mellanox/mlx4/mlx4_{core,en}.ko /lib/modules/${rel}/extra/mlx4
and the (not especially obvious):
$ sudo depmod -a
to tell the kernel it has new drivers.
Done! Go ahead and reboot.
weak-modules
So, weak-modules will link our driver if it is compatible. What happens if it isn't? Erm, we don't get our driver. This, of course, happens on the first kernel update...
... finds no driver on reboot ... # weak-modules --add-kernel --verbose --dry-run Module mlx4_en.ko from kernel 4.18.0 is not compatible with kernel 4.18.0-305.el8.x86_64 in symbols: ... (long list of symbols) ... Module mlx4_core.ko from kernel 4.18.0 is not compatible with kernel 4.18.0-305.el8.x86_64 in symbols: ... (another list including strcpy??) ...
Document Actions