What should be done to a cluster running Bright Cluster Manager to deal with the Meltdown and Spectre vulnerabilities?
Action to take if running a Bright Cluster
If staying with vulnerable hardware, then software and firmware mitigation should be carried out as soon as patches become available. This requires following the firmware updates provided by the chip vendors and the software updates provided by the Linux distributors. Bright Computing does not provide these updates.
The overall method to carry out the updates is as follows:
* "yum update" is run on the head node and in the software images to get the latest kernel and software
* if there are any drivers that depend on the kernel, such as the OFED drivers, then these drivers should first be removed before carrying out the update. They should be re-installed after updating and booting the new kernel.
Details of the method are illustrated by an example, as tested on Bright Cluster Manager 7.3 installed on Centos 7.3:
On the head node, and images, run:
yum update --installroot=/cm/images/<image-name>
After installing the new kernel in the software images, the new kernel version must be set in Bright so that the nodes will boot using the updated kernel:
softwareimage use <image-name>
set kernelversion (hit tab twice for auto completion)
(wait until initrd is generated)
The yum update will install the latest kernel which includes the fixes for these security issues. This means that the head node as well as the compute nodes must be rebooted to use the latest kernel.
If kernel modules or applications which are also dependent on the kernel are in use, then newer versions of the drivers or applications which are compatible with the updated kernel must be installed. For example, if using Mellanox OFED drivers for IB cards, then the OFED drivers must first be removed. Bright Cluster Manager is by default configured to block kernel updates if such drivers are installed. The new drivers should be installed after booting into the new kernel.
If the mlnx-ofedXX scripts from Bright are being used to install the Mellanox drivers, then run the following commands can be run to remove the old drivers:
/cm/local/apps/mlnx-ofedXX/current/bin/mlnx-ofedXX-install.sh -r -h
/cm/local/apps/mlnx-ofedXX/current/bin/mlnx-ofedXX-install.sh -r -s <image name>
To re-install the new drivers that are compatible with Bright, the relevant mlnx-ofedXX package must be installed first:
yum install mlnx-ofed42
The mlnx-ofedXX-install.sh script can then be run as follows:
/cm/local/apps/mlnx-ofed42/current/bin/mlnx-ofed42-install.sh -s <image name>
The XX in the mlnx-ofed stands for the version of the OFED drivers. So, it can be replaced with the version that is to be installed.
The Meltdown and Spectre vulnerabilities are due to design flaws in chip hardware and microcode. Firmware and software updates to mitigate the issues are being made available by the chip vendors and Linux distribution vendors. The updates are expected to be a work in progress for a long time, as well as contentious. The fixes are expected to affect hardware performance speeds, perhaps by up to 30%.
New chips will eventually be released to address these vulnerabilities.
- Reference websites
meltdown and spectre websites: https://meltdownattack.com https://spectreattack.com
meltdown paper: https://meltdownattack.com/meltdown.pdf
spectre paper https://spectreattack.com/spectre.pdf
- Excerpt summarizing the vulnerabllities, from https://www.itjungle.com/2018/01/10/power-systems-spectre-meltdown-threats/:
- Variant 1, CVE-2017-5753: Bounds check bypass. This vulnerability affects specific sequences within compiled applications, which must be addressed on a per-binary basis.
- Variant 2, CVE-2017-5715: Branch target injection. This variant may either be fixed by a CPU microcode update from the CPU vendor, or by applying a software mitigation technique called Retpoline to binaries where concern about information leakage is present. This mitigation may be applied to the operating system kernel, system programs and libraries, and individual software programs, as needed.
- Variant 3, CVE-2017-5754: Rogue data cache load. This may require patching the system’s operating system. For Linux there is a patchset called KPTI (Kernel Page Table Isolation) that helps mitigate Variant 3. Other operating systems may implement similar protections – check with your vendor for specifics.
- Illustration of controversy over approach in mitigation of issues: