diff options
Diffstat (limited to 'roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst')
-rw-r--r-- | roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst | 228 |
1 files changed, 228 insertions, 0 deletions
diff --git a/roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst b/roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst new file mode 100644 index 000000000..6591e27d5 --- /dev/null +++ b/roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst @@ -0,0 +1,228 @@ +.. _skiboot-6.3-rc3: + +skiboot-6.3-rc3 +=============== + +skiboot v6.3-rc3 was released on Thursday May 2nd 2019. It is the third +release candidate of skiboot 6.3, which will become the new stable release +of skiboot following the 6.2 release, first released December 14th 2018. + +Skiboot 6.3 will mark the basis for op-build v2.3. I expect to tag the final +skiboot 6.3 in the next week (I also predicted this last time, so take my +predictions with a large amount of sodium). + +skiboot v6.3-rc3 contains all bug fixes as of :ref:`skiboot-6.0.19`, +and :ref:`skiboot-6.2.3` (the currently maintained +stable releases). + +For how the skiboot stable releases work, see :ref:`stable-rules` for details. + +Over :ref:`skiboot-6.3-rc2`, we have the following changes: + + +- Expose PNOR Flash partitions to host MTD driver via devicetree + + This makes it possible for the host to directly address each + partition without requiring each application to directly parse + the FFS headers. This has been in use for some time already to + allow BOOTKERNFW partition updates from the host. + + All partitions except BOOTKERNFW are marked readonly. + + The BOOTKERNFW partition is currently exclusively used by the TalosII platform + +- Write boot progress to LPC port 80h + + This is an adaptation of what we currently do for op_display() on FSP + machines, inventing an encoding for what we can write into the single + byte at LPC port 80h. + + Port 80h is often used on x86 systems to indicate boot progress/status + and dates back a decent amount of time. Since a byte isn't exactly very + expressive for everything that can go on (and wrong) during boot, it's + all about compromise. + + Some systems (such as Zaius/Barreleye G2) have a physical dual 7 segment + display that display these codes. So far, this has only been driven by + hostboot (see hostboot commit 90ec2e65314c). + +- Write boot progress to LPC ports 81 and 82 + + There's a thought to write more extensive boot progress codes to LPC + ports 81 and 82 to supplement/replace any reliance on port 80. + + We want to still emit port 80 for platforms like Zaius and Barreleye + that have the physical display. Ports 81 and 82 can be monitored by a + BMC though. + +- Copy and convert Romulus descriptors to Talos + + Talos II has some hardware differences from Romulus, therefore + we cannot guarantee Talos II == Romulus in skiboot. Copy and + slightly modify the Romulus files for Talos II. + +- npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default + + V100 GPUs are known to violate NVLink2 protocol in some cases (one is when + memory was accessed by the CPU and they by GPU using so called block + linear mapping) and issue double probes to NPU which can cope with this + problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO + snarfing a cp_m") is not set in the CQ_SM Misc Config register #0. + If the bit is set (which is the case today), NPU issues the machine + check stop. + + The snarfing feature is designed to detect 2 probes in flight and combine + them into one. + + This adds a new "opal-npu2-snarf-cpm" nvram variable which controls + CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check + stop from happening. + + This disables snarfing by default as otherwise a broken GPU driver can + crash the entire box even when a GPU is passed through to a guest. + This provides a dial to allow regression tests (might be useful for + a bare metal). To enable snarfing, the user needs to run: :: + + sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable + + and reboot the host system. + +- hw/npu2: Show name of opencapi error interrupts +- core/pci: Use PHB io-base-location by default for PHB slots + + On witherspoon only the GPU slots and the three pluggable PCI slots + (SLOT0, 1, 2) have platform defined slot names. For builtin devices such + as the SATA controller or the PLX switch that fans out to the GPU slots + we have no location codes which some people consider an issue. + + This patch address the problem by making the ibm,slot-location-code for + the root port device default to the ibm,io-base-location-code which is + typically the location code for the system itself. + + e.g. :: + + pciex@600c3c0100000/ibm,loc-code + "UOPWR.0000000-Node0-Proc0" + + pciex@600c3c0100000/pci@0/ibm,loc-code + "UOPWR.0000000-Node0-Proc0" + + pciex@600c3c0100000/pci@0/usb-xhci@0/ibm,loc-code + "UOPWR.0000000-Node0" + + The PHB node, and the root complex nodes have a loc code of the + processor they are attached to, while the usb-xhci device under the + root port has a location code of the system itself. + +- hw/phb4: Read ibm,loc-code from PBCQ node + + On P9 the PBCQs are subdivided by stacks which implement the PCI Express + logic. When phb4 was forked from phb3 most of the properties that were + in the pbcq node moved into the stack node, but ibm,loc-code was not one + of them. This patch fixes the phb4 init sequence to read the base + location code from the PBCQ node (parent of the stack node) rather than + the stack node itself. +- hw/xscom: add missing P9P chip name +- asm/head: balance branches to avoid link stack predictor mispredicts + + The Linux wrapper for OPAL call and return is arranged like this: :: + + __opal_call: + mflr r0 + std r0,PPC_STK_LROFF(r1) + LOAD_REG_ADDR(r11, opal_return) + mtlr r11 + hrfid -> OPAL + + opal_return: + ld r0,PPC_STK_LROFF(r1) + mtlr r0 + blr + + When skiboot returns to Linux, it branches to LR (i.e., opal_return) + with a blr. This unbalances the link stack predictor and will cause + mispredicts back up the return stack. +- external/mambo: also invoke readline for the non-autorun case +- asm/head.S: set POWER9 radix HID bit at entry + + When running in virtual memory mode, the radix MMU hid bit should not + be changed, so set this in the initial boot SPR setup. + + As a side effect, fast reboot also has HID0:RADIX bit set by the + shared spr init, so no need for an explicit call. +- opal-prd: Fix memory leak in is-fsp-system check +- opal-prd: Check malloc return value +- hw/phb4: Squash the IO bridge window + + The PCI-PCI bridge spec says that bridges that implement an IO window + should hardcode the IO base and limit registers to zero. + Unfortunately, these registers only define the upper bits of the IO + window and the low bits are assumed to be 0 for the base and 1 for the + limit address. As a result, setting both to zero can be mis-interpreted + as a 4K IO window. + + This patch fixes the problem the same way PHB3 does. It sets the IO base + and limit values to 0xf000 and 0x1000 respectively which most software + interprets as a disabled window. + + lspci before patch: :: + + 0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode]) + I/O behind bridge: 00000000-00000fff + + lspci after patch: :: + + 0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode]) + I/O behind bridge: None + +- build: link with --orphan-handling=warn + + The linker can warn when the linker script does not explicitly place + all sections. These orphan sections are placed according to + heuristics, which may not always be desirable. Enable this warning. +- build: -fno-asynchronous-unwind-tables + + skiboot does not use unwind tables, this option saves about 100kB, + mostly from .text. +- hw/xscom: Enable sw xstop by default on p9 + + This was disabled at some point during bringup to make life easier for + the lab folks trying to debug NVLink issues. This hack really should + have never made it out into the wild though, so we now have the + following situation occuring in the field: + + 1) A bad happens + 2) The host kernel recieves an unrecoverable HMI and calls into OPAL to + request a platform reboot. + 3) OPAL rejects the reboot attempt and returns to the kernel with + OPAL_PARAMETER. + 4) Kernel panics and attempts to kexec into a kdump kernel. + + A side effect of the HMI seems to be CPUs becoming stuck which results + in the initialisation of the kdump kernel taking a extremely long time + (6+ hours). It's also been observed that after performing a dump the + kdump kernel then crashes itself because OPAL has ended up in a bad + state as a side effect of the HMI. + + All up, it's not very good so re-enable the software checkstop by + default. If people still want to turn it off they can using the nvram + override. +- opal/hmi: Initialize the hmi event with old value of TFMR. + + Do this before we fix TFAC errors. Otherwise the event at host console + shows no thread error reported in TFMR register. + + Without this patch the console event show TFMR with no thread error: + (DEC parity error TFMR[59] injection) :: + + [ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered] + [ 53.737596] Error detail: Timer facility experienced an error + [ 53.737611] HMER: 0840000000000000 + [ 53.737621] TFMR: 3212000870e04000 + + After this patch it shows old TFMR value on host console: :: + + [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered] + [ 2302.267305] Error detail: Timer facility experienced an error + [ 2302.267320] HMER: 0840000000000000 + [ 2302.267330] TFMR: 3212000870e14010 |