aboutsummaryrefslogtreecommitdiffstats
path: root/roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst
diff options
context:
space:
mode:
Diffstat (limited to 'roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst')
-rw-r--r--roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst228
1 files changed, 228 insertions, 0 deletions
diff --git a/roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst b/roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst
new file mode 100644
index 000000000..6591e27d5
--- /dev/null
+++ b/roms/skiboot/doc/release-notes/skiboot-6.3-rc3.rst
@@ -0,0 +1,228 @@
+.. _skiboot-6.3-rc3:
+
+skiboot-6.3-rc3
+===============
+
+skiboot v6.3-rc3 was released on Thursday May 2nd 2019. It is the third
+release candidate of skiboot 6.3, which will become the new stable release
+of skiboot following the 6.2 release, first released December 14th 2018.
+
+Skiboot 6.3 will mark the basis for op-build v2.3. I expect to tag the final
+skiboot 6.3 in the next week (I also predicted this last time, so take my
+predictions with a large amount of sodium).
+
+skiboot v6.3-rc3 contains all bug fixes as of :ref:`skiboot-6.0.19`,
+and :ref:`skiboot-6.2.3` (the currently maintained
+stable releases).
+
+For how the skiboot stable releases work, see :ref:`stable-rules` for details.
+
+Over :ref:`skiboot-6.3-rc2`, we have the following changes:
+
+
+- Expose PNOR Flash partitions to host MTD driver via devicetree
+
+ This makes it possible for the host to directly address each
+ partition without requiring each application to directly parse
+ the FFS headers. This has been in use for some time already to
+ allow BOOTKERNFW partition updates from the host.
+
+ All partitions except BOOTKERNFW are marked readonly.
+
+ The BOOTKERNFW partition is currently exclusively used by the TalosII platform
+
+- Write boot progress to LPC port 80h
+
+ This is an adaptation of what we currently do for op_display() on FSP
+ machines, inventing an encoding for what we can write into the single
+ byte at LPC port 80h.
+
+ Port 80h is often used on x86 systems to indicate boot progress/status
+ and dates back a decent amount of time. Since a byte isn't exactly very
+ expressive for everything that can go on (and wrong) during boot, it's
+ all about compromise.
+
+ Some systems (such as Zaius/Barreleye G2) have a physical dual 7 segment
+ display that display these codes. So far, this has only been driven by
+ hostboot (see hostboot commit 90ec2e65314c).
+
+- Write boot progress to LPC ports 81 and 82
+
+ There's a thought to write more extensive boot progress codes to LPC
+ ports 81 and 82 to supplement/replace any reliance on port 80.
+
+ We want to still emit port 80 for platforms like Zaius and Barreleye
+ that have the physical display. Ports 81 and 82 can be monitored by a
+ BMC though.
+
+- Copy and convert Romulus descriptors to Talos
+
+ Talos II has some hardware differences from Romulus, therefore
+ we cannot guarantee Talos II == Romulus in skiboot. Copy and
+ slightly modify the Romulus files for Talos II.
+
+- npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default
+
+ V100 GPUs are known to violate NVLink2 protocol in some cases (one is when
+ memory was accessed by the CPU and they by GPU using so called block
+ linear mapping) and issue double probes to NPU which can cope with this
+ problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO
+ snarfing a cp_m") is not set in the CQ_SM Misc Config register #0.
+ If the bit is set (which is the case today), NPU issues the machine
+ check stop.
+
+ The snarfing feature is designed to detect 2 probes in flight and combine
+ them into one.
+
+ This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
+ CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
+ stop from happening.
+
+ This disables snarfing by default as otherwise a broken GPU driver can
+ crash the entire box even when a GPU is passed through to a guest.
+ This provides a dial to allow regression tests (might be useful for
+ a bare metal). To enable snarfing, the user needs to run: ::
+
+ sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable
+
+ and reboot the host system.
+
+- hw/npu2: Show name of opencapi error interrupts
+- core/pci: Use PHB io-base-location by default for PHB slots
+
+ On witherspoon only the GPU slots and the three pluggable PCI slots
+ (SLOT0, 1, 2) have platform defined slot names. For builtin devices such
+ as the SATA controller or the PLX switch that fans out to the GPU slots
+ we have no location codes which some people consider an issue.
+
+ This patch address the problem by making the ibm,slot-location-code for
+ the root port device default to the ibm,io-base-location-code which is
+ typically the location code for the system itself.
+
+ e.g. ::
+
+ pciex@600c3c0100000/ibm,loc-code
+ "UOPWR.0000000-Node0-Proc0"
+
+ pciex@600c3c0100000/pci@0/ibm,loc-code
+ "UOPWR.0000000-Node0-Proc0"
+
+ pciex@600c3c0100000/pci@0/usb-xhci@0/ibm,loc-code
+ "UOPWR.0000000-Node0"
+
+ The PHB node, and the root complex nodes have a loc code of the
+ processor they are attached to, while the usb-xhci device under the
+ root port has a location code of the system itself.
+
+- hw/phb4: Read ibm,loc-code from PBCQ node
+
+ On P9 the PBCQs are subdivided by stacks which implement the PCI Express
+ logic. When phb4 was forked from phb3 most of the properties that were
+ in the pbcq node moved into the stack node, but ibm,loc-code was not one
+ of them. This patch fixes the phb4 init sequence to read the base
+ location code from the PBCQ node (parent of the stack node) rather than
+ the stack node itself.
+- hw/xscom: add missing P9P chip name
+- asm/head: balance branches to avoid link stack predictor mispredicts
+
+ The Linux wrapper for OPAL call and return is arranged like this: ::
+
+ __opal_call:
+ mflr r0
+ std r0,PPC_STK_LROFF(r1)
+ LOAD_REG_ADDR(r11, opal_return)
+ mtlr r11
+ hrfid -> OPAL
+
+ opal_return:
+ ld r0,PPC_STK_LROFF(r1)
+ mtlr r0
+ blr
+
+ When skiboot returns to Linux, it branches to LR (i.e., opal_return)
+ with a blr. This unbalances the link stack predictor and will cause
+ mispredicts back up the return stack.
+- external/mambo: also invoke readline for the non-autorun case
+- asm/head.S: set POWER9 radix HID bit at entry
+
+ When running in virtual memory mode, the radix MMU hid bit should not
+ be changed, so set this in the initial boot SPR setup.
+
+ As a side effect, fast reboot also has HID0:RADIX bit set by the
+ shared spr init, so no need for an explicit call.
+- opal-prd: Fix memory leak in is-fsp-system check
+- opal-prd: Check malloc return value
+- hw/phb4: Squash the IO bridge window
+
+ The PCI-PCI bridge spec says that bridges that implement an IO window
+ should hardcode the IO base and limit registers to zero.
+ Unfortunately, these registers only define the upper bits of the IO
+ window and the low bits are assumed to be 0 for the base and 1 for the
+ limit address. As a result, setting both to zero can be mis-interpreted
+ as a 4K IO window.
+
+ This patch fixes the problem the same way PHB3 does. It sets the IO base
+ and limit values to 0xf000 and 0x1000 respectively which most software
+ interprets as a disabled window.
+
+ lspci before patch: ::
+
+ 0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
+ I/O behind bridge: 00000000-00000fff
+
+ lspci after patch: ::
+
+ 0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
+ I/O behind bridge: None
+
+- build: link with --orphan-handling=warn
+
+ The linker can warn when the linker script does not explicitly place
+ all sections. These orphan sections are placed according to
+ heuristics, which may not always be desirable. Enable this warning.
+- build: -fno-asynchronous-unwind-tables
+
+ skiboot does not use unwind tables, this option saves about 100kB,
+ mostly from .text.
+- hw/xscom: Enable sw xstop by default on p9
+
+ This was disabled at some point during bringup to make life easier for
+ the lab folks trying to debug NVLink issues. This hack really should
+ have never made it out into the wild though, so we now have the
+ following situation occuring in the field:
+
+ 1) A bad happens
+ 2) The host kernel recieves an unrecoverable HMI and calls into OPAL to
+ request a platform reboot.
+ 3) OPAL rejects the reboot attempt and returns to the kernel with
+ OPAL_PARAMETER.
+ 4) Kernel panics and attempts to kexec into a kdump kernel.
+
+ A side effect of the HMI seems to be CPUs becoming stuck which results
+ in the initialisation of the kdump kernel taking a extremely long time
+ (6+ hours). It's also been observed that after performing a dump the
+ kdump kernel then crashes itself because OPAL has ended up in a bad
+ state as a side effect of the HMI.
+
+ All up, it's not very good so re-enable the software checkstop by
+ default. If people still want to turn it off they can using the nvram
+ override.
+- opal/hmi: Initialize the hmi event with old value of TFMR.
+
+ Do this before we fix TFAC errors. Otherwise the event at host console
+ shows no thread error reported in TFMR register.
+
+ Without this patch the console event show TFMR with no thread error:
+ (DEC parity error TFMR[59] injection) ::
+
+ [ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered]
+ [ 53.737596] Error detail: Timer facility experienced an error
+ [ 53.737611] HMER: 0840000000000000
+ [ 53.737621] TFMR: 3212000870e04000
+
+ After this patch it shows old TFMR value on host console: ::
+
+ [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered]
+ [ 2302.267305] Error detail: Timer facility experienced an error
+ [ 2302.267320] HMER: 0840000000000000
+ [ 2302.267330] TFMR: 3212000870e14010