diff options
author | 2023-10-10 14:33:42 +0000 | |
---|---|---|
committer | 2023-10-10 14:33:42 +0000 | |
commit | af1a266670d040d2f4083ff309d732d648afba2a (patch) | |
tree | 2fc46203448ddcc6f81546d379abfaeb323575e9 /roms/skiboot/doc/release-notes/skiboot-6.1-rc1.rst | |
parent | e02cda008591317b1625707ff8e115a4841aa889 (diff) |
Change-Id: Iaf8d18082d3991dec7c0ebbea540f092188eb4ec
Diffstat (limited to 'roms/skiboot/doc/release-notes/skiboot-6.1-rc1.rst')
-rw-r--r-- | roms/skiboot/doc/release-notes/skiboot-6.1-rc1.rst | 466 |
1 files changed, 466 insertions, 0 deletions
diff --git a/roms/skiboot/doc/release-notes/skiboot-6.1-rc1.rst b/roms/skiboot/doc/release-notes/skiboot-6.1-rc1.rst new file mode 100644 index 000000000..3ae436d9e --- /dev/null +++ b/roms/skiboot/doc/release-notes/skiboot-6.1-rc1.rst @@ -0,0 +1,466 @@ +.. _skiboot-6.1-rc1: + +skiboot-6.1-rc1 +=============== + +skiboot v6.1-rc1 was released on Friday June 22nd 2018. It is the first +release candidate of skiboot 6.1, which will become the new stable release +of skiboot following the 6.0 release, first released May 11th 2018. + +Skiboot 6.1 will mark the basis for op-build v2.1. + +skiboot v6.1-rc1 contains all bug fixes as of :ref:`skiboot-6.0.4`, +and :ref:`skiboot-5.4.9` (the currently maintained +stable releases). + +For how the skiboot stable releases work, see :ref:`stable-rules` for details. + +This release contains a lot of small cleanups and fixes all over the place, +which is possibly a sign that we've shipped our big POWER9 GA release and +now get to breathe for a moment to look at what we ended up with. +Since this is a really small incremental release, there will unlikely be +many release candidates. + +Over skiboot 6.0, we have the following changes: + +General changes and bug fixes +----------------------------- + +- GCC8 build fixes +- Add prepare_hbrt_update to hbrt interfaces + + Add placeholder support for prepare_hbrt_update call into + hostboot runtime (opal-prd) code. This interface is only + called as part of a concurrent code update on a FSP based + system. +- cpu: Clear PCR SPR in opal_reinit_cpus() + + Currently if Linux boots with a non-zero PCR, things can go bad where + some early userspace programs can take illegal instructions. This is + being fixed in Linux, but in the mean time, we should cleanup in + skiboot also. +- pci: Fix PCI_DEVICE_ID() + + The vendor ID is 16 bits not 8. This error leaves the top of the vendor + ID in the bottom bits of the device ID, which resulted in e.g. a failure + to run the PCI quirk for the AST VGA device. +- Quieten console output on boot + + We print out a whole bunch of things on boot, most of which aren't + interesting, so we should *not* print them instead. + + Printing things like what CPUs we found and what PCI devices we found + *are* useful, so continue to do that. But we don't need to splat out + a bunch of things that are always going to be true. +- core/console: fix deadlock when printing with console lock held + + Some debugging options will print while the console lock is held, + which is why the console lock is taken as a recursive lock. + However console_write calls __flush_console, which will drop and + re-take the lock non-recursively in some cases. + + Just set con_need_flush and return from __flush_console if we are + holding the console lock already. + + This stack usage message (taken with this patch applied) could lead + to a deadlock without this: :: + + CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0 + CPU 0000 Backtrace: + S: 0000000031c03370 R: 00000000300cb808 .list_check_node+0x1c + S: 0000000031c03410 R: 00000000300cb910 .list_check+0x38 + S: 0000000031c034b0 R: 00000000300190ac .try_lock_caller+0xb8 + S: 0000000031c03540 R: 00000000300192e0 .lock_caller+0x80 + S: 0000000031c03600 R: 0000000030012c70 .__flush_console+0x134 + S: 0000000031c036d0 R: 00000000300130cc .console_write+0x68 + S: 0000000031c03780 R: 00000000300347bc .vprlog+0xc8 + S: 0000000031c03970 R: 0000000030034844 ._prlog+0x50 + S: 0000000031c03a00 R: 00000000300364a4 .log_simple_error+0x74 + S: 0000000031c03b90 R: 000000003004ab48 .occ_pstates_init+0x184 + S: 0000000031c03d50 R: 000000003001480c .load_and_boot_kernel+0x38c + S: 0000000031c03e30 R: 000000003001571c .main_cpu_entry+0x62c + S: 0000000031c03f00 R: 0000000030002700 boot_entry+0x1c0 +- opal-prd: Do not error out on first failure for soft/hard offline. + + The memory errors (CEs and UEs) that are detected as part of background + memory scrubbing are reported by PRD asynchronously to opal-prd along with + affected memory ranges. hservice_memory_error() converts these ranges into + page granularity before hooking up them to soft/hard offline-ing + infrastructure. + + But the current implementation of hservice_memory_error() does not hookup + all the pages to soft/hard offline-ing if any of the page offline action + fails. e.g hard offline can fail for: + + - Pages that are not part of buddy managed pool. + - Pages that are reserved by kernel using memblock_reserved() + - Pages that are in use by kernel. + + But for the pages that are in use by user space application, the hard + offline marks the page as hwpoison, sends SIGBUS signal to kill the + affected application as recovery action and returns success. + + Hence, It is possible that some of the pages in that memory range are in + use by application or free. By stopping on first error we loose the + opportunity to hwpoison the subsequent pages which may be free or in use by + application. This patch fixes this issue. +- libflash/blocklevel_write: Fix missing error handling + + Caught by scan-build, we seem to trap the errors in rc, but + not take any recovery action during blocklevel_write. + +I2C +^^^ +- p8-i2c: fix wrong request status when a reset is needed + + If the bus is found in error state when starting a new request, the + engine is reset and we enter recovery. However, once complete, the + reset operation shows a status of complete in the status register. So + any badly-timed called to check_status() will think the current top + request is complete, even though it hasn't run yet. + + So don't update any request status while we are in recovery, as + nothing useful for the request is supposed to happen in that state. +- p8-i2c: Remove force reset + + Force reset was added as an attempt to work around some issues with TPM + devices locking up their I2C bus. In that particular case the problem + was that the device would hold the SCL line down permanently due to a + device firmware bug. The force reset doesn't actually do anything to + alleviate the situation here, it just happens to reset the internal + master state enough to make the I2C driver appear to work until + something tries to access the bus again. + + On P9 systems with secure boot enabled there is the added problem + of the "diagostic mode" not being supported on I2C masters A,B,C and + D. Diagnostic mode allows the SCL and SDA lines to be driven directly + by software. Without this force reset is impossible to implement. + + This patch removes the force reset functionality entirely since: + + a) it doesn't do what it's supposed to, and + b) it's butt ugly code + + Additionally, turn p8_i2c_reset_engine() into p8_i2c_reset_port(). + There's no need to reset every port on a master in response to an + error that occurred on a specific port. +- libstb/i2c-driver: Bump max timeout + + We have observed some TPMs clock streching the I2C bus for signifigant + amounts of time when processing commands. The same TPMs also have + errata that can result in permernantly locking up a bus in response to + an I2C transaction they don't understand. Using an excessively long + timeout to prevent this in the field. +- hdata: Add TPM timeout workaround + + Set the default timeout for any bus containing a TPM to one second. This + is needed to work around a bug in the firmware of certain TPMs that will + clock strech the I2C port the for up to a second. Additionally, when the + TPM is clock streching it responds to a STOP condition on the bus by + bricking itself. Clearing this error requires a hard power cycle of the + system since the TPM is powered by standby power. +- p8-i2c: Allow a per-port default timeout + + Add support for setting a default timeout for the I2C port to the + device-tree. This is consumed by skiboot. + +IPMI Watchdog +^^^^^^^^^^^^^ +- ipmi-watchdog: Support handling re-initialization + + Watchdog resets can return an error code from the BMC indicating that + the BMC watchdog was not initialized. Currently we abort skiboot due to + a missing error handler. This patch implements handling + re-initialization for the watchdog, automatically saving the last + watchdog set values and re-issuing them if needed. +- ipmi-watchdog: The stop action should disable reset + + Otherwise it is possible for the reset timer to elapse and trigger the + watchdog to wake back up. This doesn't affect the behavior of the + system since we are providing a NONE action to the BMC. However we would + like to avoid the action from taking place if possible. +- ipmi-watchdog: Add a flag to determine if we are still ticking + + This makes it easier for future changes to ensure that the watchdog + stops ticking and doesn't requeue itself for execution in the + background. This way it is safe for resets to be performed after the + ticks are assumed to be stopped and it won't start the timer again. +- ipmi-watchdog: (prepare for) not disabling at shutdown + + The op-build linux kernel has been configured to support the ipmi + watchdog. This driver will always handle the watchdog by either leaving + it enabled if configured, or by disabling it during module load if no + configuration is provided. This increases the coverage of the watchdog + during the boot process. The watchdog should no longer be disabled at + any point during skiboot execution. + + We're not enabling this by default yet as people can (and do, at least in + development) mix and match old BOOTKERNEL with new skiboot and we don't + want to break that too obviously. +- ipmi-watchdog: Don't reset the watchdog twice + + There is no clarification for why this change was needed, but presumably + this is due to a buggy BMC implementation where the Watchdog Set command + was processed concurrently or after the initial Watchdog Reset. This + inversion would cause the watchdog to stop since the DONT_STOP bit was + not set. Since we are now using the DONT_STOP bit during initialization, + the watchdog should not be stopped even if an inversion occurs. +- ipmi-watchdog: Make it possible to set DONT_STOP + + The IPMI standard supports setting a DONT_STOP bit during an Watchdog + Set operation. Most of the time we don't want to stop the Watchdog when + updating the settings so we should be using this bit. This patch makes + it possible for callers of set_wdt to prevent the watchdog from being + stopped. This only changes the behavior of the watchdog during the + initial settings update when initializing skiboot. The watchdog is no + longer disabled and then immediately re-enabled. +- ipmi-watchdog: WD_POWER_CYCLE_ACTION -> WD_RESET_ACTION + + The IPMI specification denotes that action 0x1 is Host Reset and 0x3 is + Host Power Cycle. Use the correct name for Reset in our watchdog code. + + +POWER8 platforms +---------------- + +- astbmc: Enable mbox depending on scratch reg + + P8 boxes can opt in for mbox pnor support if they set the scratch + register bit to indicate it is supported. + +Simulator platforms +------------------- +- plat/qemu: add PNOR support + + To access the PNOR, OPAL/skiboot drives the BMC SPI controller using + the iLPC2AHB device of the BMC SuperIO controller and accesses the + flash contents using the LPC FW address space on which the PNOR is + remapped. + + The QEMU PowerNV machine now integrates such models (SuperIO + controller, iLPC2AHB device) and also a pseudo Aspeed SoC AHB memory + space populated with the SPI controller registers (same model as for + ARM). The AHB window giving access to the contents of the BMC SPI + controller flash modules is mapped on the LPC FW address space. + + The change should be compatible for machine without PNOR support. +- external/mambo: Add support for readline if it exists + + Add support for tclreadline package if it is present. + This patch loads the package and uses it when the + simulation stops for any reason. + + +FSP based platforms +------------------- + +- Disable fast reboot on FSP IPL side change + + If FSP changes next IPL side, then disable fast reboot. + + sample output: :: + + [ 620.196442259,5] FSP: Got sysparam update, param ID 0xf0000007 + [ 620.196444501,5] CUPD: FW IPL side changed. Disable fast reboot + [ 620.196445389,5] CUPD: Next IPL side : perm +- fsp/console: Always establish OPAL console API backend + + Currently we only call set_opal_console() to establish the backend + used by the OPAL console API if we find at least one FSP serial + port in HDAT. + + On systems where there is none (IPMI only), we fail to set it, + causing the console code to try to use the dummy console causing + an assertion failure during boot due to clashing on the device-tree + node names. + + So always set it if an FSP is present + +AST BMC based platforms +----------------------- + +- AMI BMC: use 0x3a as OEM command + + The 0x3a OEM command is for IBM commands, while 0x32 was for AMI ones. + Sometime in the P8 timeframe, AMI BMCs were changed to listen for our + commands on either 0x32 or 0x3a. Since 0x3a is the direction forward, + we'll use that, as P9 machines with AMI BMCs probably also want these + to work, and let's not bet that 0x32 will continue to be okay. +- astbmc: Set romulus BMC type to OpenBMC +- platform/astbmc: Do not delete compatible property + + P9 onwards OPAL is building device tree for BMC based system using + HDAT. We are populating bmc/compatible node with bmc version. Hence + do not delete this property. + +Utilities +--------- +- external/xscom-utils: Add python library for xscom access + + Patch adds a simple python library module for xscom access. + It directly manipulate the '/access' file for scom read + and write from debugfs 'scom' directory. + + Example on how to generate a getscom using this module: + + .. code-block:: python + + from adu_scoms import * + getscom = GetSCom() + getscom.parse_args() + getscom.run_command() + + Sample output for above getscom.py: + + .. code-block:: console + + # ./getscom.py -l + Chip ID | Rev | Chip type + ---------|-------|----------- + 00000008 | DD2.0 | P9 (Nimbus) processor + 00000000 | DD2.0 | P9 (Nimbus) processor +- ffspart: Don't require user to create blank partitions manually + + Add '--allow-empty' which allows the filename for a given partition to + be blank. If set ffspart will set that part of the PNOR file 'blank' and + set ECC bits if required. + Without this option behaviour is unchanged and ffspart will return an + error if it can not find the partition file. +- pflash: Use correct prefix when installing + + pflash uses lowercase prefix when running make install in it's + direcetory, but uppercase PREFIX when running it in shared. Use + lowercase everywhere. + + With this the OpenBMC bitbake recipie can drop an out of tree patch it's + been carrying for years. + + +POWER9 +------ + +- occ-sensor: Avoid using uninitialised struct cpu_thread + + When adding the sensors in occ_sensors_init, if the type is not + OCC_SENSOR_LOC_CORE, then the loop to find 'c' will not be executed. + Then c->pir is used for both of the the add_sensor_node calls below. + + This provides a default value of 0 instead. +- NX: Add NX coprocessor init opal call + + The read offset (4:11) in Receive FIFO control register is incremented + by FIFO size whenever CRB read by NX. But the index in RxFIFO has to + match with the corresponding entry in FIFO maintained by VAS in kernel. + VAS entry is reset to 0 when opening the receive window during driver + initialization. So when NX842 is reloaded or in kexec boot, possibility + of mismatch between RxFIFO control register and VAS entries in kernel. + It could cause CRB failure / timeout from NX. + + This patch adds nx_coproc_init opal call for kernel to initialize + readOffset (4:11) and Queued (15:23) in RxFIFO control register. +- SLW: Remove stop1_lite and stop2_lite + + stop1_lite has been removed since it adds no additional benefit + over stop0_lite. stop2_lite has been removed since currently it adds + minimal benefit over stop2. However, the benefit is eclipsed by the time + required to ungate the clocks + + Moreover, Lite states don't give up the SMT resources, can potentially + have a performance impact on sibling threads. + + Since current OSs (Linux) aren't smart enough to make good decisions + with these stop states, we're (temporarly) removing them from what + we expose to the OS, the idea being to bring them back in a new + DT representation so that only an OS that knows what to do will + do things with them. +- cpu: Use STOP1 on POWER9 for idle/sleep inside OPAL + + The current code requests STOP3, which means it gets STOP2 in practice. + + STOP2 has proven to occasionally be unreliable depending on FW + version and chip revision, it also requires a functional CME, + so instead, let's use STOP1. The difference is rather minimum + for something that is only used a few seconds during boot. + +NPU2 (NVLink2 and OpenCAPI) +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- npu2: Reset NVLinks on hot reset + + This effectively fences GPU RAM on GPU reset so the host system + does not have to crash every time we stop a KVM guest with a GPU + passed through. +- npu2-opencapi: reduce number of retries to train the link + + We've been reliably training the opencapi link on the first attempt + for quite a while. Furthermore, if it doesn't train on the first + attempt, retries haven't been that useful. So let's reduce the number + of attempts we do to train the link. + + 2 retries = 3 attempts to train. + + Each (failed) training sequence costs about 3 seconds. +- opal/hmi: Display correct chip id while printing NPU FIRs. + + HMIs for NPU xstops are broadcasted to all chips. All cores on all the + chips receive HMI. HMI handler correctly identifies and extracts the + NPU FIR details from affected chip, but while printing FIR data it + prints chip id and location code details of this_cpu()->chip_id which + may not be correct. This patch fixes this issue. +- npu2-opencapi: Fix link state to report link down + + The PHB callback 'get_link_state' is always reporting the link width, + irrespective of the link status and even when the link is down. It is + causing too much work (and failures) when the PHB is probed during pci + init. + The fix is to look at the link status first and report the link as + down when appropriate. +- npu2-opencapi: Cleanup traces printed during link training + + Now that links may train in parallel, traces shown during training can + be all mixed up. So add a prefix to all the traces to clearly identify + the chip and link the trace refers to: :: + + OCAPI[<chip id>:<link id>]: this is a very useful message + + The lower-level hardware procedures (npu2-hw-procedures.c) also print + traces which would need work. But that code is being reworked to be + better integrated with opencapi and nvidia, so leave it alone for now. +- npu2-opencapi: Train links on fundamental reset + + Reorder our link training steps so that they are executed on + fundamental reset instead of during the initial setup. Skiboot always + call a fundamental reset on all the PHBs during pci init. + + It is done through a state machine, similarly to what is done for + 'real' PHBs. + + This is the first step for a longer term goal to be able to trigger an + adapter reset from linux. We'll need the reset callbacks of the PHB to + be defined. We have to handle the various delays differently, since a + linux thread shouldn't stay stuck waiting in opal for too long. +- npu2-opencapi: Rework adapter reset + + Rework a bit the code to reset the opencapi adapter: + + - make clearer which i2c pin is resetting which device + - break the reset operation in smaller chunks. This is really to + prepare for a future patch. + + No functional changes. +- npu2-opencapi: Use presence detection + + Presence detection is not part of the opencapi specification. So each + platform may choose to implement it the way it wants. + + All current platforms implement it through an i2c device where we can + query a pin to know if a device is connected or not. ZZ and Zaius have + a similar design and even use the same i2c information and pin + numbers. + However, presence detection on older ZZ planar (older than v4) doesn't + work, so we don't activate it for now, until our lab systems are + upgraded and it's better tested. + + Presence detection on witherspoon is still being worked on. It's + shaping up to be quite different, so we may have to revisit the topic + in a later patch. |