diff options
Diffstat (limited to 'roms/skiboot/doc/opal-api/opal-handle-hmi-98-166.rst')
-rw-r--r-- | roms/skiboot/doc/opal-api/opal-handle-hmi-98-166.rst | 155 |
1 files changed, 155 insertions, 0 deletions
diff --git a/roms/skiboot/doc/opal-api/opal-handle-hmi-98-166.rst b/roms/skiboot/doc/opal-api/opal-handle-hmi-98-166.rst new file mode 100644 index 000000000..5b75144dc --- /dev/null +++ b/roms/skiboot/doc/opal-api/opal-handle-hmi-98-166.rst @@ -0,0 +1,155 @@ +Hypervisor Maintenance Interrupt (HMI) +====================================== + +Hypervisor Maintenance Interrupt usually reports error related to processor +recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then +takes this opportunity to analyze and recover from some of these errors. +Hypervisor takes assistance from OPAL layer to handle and recover from HMI. +After handling HMI, OPAL layer sends the summary of error report and status +of recovery action using HMI event. See ref:`opal-messages` for HMI +event structure under :ref:`OPAL_MSG_HMI_EVT` section. + +HMI is thread specific. The reason for HMI is available in a per thread +Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance +Exception Enable Register (HMEER) is per core. Bits from the HMER need to +be enabled by the corresponding bits in the HMEER in order to cause an HMI. + +Several interrupt reasons are routed in parallel to each of the thread +specific copies. Each thread can only clear bits in its own HMER. OPAL +handler from each thread clears the respective bit from HMER register +after handling the error. + +List of errors that causes HMI +============================== + + - CPU Errors + + - Processor Core checkstop + - Processor retry recovery + - NX/NPU/CAPP checkstop. + + - Timer facility Errors + + - ChipTOD Errors + + - ChipTOD sync check and parity errors + - ChipTOD configuration register parity errors + - ChiTOD topology failover + + - Timebase (TB) errors + + - TB parity/residue error + - TFMR parity and firmware control error + - DEC/HDEC/PURR/SPURR parity errors + +HMI handling +============ + +A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0). +OPAL handler scans through Fault Isolation Register (FIR) for each +core/nx/npu to detect the exact reason for checkstop and reports it back +to the host alongwith the disposition. + +A processor recovery is reported through HMER bits 2, 3 and 11. These are +just an informational messages and no extra recovery is required. + +Timer facility errors are reported through HMER bit 4. These are all +recoverable errors. The exact reason for the errors are stored in +Timer Facility Management Register (TFMR). Some of the Timer facility +errors affects TB and some of them affects TOD. TOD is a per chip +Time-Of-Day logic that holds the actual time value of the chip and +communicates with every TOD in the system to achieve synchronized +timer value within a system. TB is per core register (64-bit) derives its +value from ChipTOD at startup and then it gets periodically incremented +by STEP signal provided by the TOD. In a multi-socket system TODs are +always configured as master/backup TOD under primary/secondary +topology configuration respectively. + +TB error generates HMI on all threads of the affected core. TB errors +except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running +making it invalid. As part of TB recovery, OPAL hmi handler synchronizes +with all threads, clears the TB errors and then re-sync the TB with TOD +value putting it back in running state. + +TOD errors generates HMI on every core/thread of affected chip. The reason +for TOD errors are stored in TOD ERROR register (0x40030). As part of the +recovery OPAL hmi handler clears the TOD error and then requests new TOD +value from another running chipTOD in the system. Sometimes, if a primary +chipTOD is in error, it may need a TOD topology switch to recover from +error. A TOD topology switch basically makes a backup as new active master. + +.. _OPAL_HANDLE_HMI: + +OPAL_HANDLE_HMI +=============== + +.. code-block:: c + + #define OPAL_HANDLE_HMI 98 + + int64_t opal_handle_hmi(void); + + +Superseded by :ref:`OPAL_HANDLE_HMI2`, meaning that :ref:`OPAL_HANDLE_HMI` +should only be called if :ref:`OPAL_HANDLE_HMI2` is not available. + +Since :ref:`OPAL_HANDLE_HMI2` has been available since the start of POWER9 +systems being supported, if you only target POWER9 and above, you can +assume the presence of :ref:`OPAL_HANDLE_HMI2`. + +.. _OPAL_HANDLE_HMI2: + +OPAL_HANDLE_HMI2 +================ + +.. code-block:: c + + #define OPAL_HANDLE_HMI2 166 + + int64_t opal_handle_hmi2(__be64 *out_flags); + +When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call +:ref:`OPAL_HANDLE_HMI` or :ref:`OPAL_HANDLE_HMI2`. The :ref:`OPAL_HANDLE_HMI` +is an old interface. :ref:`OPAL_HANDLE_HMI2` is newly introduced opal call +that returns direct info to the OS. It returns a 64-bit flag mask currently +set to provide info about which timer facilities were lost, and whether an +event was generated. This information will help OS to take respective +actions. + +In case where opal hmi handler is unable to recover from TOD or TB errors, +it would flag ``OPAL_HMI_FLAGS_TOD_TB_FAIL`` to indicate OS that TB is +dead. This information then can be used by OS to make sure that the +functions relying on TB value (e.g. udelay()) are aware of TB not ticking. +This will avoid OS getting stuck or hang during its way to panic path. + + +Parameters +^^^^^^^^^^ + +.. code-block:: c + + __be64 *out_flags; + +Returns the 64-bit flag mask that provides info about which timer facilities +were lost, and whether an event was generated. + +.. code-block:: c + + /* OPAL_HANDLE_HMI2 out_flags */ + enum { + OPAL_HMI_FLAGS_TB_RESYNC = (1ull << 0), /* Timebase has been resynced */ + OPAL_HMI_FLAGS_DEC_LOST = (1ull << 1), /* DEC lost, needs to be reprogrammed */ + OPAL_HMI_FLAGS_HDEC_LOST = (1ull << 2), /* HDEC lost, needs to be reprogrammed */ + OPAL_HMI_FLAGS_TOD_TB_FAIL = (1ull << 3), /* TOD/TB recovery failed. */ + OPAL_HMI_FLAGS_NEW_EVENT = (1ull << 63), /* An event has been created */ + }; + +.. _OPAL_HMI_FLAGS_TOD_TB_FAIL: + +OPAL_HMI_FLAGS_TOD_TB_FAIL + The Time of Day (TOD) / Timebase facility has failed. This is probably fatal + for the OS, and requires the OS to be very careful to not call any function + that may rely on it, usually as it heads down a `panic()` code path. + This code path should be :ref:`OPAL_CEC_REBOOT2` with the OPAL_REBOOT_PLATFORM_ERROR + option. Details of the failure are likely delivered as part of HMI events if + `OPAL_HMI_FLAGS_NEW_EVENT` is set. |