1 files changed, 183 insertions, 0 deletions
diff --git a/roms/skiboot/doc/pci.rst b/roms/skiboot/doc/pci.rst
new file mode 100644
index 000000000..d18d35d8f
--- /dev/null
+++ b/roms/skiboot/doc/pci.rst
@@ -0,0 +1,183 @@
+PCI
+===
+
+Debugging
+---------
+
+There exist a couple of NVRAM options for enabling extra debug functionality
+to help debug PCI issues. These are not ABI and may be changed or removed at
+**any** time.
+
+Verbose EEH
+^^^^^^^^^^^
+
+::
+
+   nvram -p ibm,skiboot --update-config pci-eeh-verbose=true
+
+Disable EEH MMIO
+^^^^^^^^^^^^^^^^
+::
+   nvram -p ibm,skiboot --update-config pci-eeh-mmio=disabled
+
+
+Check for RX errors after link training
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Some PHB4 PHYs can get stuck in a bad state where they are constantly
+retraining the link. This happens transparently to skiboot and Linux
+but will causes PCIe to be slow. Resetting the PHB4 clears the
+problem.
+
+We can detect this case by looking at the RX errors count where we
+check for link stability. This patch does this by modifying the link
+optimal code to check for RX errors. If errors are occurring we
+retrain the link irrespective of the chip rev or card.
+
+Normally when this problem occurs, the RX error count is maxed out at
+255. When there is no problem, the count is 0. We chose 8 as the max
+rx errors value to give us some margin for a few errors. There is also
+a knob that can be used to set the error threshold for when we should
+retrain the link. i.e. ::
+
+      nvram -p ibm,skiboot --update-config phb-rx-err-max=8
+
+Retrain link if degraded
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+On P9 Scale Out (Nimbus) DD2.0 and Scale in (Cumulus) DD1.0 (and
+below) the PCIe PHY can lockup causing training issues. This can cause
+a degradation in speed or width in ~5% of training cases (depending on
+the card). This is fixed in later chip revisions. This issue can also
+cause PCIe links to not train at all, but this case is already
+handled.
+
+There is code in skiboot that checks if the PCIe link has trained optimally
+and if not, does a full PHB reset (to fix the PHY lockup) and retrain.
+
+One complication is some devices are known to train degraded unless
+device specific configuration is performed. Because of this, we only
+retrain when the device is in a whitelist. All devices in the current
+whitelist have been testing on a P9DSU/Boston, ZZ and Witherspoon.
+
+We always gather information on the link and print it in the logs even
+if the card is not in the whitelist.
+
+For testing purposes, there's an nvram to retry all PCIe cards and all
+P9 chips when a degraded link is detected. The new option is
+'pci-retry-all=true' which can be set using: ::
+
+  nvram -p ibm,skiboot --update-config pci-retry-all=true
+
+This option may increase the boot time if used on a badly behaving
+card.
+
+Maximum link speed
+^^^^^^^^^^^^^^^^^^
+
+Was useful during bringup on P9 DD1.
+
+::
+   nvram -p ibm,skiboot --update-config pcie-max-link-speed=4
+
+
+Ric Mata Mode
+^^^^^^^^^^^^^
+
+This mode (for PHB4) will trace the training process closely. This activates
+as soon as PERST is deasserted and produces human readable output of
+the process.
+
+It will also add the PCIe Link Training and Status State Machine (LTSSM) tracing
+and details on speed and link width.
+
+Output looks a bit like this ::
+
+  [    1.096995141,3] PHB#0000[0:0]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
+  [    1.102849137,3] PHB#0000[0:0]: TRACE:0x0000102101000000 11ms presence GEN1:x16:polling
+  [    1.104341838,3] PHB#0000[0:0]: TRACE:0x0000182101000000 14ms training GEN1:x16:polling
+  [    1.104357444,3] PHB#0000[0:0]: TRACE:0x00001c5101000000 14ms training GEN1:x16:recovery
+  [    1.104580394,3] PHB#0000[0:0]: TRACE:0x00001c5103000000 14ms training GEN3:x16:recovery
+  [    1.123259359,3] PHB#0000[0:0]: TRACE:0x00001c5104000000 51ms training GEN4:x16:recovery
+  [    1.141737656,3] PHB#0000[0:0]: TRACE:0x0000144104000000 87ms presence GEN4:x16:L0
+  [    1.141752318,3] PHB#0000[0:0]: TRACE:0x0000154904000000 87ms trained  GEN4:x16:L0
+  [    1.141757964,3] PHB#0000[0:0]: TRACE: Link trained.
+  [    1.096834019,3] PHB#0001[0:1]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
+  [    1.105578525,3] PHB#0001[0:1]: TRACE:0x0000102101000000 17ms presence GEN1:x16:polling
+  [    1.112763075,3] PHB#0001[0:1]: TRACE:0x0000183101000000 31ms training GEN1:x16:config
+  [    1.112778956,3] PHB#0001[0:1]: TRACE:0x00001c5081000000 31ms training GEN1:x08:recovery
+  [    1.113002083,3] PHB#0001[0:1]: TRACE:0x00001c5083000000 31ms training GEN3:x08:recovery
+  [    1.114833873,3] PHB#0001[0:1]: TRACE:0x0000144083000000 35ms presence GEN3:x08:L0
+  [    1.114848832,3] PHB#0001[0:1]: TRACE:0x0000154883000000 35ms trained  GEN3:x08:L0
+  [    1.114854650,3] PHB#0001[0:1]: TRACE: Link trained.
+
+Enabled via NVRAM: ::
+
+  nvram -p ibm,skiboot --update-config pci-tracing=true
+
+Named after the person the output of this mode is typically sent to.
+
+
+**WARNING**: The documentation below **urgently needs updating** and is *woefully* incomplete.
+
+IODA PE Setup Sequences
+-----------------------
+
+(**WARNING**: this was rescued from old internal documentation. Needs verification)
+
+To setup basic PE mappings, the host performs this basic sequence:
+
+For ibm,opal-ioda2, prior to allocating PHB resources to PEs, the host must
+allocate memory for PE structures and then calls
+``opal_pci_set_phb_table_memory( phb_id, rtt_addr, ivt_addr, ivt_len, rrba_addr, peltv_addr)`` to define them to the PHB. OPAL returns ``OPAL_UNSUPPORTED`` status for ``ibm,opal-ioda`` PHBs.
+
+The host calls ``opal_pci_set_pe( phb_id, pe_number, bus, dev, func, validate_mask, bus_mask, dev_mask, func mask)`` to map a PE to a PCI RID or range of RIDs in the same PE domain.
+
+The host calls ``opal_pci_set_peltv(phb_id, parent_pe, child_pe, state)`` to
+set a parent PELT vector bit for the child PE argument to 1 (a child of the
+parent) or 0 (not in the parent PE domain).
+
+IODA MMIO Setup Sequences
+-------------------------
+
+(**WARNING**: this was rescued from old internal documentation. Needs verification)
+
+
+The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x0)`` to disable the MMIO window.
+
+The host calls ``opal_pci_set_phb_mmio_window( phb_id, mmio_window, starting_real_address, starting_pci_address, segment_size)`` to change the MMIO window location in PCI and/or processor real address space, or to change the size -- and corresponding window size -- of a particular MMIO window.
+
+The host calls ``opal_pci_map_pe_mmio_window( pe_number, mmio_window, segment_number)`` to map PEs to window segments, for each segment mapped to each PE.
+
+The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x1)`` to enable the MMIO window.
+
+IODA MSI Setup Sequences
+------------------------
+
+(**WARNING**: this was rescued from old internal documentation. Needs verification)
+
+To setup MSIs:
+
+1. For ibm,opal-ioda PHBs, the host chooses an MVE for a PE to use and calls ``opal_pci_set_mve( phb_id, mve_number, pe_number,)`` to setup the MVE for the PE number. HAL treats this call as a NOP and returns hal_success status for ibm,opal-ioda2 PHBs.
+2. The host chooses an XIVE to use with a PE and calls
+   a. ``opal_pci_set_xive_pe( phb_id, xive_number, pe_number)`` to authorize that PE to signal that XIVE as an interrupt. The host must call this function for each XIVE assigned to a particular PE, but may use this call for all XIVEs prior to calling ``opel_pci_set_mve()`` to bind the PE XIVEs to an MVE. For MSI conventional, the host must bind a unique MVE for each sequential set of 32 XIVEs.
+   b. The host forms the interrupt_source_number from the combination of the device tree MSI property base BUID and XIVE number, as an input to ``opal_set_xive(interrupt_source_number, server_number, priority)`` and ``opal_get_xive(interrupt_source_number, server_number, priority)`` to set or return the server and priority numbers within an XIVE.
+   c. ``opal_get_msi_64[32](phb_id, mve_number, xive_num, msi_range, msi_address, message_data)`` to determine the MSI DMA address (32 or 64 bit) and message data value for that xive.
+
+      For MSI conventional, the host uses this for each sequential power of 2 set of 1 to 32 MSIs, to determine the MSI DMA address and starting message data value for that MSI range. For MSI-X, the host calls this uniquely for each MSI interrupt with an msi_range input value of 1.
+3. For ``ibm,opal-ioda`` PHBs, once the MVE and XIVRs are setup for a PE, the host calls ``opal_pci_set_mve_enable( phb_id, mve_number, state)`` to enable that MVE to be a valid target of MSI DMAs. The host may also call this function to disable an MVE when changing PE domains or states.
+
+IODA DMA Setup Sequences
+------------------------
+
+(**WARNING**: this was rescued from old internal documentation. Needs verification)
+
+To Manage DMA Windows :
+
+1. The host calls ``opal_pci_map_pe_dma_window( phb_id, dma_window_number, pe_number, tce_levels, tce_table_addr, tce_table_size, tce_page_size, utin64_t* pci_start_addr )`` to setup a DMA window for a PE to translate through a TCE table structure in KVM memory.
+2. The host calls ``opal_pci_map_pe_dma_window_real( phb_id, dma_window_number, pe_number, mem_low_addr, mem_high_addr)`` to setup a DMA window for a PE that is translated (but validated by the PHB as an untranlsated address space authorized to this PE).
+
+Device Tree Bindings
+--------------------
+
+See :doc:`device-tree/pci` for device tree information.