diff options
Diffstat (limited to 'docs/specs')
27 files changed, 5158 insertions, 0 deletions
diff --git a/docs/specs/acpi_cpu_hotplug.rst b/docs/specs/acpi_cpu_hotplug.rst new file mode 100644 index 000000000..351057c96 --- /dev/null +++ b/docs/specs/acpi_cpu_hotplug.rst @@ -0,0 +1,235 @@ +QEMU<->ACPI BIOS CPU hotplug interface +====================================== + +QEMU supports CPU hotplug via ACPI. This document +describes the interface between QEMU and the ACPI BIOS. + +ACPI BIOS GPE.2 handler is dedicated for notifying OS about CPU hot-add +and hot-remove events. + + +Legacy ACPI CPU hotplug interface registers +------------------------------------------- + +CPU present bitmap for: + +- ICH9-LPC (IO port 0x0cd8-0xcf7, 1-byte access) +- PIIX-PM (IO port 0xaf00-0xaf1f, 1-byte access) +- One bit per CPU. Bit position reflects corresponding CPU APIC ID. Read-only. +- The first DWORD in bitmap is used in write mode to switch from legacy + to modern CPU hotplug interface, write 0 into it to do switch. + +QEMU sets corresponding CPU bit on hot-add event and issues SCI +with GPE.2 event set. CPU present map is read by ACPI BIOS GPE.2 handler +to notify OS about CPU hot-add events. CPU hot-remove isn't supported. + + +Modern ACPI CPU hotplug interface registers +------------------------------------------- + +Register block base address: + +- ICH9-LPC IO port 0x0cd8 +- PIIX-PM IO port 0xaf00 + +Register block size: + +- ACPI_CPU_HOTPLUG_REG_LEN = 12 + +All accesses to registers described below, imply little-endian byte order. + +Reserved registers behavior: + +- write accesses are ignored +- read accesses return all bits set to 0. + +The last stored value in 'CPU selector' must refer to a possible CPU, otherwise + +- reads from any register return 0 +- writes to any other register are ignored until valid value is stored into it + +On QEMU start, 'CPU selector' is initialized to a valid value, on reset it +keeps the current value. + +Read access behavior +^^^^^^^^^^^^^^^^^^^^ + +offset [0x0-0x3] + Command data 2: (DWORD access) + + If value last stored in 'Command field' is: + + 0: + reads as 0x0 + 3: + upper 32 bits of architecture specific CPU ID value + other values: + reserved + +offset [0x4] + CPU device status fields: (1 byte access) + + bits: + + 0: + Device is enabled and may be used by guest + 1: + Device insert event, used to distinguish device for which + no device check event to OSPM was issued. + It's valid only when bit 0 is set. + 2: + Device remove event, used to distinguish device for which + no device eject request to OSPM was issued. Firmware must + ignore this bit. + 3: + reserved and should be ignored by OSPM + 4: + if set to 1, OSPM requests firmware to perform device eject. + 5-7: + reserved and should be ignored by OSPM + +offset [0x5-0x7] + reserved + +offset [0x8] + Command data: (DWORD access) + + If value last stored in 'Command field' is one of: + + 0: + contains 'CPU selector' value of a CPU with pending event[s] + 3: + lower 32 bits of architecture specific CPU ID value + (in x86 case: APIC ID) + otherwise: + contains 0 + +Write access behavior +^^^^^^^^^^^^^^^^^^^^^ + +offset [0x0-0x3] + CPU selector: (DWORD access) + + Selects active CPU device. All following accesses to other + registers will read/store data from/to selected CPU. + Valid values: [0 .. max_cpus) + +offset [0x4] + CPU device control fields: (1 byte access) + + bits: + + 0: + reserved, OSPM must clear it before writing to register. + 1: + if set to 1 clears device insert event, set by OSPM + after it has emitted device check event for the + selected CPU device + 2: + if set to 1 clears device remove event, set by OSPM + after it has emitted device eject request for the + selected CPU device. + 3: + if set to 1 initiates device eject, set by OSPM when it + triggers CPU device removal and calls _EJ0 method or by firmware + when bit #4 is set. In case bit #4 were set, it's cleared as + part of device eject. + 4: + if set to 1, OSPM hands over device eject to firmware. + Firmware shall issue device eject request as described above + (bit #3) and OSPM should not touch device eject bit (#3) in case + it's asked firmware to perform CPU device eject. + 5-7: + reserved, OSPM must clear them before writing to register + +offset[0x5] + Command field: (1 byte access) + + value: + + 0: + selects a CPU device with inserting/removing events and + following reads from 'Command data' register return + selected CPU ('CPU selector' value). + If no CPU with events found, the current 'CPU selector' doesn't + change and corresponding insert/remove event flags are not modified. + + 1: + following writes to 'Command data' register set OST event + register in QEMU + 2: + following writes to 'Command data' register set OST status + register in QEMU + 3: + following reads from 'Command data' and 'Command data 2' return + architecture specific CPU ID value for currently selected CPU. + other values: + reserved + +offset [0x6-0x7] + reserved + +offset [0x8] + Command data: (DWORD access) + + If last stored 'Command field' value is: + + 1: + stores value into OST event register + 2: + stores value into OST status register, triggers + ACPI_DEVICE_OST QMP event from QEMU to external applications + with current values of OST event and status registers. + other values: + reserved + +Typical usecases +---------------- + +(x86) Detecting and enabling modern CPU hotplug interface +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +QEMU starts with legacy CPU hotplug interface enabled. Detecting and +switching to modern interface is based on the 2 legacy CPU hotplug features: + +#. Writes into CPU bitmap are ignored. +#. CPU bitmap always has bit #0 set, corresponding to boot CPU. + +Use following steps to detect and enable modern CPU hotplug interface: + +#. Store 0x0 to the 'CPU selector' register, attempting to switch to modern mode +#. Store 0x0 to the 'CPU selector' register, to ensure valid selector value +#. Store 0x0 to the 'Command field' register +#. Read the 'Command data 2' register. + If read value is 0x0, the modern interface is enabled. + Otherwise legacy or no CPU hotplug interface available + +Get a cpu with pending event +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Store 0x0 to the 'CPU selector' register. +#. Store 0x0 to the 'Command field' register. +#. Read the 'CPU device status fields' register. +#. If both bit #1 and bit #2 are clear in the value read, there is no CPU + with a pending event and selected CPU remains unchanged. +#. Otherwise, read the 'Command data' register. The value read is the + selector of the CPU with the pending event (which is already selected). + +Enumerate CPUs present/non present CPUs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Set the present CPU count to 0. +#. Set the iterator to 0. +#. Store 0x0 to the 'CPU selector' register, to ensure that it's in + a valid state and that access to other registers won't be ignored. +#. Store 0x0 to the 'Command field' register to make 'Command data' + register return 'CPU selector' value of selected CPU +#. Read the 'CPU device status fields' register. +#. If bit #0 is set, increment the present CPU count. +#. Increment the iterator. +#. Store the iterator to the 'CPU selector' register. +#. Read the 'Command data' register. +#. If the value read is not zero, goto 05. +#. Otherwise store 0x0 to the 'CPU selector' register, to put it + into a valid state and exit. + The iterator at this point equals "max_cpus". diff --git a/docs/specs/acpi_hest_ghes.rst b/docs/specs/acpi_hest_ghes.rst new file mode 100644 index 000000000..68f1fbe0a --- /dev/null +++ b/docs/specs/acpi_hest_ghes.rst @@ -0,0 +1,110 @@ +APEI tables generating and CPER record +====================================== + +.. + Copyright (c) 2020 HUAWEI TECHNOLOGIES CO., LTD. + + This work is licensed under the terms of the GNU GPL, version 2 or later. + See the COPYING file in the top-level directory. + +Design Details +-------------- + +:: + + etc/acpi/tables etc/hardware_errors + ==================== =============================== + + +--------------------------+ +----------------------------+ + | | HEST | +--------->| error_block_address1 |------+ + | +--------------------------+ | +----------------------------+ | + | | GHES1 | | +------->| error_block_address2 |------+-+ + | +--------------------------+ | | +----------------------------+ | | + | | ................. | | | | .............. | | | + | | error_status_address-----+-+ | -----------------------------+ | | + | | ................. | | +--->| error_block_addressN |------+-+---+ + | | read_ack_register--------+-+ | | +----------------------------+ | | | + | | read_ack_preserve | +-+---+--->| read_ack_register1 | | | | + | | read_ack_write | | | +----------------------------+ | | | + + +--------------------------+ | +-+--->| read_ack_register2 | | | | + | | GHES2 | | | | +----------------------------+ | | | + + +--------------------------+ | | | | ............. | | | | + | | ................. | | | | +----------------------------+ | | | + | | error_status_address-----+---+ | | +->| read_ack_registerN | | | | + | | ................. | | | | +----------------------------+ | | | + | | read_ack_register--------+-----+ | | |Generic Error Status Block 1|<-----+ | | + | | read_ack_preserve | | | |-+------------------------+-+ | | + | | read_ack_write | | | | | CPER | | | | + + +--------------------------| | | | | CPER | | | | + | | ............... | | | | | .... | | | | + + +--------------------------+ | | | | CPER | | | | + | | GHESN | | | |-+------------------------+-| | | + + +--------------------------+ | | |Generic Error Status Block 2|<-------+ | + | | ................. | | | |-+------------------------+-+ | + | | error_status_address-----+-------+ | | | CPER | | | + | | ................. | | | | CPER | | | + | | read_ack_register--------+---------+ | | .... | | | + | | read_ack_preserve | | | CPER | | | + | | read_ack_write | +-+------------------------+-+ | + + +--------------------------+ | .......... | | + |----------------------------+ | + |Generic Error Status Block N |<----------+ + |-+-------------------------+-+ + | | CPER | | + | | CPER | | + | | .... | | + | | CPER | | + +-+-------------------------+-+ + + +(1) QEMU generates the ACPI HEST table. This table goes in the current + "etc/acpi/tables" fw_cfg blob. Each error source has different + notification types. + +(2) A new fw_cfg blob called "etc/hardware_errors" is introduced. QEMU + also needs to populate this blob. The "etc/hardware_errors" fw_cfg blob + contains an address registers table and an Error Status Data Block table. + +(3) The address registers table contains N Error Block Address entries + and N Read Ack Register entries. The size for each entry is 8-byte. + The Error Status Data Block table contains N Error Status Data Block + entries. The size for each entry is 4096(0x1000) bytes. The total size + for the "etc/hardware_errors" fw_cfg blob is (N * 8 * 2 + N * 4096) bytes. + N is the number of the kinds of hardware error sources. + +(4) QEMU generates the ACPI linker/loader script for the firmware. The + firmware pre-allocates memory for "etc/acpi/tables", "etc/hardware_errors" + and copies blob contents there. + +(5) QEMU generates N ADD_POINTER commands, which patch addresses in the + "error_status_address" fields of the HEST table with a pointer to the + corresponding "address registers" in the "etc/hardware_errors" blob. + +(6) QEMU generates N ADD_POINTER commands, which patch addresses in the + "read_ack_register" fields of the HEST table with a pointer to the + corresponding "read_ack_register" within the "etc/hardware_errors" blob. + +(7) QEMU generates N ADD_POINTER commands for the firmware, which patch + addresses in the "error_block_address" fields with a pointer to the + respective "Error Status Data Block" in the "etc/hardware_errors" blob. + +(8) QEMU defines a third and write-only fw_cfg blob which is called + "etc/hardware_errors_addr". Through that blob, the firmware can send back + the guest-side allocation addresses to QEMU. The "etc/hardware_errors_addr" + blob contains a 8-byte entry. QEMU generates a single WRITE_POINTER command + for the firmware. The firmware will write back the start address of + "etc/hardware_errors" blob to the fw_cfg file "etc/hardware_errors_addr". + +(9) When QEMU gets a SIGBUS from the kernel, QEMU writes CPER into corresponding + "Error Status Data Block", guest memory, and then injects platform specific + interrupt (in case of arm/virt machine it's Synchronous External Abort) as a + notification which is necessary for notifying the guest. + +(10) This notification (in virtual hardware) will be handled by the guest + kernel, on receiving notification, guest APEI driver could read the CPER error + and take appropriate action. + +(11) kvm_arch_on_sigbus_vcpu() uses source_id as index in "etc/hardware_errors" to + find out "Error Status Data Block" entry corresponding to error source. So supported + source_id values should be assigned here and not be changed afterwards to make sure + that guest will write error into expected "Error Status Data Block" even if guest was + migrated to a newer QEMU. diff --git a/docs/specs/acpi_hw_reduced_hotplug.rst b/docs/specs/acpi_hw_reduced_hotplug.rst new file mode 100644 index 000000000..0bd3f9399 --- /dev/null +++ b/docs/specs/acpi_hw_reduced_hotplug.rst @@ -0,0 +1,71 @@ +================================================== +QEMU and ACPI BIOS Generic Event Device interface +================================================== + +The ACPI *Generic Event Device* (GED) is a HW reduced platform +specific device introduced in ACPI v6.1 that handles all platform +events, including the hotplug ones. GED is modelled as a device +in the namespace with a _HID defined to be ACPI0013. This document +describes the interface between QEMU and the ACPI BIOS. + +GED allows HW reduced platforms to handle interrupts in ACPI ASL +statements. It follows a very similar approach to the _EVT method +from GPIO events. All interrupts are listed in _CRS and the handler +is written in _EVT method. However, the QEMU implementation uses a +single interrupt for the GED device, relying on an IO memory region +to communicate the type of device affected by the interrupt. This way, +we can support up to 32 events with a unique interrupt. + +**Here is an example,** + +:: + + Device (\_SB.GED) + { + Name (_HID, "ACPI0013") + Name (_UID, Zero) + Name (_CRS, ResourceTemplate () + { + Interrupt (ResourceConsumer, Edge, ActiveHigh, Exclusive, ,, ) + { + 0x00000029, + } + }) + OperationRegion (EREG, SystemMemory, 0x09080000, 0x04) + Field (EREG, DWordAcc, NoLock, WriteAsZeros) + { + ESEL, 32 + } + Method (_EVT, 1, Serialized) + { + Local0 = ESEL // ESEL = IO memory region which specifies the + // device type. + If (((Local0 & One) == One)) + { + MethodEvent1() + } + If ((Local0 & 0x2) == 0x2) + { + MethodEvent2() + } + ... + } + } + +GED IO interface (4 byte access) +-------------------------------- +**read access:** + +:: + + [0x0-0x3] Event selector bit field (32 bit) set by QEMU. + + bits: + 0: Memory hotplug event + 1: System power down event + 2: NVDIMM hotplug event + 3-31: Reserved + +**write_access:** + +Nothing is expected to be written into GED IO memory diff --git a/docs/specs/acpi_mem_hotplug.rst b/docs/specs/acpi_mem_hotplug.rst new file mode 100644 index 000000000..069819bc3 --- /dev/null +++ b/docs/specs/acpi_mem_hotplug.rst @@ -0,0 +1,128 @@ +QEMU<->ACPI BIOS memory hotplug interface +========================================= + +ACPI BIOS GPE.3 handler is dedicated for notifying OS about memory hot-add +and hot-remove events. + +Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access) +---------------------------------------------------------------- + +Read access behavior +^^^^^^^^^^^^^^^^^^^^ + +[0x0-0x3] + Lo part of memory device phys address +[0x4-0x7] + Hi part of memory device phys address +[0x8-0xb] + Lo part of memory device size in bytes +[0xc-0xf] + Hi part of memory device size in bytes +[0x10-0x13] + Memory device proximity domain +[0x14] + Memory device status fields + + bits: + + 0: + Device is enabled and may be used by guest + 1: + Device insert event, used to distinguish device for which + no device check event to OSPM was issued. + It's valid only when bit 1 is set. + 2: + Device remove event, used to distinguish device for which + no device eject request to OSPM was issued. + 3-7: + reserved and should be ignored by OSPM + +[0x15-0x17] + reserved + +Write access behavior +^^^^^^^^^^^^^^^^^^^^^ + + +[0x0-0x3] + Memory device slot selector, selects active memory device. + All following accesses to other registers in 0xa00-0xa17 + region will read/store data from/to selected memory device. +[0x4-0x7] + OST event code reported by OSPM +[0x8-0xb] + OST status code reported by OSPM +[0xc-0x13] + reserved, writes into it are ignored +[0x14] + Memory device control fields + + bits: + + 0: + reserved, OSPM must clear it before writing to register. + Due to BUG in versions prior 2.4 that field isn't cleared + when other fields are written. Keep it reserved and don't + try to reuse it. + 1: + if set to 1 clears device insert event, set by OSPM + after it has emitted device check event for the + selected memory device + 2: + if set to 1 clears device remove event, set by OSPM + after it has emitted device eject request for the + selected memory device + 3: + if set to 1 initiates device eject, set by OSPM when it + triggers memory device removal and calls _EJ0 method + 4-7: + reserved, OSPM must clear them before writing to register + +Selecting memory device slot beyond present range has no effect on platform: + +- write accesses to memory hot-plug registers not documented above are ignored +- read accesses to memory hot-plug registers not documented above return + all bits set to 1. + +Memory hot remove process diagram +--------------------------------- + +:: + + +-------------+ +-----------------------+ +------------------+ + | 1. QEMU | | 2. QEMU | |3. QEMU | + | device_del +---->+ device unplug request +----->+Send SCI to guest,| + | | | cb | |return control to | + | | | | |management | + +-------------+ +-----------------------+ +------------------+ + + +---------------------------------------------------------------------+ + + +---------------------+ +-------------------------+ + | OSPM: | remove event | OSPM: | + | send Eject Request, | | Scan memory devices | + | clear remove event +<-------------+ for event flags | + | | | | + +---------------------+ +-------------------------+ + | + | + +---------v--------+ +-----------------------+ + | Guest OS: | success | OSPM: | + | process Ejection +----------->+ Execute _EJ0 method, | + | request | | set eject bit in flags| + +------------------+ +-----------------------+ + |failure | + v v + +------------------------+ +-----------------------+ + | OSPM: | | QEMU: | + | set OST event & status | | call device unplug cb | + | fields | | | + +------------------------+ +-----------------------+ + | | + v v + +------------------+ +-------------------+ + |QEMU: | |QEMU: | + |Send OST QMP event| |Send device deleted| + | | |QMP event | + +------------------+ | | + +-------------------+ diff --git a/docs/specs/acpi_nvdimm.rst b/docs/specs/acpi_nvdimm.rst new file mode 100644 index 000000000..ab0335253 --- /dev/null +++ b/docs/specs/acpi_nvdimm.rst @@ -0,0 +1,228 @@ +QEMU<->ACPI BIOS NVDIMM interface +================================= + +QEMU supports NVDIMM via ACPI. This document describes the basic concepts of +NVDIMM ACPI and the interface between QEMU and the ACPI BIOS. + +NVDIMM ACPI Background +---------------------- + +NVDIMM is introduced in ACPI 6.0 which defines an NVDIMM root device under +_SB scope with a _HID of "ACPI0012". For each NVDIMM present or intended +to be supported by platform, platform firmware also exposes an ACPI +Namespace Device under the root device. + +The NVDIMM child devices under the NVDIMM root device are defined with _ADR +corresponding to the NFIT device handle. The NVDIMM root device and the +NVDIMM devices can have device specific methods (_DSM) to provide additional +functions specific to a particular NVDIMM implementation. + +This is an example from ACPI 6.0, a platform contains one NVDIMM:: + + Scope (\_SB){ + Device (NVDR) // Root device + { + Name (_HID, "ACPI0012") + Method (_STA) {...} + Method (_FIT) {...} + Method (_DSM, ...) {...} + Device (NVD) + { + Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM + Method (_DSM, ...) {...} + } + } + } + +Methods supported on both NVDIMM root device and NVDIMM device +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +_DSM (Device Specific Method) + It is a control method that enables devices to provide device specific + control functions that are consumed by the device driver. + The NVDIMM DSM specification can be found at + http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf + + Arguments: + + Arg0 + A Buffer containing a UUID (16 Bytes) + Arg1 + An Integer containing the Revision ID (4 Bytes) + Arg2 + An Integer containing the Function Index (4 Bytes) + Arg3 + A package containing parameters for the function specified by the + UUID, Revision ID, and Function Index + + Return Value: + + If Function Index = 0, a Buffer containing a function index bitfield. + Otherwise, the return value and type depends on the UUID, revision ID + and function index which are described in the DSM specification. + +Methods on NVDIMM ROOT Device +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +_FIT(Firmware Interface Table) + It evaluates to a buffer returning data in the format of a series of NFIT + Type Structure. + + Arguments: None + + Return Value: + A Buffer containing a list of NFIT Type structure entries. + + The detailed definition of the structure can be found at ACPI 6.0: 5.2.25 + NVDIMM Firmware Interface Table (NFIT). + +QEMU NVDIMM Implementation +-------------------------- + +QEMU uses 4 bytes IO Port starting from 0x0a18 and a RAM-based memory page +for NVDIMM ACPI. + +Memory: + QEMU uses BIOS Linker/loader feature to ask BIOS to allocate a memory + page and dynamically patch its address into an int32 object named "MEMA" + in ACPI. + + This page is RAM-based and it is used to transfer data between _DSM + method and QEMU. If ACPI has control, this pages is owned by ACPI which + writes _DSM input data to it, otherwise, it is owned by QEMU which + emulates _DSM access and writes the output data to it. + + ACPI writes _DSM Input Data (based on the offset in the page): + + [0x0 - 0x3] + 4 bytes, NVDIMM Device Handle. + + The handle is completely QEMU internal thing, the values in + range [1, 0xFFFF] indicate nvdimm device. Other values are + reserved for other purposes. + + Reserved handles: + + - 0 is reserved for nvdimm root device named NVDR. + - 0x10000 is reserved for QEMU internal DSM function called on + the root device. + + [0x4 - 0x7] + 4 bytes, Revision ID, that is the Arg1 of _DSM method. + + [0x8 - 0xB] + 4 bytes. Function Index, that is the Arg2 of _DSM method. + + [0xC - 0xFFF] + 4084 bytes, the Arg3 of _DSM method. + + QEMU writes Output Data (based on the offset in the page): + + [0x0 - 0x3] + 4 bytes, the length of result + + [0x4 - 0xFFF] + 4092 bytes, the DSM result filled by QEMU + +IO Port 0x0a18 - 0xa1b: + ACPI writes the address of the memory page allocated by BIOS to this + port then QEMU gets the control and fills the result in the memory page. + + Write Access: + + [0x0a18 - 0xa1b] + 4 bytes, the address of the memory page allocated by BIOS. + +_DSM process diagram +-------------------- + +"MEMA" indicates the address of memory page allocated by BIOS. + +:: + + +----------------------+ +-----------------------+ + | 1. OSPM | | 2. OSPM | + | save _DSM input data | | write "MEMA" to | Exit to QEMU + | to the page +----->| IO port 0x0a18 +------------+ + | indicated by "MEMA" | | | | + +----------------------+ +-----------------------+ | + | + v + +--------------------+ +-----------+ +------------------+--------+ + | 5 QEMU | | 4 QEMU | | 3. QEMU | + | write _DSM result | | emulate | | get _DSM input data from | + | to the page +<------+ _DSM +<-----+ the page indicated by the | + | | | | | value from the IO port | + +--------+-----------+ +-----------+ +---------------------------+ + | + | Enter Guest + | + v + +--------------------------+ +--------------+ + | 6 OSPM | | 7 OSPM | + | result size is returned | | _DSM return | + | by reading DSM +----->+ | + | result from the page | | | + +--------------------------+ +--------------+ + +NVDIMM hotplug +-------------- + +ACPI BIOS GPE.4 handler is dedicated for notifying OS about nvdimm device +hot-add event. + +QEMU internal use only _DSM functions +------------------------------------- + +Read FIT +^^^^^^^^ + +_FIT method uses _DSM method to fetch NFIT structures blob from QEMU +in 1 page sized increments which are then concatenated and returned +as _FIT method result. + +Input parameters: + +Arg0 + UUID {set to 648B9CF2-CDA1-4312-8AD9-49C4AF32BD62} +Arg1 + Revision ID (set to 1) +Arg2 + Function Index, 0x1 +Arg3 + A package containing a buffer whose layout is as follows: + + +----------+--------+--------+-------------------------------------------+ + | Field | Length | Offset | Description | + +----------+--------+--------+-------------------------------------------+ + | offset | 4 | 0 | offset in QEMU's NFIT structures blob to | + | | | | read from | + +----------+--------+--------+-------------------------------------------+ + +Output layout in the dsm memory page: + + +----------+--------+--------+-------------------------------------------+ + | Field | Length | Offset | Description | + +----------+--------+--------+-------------------------------------------+ + | length | 4 | 0 | length of entire returned data | + | | | | (including this header) | + +----------+--------+--------+-------------------------------------------+ + | | | | return status codes | + | | | | | + | | | | - 0x0 - success | + | | | | - 0x100 - error caused by NFIT update | + | status | 4 | 4 | while read by _FIT wasn't completed | + | | | | - other codes follow Chapter 3 in | + | | | | DSM Spec Rev1 | + +----------+--------+--------+-------------------------------------------+ + | fit data | Varies | 8 | contains FIT data. This field is present | + | | | | if status field is 0. | + +----------+--------+--------+-------------------------------------------+ + +The FIT offset is maintained by the OSPM itself, current offset plus +the size of the fit data returned by the function is the next offset +OSPM should read. When all FIT data has been read out, zero fit data +size is returned. + +If it returns status code 0x100, OSPM should restart to read FIT (read +from offset 0 again). diff --git a/docs/specs/acpi_pci_hotplug.rst b/docs/specs/acpi_pci_hotplug.rst new file mode 100644 index 000000000..685bc5c32 --- /dev/null +++ b/docs/specs/acpi_pci_hotplug.rst @@ -0,0 +1,48 @@ +QEMU<->ACPI BIOS PCI hotplug interface +====================================== + +QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document +describes the interface between QEMU and the ACPI BIOS. + +ACPI GPE block (IO ports 0xafe0-0xafe3, byte access) +---------------------------------------------------- + +Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject +event to ACPI BIOS, via SCI interrupt. + +PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access) +------------------------------------------------------------------------------ + +Slot injection notification pending. One bit per slot. + +Read by ACPI BIOS GPE.1 handler to notify OS of injection +events. Read-only. + +PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access) +-------------------------------------------------------------------- + +Slot removal notification pending. One bit per slot. + +Read by ACPI BIOS GPE.1 handler to notify OS of removal +events. Read-only. + +PCI device eject (IO port 0xae08-0xae0b, 4-byte access) +------------------------------------------------------- + +Write: Used by ACPI BIOS _EJ0 method to request device removal. +One bit per slot. + +Read: Hotplug features register. Used by platform to identify features +available. Current base feature set (no bits set): + +- Read-only "up" register @0xae00, 4-byte access, bit per slot +- Read-only "down" register @0xae04, 4-byte access, bit per slot +- Read/write "eject" register @0xae08, 4-byte access, + write: bit per slot eject, read: hotplug feature set +- Read-only hotplug capable register @0xae0c, 4-byte access, bit per slot + +PCI removability status (IO port 0xae0c-0xae0f, 4-byte access) +-------------------------------------------------------------- + +Used by ACPI BIOS _RMV method to indicate removability status to OS. One +bit per slot. Read-only. diff --git a/docs/specs/edu.txt b/docs/specs/edu.txt new file mode 100644 index 000000000..087631080 --- /dev/null +++ b/docs/specs/edu.txt @@ -0,0 +1,115 @@ + +EDU device +========== + +Copyright (c) 2014-2015 Jiri Slaby + +This document is licensed under the GPLv2 (or later). + +This is an educational device for writing (kernel) drivers. Its original +intention was to support the Linux kernel lectures taught at the Masaryk +University. Students are given this virtual device and are expected to write a +driver with I/Os, IRQs, DMAs and such. + +The devices behaves very similar to the PCI bridge present in the COMBO6 cards +developed under the Liberouter wings. Both PCI device ID and PCI space is +inherited from that device. + +Command line switches: + -device edu[,dma_mask=mask] + + dma_mask makes the virtual device work with DMA addresses with the given + mask. For educational purposes, the device supports only 28 bits (256 MiB) + by default. Students shall set dma_mask for the device in the OS driver + properly. + +PCI specs +--------- + +PCI ID: 1234:11e8 + +PCI Region 0: + I/O memory, 1 MB in size. Users are supposed to communicate with the card + through this memory. + +MMIO area spec +-------------- + +Only size == 4 accesses are allowed for addresses < 0x80. size == 4 or +size == 8 for the rest. + +0x00 (RO) : identification (0xRRrr00edu) + RR -- major version + rr -- minor version + +0x04 (RW) : card liveness check + It is a simple value inversion (~ C operator). + +0x08 (RW) : factorial computation + The stored value is taken and factorial of it is put back here. + This happens only after factorial bit in the status register (0x20 + below) is cleared. + +0x20 (RW) : status register, bitwise OR + 0x01 -- computing factorial (RO) + 0x80 -- raise interrupt after finishing factorial computation + +0x24 (RO) : interrupt status register + It contains values which raised the interrupt (see interrupt raise + register below). + +0x60 (WO) : interrupt raise register + Raise an interrupt. The value will be put to the interrupt status + register (using bitwise OR). + +0x64 (WO) : interrupt acknowledge register + Clear an interrupt. The value will be cleared from the interrupt + status register. This needs to be done from the ISR to stop + generating interrupts. + +0x80 (RW) : DMA source address + Where to perform the DMA from. + +0x88 (RW) : DMA destination address + Where to perform the DMA to. + +0x90 (RW) : DMA transfer count + The size of the area to perform the DMA on. + +0x98 (RW) : DMA command register, bitwise OR + 0x01 -- start transfer + 0x02 -- direction (0: from RAM to EDU, 1: from EDU to RAM) + 0x04 -- raise interrupt 0x100 after finishing the DMA + +IRQ controller +-------------- +An IRQ is generated when written to the interrupt raise register. The value +appears in interrupt status register when the interrupt is raised and has to +be written to the interrupt acknowledge register to lower it. + +The device supports both INTx and MSI interrupt. By default, INTx is +used. Even if the driver disabled INTx and only uses MSI, it still +needs to update the acknowledge register at the end of the IRQ handler +routine. + +DMA controller +-------------- +One has to specify, source, destination, size, and start the transfer. One +4096 bytes long buffer at offset 0x40000 is available in the EDU device. I.e. +one can perform DMA to/from this space when programmed properly. + +Example of transferring a 100 byte block to and from the buffer using a given +PCI address 'addr': +addr -> DMA source address +0x40000 -> DMA destination address +100 -> DMA transfer count +1 -> DMA command register +while (DMA command register & 1) + ; + +0x40000 -> DMA source address +addr+100 -> DMA destination address +100 -> DMA transfer count +3 -> DMA command register +while (DMA command register & 1) + ; diff --git a/docs/specs/fw_cfg.txt b/docs/specs/fw_cfg.txt new file mode 100644 index 000000000..3e6d586f6 --- /dev/null +++ b/docs/specs/fw_cfg.txt @@ -0,0 +1,265 @@ +QEMU Firmware Configuration (fw_cfg) Device +=========================================== + += Guest-side Hardware Interface = + +This hardware interface allows the guest to retrieve various data items +(blobs) that can influence how the firmware configures itself, or may +contain tables to be installed for the guest OS. Examples include device +boot order, ACPI and SMBIOS tables, virtual machine UUID, SMP and NUMA +information, kernel/initrd images for direct (Linux) kernel booting, etc. + +== Selector (Control) Register == + +* Write only +* Location: platform dependent (IOport or MMIO) +* Width: 16-bit +* Endianness: little-endian (if IOport), or big-endian (if MMIO) + +A write to this register sets the index of a firmware configuration +item which can subsequently be accessed via the data register. + +Setting the selector register will cause the data offset to be set +to zero. The data offset impacts which data is accessed via the data +register, and is explained below. + +Bit14 of the selector register indicates whether the configuration +setting is being written. A value of 0 means the item is only being +read, and all write access to the data port will be ignored. A value +of 1 means the item's data can be overwritten by writes to the data +register. In other words, configuration write mode is enabled when +the selector value is between 0x4000-0x7fff or 0xc000-0xffff. + +NOTE: As of QEMU v2.4, writes to the fw_cfg data register are no + longer supported, and will be ignored (treated as no-ops)! + +NOTE: As of QEMU v2.9, writes are reinstated, but only through the DMA + interface (see below). Furthermore, writeability of any specific item is + governed independently of Bit14 in the selector key value. + +Bit15 of the selector register indicates whether the configuration +setting is architecture specific. A value of 0 means the item is a +generic configuration item. A value of 1 means the item is specific +to a particular architecture. In other words, generic configuration +items are accessed with a selector value between 0x0000-0x7fff, and +architecture specific configuration items are accessed with a selector +value between 0x8000-0xffff. + +== Data Register == + +* Read/Write (writes ignored as of QEMU v2.4, but see the DMA interface) +* Location: platform dependent (IOport [*] or MMIO) +* Width: 8-bit (if IOport), 8/16/32/64-bit (if MMIO) +* Endianness: string-preserving + +[*] On platforms where the data register is exposed as an IOport, its +port number will always be one greater than the port number of the +selector register. In other words, the two ports overlap, and can not +be mapped separately. + +The data register allows access to an array of bytes for each firmware +configuration data item. The specific item is selected by writing to +the selector register, as described above. + +Initially following a write to the selector register, the data offset +will be set to zero. Each successful access to the data register will +increment the data offset by the appropriate access width. + +Each firmware configuration item has a maximum length of data +associated with the item. After the data offset has passed the +end of this maximum data length, then any reads will return a data +value of 0x00, and all writes will be ignored. + +An N-byte wide read of the data register will return the next available +N bytes of the selected firmware configuration item, as a substring, in +increasing address order, similar to memcpy(). + +== Register Locations == + +=== x86, x86_64 Register Locations === + +Selector Register IOport: 0x510 +Data Register IOport: 0x511 +DMA Address IOport: 0x514 + +=== Arm Register Locations === + +Selector Register address: Base + 8 (2 bytes) +Data Register address: Base + 0 (8 bytes) +DMA Address address: Base + 16 (8 bytes) + +== ACPI Interface == + +The fw_cfg device is defined with ACPI ID "QEMU0002". Since we expect +ACPI tables to be passed into the guest through the fw_cfg device itself, +the guest-side firmware can not use ACPI to find fw_cfg. However, once the +firmware is finished setting up ACPI tables and hands control over to the +guest kernel, the latter can use the fw_cfg ACPI node for a more accurate +inventory of in-use IOport or MMIO regions. + +== Firmware Configuration Items == + +=== Signature (Key 0x0000, FW_CFG_SIGNATURE) === + +The presence of the fw_cfg selector and data registers can be verified +by selecting the "signature" item using key 0x0000 (FW_CFG_SIGNATURE), +and reading four bytes from the data register. If the fw_cfg device is +present, the four bytes read will contain the characters "QEMU". + +If the DMA interface is available, then reading the DMA Address +Register returns 0x51454d5520434647 ("QEMU CFG" in big-endian format). + +=== Revision / feature bitmap (Key 0x0001, FW_CFG_ID) === + +A 32-bit little-endian unsigned int, this item is used to check for enabled +features. + - Bit 0: traditional interface. Always set. + - Bit 1: DMA interface. + +=== File Directory (Key 0x0019, FW_CFG_FILE_DIR) === + +Firmware configuration items stored at selector keys 0x0020 or higher +(FW_CFG_FILE_FIRST or higher) have an associated entry in a directory +structure, which makes it easier for guest-side firmware to identify +and retrieve them. The format of this file directory (from fw_cfg.h in +the QEMU source tree) is shown here, slightly annotated for clarity: + +struct FWCfgFiles { /* the entire file directory fw_cfg item */ + uint32_t count; /* number of entries, in big-endian format */ + struct FWCfgFile f[]; /* array of file entries, see below */ +}; + +struct FWCfgFile { /* an individual file entry, 64 bytes total */ + uint32_t size; /* size of referenced fw_cfg item, big-endian */ + uint16_t select; /* selector key of fw_cfg item, big-endian */ + uint16_t reserved; + char name[56]; /* fw_cfg item name, NUL-terminated ascii */ +}; + +=== All Other Data Items === + +Please consult the QEMU source for the most up-to-date and authoritative list +of selector keys and their respective items' purpose, format and writeability. + +=== Ranges === + +Theoretically, there may be up to 0x4000 generic firmware configuration +items, and up to 0x4000 architecturally specific ones. + +Selector Reg. Range Usage +--------------- ----------- +0x0000 - 0x3fff Generic (0x0000 - 0x3fff, generally RO, possibly RW through + the DMA interface in QEMU v2.9+) +0x4000 - 0x7fff Generic (0x0000 - 0x3fff, RW, ignored in QEMU v2.4+) +0x8000 - 0xbfff Arch. Specific (0x0000 - 0x3fff, generally RO, possibly RW + through the DMA interface in QEMU v2.9+) +0xc000 - 0xffff Arch. Specific (0x0000 - 0x3fff, RW, ignored in v2.4+) + +In practice, the number of allowed firmware configuration items depends on the +machine type/version. + += Guest-side DMA Interface = + +If bit 1 of the feature bitmap is set, the DMA interface is present. This does +not replace the existing fw_cfg interface, it is an add-on. This interface +can be used through the 64-bit wide address register. + +The address register is in big-endian format. The value for the register is 0 +at startup and after an operation. A write to the least significant half (at +offset 4) triggers an operation. This means that operations with 32-bit +addresses can be triggered with just one write, whereas operations with +64-bit addresses can be triggered with one 64-bit write or two 32-bit writes, +starting with the most significant half (at offset 0). + +In this register, the physical address of a FWCfgDmaAccess structure in RAM +should be written. This is the format of the FWCfgDmaAccess structure: + +typedef struct FWCfgDmaAccess { + uint32_t control; + uint32_t length; + uint64_t address; +} FWCfgDmaAccess; + +The fields of the structure are in big endian mode, and the field at the lowest +address is the "control" field. + +The "control" field has the following bits: + - Bit 0: Error + - Bit 1: Read + - Bit 2: Skip + - Bit 3: Select. The upper 16 bits are the selected index. + - Bit 4: Write + +When an operation is triggered, if the "control" field has bit 3 set, the +upper 16 bits are interpreted as an index of a firmware configuration item. +This has the same effect as writing the selector register. + +If the "control" field has bit 1 set, a read operation will be performed. +"length" bytes for the current selector and offset will be copied into the +physical RAM address specified by the "address" field. + +If the "control" field has bit 4 set (and not bit 1), a write operation will be +performed. "length" bytes will be copied from the physical RAM address +specified by the "address" field to the current selector and offset. QEMU +prevents starting or finishing the write beyond the end of the item associated +with the current selector (i.e., the item cannot be resized). Truncated writes +are dropped entirely. Writes to read-only items are also rejected. All of these +write errors set bit 0 (the error bit) in the "control" field. + +If the "control" field has bit 2 set (and neither bit 1 nor bit 4), a skip +operation will be performed. The offset for the current selector will be +advanced "length" bytes. + +To check the result, read the "control" field: + error bit set -> something went wrong. + all bits cleared -> transfer finished successfully. + otherwise -> transfer still in progress (doesn't happen + today due to implementation not being async, + but may in the future). + += Externally Provided Items = + +Since v2.4, "file" fw_cfg items (i.e., items with selector keys above +FW_CFG_FILE_FIRST, and with a corresponding entry in the fw_cfg file +directory structure) may be inserted via the QEMU command line, using +the following syntax: + + -fw_cfg [name=]<item_name>,file=<path> + +Or + + -fw_cfg [name=]<item_name>,string=<string> + +Since v5.1, QEMU allows some objects to generate fw_cfg-specific content, +the content is then associated with a "file" item using the 'gen_id' option +in the command line, using the following syntax: + + -object <generator-type>,id=<generated_id>,[generator-specific-options] \ + -fw_cfg [name=]<item_name>,gen_id=<generated_id> + +See QEMU man page for more documentation. + +Using item_name with plain ASCII characters only is recommended. + +Item names beginning with "opt/" are reserved for users. QEMU will +never create entries with such names unless explicitly ordered by the +user. + +To avoid clashes among different users, it is strongly recommended +that you use names beginning with opt/RFQDN/, where RFQDN is a reverse +fully qualified domain name you control. For instance, if SeaBIOS +wanted to define additional names, the prefix "opt/org.seabios/" would +be appropriate. + +For historical reasons, "opt/ovmf/" is reserved for OVMF firmware. + +Prefix "opt/org.qemu/" is reserved for QEMU itself. + +Use of names not beginning with "opt/" is potentially dangerous and +entirely unsupported. QEMU will warn if you try. + +Use of names not beginning with "opt/" is tolerated with 'gen_id' (that +is, the warning is suppressed), but you must know exactly what you're +doing. + +All externally provided fw_cfg items are read-only to the guest. diff --git a/docs/specs/index.rst b/docs/specs/index.rst new file mode 100644 index 000000000..ecc43896b --- /dev/null +++ b/docs/specs/index.rst @@ -0,0 +1,20 @@ +---------------------------------------------- +System Emulation Guest Hardware Specifications +---------------------------------------------- + +This section of the manual contains specifications of +guest hardware that is specific to QEMU. + +.. toctree:: + :maxdepth: 2 + + ppc-xive + ppc-spapr-xive + ppc-spapr-numa + acpi_hw_reduced_hotplug + tpm + acpi_hest_ghes + acpi_cpu_hotplug + acpi_mem_hotplug + acpi_pci_hotplug + acpi_nvdimm diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt new file mode 100644 index 000000000..1beb3a01e --- /dev/null +++ b/docs/specs/ivshmem-spec.txt @@ -0,0 +1,258 @@ += Device Specification for Inter-VM shared memory device = + +The Inter-VM shared memory device (ivshmem) is designed to share a +memory region between multiple QEMU processes running different guests +and the host. In order for all guests to be able to pick up the +shared memory area, it is modeled by QEMU as a PCI device exposing +said memory to the guest as a PCI BAR. + +The device can use a shared memory object on the host directly, or it +can obtain one from an ivshmem server. + +In the latter case, the device can additionally interrupt its peers, and +get interrupted by its peers. + + +== Configuring the ivshmem PCI device == + +There are two basic configurations: + +- Just shared memory: + + -device ivshmem-plain,memdev=HMB,... + + This uses host memory backend HMB. It should have option "share" + set. + +- Shared memory plus interrupts: + + -device ivshmem-doorbell,chardev=CHR,vectors=N,... + + An ivshmem server must already be running on the host. The device + connects to the server's UNIX domain socket via character device + CHR. + + Each peer gets assigned a unique ID by the server. IDs must be + between 0 and 65535. + + Interrupts are message-signaled (MSI-X). vectors=N configures the + number of vectors to use. + +For more details on ivshmem device properties, see the QEMU Emulator +user documentation. + + +== The ivshmem PCI device's guest interface == + +The device has vendor ID 1af4, device ID 1110, revision 1. Before +QEMU 2.6.0, it had revision 0. + +=== PCI BARs === + +The ivshmem PCI device has two or three BARs: + +- BAR0 holds device registers (256 Byte MMIO) +- BAR1 holds MSI-X table and PBA (only ivshmem-doorbell) +- BAR2 maps the shared memory object + +There are two ways to use this device: + +- If you only need the shared memory part, BAR2 suffices. This way, + you have access to the shared memory in the guest and can use it as + you see fit. Memnic, for example, uses ivshmem this way from guest + user space (see http://dpdk.org/browse/memnic). + +- If you additionally need the capability for peers to interrupt each + other, you need BAR0 and BAR1. You will most likely want to write a + kernel driver to handle interrupts. Requires the device to be + configured for interrupts, obviously. + +Before QEMU 2.6.0, BAR2 can initially be invalid if the device is +configured for interrupts. It becomes safely accessible only after +the ivshmem server provided the shared memory. These devices have PCI +revision 0 rather than 1. Guest software should wait for the +IVPosition register (described below) to become non-negative before +accessing BAR2. + +Revision 0 of the device is not capable to tell guest software whether +it is configured for interrupts. + +=== PCI device registers === + +BAR 0 contains the following registers: + + Offset Size Access On reset Function + 0 4 read/write 0 Interrupt Mask + bit 0: peer interrupt (rev 0) + reserved (rev 1) + bit 1..31: reserved + 4 4 read/write 0 Interrupt Status + bit 0: peer interrupt (rev 0) + reserved (rev 1) + bit 1..31: reserved + 8 4 read-only 0 or ID IVPosition + 12 4 write-only N/A Doorbell + bit 0..15: vector + bit 16..31: peer ID + 16 240 none N/A reserved + +Software should only access the registers as specified in column +"Access". Reserved bits should be ignored on read, and preserved on +write. + +In revision 0 of the device, Interrupt Status and Mask Register +together control the legacy INTx interrupt when the device has no +MSI-X capability: INTx is asserted when the bit-wise AND of Status and +Mask is non-zero and the device has no MSI-X capability. Interrupt +Status Register bit 0 becomes 1 when an interrupt request from a peer +is received. Reading the register clears it. + +IVPosition Register: if the device is not configured for interrupts, +this is zero. Else, it is the device's ID (between 0 and 65535). + +Before QEMU 2.6.0, the register may read -1 for a short while after +reset. These devices have PCI revision 0 rather than 1. + +There is no good way for software to find out whether the device is +configured for interrupts. A positive IVPosition means interrupts, +but zero could be either. + +Doorbell Register: writing this register requests to interrupt a peer. +The written value's high 16 bits are the ID of the peer to interrupt, +and its low 16 bits select an interrupt vector. + +If the device is not configured for interrupts, the write is ignored. + +If the interrupt hasn't completed setup, the write is ignored. The +device is not capable to tell guest software whether setup is +complete. Interrupts can regress to this state on migration. + +If the peer with the requested ID isn't connected, or it has fewer +interrupt vectors connected, the write is ignored. The device is not +capable to tell guest software what peers are connected, or how many +interrupt vectors are connected. + +The peer's interrupt for this vector then becomes pending. There is +no way for software to clear the pending bit, and a polling mode of +operation is therefore impossible. + +If the peer is a revision 0 device without MSI-X capability, its +Interrupt Status register is set to 1. This asserts INTx unless +masked by the Interrupt Mask register. The device is not capable to +communicate the interrupt vector to guest software then. + +With multiple MSI-X vectors, different vectors can be used to indicate +different events have occurred. The semantics of interrupt vectors +are left to the application. + + +== Interrupt infrastructure == + +When configured for interrupts, the peers share eventfd objects in +addition to shared memory. The shared resources are managed by an +ivshmem server. + +=== The ivshmem server === + +The server listens on a UNIX domain socket. + +For each new client that connects to the server, the server +- picks an ID, +- creates eventfd file descriptors for the interrupt vectors, +- sends the ID and the file descriptor for the shared memory to the + new client, +- sends connect notifications for the new client to the other clients + (these contain file descriptors for sending interrupts), +- sends connect notifications for the other clients to the new client, + and +- sends interrupt setup messages to the new client (these contain file + descriptors for receiving interrupts). + +The first client to connect to the server receives ID zero. + +When a client disconnects from the server, the server sends disconnect +notifications to the other clients. + +The next section describes the protocol in detail. + +If the server terminates without sending disconnect notifications for +its connected clients, the clients can elect to continue. They can +communicate with each other normally, but won't receive disconnect +notification on disconnect, and no new clients can connect. There is +no way for the clients to connect to a restarted server. The device +is not capable to tell guest software whether the server is still up. + +Example server code is in contrib/ivshmem-server/. Not to be used in +production. It assumes all clients use the same number of interrupt +vectors. + +A standalone client is in contrib/ivshmem-client/. It can be useful +for debugging. + +=== The ivshmem Client-Server Protocol === + +An ivshmem device configured for interrupts connects to an ivshmem +server. This section details the protocol between the two. + +The connection is one-way: the server sends messages to the client. +Each message consists of a single 8 byte little-endian signed number, +and may be accompanied by a file descriptor via SCM_RIGHTS. Both +client and server close the connection on error. + +Note: QEMU currently doesn't close the connection right on error, but +only when the character device is destroyed. + +On connect, the server sends the following messages in order: + +1. The protocol version number, currently zero. The client should + close the connection on receipt of versions it can't handle. + +2. The client's ID. This is unique among all clients of this server. + IDs must be between 0 and 65535, because the Doorbell register + provides only 16 bits for them. + +3. The number -1, accompanied by the file descriptor for the shared + memory. + +4. Connect notifications for existing other clients, if any. This is + a peer ID (number between 0 and 65535 other than the client's ID), + repeated N times. Each repetition is accompanied by one file + descriptor. These are for interrupting the peer with that ID using + vector 0,..,N-1, in order. If the client is configured for fewer + vectors, it closes the extra file descriptors. If it is configured + for more, the extra vectors remain unconnected. + +5. Interrupt setup. This is the client's own ID, repeated N times. + Each repetition is accompanied by one file descriptor. These are + for receiving interrupts from peers using vector 0,..,N-1, in + order. If the client is configured for fewer vectors, it closes + the extra file descriptors. If it is configured for more, the + extra vectors remain unconnected. + +From then on, the server sends these kinds of messages: + +6. Connection / disconnection notification. This is a peer ID. + + - If the number comes with a file descriptor, it's a connection + notification, exactly like in step 4. + + - Else, it's a disconnection notification for the peer with that ID. + +Known bugs: + +* The protocol changed incompatibly in QEMU 2.5. Before, messages + were native endian long, and there was no version number. + +* The protocol is poorly designed. + +=== The ivshmem Client-Client Protocol === + +An ivshmem device configured for interrupts receives eventfd file +descriptors for interrupting peers and getting interrupted by peers +from the server, as explained in the previous section. + +To interrupt a peer, the device writes the 8-byte integer 1 in native +byte order to the respective file descriptor. + +To receive an interrupt, the device reads and discards as many 8-byte +integers as it can. diff --git a/docs/specs/pci-ids.txt b/docs/specs/pci-ids.txt new file mode 100644 index 000000000..5e407a6f3 --- /dev/null +++ b/docs/specs/pci-ids.txt @@ -0,0 +1,71 @@ + +PCI IDs for qemu +================ + +Red Hat, Inc. donates a part of its device ID range to qemu, to be used for +virtual devices. The vendor IDs are 1af4 (formerly Qumranet ID) and 1b36. + +Contact Gerd Hoffmann <kraxel@redhat.com> to get a device ID assigned +for your devices. + +1af4 vendor ID +-------------- + +The 1000 -> 10ff device ID range is used as follows for virtio-pci devices. +Note that this allocation separate from the virtio device IDs, which are +maintained as part of the virtio specification. + +1af4:1000 network device (legacy) +1af4:1001 block device (legacy) +1af4:1002 balloon device (legacy) +1af4:1003 console device (legacy) +1af4:1004 SCSI host bus adapter device (legacy) +1af4:1005 entropy generator device (legacy) +1af4:1009 9p filesystem device (legacy) + +1af4:1041 network device (modern) +1af4:1042 block device (modern) +1af4:1043 console device (modern) +1af4:1044 entropy generator device (modern) +1af4:1045 balloon device (modern) +1af4:1048 SCSI host bus adapter device (modern) +1af4:1049 9p filesystem device (modern) +1af4:1050 virtio gpu device (modern) +1af4:1052 virtio input device (modern) + +1af4:10f0 Available for experimental usage without registration. Must get + to official ID when the code leaves the test lab (i.e. when seeking +1af4:10ff upstream merge or shipping a distro/product) to avoid conflicts. + +1af4:1100 Used as PCI Subsystem ID for existing hardware devices emulated + by qemu. + +1af4:1110 ivshmem device (shared memory, docs/specs/ivshmem-spec.txt) + +All other device IDs are reserved. + +1b36 vendor ID +-------------- + +The 0000 -> 00ff device ID range is used as follows for QEMU-specific +PCI devices (other than virtio): + +1b36:0001 PCI-PCI bridge +1b36:0002 PCI serial port (16550A) adapter (docs/specs/pci-serial.txt) +1b36:0003 PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt) +1b36:0004 PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt) +1b36:0005 PCI test device (docs/specs/pci-testdev.txt) +1b36:0006 PCI Rocker Ethernet switch device +1b36:0007 PCI SD Card Host Controller Interface (SDHCI) +1b36:0008 PCIe host bridge +1b36:0009 PCI Expander Bridge (-device pxb) +1b36:000a PCI-PCI bridge (multiseat) +1b36:000b PCIe Expander Bridge (-device pxb-pcie) +1b36:000d PCI xhci usb host adapter +1b36:000f mdpy (mdev sample device), linux/samples/vfio-mdev/mdpy.c +1b36:0010 PCIe NVMe device (-device nvme) +1b36:0011 PCI PVPanic device (-device pvpanic-pci) + +All these devices are documented in docs/specs. + +The 0100 device ID is used for the QXL video card device. diff --git a/docs/specs/pci-serial.txt b/docs/specs/pci-serial.txt new file mode 100644 index 000000000..66c761f2b --- /dev/null +++ b/docs/specs/pci-serial.txt @@ -0,0 +1,34 @@ + +QEMU pci serial devices +======================= + +There is one single-port variant and two muliport-variants. Linux +guests out-of-the box with all cards. There is a Windows inf file +(docs/qemupciserial.inf) to setup the single-port card in Windows +guests. + + +single-port card +---------------- + +Name: pci-serial +PCI ID: 1b36:0002 + +PCI Region 0: + IO bar, 8 bytes long, with the 16550 uart mapped to it. + Interrupt is wired to pin A. + + +multiport cards +--------------- + +Name: pci-serial-2x +PCI ID: 1b36:0003 + +Name: pci-serial-4x +PCI ID: 1b36:0004 + +PCI Region 0: + IO bar, with two/four 16550 uart mapped after each other. + The first is at offset 0, second at offset 8, ... + Interrupt is wired to pin A. diff --git a/docs/specs/pci-testdev.txt b/docs/specs/pci-testdev.txt new file mode 100644 index 000000000..4280a1e73 --- /dev/null +++ b/docs/specs/pci-testdev.txt @@ -0,0 +1,31 @@ +pci-test is a device used for testing low level IO + +device implements up to three BARs: BAR0, BAR1 and BAR2. +Each of BAR 0+1 can be memory or IO. Guests must detect +BAR types and act accordingly. + +BAR 0+1 size is up to 4K bytes each. +BAR 0+1 starts with the following header: + +typedef struct PCITestDevHdr { + uint8_t test; <- write-only, starts a given test number + uint8_t width_type; <- read-only, type and width of access for a given test. + 1,2,4 for byte,word or long write. + any other value if test not supported on this BAR + uint8_t pad0[2]; + uint32_t offset; <- read-only, offset in this BAR for a given test + uint32_t data; <- read-only, data to use for a given test + uint32_t count; <- for debugging. number of writes detected. + uint8_t name[]; <- for debugging. 0-terminated ASCII string. +} PCITestDevHdr; + +All registers are little endian. + +device is expected to always implement tests 0 to N on each BAR, and to add new +tests with higher numbers. In this way a guest can scan test numbers until it +detects an access type that it does not support on this BAR, then stop. + +BAR2 is a 64bit memory bar, without backing storage. It is disabled +by default and can be enabled using the membar=<size> property. This +can be used to test whether guests handle pci bars of a specific +(possibly quite large) size correctly. diff --git a/docs/specs/ppc-spapr-hcalls.txt b/docs/specs/ppc-spapr-hcalls.txt new file mode 100644 index 000000000..93fe3da91 --- /dev/null +++ b/docs/specs/ppc-spapr-hcalls.txt @@ -0,0 +1,78 @@ +When used with the "pseries" machine type, QEMU-system-ppc64 implements +a set of hypervisor calls using a subset of the server "PAPR" specification +(IBM internal at this point), which is also what IBM's proprietary hypervisor +adheres too. + +The subset is selected based on the requirements of Linux as a guest. + +In addition to those calls, we have added our own private hypervisor +calls which are mostly used as a private interface between the firmware +running in the guest and QEMU. + +All those hypercalls start at hcall number 0xf000 which correspond +to an implementation specific range in PAPR. + +- H_RTAS (0xf000) + +RTAS is a set of runtime services generally provided by the firmware +inside the guest to the operating system. It predates the existence +of hypervisors (it was originally an extension to Open Firmware) and +is still used by PAPR to provide various services that aren't performance +sensitive. + +We currently implement the RTAS services in QEMU itself. The actual RTAS +"firmware" blob in the guest is a small stub of a few instructions which +calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU. + +Arguments: + + r3 : H_RTAS (0xf000) + r4 : Guest physical address of RTAS parameter block + +Returns: + + H_SUCCESS : Successfully called the RTAS function (RTAS result + will have been stored in the parameter block) + H_PARAMETER : Unknown token + +- H_LOGICAL_MEMOP (0xf001) + +When the guest runs in "real mode" (in powerpc lingua this means +with MMU disabled, ie guest effective == guest physical), it only +has access to a subset of memory and no IOs. + +PAPR provides a set of hypervisor calls to perform cacheable or +non-cacheable accesses to any guest physical addresses that the +guest can use in order to access IO devices while in real mode. + +This is typically used by the firmware running in the guest. + +However, doing a hypercall for each access is extremely inefficient +(even more so when running KVM) when accessing the frame buffer. In +that case, things like scrolling become unusably slow. + +This hypercall allows the guest to request a "memory op" to be applied +to memory. The supported memory ops at this point are to copy a range +of memory (supports overlap of source and destination) and XOR which +is used by our SLOF firmware to invert the screen. + +Arguments: + + r3: H_LOGICAL_MEMOP (0xf001) + r4: Guest physical address of destination + r5: Guest physical address of source + r6: Individual element size + 0 = 1 byte + 1 = 2 bytes + 2 = 4 bytes + 3 = 8 bytes + r7: Number of elements + r8: Operation + 0 = copy + 1 = xor + +Returns: + + H_SUCCESS : Success + H_PARAMETER : Invalid argument + diff --git a/docs/specs/ppc-spapr-hotplug.txt b/docs/specs/ppc-spapr-hotplug.txt new file mode 100644 index 000000000..d4fb2d46d --- /dev/null +++ b/docs/specs/ppc-spapr-hotplug.txt @@ -0,0 +1,409 @@ += sPAPR Dynamic Reconfiguration = + +sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration +to handle hotplugging of dynamic "physical" resources like PCI cards, or +"logical"/paravirtual resources like memory, CPUs, and "physical" +host-bridges, which are generally managed by the host/hypervisor and provided +to guests as virtualized resources. The specifics of dynamic-reconfiguration +are documented extensively in PAPR+ v2.7, Section 13.1. This document +provides a summary of that information as it applies to the implementation +within QEMU. + +== Dynamic-reconfiguration Connectors == + +To manage hotplug/unplug of these resources, a firmware abstraction known as +a Dynamic Resource Connector (DRC) is used to assign a particular dynamic +resource to the guest, and provide an interface for the guest to manage +configuration/removal of the resource associated with it. + +== Device-tree description of DRCs == + +A set of 4 Open Firmware device tree array properties are used to describe +the name/index/power-domain/type of each DRC allocated to a guest at +boot-time. There may be multiple sets of these arrays, rooted at different +paths in the device tree depending on the type of resource the DRCs manage. + +In some cases, the DRCs themselves may be provided by a dynamic resource, +such as the DRCs managing PCI slots on a hotplugged PHB. In this case the +arrays would be fetched as part of the device tree retrieval interfaces +for hotplugged resources described under "Guest->Host interface". + +The array properties are described below. Each entry/element in an array +describes the DRC identified by the element in the corresponding position +of ibm,drc-indexes: + +ibm,drc-names: + first 4-bytes: BE-encoded integer denoting the number of entries + each entry: a NULL-terminated <name> string encoded as a byte array + + <name> values for logical/virtual resources are defined in PAPR+ v2.7, + Section 13.5.2.4, and basically consist of the type of the resource + followed by a space and a numerical value that's unique across resources + of that type. + + <name> values for "physical" resources such as PCI or VIO devices are + defined as being "location codes", which are the "location labels" of + each encapsulating device, starting from the chassis down to the + individual slot for the device, concatenated by a hyphen. This provides + a mapping of resources to a physical location in a chassis for debugging + purposes. For QEMU, this mapping is less important, so we assign a + location code that conforms to naming specifications, but is simply a + location label for the slot by itself to simplify the implementation. + The naming convention for location labels is documented in detail in + PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>" + for PCI/VIO device slots, where <n> is unique across all PCI/VIO + device slots. + +ibm,drc-indexes: + first 4-bytes: BE-encoded integer denoting the number of entries + each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs + in the machine + + <index> is arbitrary, but in the case of QEMU we try to maintain the + convention used to assign them to pSeries guests on pHyp: + + bit[31:28]: integer encoding of <type>, where <type> is: + 1 for CPU resource + 2 for PHB resource + 3 for VIO resource + 4 for PCI resource + 8 for Memory resource + bit[27:0]: integer encoding of <id>, where <id> is unique across + all resources of specified type + +ibm,drc-power-domains: + first 4-bytes: BE-encoded integer denoting the number of entries + each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the + power domain the resource will be assigned to. In the case of QEMU + we associated all resources with a "live insertion" domain, where the + power is assumed to be managed automatically. The integer value for + this domain is a special value of -1. + + +ibm,drc-types: + first 4-bytes: BE-encoded integer denoting the number of entries + each entry: a NULL-terminated <type> string encoded as a byte array + + <type> is assigned as follows: + "CPU" for a CPU + "PHB" for a physical host-bridge + "SLOT" for a VIO slot + "28" for a PCI slot + "MEM" for memory resource + +== Guest->Host interface to manage dynamic resources == + +Each DRC is given a globally unique DRC Index, and resources associated with +a particular DRC are configured/managed by the guest via a number of RTAS +calls which reference individual DRCs based on the DRC index. This can be +considered the guest->host interface. + +rtas-set-power-level: + arg[0]: integer identifying power domain + arg[1]: new power level for the domain, 0-100 + output[0]: status, 0 on success + output[1]: power level after command + + Set the power level for a specified power domain + +rtas-get-power-level: + arg[0]: integer identifying power domain + output[0]: status, 0 on success + output[1]: current power level + + Get the power level for a specified power domain + +rtas-set-indicator: + arg[0]: integer identifying sensor/indicator type + arg[1]: index of sensor, for DR-related sensors this is generally the + DRC index + arg[2]: desired sensor value + output[0]: status, 0 on success + + Set the state of an indicator or sensor. For the purpose of this document we + focus on the indicator/sensor types associated with a DRC. The types are: + + 9001: isolation-state, controls/indicates whether a device has been made + accessible to a guest + + supported sensor values: + 0: isolate, device is made unaccessible by guest OS + 1: unisolate, device is made available to guest OS + + 9002: dr-indicator, controls "visual" indicator associated with device + + supported sensor values: + 0: inactive, resource may be safely removed + 1: active, resource is in use and cannot be safely removed + 2: identify, used to visually identify slot for interactive hotplug + 3: action, in most cases, used in the same manner as identify + + 9003: allocation-state, generally only used for "logical" DR resources to + request the allocation/deallocation of a resource prior to acquiring + it via isolation-state->unisolate, or after releasing it via + isolation-state->isolate, respectively. for "physical" DR (like PCI + hotplug/unplug) the pre-allocation of the resource is implied and + this sensor is unused. + + supported sensor values: + 0: unusable, tell firmware/system the resource can be + unallocated/reclaimed and added back to the system resource pool + 1: usable, request the resource be allocated/reserved for use by + guest OS + 2: exchange, used to allocate a spare resource to use for fail-over + in certain situations. unused in QEMU + 3: recover, used to reclaim a previously allocated resource that's + not currently allocated to the guest OS. unused in QEMU + +rtas-get-sensor-state: + arg[0]: integer identifying sensor/indicator type + arg[1]: index of sensor, for DR-related sensors this is generally the + DRC index + output[0]: status, 0 on success + + Used to read an indicator or sensor value. + + For DR-related operations, the only noteworthy sensor is dr-entity-sense, + which has a type value of 9003, as allocation-state does in the case of + rtas-set-indicator. The semantics/encodings of the sensor values are distinct + however: + + supported sensor values for dr-entity-sense (9003) sensor: + 0: empty, + for physical resources: DRC/slot is empty + for logical resources: unused + 1: present, + for physical resources: DRC/slot is populated with a device/resource + for logical resources: resource has been allocated to the DRC + 2: unusable, + for physical resources: unused + for logical resources: DRC has no resource allocated to it + 3: exchange, + for physical resources: unused + for logical resources: resource available for exchange (see + allocation-state sensor semantics above) + 4: recovery, + for physical resources: unused + for logical resources: resource available for recovery (see + allocation-state sensor semantics above) + +rtas-ibm-configure-connector: + arg[0]: guest physical address of 4096-byte work area buffer + arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero + if a prior RTAS response indicated a need for additional memory + output[0]: status: + 0: completed transmittal of device-tree node + 1: instruct guest to prepare for next DT sibling node + 2: instruct guest to prepare for next DT child node + 3: instruct guest to prepare for next DT property + 4: instruct guest to ascend to parent DT node + 5: instruct guest to provide additional work-area buffer + via arg[1] + 990x: instruct guest that operation took too long and to try + again later + + Used to fetch an OF device-tree description of the resource associated with + a particular DRC. The DRC index is encoded in the first 4-bytes of the first + work area buffer. + + Work area layout, using 4-byte offsets: + wa[0]: DRC index of the DRC to fetch device-tree nodes from + wa[1]: 0 (hard-coded) + wa[2]: for next-sibling/next-child response: + wa offset of null-terminated string denoting the new node's name + for next-property response: + wa offset of null-terminated string denoting new property's name + wa[3]: for next-property response (unused otherwise): + byte-length of new property's value + wa[4]: for next-property response (unused otherwise): + new property's value, encoded as an OFDT-compatible byte array + +== hotplug/unplug events == + +For most DR operations, the hypervisor will issue host->guest add/remove events +using the EPOW/check-exception notification framework, where the host issues a +check-exception interrupt, then provides an RTAS event log via an +rtas-check-exception call issued by the guest in response. This framework is +documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown +requests via EPOW events. + +For DR, this framework has been extended to include hotplug events, which were +previously unneeded due to direct manipulation of DR-related guest userspace +tools by host-level management such as an HMC. This level of management is not +applicable to PowerKVM, hence the reason for extending the notification +framework to support hotplug events. + +The format for these EPOW-signalled events is described below under +"hotplug/unplug event structure". Note that these events are not +formally part of the PAPR+ specification, and have been superseded by a +newer format, also described below under "hotplug/unplug event structure", +and so are now deemed a "legacy" format. The formats are similar, but the +"modern" format contains additional fields/flags, which are denoted for the +purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards. + +QEMU should assume support only for "legacy" fields/flags unless the guest +advertises support for the "modern" format via ibm,client-architecture-support +hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector +structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events, +"modern" format events are surfaced to the guest via check-exception RTAS calls, +but use a dedicated event source to signal the guest. This event source is +advertised to the guest by the addition of a "hot-plug-events" node under +"/event-sources" node of the guest's device tree using the standard format +described in LoPAPR v11, B.6.12.1. + +== hotplug/unplug event structure == + +The hotplug-specific payload in QEMU is implemented as follows (with all values +encoded in big-endian format): + +struct rtas_event_log_v6_hp { +#define SECTION_ID_HOTPLUG 0x4850 /* HP */ + struct section_header { + uint16_t section_id; /* set to SECTION_ID_HOTPLUG */ + uint16_t section_length; /* sizeof(rtas_event_log_v6_hp), + * plus the length of the DRC name + * if a DRC name identifier is + * specified for hotplug_identifier + */ + uint8_t section_version; /* version 1 */ + uint8_t section_subtype; /* unused */ + uint16_t creator_component_id; /* unused */ + } hdr; +#define RTAS_LOG_V6_HP_TYPE_CPU 1 +#define RTAS_LOG_V6_HP_TYPE_MEMORY 2 +#define RTAS_LOG_V6_HP_TYPE_SLOT 3 +#define RTAS_LOG_V6_HP_TYPE_PHB 4 +#define RTAS_LOG_V6_HP_TYPE_PCI 5 + uint8_t hotplug_type; /* type of resource/device */ +#define RTAS_LOG_V6_HP_ACTION_ADD 1 +#define RTAS_LOG_V6_HP_ACTION_REMOVE 2 + uint8_t hotplug_action; /* action (add/remove) */ +#define RTAS_LOG_V6_HP_ID_DRC_NAME 1 +#define RTAS_LOG_V6_HP_ID_DRC_INDEX 2 +#define RTAS_LOG_V6_HP_ID_DRC_COUNT 3 +#ifdef GUEST_SUPPORTS_MODERN +#define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4 +#endif + uint8_t hotplug_identifier; /* type of the resource identifier, + * which serves as the discriminator + * for the 'drc' union field below + */ +#ifdef GUEST_SUPPORTS_MODERN + uint8_t capabilities; /* capability flags, currently unused + * by QEMU + */ +#else + uint8_t reserved; +#endif + union { + uint32_t index; /* DRC index of resource to take action + * on + */ + uint32_t count; /* number of DR resources to take + * action on (guest chooses which) + */ +#ifdef GUEST_SUPPORTS_MODERN + struct { + uint32_t count; /* number of DR resources to take + * action on + */ + uint32_t index; /* DRC index of first resource to take + * action on. guest will take action + * on DRC index <index> through + * DRC index <index + count - 1> in + * sequential order + */ + } count_indexed; +#endif + char name[1]; /* string representing the name of the + * DRC to take action on + */ + } drc; +} QEMU_PACKED; + +== ibm,lrdr-capacity == + +ibm,lrdr-capacity is a property in the /rtas device tree node that identifies +the dynamic reconfiguration capabilities of the guest. It consists of a triple +consisting of <phys>, <size> and <maxcpus>. + + <phys>, encoded in BE format represents the maximum address in bytes and + hence the maximum memory that can be allocated to the guest. + + <size>, encoded in BE format represents the size increments in which + memory can be hot-plugged to the guest. + + <maxcpus>, a BE-encoded integer, represents the maximum number of + processors that the guest can have. + +pseries guests use this property to note the maximum allowed CPUs for the +guest. + +== ibm,dynamic-reconfiguration-memory == + +ibm,dynamic-reconfiguration-memory is a device tree node that represents +dynamically reconfigurable logical memory blocks (LMB). This node +is generated only when the guest advertises the support for it via +ibm,client-architecture-support call. Memory that is not dynamically +reconfigurable is represented by /memory nodes. The properties of this +node that are of interest to the sPAPR memory hotplug implementation +in QEMU are described here. + +ibm,lmb-size + +This 64bit integer defines the size of each dynamically reconfigurable LMB. + +ibm,associativity-lookup-arrays + +This property defines a lookup array in which the NUMA associativity +information for each LMB can be found. It is a property encoded array +that begins with an integer M, the number of associativity lists followed +by an integer N, the number of entries per associativity list and terminated +by M associativity lists each of length N integers. + +This property provides the same information as given by ibm,associativity +property in a /memory node. Each assigned LMB has an index value between +0 and M-1 which is used as an index into this table to select which +associativity list to use for the LMB. This index value for each LMB +is defined in ibm,dynamic-memory property. + +ibm,dynamic-memory + +This property describes the dynamically reconfigurable memory. It is a +property encoded array that has an integer N, the number of LMBs followed +by N LMB list entries. + +Each LMB list entry consists of the following elements: + +- Logical address of the start of the LMB encoded as a 64bit integer. This + corresponds to reg property in /memory node. +- DRC index of the LMB that corresponds to ibm,my-drc-index property + in a /memory node. +- Four bytes reserved for expansion. +- Associativity list index for the LMB that is used as an index into + ibm,associativity-lookup-arrays property described earlier. This + is used to retrieve the right associativity list to be used for this + LMB. +- A 32bit flags word. The bit at bit position 0x00000008 defines whether + the LMB is assigned to the partition as of boot time. + +ibm,dynamic-memory-v2 + +This property describes the dynamically reconfigurable memory. This is +an alternate and newer way to describe dynamically reconfigurable memory. +It is a property encoded array that has an integer N (the number of +LMB set entries) followed by N LMB set entries. There is an LMB set entry +for each sequential group of LMBs that share common attributes. + +Each LMB set entry consists of the following elements: + +- Number of sequential LMBs in the entry represented by a 32bit integer. +- Logical address of the first LMB in the set encoded as a 64bit integer. +- DRC index of the first LMB in the set. +- Associativity list index that is used as an index into + ibm,associativity-lookup-arrays property described earlier. This + is used to retrieve the right associativity list to be used for all + the LMBs in this set. +- A 32bit flags word that applies to all the LMBs in the set. + +[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867 diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst new file mode 100644 index 000000000..ffa687dc8 --- /dev/null +++ b/docs/specs/ppc-spapr-numa.rst @@ -0,0 +1,410 @@ + +NUMA mechanics for sPAPR (pseries machines) +============================================ + +NUMA in sPAPR works different than the System Locality Distance +Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR +1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This +document aims to complement this specification, providing details +of the elements that impacts how QEMU views NUMA in pseries. + +Associativity and ibm,associativity property +-------------------------------------------- + +Associativity is defined as a group of platform resources that has +similar mean performance (or in our context here, distance) relative to +everyone else outside of the group. + +The format of the ibm,associativity property varies with the value of +bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with +bit 0 equal to zero is deprecated. The current format, with the bit 0 +with the value of one, makes ibm,associativity property represent the +physical hierarchy of the platform, as one or more lists that starts +with the highest level grouping up to the smallest. Considering the +following topology: + +:: + + Mem M1 ---- Proc P1 | + ----------------- | Socket S1 ---| + chip C1 | | + | HW module 1 (MOD1) + Mem M2 ---- Proc P2 | | + ----------------- | Socket S2 ---| + chip C2 | + +The ibm,associativity property for the processors would be: + +* P1: {MOD1, S1, C1, P1} +* P2: {MOD1, S2, C2, P2} + +Each allocable resource has an ibm,associativity property. The LOPAPR +specification allows multiple lists to be present in this property, +considering that the same resource can have multiple connections to the +platform. + +Relative Performance Distance and ibm,associativity-reference-points +-------------------------------------------------------------------- + +The ibm,associativity-reference-points property is an array that is used +to define the relevant performance/distance related boundaries, defining +the NUMA levels for the platform. + +The definition of its elements also varies with the value of bit 0 of byte 5 +of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero +is also deprecated. With the current format, each integer of the +ibm,associativity-reference-points represents an 1 based ordinal index (i.e. +the first element is 1) of the ibm,associativity array. The first +boundary is the most significant to application performance, followed by +less significant boundaries. Allocated resources that belongs to the +same performance boundaries are expected to have relative NUMA distance +that matches the relevancy of the boundary itself. Resources that belongs +to the same first boundary will have the shortest distance from each +other. Subsequent boundaries represents greater distances and degraded +performance. + +Using the previous example, the following setting reference points defines +three NUMA levels: + +* ibm,associativity-reference-points = {0x3, 0x2, 0x1} + +The first NUMA level (0x3) is interpreted as the third element of each +ibm,associativity array, the second level is the second element and +the third level is the first element. Let's also consider that elements +belonging to the first NUMA level have distance equal to 10 from each +other, and each NUMA level doubles the distance from the previous. This +means that the second would be 20 and the third level 40. For the P1 and +P2 processors, we would have the following NUMA levels: + +:: + + * ibm,associativity-reference-points = {0x3, 0x2, 0x1} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x3) => associativity[2] = C1 + Second NUMA level (0x2) => associativity[1] = S1 + Third NUMA level (0x1) => associativity[0] = MOD1 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x3) => associativity[2] = C2 + Second NUMA level (0x2) => associativity[1] = S2 + Third NUMA level (0x1) => associativity[0] = MOD1 + + P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40 + +Changing the ibm,associativity-reference-points array changes the performance +distance attributes for the same associativity arrays, as the following +example illustrates: + +:: + + * ibm,associativity-reference-points = {0x2} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x2) => associativity[1] = S1 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x2) => associativity[1] = S2 + + P1 and P2 does not have a common performance boundary. Since this is a one level + NUMA configuration, distance between them is one boundary above the first + level, 20. + + +In a hypothetical platform where all resources inside the same hardware module +is considered to be on the same performance boundary: + +:: + + * ibm,associativity-reference-points = {0x1} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x1) => associativity[0] = MOD0 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x1) => associativity[0] = MOD0 + + P1 and P2 belongs to the same first order boundary. The distance between then + is 10. + + +How the pseries Linux guest calculates NUMA distances +===================================================== + +Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is +how the distances are expressed. The SLIT table provides the NUMA distance +value between the relevant resources. LOPAPR does not provide a standard +way to calculate it. We have the ibm,associativity for each resource, which +provides a common-performance hierarchy, and the ibm,associativity-reference-points +array that tells which level of associativity is considered to be relevant +or not. + +The result is that each OS is free to implement and to interpret the distance +as it sees fit. For the pseries Linux guest, each level of NUMA duplicates +the distance of the previous level, and the maximum amount of levels is +limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the +kernel tree). This results in the following distances: + +* both resources in the first NUMA level: 10 +* resources one NUMA level apart: 20 +* resources two NUMA levels apart: 40 +* resources three NUMA levels apart: 80 +* resources four NUMA levels apart: 160 + + +pseries NUMA mechanics +====================== + +Starting in QEMU 5.2, the pseries machine considers user input when setting NUMA +topology of the guest. The overall design is: + +* ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing + for 4 distinct NUMA distance values based on the NUMA levels + +* ibm,max-associativity-domains supports multiple associativity domains in all + NUMA levels, granting user flexibility + +* ibm,associativity for all resources varies with user input + +These changes are only effective for pseries-5.2 and newer machines that are +created with more than one NUMA node (disconsidering NUMA nodes created by +the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been +around for such a long time, with users seeing NUMA distances 10 and 40 +(and 80 if using NVLink2 GPUs), and there is no need to disrupt the +existing experience of those guests. + +To bring the user experience x86 users have when tuning up NUMA, we had +to operate under the current pseries Linux kernel logic described in +`How the pseries Linux guest calculates NUMA distances`_. The result +is that we needed to translate NUMA distance user input to pseries +Linux kernel input. + +Translating user distance to kernel distance +-------------------------------------------- + +User input for NUMA distance can vary from 10 to 254. We need to translate +that to the values that the Linux kernel operates on (10, 20, 40, 80, 160). +This is how it is being done: + +* user distance 11 to 30 will be interpreted as 20 +* user distance 31 to 60 will be interpreted as 40 +* user distance 61 to 120 will be interpreted as 80 +* user distance 121 and beyond will be interpreted as 160 +* user distance 10 stays 10 + +The reasoning behind this approximation is to avoid any round up to the local +distance (10), keeping it exclusive to the 4th NUMA level (which is still +exclusive to the node_id). All other ranges were chosen under the developer +discretion of what would be (somewhat) sensible considering the user input. +Any other strategy can be used here, but in the end the reality is that we'll +have to accept that a large array of values will be translated to the same +NUMA topology in the guest, e.g. this user input: + +:: + + 0 1 2 + 0 10 31 120 + 1 31 10 30 + 2 120 30 10 + +And this other user input: + +:: + + 0 1 2 + 0 10 60 61 + 1 60 10 11 + 2 61 11 10 + +Will both be translated to the same values internally: + +:: + + 0 1 2 + 0 10 40 80 + 1 40 10 20 + 2 80 20 10 + +Users are encouraged to use only the kernel values in the NUMA definition to +avoid being taken by surprise with that the guest is actually seeing in the +topology. There are enough potential surprises that are inherent to the +associativity domain assignment process, discussed below. + + +How associativity domains are assigned +-------------------------------------- + +LOPAPR allows more than one associativity array (or 'string') per allocated +resource. This would be used to represent that the resource has multiple +connections with the board, and then the operational system, when deciding +NUMA distancing, should consider the associativity information that provides +the shortest distance. + +The spapr implementation does not support multiple associativity arrays per +resource, neither does the pseries Linux kernel. We'll have to represent the +NUMA topology using one associativity per resource, which means that choices +and compromises are going to be made. + +Consider the following NUMA topology entered by user input: + +:: + + 0 1 2 3 + 0 10 40 20 40 + 1 40 10 80 40 + 2 20 80 10 20 + 3 40 40 20 10 + +All the associativity arrays are initialized with NUMA id in all associativity +domains: + +* node 0: 0 0 0 0 +* node 1: 1 1 1 1 +* node 2: 2 2 2 2 +* node 3: 3 3 3 3 + + +Honoring just the relative distances of node 0 to every other node, we find the +NUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for +each distance: + +* distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match + at 0x2) +* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3) +* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match + at 0x2) + +We'll copy the associativity domains of node 0 to all other nodes, based on +the NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy +the domains 0x2 and 0x1 from 0 to 1 as well. This will give us: + +* node 0: 0 0 0 0 +* node 1: 0 0 1 1 + +Doing the same to node 2 and node 3, these are the associativity arrays +after considering all matches with node 0: + +* node 0: 0 0 0 0 +* node 1: 0 0 1 1 +* node 2: 0 0 0 2 +* node 3: 0 0 3 3 + +The distances related to node 0 are accounted for. For node 1, and keeping +in mind that we don't need to revisit node 0 again, the distance from +node 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40, +match in 0x2. Repeating the same logic of copying all domains up to +the NUMA level match: + +* node 0: 0 0 0 0 +* node 1: 1 0 1 1 +* node 2: 1 0 0 2 +* node 3: 1 0 3 3 + +In the last step we will analyze just nodes 2 and 3. The desired distance +between 2 and 3 is 20, i.e. a match in 0x3: + +* node 0: 0 0 0 0 +* node 1: 1 0 1 1 +* node 2: 1 0 0 2 +* node 3: 1 0 0 3 + + +The kernel will read these arrays and will calculate the following NUMA topology for +the guest: + +:: + + 0 1 2 3 + 0 10 40 20 20 + 1 40 10 40 40 + 2 20 40 10 20 + 3 20 40 20 10 + +Note that this is not what the user wanted - the desired distance between +0 and 3 is 40, we calculated it as 20. This is what the current logic and +implementation constraints of the kernel and QEMU will provide inside the +LOPAPR specification. + +Users are welcome to use this knowledge and experiment with the input to get +the NUMA topology they want, or as closer as they want. The important thing +is to keep expectations up to par with what we are capable of provide at this +moment: an approximation. + +Limitations of the implementation +--------------------------------- + +As mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate +user choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways +to fully map user input to actual NUMA distance the guest will use. These limitations +creates two notable limitations in our support: + +* Asymmetrical topologies aren't supported. We only support NUMA topologies where + the distance from node A to B is always the same as B to A. We do not support + any A-B pair where the distance back and forth is asymmetric. For example, the + following topology isn't supported and the pSeries guest will not boot with this + user input: + +:: + + 0 1 + 0 10 40 + 1 20 10 + + +* 'non-transitive' topologies will be poorly translated to the guest. This is the + kind of topology where the distance from a node A to B is X, B to C is X, but + the distance A to C is not X. E.g.: + +:: + + 0 1 2 3 + 0 10 20 20 40 + 1 20 10 80 40 + 2 20 80 10 20 + 3 40 40 20 10 + + In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40. + The kernel will always match with the shortest associativity domain possible, + and we're attempting to retain the previous established relations between the + nodes. This means that a distance equal to 20 between nodes 0 and 2 and the + same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3 + to also be 20. + + +Legacy (5.1 and older) pseries NUMA mechanics +============================================= + +In short, we can summarize the NUMA distances seem in pseries Linux guests, using +QEMU up to 5.1, as follows: + +* local distance, i.e. the distance of the resource to its own NUMA node: 10 +* if it's a NVLink GPU device, distance: 80 +* every other resource, distance: 40 + +The way the pseries Linux guest calculates NUMA distances has a direct effect +on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is +the default ibm,associativity-reference-points being used in the pseries +machine: + +ibm,associativity-reference-points = {0x4, 0x4, 0x2} + +The first and second level are equal, 0x4, and a third one was added in +commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that +regardless of how the ibm,associativity properties are being created in +the device tree, the pseries Linux guest will only recognize three scenarios +as far as NUMA distance goes: + +* if the resources belongs to the same first NUMA level = 10 +* second level is skipped since it's equal to the first +* all resources that aren't a NVLink GPU, it is guaranteed that they will belong + to the same third NUMA level, having distance = 40 +* for NVLink GPUs, distance = 80 from everything else + +This also means that user input in QEMU command line does not change the +NUMA distancing inside the guest for the pseries machine. diff --git a/docs/specs/ppc-spapr-uv-hcalls.txt b/docs/specs/ppc-spapr-uv-hcalls.txt new file mode 100644 index 000000000..389c2740d --- /dev/null +++ b/docs/specs/ppc-spapr-uv-hcalls.txt @@ -0,0 +1,76 @@ +On PPC64 systems supporting Protected Execution Facility (PEF), system +memory can be placed in a secured region where only an "ultravisor" +running in firmware can provide to access it. pseries guests on such +systems can communicate with the ultravisor (via ultracalls) to switch to a +secure VM mode (SVM) where the guest's memory is relocated to this secured +region, making its memory inaccessible to normal processes/guests running on +the host. + +The various ultracalls/hypercalls relating to SVM mode are currently +only documented internally, but are planned for direct inclusion into the +public OpenPOWER version of the PAPR specification (LoPAPR/LoPAR). An internal +ACR has been filed to reserve a hypercall number range specific to this +use-case to avoid any future conflicts with the internally-maintained PAPR +specification. This document summarizes some of these details as they relate +to QEMU. + +== hypercalls needed by the ultravisor == + +Switching to SVM mode involves a number of hcalls issued by the ultravisor +to the hypervisor to orchestrate the movement of guest memory to secure +memory and various other aspects SVM mode. Numbers are assigned for these +hcalls within the reserved range 0xEF00-0xEF80. The below documents the +hcalls relevant to QEMU. + +- H_TPM_COMM (0xef10) + + For TPM_COMM_OP_EXECUTE operation: + Send a request to a TPM and receive a response, opening a new TPM session + if one has not already been opened. + + For TPM_COMM_OP_CLOSE_SESSION operation: + Close the existing TPM session, if any. + + Arguments: + + r3 : H_TPM_COMM (0xef10) + r4 : TPM operation, one of: + TPM_COMM_OP_EXECUTE (0x1) + TPM_COMM_OP_CLOSE_SESSION (0x2) + r5 : in_buffer, guest physical address of buffer containing the request + - Caller may use the same address for both request and response + r6 : in_size, size of the in buffer + - Must be less than or equal to 4KB + r7 : out_buffer, guest physical address of buffer to store the response + - Caller may use the same address for both request and response + r8 : out_size, size of the out buffer + - Must be at least 4KB, as this is the maximum request/response size + supported by most TPM implementations, including the TPM Resource + Manager in the linux kernel. + + Return values: + + r3 : H_Success request processed successfully + H_PARAMETER invalid TPM operation + H_P2 in_buffer is invalid + H_P3 in_size is invalid + H_P4 out_buffer is invalid + H_P5 out_size is invalid + H_RESOURCE problem communicating with TPM + H_FUNCTION TPM access is not currently allowed/configured + r4 : For TPM_COMM_OP_EXECUTE, the size of the response will be stored here + upon success. + + Use-case/notes: + + SVM filesystems are encrypted using a symmetric key. This key is then + wrapped/encrypted using the public key of a trusted system which has the + private key stored in the system's TPM. An Ultravisor will use this + hcall to unwrap/unseal the symmetric key using the system's TPM device + or a TPM Resource Manager associated with the device. + + The Ultravisor sets up a separate session key with the TPM in advance + during host system boot. All sensitive in and out values will be + encrypted using the session key. Though the hypervisor will see the 'in' + and 'out' buffers in raw form, any sensitive contents will generally be + encrypted using this session key. diff --git a/docs/specs/ppc-spapr-xive.rst b/docs/specs/ppc-spapr-xive.rst new file mode 100644 index 000000000..f47f739e0 --- /dev/null +++ b/docs/specs/ppc-spapr-xive.rst @@ -0,0 +1,282 @@ +XIVE for sPAPR (pseries machines) +================================= + +The POWER9 processor comes with a new interrupt controller +architecture, called XIVE as "eXternal Interrupt Virtualization +Engine". It supports a larger number of interrupt sources and offers +virtualization features which enables the HW to deliver interrupts +directly to virtual processors without hypervisor assistance. + +A QEMU ``pseries`` machine (which is PAPR compliant) using POWER9 +processors can run under two interrupt modes: + +- *Legacy Compatibility Mode* + + the hypervisor provides identical interfaces and similar + functionality to PAPR+ Version 2.7. This is the default mode + + It is also referred as *XICS* in QEMU. + +- *XIVE native exploitation mode* + + the hypervisor provides new interfaces to manage the XIVE control + structures, and provides direct control for interrupt management + through MMIO pages. + +Which interrupt modes can be used by the machine is negotiated with +the guest O/S during the Client Architecture Support negotiation +sequence. The two modes are mutually exclusive. + +Both interrupt mode share the same IRQ number space. See below for the +layout. + +CAS Negotiation +--------------- + +QEMU advertises the supported interrupt modes in the device tree +property ``ibm,arch-vec-5-platform-support`` in byte 23 and the OS +Selection for XIVE is indicated in the ``ibm,architecture-vec-5`` +property byte 23. + +The interrupt modes supported by the machine depend on the CPU type +(POWER9 is required for XIVE) but also on the machine property +``ic-mode`` which can be set on the command line. It can take the +following values: ``xics``, ``xive``, and ``dual`` which is the +default mode. ``dual`` means that both modes XICS **and** XIVE are +supported and if the guest OS supports XIVE, this mode will be +selected. + +The chosen interrupt mode is activated after a reconfiguration done +in a machine reset. + +KVM negotiation +--------------- + +When the guest starts under KVM, the capabilities of the host kernel +and QEMU are also negotiated. Depending on the version of the host +kernel, KVM will advertise the XIVE capability to QEMU or not. + +Nevertheless, the available interrupt modes in the machine should not +depend on the XIVE KVM capability of the host. On older kernels +without XIVE KVM support, QEMU will use the emulated XIVE device as a +fallback and on newer kernels (>=5.2), the KVM XIVE device. + +XIVE native exploitation mode is not supported for KVM nested guests, +VMs running under a L1 hypervisor (KVM on pSeries). In that case, the +hypervisor will not advertise the KVM capability and QEMU will use the +emulated XIVE device, same as for older versions of KVM. + +As a final refinement, the user can also switch the use of the KVM +device with the machine option ``kernel_irqchip``. + + +XIVE support in KVM +~~~~~~~~~~~~~~~~~~~ + +For guest OSes supporting XIVE, the resulting interrupt modes on host +kernels with XIVE KVM support are the following: + +============== ============= ============= ================ +ic-mode kernel_irqchip +-------------- ---------------------------------------------- +/ allowed off on + (default) +============== ============= ============= ================ +dual (default) XIVE KVM XIVE emul. XIVE KVM +xive XIVE KVM XIVE emul. XIVE KVM +xics XICS KVM XICS emul. XICS KVM +============== ============= ============= ================ + +For legacy guest OSes without XIVE support, the resulting interrupt +modes are the following: + +============== ============= ============= ================ +ic-mode kernel_irqchip +-------------- ---------------------------------------------- +/ allowed off on + (default) +============== ============= ============= ================ +dual (default) XICS KVM XICS emul. XICS KVM +xive QEMU error(3) QEMU error(3) QEMU error(3) +xics XICS KVM XICS emul. XICS KVM +============== ============= ============= ================ + +(3) QEMU fails at CAS with ``Guest requested unavailable interrupt + mode (XICS), either don't set the ic-mode machine property or try + ic-mode=xics or ic-mode=dual`` + + +No XIVE support in KVM +~~~~~~~~~~~~~~~~~~~~~~ + +For guest OSes supporting XIVE, the resulting interrupt modes on host +kernels without XIVE KVM support are the following: + +============== ============= ============= ================ +ic-mode kernel_irqchip +-------------- ---------------------------------------------- +/ allowed off on + (default) +============== ============= ============= ================ +dual (default) XIVE emul.(1) XIVE emul. QEMU error (2) +xive XIVE emul.(1) XIVE emul. QEMU error (2) +xics XICS KVM XICS emul. XICS KVM +============== ============= ============= ================ + + +(1) QEMU warns with ``warning: kernel_irqchip requested but unavailable: + IRQ_XIVE capability must be present for KVM`` + In some cases (old host kernels or KVM nested guests), one may hit a + QEMU/KVM incompatibility due to device destruction in reset. QEMU fails + with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on`` +(2) QEMU fails with ``kernel_irqchip requested but unavailable: + IRQ_XIVE capability must be present for KVM`` + + +For legacy guest OSes without XIVE support, the resulting interrupt +modes are the following: + +============== ============= ============= ================ +ic-mode kernel_irqchip +-------------- ---------------------------------------------- +/ allowed off on + (default) +============== ============= ============= ================ +dual (default) QEMU error(4) XICS emul. QEMU error(4) +xive QEMU error(3) QEMU error(3) QEMU error(3) +xics XICS KVM XICS emul. XICS KVM +============== ============= ============= ================ + +(3) QEMU fails at CAS with ``Guest requested unavailable interrupt + mode (XICS), either don't set the ic-mode machine property or try + ic-mode=xics or ic-mode=dual`` +(4) QEMU/KVM incompatibility due to device destruction in reset. QEMU fails + with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on`` + + +XIVE Device tree properties +--------------------------- + +The properties for the PAPR interrupt controller node when the *XIVE +native exploitation mode* is selected should contain: + +- ``device_type`` + + value should be "power-ivpe". + +- ``compatible`` + + value should be "ibm,power-ivpe". + +- ``reg`` + + contains the base address and size of the thread interrupt + managnement areas (TIMA), for the User level and for the Guest OS + level. Only the Guest OS level is taken into account today. + +- ``ibm,xive-eq-sizes`` + + the size of the event queues. One cell per size supported, contains + log2 of size, in ascending order. + +- ``ibm,xive-lisn-ranges`` + + the IRQ interrupt number ranges assigned to the guest for the IPIs. + +The root node also exports : + +- ``ibm,plat-res-int-priorities`` + + contains a list of priorities that the hypervisor has reserved for + its own use. + +IRQ number space +---------------- + +IRQ Number space of the ``pseries`` machine is 8K wide and is the same +for both interrupt mode. The different ranges are defined as follow : + +- ``0x0000 .. 0x0FFF`` 4K CPU IPIs (only used under XIVE) +- ``0x1000 .. 0x1000`` 1 EPOW +- ``0x1001 .. 0x1001`` 1 HOTPLUG +- ``0x1002 .. 0x10FF`` unused +- ``0x1100 .. 0x11FF`` 256 VIO devices +- ``0x1200 .. 0x127F`` 32x4 LSIs for PHB devices +- ``0x1280 .. 0x12FF`` unused +- ``0x1300 .. 0x1FFF`` PHB MSIs (dynamically allocated) + +Monitoring XIVE +--------------- + +The state of the XIVE interrupt controller can be queried through the +monitor commands ``info pic``. The output comes in two parts. + +First, the state of the thread interrupt context registers is dumped +for each CPU : + +:: + + (qemu) info pic + CPU[0000]: QW NSR CPPR IPB LSMFB ACK# INC AGE PIPR W2 + CPU[0000]: USER 00 00 00 00 00 00 00 00 00000000 + CPU[0000]: OS 00 ff 00 00 ff 00 ff ff 80000400 + CPU[0000]: POOL 00 00 00 00 00 00 00 00 00000000 + CPU[0000]: PHYS 00 00 00 00 00 00 00 ff 00000000 + ... + +In the case of a ``pseries`` machine, QEMU acts as the hypervisor and only +the O/S and USER register rings make sense. ``W2`` contains the vCPU CAM +line which is set to the VP identifier. + +Then comes the routing information which aggregates the EAS and the +END configuration: + +:: + + ... + LISN PQ EISN CPU/PRIO EQ + 00000000 MSI -- 00000010 0/6 380/16384 @1fe3e0000 ^1 [ 80000010 ... ] + 00000001 MSI -- 00000010 1/6 305/16384 @1fc230000 ^1 [ 80000010 ... ] + 00000002 MSI -- 00000010 2/6 220/16384 @1fc2f0000 ^1 [ 80000010 ... ] + 00000003 MSI -- 00000010 3/6 201/16384 @1fc390000 ^1 [ 80000010 ... ] + 00000004 MSI -Q M 00000000 + 00000005 MSI -Q M 00000000 + 00000006 MSI -Q M 00000000 + 00000007 MSI -Q M 00000000 + 00001000 MSI -- 00000012 0/6 380/16384 @1fe3e0000 ^1 [ 80000010 ... ] + 00001001 MSI -- 00000013 0/6 380/16384 @1fe3e0000 ^1 [ 80000010 ... ] + 00001100 MSI -- 00000100 1/6 305/16384 @1fc230000 ^1 [ 80000010 ... ] + 00001101 MSI -Q M 00000000 + 00001200 LSI -Q M 00000000 + 00001201 LSI -Q M 00000000 + 00001202 LSI -Q M 00000000 + 00001203 LSI -Q M 00000000 + 00001300 MSI -- 00000102 1/6 305/16384 @1fc230000 ^1 [ 80000010 ... ] + 00001301 MSI -- 00000103 2/6 220/16384 @1fc2f0000 ^1 [ 80000010 ... ] + 00001302 MSI -- 00000104 3/6 201/16384 @1fc390000 ^1 [ 80000010 ... ] + +The source information and configuration: + +- The ``LISN`` column outputs the interrupt number of the source in + range ``[ 0x0 ... 0x1FFF ]`` and its type : ``MSI`` or ``LSI`` +- The ``PQ`` column reflects the state of the PQ bits of the source : + + - ``--`` source is ready to take events + - ``P-`` an event was sent and an EOI is PENDING + - ``PQ`` an event was QUEUED + - ``-Q`` source is OFF + + a ``M`` indicates that source is *MASKED* at the EAS level, + +The targeting configuration : + +- The ``EISN`` column is the event data that will be queued in the event + queue of the O/S. +- The ``CPU/PRIO`` column is the tuple defining the CPU number and + priority queue serving the source. +- The ``EQ`` column outputs : + + - the current index of the event queue/ the max number of entries + - the O/S event queue address + - the toggle bit + - the last entries that were pushed in the event queue. diff --git a/docs/specs/ppc-xive.rst b/docs/specs/ppc-xive.rst new file mode 100644 index 000000000..83d43f658 --- /dev/null +++ b/docs/specs/ppc-xive.rst @@ -0,0 +1,200 @@ +================================ +POWER9 XIVE interrupt controller +================================ + +The POWER9 processor comes with a new interrupt controller +architecture, called XIVE as "eXternal Interrupt Virtualization +Engine". + +Compared to the previous architecture, the main characteristics of +XIVE are to support a larger number of interrupt sources and to +deliver interrupts directly to virtual processors without hypervisor +assistance. This removes the context switches required for the +delivery process. + + +XIVE architecture +================= + +The XIVE IC is composed of three sub-engines, each taking care of a +processing layer of external interrupts: + +- Interrupt Virtualization Source Engine (IVSE), or Source Controller + (SC). These are found in PCI PHBs, in the Processor Service + Interface (PSI) host bridge Controller, but also inside the main + controller for the core IPIs and other sub-chips (NX, CAP, NPU) of + the chip/processor. They are configured to feed the IVRE with + events. +- Interrupt Virtualization Routing Engine (IVRE) or Virtualization + Controller (VC). It handles event coalescing and perform interrupt + routing by matching an event source number with an Event + Notification Descriptor (END). +- Interrupt Virtualization Presentation Engine (IVPE) or Presentation + Controller (PC). It maintains the interrupt context state of each + thread and handles the delivery of the external interrupt to the + thread. + +:: + + XIVE Interrupt Controller + +------------------------------------+ IPIs + | +---------+ +---------+ +--------+ | +-------+ + | |IVRE | |Common Q | |IVPE |----> | CORES | + | | esb | | | | |----> | | + | | eas | | Bridge | | tctx |----> | | + | |SC end | | | | nvt | | | | + +------+ | +---------+ +----+----+ +--------+ | +-+-+-+-+ + | RAM | +------------------|-----------------+ | | | + | | | | | | + | | | | | | + | | +--------------------v------------------------v-v-v--+ other + | <--+ Power Bus +--> chips + | esb | +---------+-----------------------+------------------+ + | eas | | | + | end | +--|------+ | + | nvt | +----+----+ | +----+----+ + +------+ |IVSE | | |IVSE | + | | | | | + | PQ-bits | | | PQ-bits | + | local |-+ | in VC | + +---------+ +---------+ + PCIe NX,NPU,CAPI + + + PQ-bits: 2 bits source state machine (P:pending Q:queued) + esb: Event State Buffer (Array of PQ bits in an IVSE) + eas: Event Assignment Structure + end: Event Notification Descriptor + nvt: Notification Virtual Target + tctx: Thread interrupt Context registers + + + +XIVE internal tables +-------------------- + +Each of the sub-engines uses a set of tables to redirect interrupts +from event sources to CPU threads. + +:: + + +-------+ + User or O/S | EQ | + or +------>|entries| + Hypervisor | | .. | + Memory | +-------+ + | ^ + | | + +-------------------------------------------------+ + | | + Hypervisor +------+ +---+--+ +---+--+ +------+ + Memory | ESB | | EAT | | ENDT | | NVTT | + (skiboot) +----+-+ +----+-+ +----+-+ +------+ + ^ | ^ | ^ | ^ + | | | | | | | + +-------------------------------------------------+ + | | | | | | | + | | | | | | | + +----|--|--------|--|--------|--|-+ +-|-----+ +------+ + | | | | | | | | | | tctx| |Thread| + IPI or ---+ + v + v + v |---| + .. |-----> | + HW events | | | | | | + | IVRE | | IVPE | +------+ + +---------------------------------+ +-------+ + + +The IVSE have a 2-bits state machine, P for pending and Q for queued, +for each source that allows events to be triggered. They are stored in +an Event State Buffer (ESB) array and can be controlled by MMIOs. + +If the event is let through, the IVRE looks up in the Event Assignment +Structure (EAS) table for an Event Notification Descriptor (END) +configured for the source. Each Event Notification Descriptor defines +a notification path to a CPU and an in-memory Event Queue, in which +will be enqueued an EQ data for the O/S to pull. + +The IVPE determines if a Notification Virtual Target (NVT) can handle +the event by scanning the thread contexts of the VCPUs dispatched on +the processor HW threads. It maintains the interrupt context state of +each thread in a NVT table. + +XIVE thread interrupt context +----------------------------- + +The XIVE presenter can generate four different exceptions to its +HW threads: + +- hypervisor exception +- O/S exception +- Event-Based Branch (user level) +- msgsnd (doorbell) + +Each exception has a state independent from the others called a Thread +Interrupt Management context. This context is a set of registers which +lets the thread handle priority management and interrupt +acknowledgment among other things. The most important ones being : + +- Interrupt Priority Register (PIPR) +- Interrupt Pending Buffer (IPB) +- Current Processor Priority (CPPR) +- Notification Source Register (NSR) + +TIMA +~~~~ + +The Thread Interrupt Management registers are accessible through a +specific MMIO region, called the Thread Interrupt Management Area +(TIMA), four aligned pages, each exposing a different view of the +registers. First page (page address ending in ``0b00``) gives access +to the entire context and is reserved for the ring 0 view for the +physical thread context. The second (page address ending in ``0b01``) +is for the hypervisor, ring 1 view. The third (page address ending in +``0b10``) is for the operating system, ring 2 view. The fourth (page +address ending in ``0b11``) is for user level, ring 3 view. + +Interrupt flow from an O/S perspective +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After an event data has been enqueued in the O/S Event Queue, the IVPE +raises the bit corresponding to the priority of the pending interrupt +in the register IBP (Interrupt Pending Buffer) to indicate that an +event is pending in one of the 8 priority queues. The Pending +Interrupt Priority Register (PIPR) is also updated using the IPB. This +register represent the priority of the most favored pending +notification. + +The PIPR is then compared to the Current Processor Priority +Register (CPPR). If it is more favored (numerically less than), the +CPU interrupt line is raised and the EO bit of the Notification Source +Register (NSR) is updated to notify the presence of an exception for +the O/S. The O/S acknowledges the interrupt with a special load in the +Thread Interrupt Management Area. + +The O/S handles the interrupt and when done, performs an EOI using a +MMIO operation on the ESB management page of the associate source. + +Overview of the QEMU models for XIVE +==================================== + +The XiveSource models the IVSE in general, internal and external. It +handles the source ESBs and the MMIO interface to control them. + +The XiveNotifier is a small helper interface interconnecting the +XiveSource to the XiveRouter. + +The XiveRouter is an abstract model acting as a combined IVRE and +IVPE. It routes event notifications using the EAS and END tables to +the IVPE sub-engine which does a CAM scan to find a CPU to deliver the +exception. Storage should be provided by the inheriting classes. + +XiveEnDSource is a special source object. It exposes the END ESB MMIOs +of the Event Queues which are used for coalescing event notifications +and for escalation. Not used on the field, only to sync the EQ cache +in OPAL. + +Finally, the XiveTCTX contains the interrupt state context of a thread, +four sets of registers, one for each exception that can be delivered +to a CPU. These contexts are scanned by the IVPE to find a matching VP +when a notification is triggered. It also models the Thread Interrupt +Management Area (TIMA), which exposes the thread context registers to +the CPU for interrupt management. diff --git a/docs/specs/pvpanic.txt b/docs/specs/pvpanic.txt new file mode 100644 index 000000000..8afcde11c --- /dev/null +++ b/docs/specs/pvpanic.txt @@ -0,0 +1,54 @@ +PVPANIC DEVICE +============== + +pvpanic device is a simulated device, through which a guest panic +event is sent to qemu, and a QMP event is generated. This allows +management apps (e.g. libvirt) to be notified and respond to the event. + +The management app has the option of waiting for GUEST_PANICKED events, +and/or polling for guest-panicked RunState, to learn when the pvpanic +device has fired a panic event. + +The pvpanic device can be implemented as an ISA device (using IOPORT) or as a +PCI device. + +ISA Interface +------------- + +pvpanic exposes a single I/O port, by default 0x505. On read, the bits +recognized by the device are set. Software should ignore bits it doesn't +recognize. On write, the bits not recognized by the device are ignored. +Software should set only bits both itself and the device recognize. + +Bit Definition +-------------- +bit 0: a guest panic has happened and should be processed by the host +bit 1: a guest panic has happened and will be handled by the guest; + the host should record it or report it, but should not affect + the execution of the guest. + +PCI Interface +------------- + +The PCI interface is similar to the ISA interface except that it uses an MMIO +address space provided by its BAR0, 1 byte long. Any machine with a PCI bus +can enable a pvpanic device by adding '-device pvpanic-pci' to the command +line. + +ACPI Interface +-------------- + +pvpanic device is defined with ACPI ID "QEMU0001". Custom methods: + +RDPT: To determine whether guest panic notification is supported. +Arguments: None +Return: Returns a byte, with the same semantics as the I/O port + interface. + +WRPT: To send a guest panic event +Arguments: Arg0 is a byte to be written, with the same semantics as + the I/O interface. +Return: None + +The ACPI device will automatically refer to the right port in case it +is modified. diff --git a/docs/specs/rocker.txt b/docs/specs/rocker.txt new file mode 100644 index 000000000..1857b3170 --- /dev/null +++ b/docs/specs/rocker.txt @@ -0,0 +1,1014 @@ +Rocker Network Switch Register Programming Guide +Copyright (c) Scott Feldman <sfeldma@gmail.com> +Copyright (c) Neil Horman <nhorman@tuxdriver.com> +Version 0.11, 12/29/2014 + +LICENSE +======= + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 2 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +SECTION 1: Introduction +======================= + +Overview +-------- + +This document describes the hardware/software interface for the Rocker switch +device. The intended audience is authors of OS drivers and device emulation +software. + +Notations and Conventions +------------------------- + +o In register descriptions, [n:m] indicates a range from bit n to bit m, +inclusive. +o Use of leading 0x indicates a hexadecimal number. +o Use of leading 0b indicates a binary number. +o The use of RSVD or Reserved indicates that a bit or field is reserved for +future use. +o Field width is in bytes, unless otherwise noted. +o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear +on read +o TLV values in network-byte-order are designated with (N). + + +SECTION 2: PCI Configuration Registers +====================================== + +PCI Configuration Space +----------------------- + +Each switch instance registers as a PCI device with PCI configuration space: + + offset width description value + --------------------------------------------- + 0x0 2 Vendor ID 0x1b36 + 0x2 2 Device ID 0x0006 + 0x4 4 Command/Status + 0x8 1 Revision ID 0x01 + 0x9 3 Class code 0x2800 + 0xC 1 Cache line size + 0xD 1 Latency timer + 0xE 1 Header type + 0xF 1 Built-in self test + 0x10 4 Base address low + 0x14 4 Base address high + 0x18-28 Reserved + 0x2C 2 Subsystem vendor ID * + 0x2E 2 Subsystem ID * + 0x30-38 Reserved + 0x3C 1 Interrupt line + 0x3D 1 Interrupt pin 0x00 + 0x3E 1 Min grant 0x00 + 0x3D 1 Max latency 0x00 + 0x40 1 TRDY timeout + 0x41 1 Retry count + 0x42 2 Reserved + + +* Assigned by sub-system implementation + +SECTION 3: Memory-Mapped Register Space +======================================= + +There are two memory-mapped BARs. BAR0 maps device register space and is +0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in +size, allowing for 256 MSI-X vectors. + +All registers are 4 or 8 bytes long. It is assumed host software will access 4 +byte registers with one 4-byte access, and 8 byte registers with either two +4-byte accesses or a single 8-byte access. In the case of two 4-byte accesses, +access must be lower and then upper 4-bytes, in that order. + +BAR0 device register space is organized as follows: + + offset description + ------------------------------------------------------ + 0x0000-0x000f Bogus registers to catch misbehaving + drivers. Writes do nothing. Reads + back as 0xDEADBABE. + 0x0010-0x00ff Test registers + 0x0300-0x03ff General purpose registers + 0x1000-0x1fff Descriptor control + +Holes in register space are reserved. Writes to reserved registers do nothing. +Reads to reserved registers read back as 0. + +No fancy stuff like write-combining is enabled on any of the registers. + +BAR1 MSI-X register space is organized as follows: + + offset description + ------------------------------------------------------ + 0x0000-0x0fff MSI-X vector table (256 vectors total) + 0x1000-0x1fff MSI-X PBA table + + +SECTION 4: Interrupts, DMA, and Endianness +========================================== + +PCI Interrupts +-------------- + +The device supports only MSI-X interrupts. BAR1 memory-mapped region contains +the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors. + +The vector assignment is: + + vector description + ----------------------------------------------------- + 0 Command descriptor ring completion + 1 Event descriptor ring completion + 2 Test operation completion + 3 RSVD + 4-255 Tx and Rx descriptor ring completion + Tx vector is even + Rx vector is odd + +A MSI-X vector table entry is 16 bytes: + + field offset width description + ------------------------------------------------------------- + lower_addr 0x0 4 [31:2] message address[31:2] + [1:0] Rsvd (4 byte alignment + required) + upper_addr 0x4 4 [31:19] Rsvd + [14:0] message address[46:32] + data 0x8 4 message data[31:0] + control 0xc 4 [31:1] Rsvd + [0] mask (0 = enable, + 1 = masked) + +Software should install the Interrupt Service Routine (ISR) before any ports +are enabled or any commands are issued on the command ring. + +DMA Operations +-------------- + +DMA operations are used for packet DMA to/from the CPU, command and event +processing. Command processing includes statistical counters and table dumps, +table insertion/deletion, and more. Event processing provides an async +notification method for device-originating events. Each DMA operation has a +set of control registers to manage a descriptor ring. The descriptor rings are +allocated from contiguous host DMA-able memory and registers specify the rings +base address, size and current head and tail indices. Software always writes +the head, and hardware always writes the tail. + +The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion +of a descriptor. Software will clear this bit when posting a descriptor to the +ring, and hardware will set this bit when the descriptor is complete. + +Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries. +Descriptor rings' base address must be 8-byte aligned. Descriptors must be +packed within ring. Each descriptor in each ring must also be aligned on an 8 +byte boundary. Each descriptor ring will have these registers: + + DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W) + DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W) + DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W) + DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R) + DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W) + DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W) + DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W) + +Where x is descriptor ring index: + + index ring + -------------------- + 0 CMD + 1 EVENT + 2 TX (port 0) + 3 RX (port 0) + 4 TX (port 1) + 5 RX (port 1) + . + . + . + 124 TX (port 61) + 125 RX (port 61) + 126 Resv + 127 Resv + +Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero. HEAD cannot be +written past TAIL. To do so would wrap the ring. An empty ring is when HEAD +== TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and +TAIL increment and modulo wrap at the ring size. + +CTRL register bits: + + bit name description + ------------------------------------------------------------------------ + [0] CTRL_RESET Reset the descriptor ring + [1:31] Reserved + +All descriptor types share some common fields: + + field width description + ------------------------------------------------------------------- + DMA_DESC_BUF_ADDR 8 Phys addr of desc payload, 8-byte + aligned + DMA_DESC_COOKIE 8 Desc cookie for completion matching, + upper-most bit is reserved + DMA_DESC_BUF_SIZE 2 Desc payload size in bytes + DMA_DESC_TLV_SIZE 2 Desc payload total size in bytes + used for TLVs. Must be <= + DMA_DESC_BUF_SIZE. + DMA_DESC_COMP_ERR 2 Completion status of associated + desc payload. High order bit is + clear on new descs, toggled by + hw for completed items. + +To support forward- and backward-compatibility, descriptor and completion +payloads are specified in TLV format. Fields are packed with Type=field name, +Length=field length, and Value=field value. Software will ignore unknown fields +filled in by the switch. Likewise, the switch will ignore unknown fields +filled in by software. + +Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The +value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is: + + field width description + ----------------------------- + type 4 TLV type + len 2 TLV value length + pad 2 Reserved + +The alignment requirements for descriptors and TLVs are to avoid unaligned +access exceptions in software. Note that the payload for each TLV is also +8 byte aligned. + +Figure 1 shows an example descriptor buffer with two TLVs. + + <------- 8 bytes -------> + + 8-byte +––––+ +–––––––––––+–––––+–––––+ +–+ + align | type | len | pad | TLV#1 hdr | + +–––––––––––+–––––+–––––+ (len=22) | + | | | + | value | TVL#1 value | + | | (padded to 8-byte | + | +–––––+ alignment) | + | |/////| | + 8-byte +––––+ +–––––––––––+–––––––––––+ | + align | type | len | pad | TLV#2 hdr DESC_BUF_SIZE + +–––––+–––––+–––––+–––––+ (len=2) | + |value|/////////////////| TLV#2 value | + +–––––+/////////////////| | + |///////////////////////| | + |///////////////////////| | + |///////////////////////| | + |////////unused/////////| | + |////////space//////////| | + |///////////////////////| | + |///////////////////////| | + |///////////////////////| | + +–––––––––––––––––––––––+ +–+ + + fig. 1 + +TLVs can be nested within the NEST TLV type. + +Interrupt credits +^^^^^^^^^^^^^^^^^ + +MSI-X vectors used for descriptor ring completions use a credit mechanism for +efficient device, PCIe bus, OS and driver operations. Each descriptor ring has +a credit count which represents the number of outstanding descriptors to be +processed by the driver. As the device marks descriptors complete, the credit +count is incremented. As the driver processes those outstanding descriptors, +it returns credits back to the device. This way, the device knows the driver's +progress and can make decisions about when to fire the next interrupt or not. +When the credit count is zero, and the first descriptors are posted for the +driver, a single interrupt is fired. Once the interrupt is fired, the +interrupt is disabled (auto-masked*). In response to the interrupt, the driver +will process descriptors and PIO write a returned credit value for that +descriptor ring. If the driver returns all credits (the driver caught up with +the device and there is no outstanding work), then the interrupt is unmasked, +but not fired. If only partial credits are returned, the interrupt remains +masked but the device generates an interrupt, signaling the driver that more +outstanding work is available. + +(* this masking is unrelated to the MSI-X interrupt mask register) + +Endianness +---------- + +Device registers are hard-coded to little-endian (LE). The driver should +convert to/from host endianness to LE for device register accesses. + +Descriptors are LE. Descriptor buffer TLVs will have LE type and length +fields, but the value field can either be LE or network-byte-order, depending +on context. TLV values containing network packet data will be in network-byte +order. A TLV value containing a field or mask used to compare against network +packet data is network-byte order. For example, flow match fields (and masks) +are network-byte-order since they're matched directly, byte-by-byte, against +network packet data. All non-network-packet TLV multi-byte values will be LE. + +TLV values in network-byte-order are designated with (N). + + +SECTION 5: Test Registers +========================= + +Rocker has several test registers to support troubleshooting register access, +interrupt generation, and DMA operations: + + TEST_REG, offset 0x0010, 32-bit (R/W) + TEST_REG64, offset 0x0018, 64-bit (R/W) + TEST_IRQ, offset 0x0020, 32-bit (R/W) + TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W) + TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W) + TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W) + +Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last +value written to the register. The 32-bit and 64-bit versions are for testing +32-bit and 64-bit host accesses. + +A vector can be written to TEST_IRQ and the device will generate an interrupt +for that vector. + +To test basic DMA operations, allocate a DMA-able host buffer and put the +buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to +TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are: + + operation value description + ----------------------------------------------------------- + TEST_DMA_CTRL_CLEAR 1 clear buffer + TEST_DMA_CTRL_FILL 2 fill buffer bytes with 0x96 + TEST_DMA_CTRL_INVERT 4 invert bytes in buffer + +Various buffer address and sizes should be tested to verify no address boundary +issue exists. In particular, buffers that start on odd-8-byte boundary and/or +span multiple PAGE sizes should be tested. + + +SECTION 6: Ports +================ + +Physical and Logical Ports +------------------------------------ + +The switch supports up to 62 physical (front-panel) ports. Register +PORT_PHYS_COUNT returns the actual number of physical ports available: + + PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R) + +In addition to front-panel ports, the switch supports logical ports for +tunnels. + +Front-panel ports and logical tunnel ports are mapped into a single 32-bit port +space. A special CPU port is assigned port 0. The front-panel ports are +mapped to ports 1-62. A special loopback port is assigned port 63. Logical +tunnel ports are assigned ports 0x0001000-0x0001ffff. +To summarize the port assignments: + + port mapping + ------------------------------------------------------- + 0 CPU port (for packets to/from host CPU) + 1-62 front-panel physical ports + 63 loopback port + 64-0x0000ffff RSVD + 0x00010000-0x0001ffff logical tunnel ports + 0x00020000-0xffffffff RSVD + +Physical Port Mode +------------------ + +Switch front-panel ports operate in a mode. Currently, the only mode is +OF-DPA. OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA) +Abstract Switch Specification, Version 1.0, from Broadcom Corporation. To +set/get the mode for front-panel ports, see port settings, below. + +Port Settings +------------- + +Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS: + + PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R) + + Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62 + read 1 for link UP and 0 for link DOWN for respective front-panel ports. + +Other properties for front-panel ports are available via DMA CMD descriptors: + + Get PORT_SETTINGS descriptor: + + field width description + ---------------------------------------------- + PORT_SETTINGS 2 CMD_GET + PPORT 4 Physical port # + + Get PORT_SETTINGS completion: + + field width description + ---------------------------------------------- + PPORT 4 Physical port # + SPEED 4 Current port interface speed, in Mbps + DUPLEX 1 1 = Full, 0 = Half + AUTONEG 1 1 = enabled, 0 = disabled + MACADDR 6 Port MAC address + MODE 1 0 = OF-DPA + LEARNING 1 MAC address learning on port + 1 = enabled + 0 = disabled + PHYS_NAME <var> Physical port name (string) + + Set PORT_SETTINGS descriptor: + + field width description + ---------------------------------------------- + PORT_SETTINGS 2 CMD_SET + PPORT 4 Physical port # + SPEED 4 Port interface speed, in Mbps + DUPLEX 1 1 = Full, 0 = Half + AUTONEG 1 1 = enabled, 0 = disabled + MACADDR 6 Port MAC address + MODE 1 0 = OF-DPA + +Port Enable +----------- + +Front-panel ports are initially disabled, which means port ingress and egress +packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE: + + PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W) + + Value is bitmap of first 64 ports. Bits 0 and 63 are ignored + and always read as 0. Write 1 to enable port; write 0 to disable it. + Default is 0. + + +SECTION 7: Switch Control +========================= + +This section covers switch-wide register settings. + +Control +------- + +This register is used for low level control of the switch. + + CONTROL: offset 0x0300, 32-bit, (W) + + bit name description + ------------------------------------------------------------------------ + [0] CONTROL_RESET If set, device will perform reset + [1:31] Reserved + +Switch ID +--------- + +The switch has a SWITCH_ID to be used by software to uniquely identify the +switch: + + SWITCH_ID: offset 0x0320, 64-bit, (R) + + Value is opaque to switch software and no special encoding is implied. + + +SECTION 8: Events +================= + +Non-I/O asynchronous events from the device are notified to the host using the +event ring. The TLV structure for events is: + + field width description + --------------------------------------------------- + TYPE 4 Event type, one of: + 1: LINK_CHANGED + 2: MAC_VLAN_SEEN + INFO <nest> Event info (details below) + +Link Changed Event +------------------ + +When link status changes on a physical port, this event is generated. + + field width description + --------------------------------------------------- + INFO <nest> + PPORT 4 Physical port + LINKUP 1 Link status: + 0: down + 1: up + +MAC VLAN Seen Event +------------------- + +When a packet ingresses on a port and the source MAC/VLAN isn't known to the +device, the device will generate this event. In response to the event, the +driver should install to the device the MAC/VLAN on the port into the bridge +table. Once installed, the MAC/VLAN is known on the port and this event will +no longer be generated. + + field width description + --------------------------------------------------- + INFO <nest> + PPORT 4 Physical port + MAC 6 MAC address + VLAN 2 VLAN ID + + +SECTION 9: CPU Packet Processing +================================ + +Ingress packets directed to the host CPU for further processing are delivered +in the DMA RX ring. Likewise, host CPU originating packets destined to egress +on switch ports are scheduled by software using the DMA TX ring. + +Tx Packet Processing +-------------------- + +Software schedules packets for egress on switch ports using the DMA TX ring. A +TX descriptor buffer describes the packet location and size in host DMA-able +memory, the destination port, and any hardware-offload functions (such as L3 +payload checksum offload). Software then bumps the descriptor head to signal +hardware of new Tx work. In response, hardware will DMA read Tx descriptors up +to head, DMA read descriptor buffer and packet data, perform offloading +functions, and finally frame packet on wire (network). Once packet processing +is complete, hardware will writeback status to descriptor(s) to signal to +software that Tx is complete and software resources (e.g. skb) backing packet +can be released. + +Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A +TLV is used for each packet fragment. + + pkt frag 1 + +–––––––+ +–+ + +–––+ | | + desc buf | | | | + +––––––––+ | | | | + Tx ring +–––+ +–––––+ | | | + +–––––––––+ | | TLVs | +–––––––+ | + | +–––+ +––––––––+ pkt frag 2 | + | desc 0 | | +–––––+ +–––––––+ | + +–––––––––+ | TLVs | +–––+ | | + head+–+ | +––––––––+ | | | + | desc 1 | | +–––––+ +–––––––+ |pkt + +–––––––––+ | TLVs | | | + | | +––––––––+ | pkt frag 3 | + | | | +–––––––+ | + +–––––––––+ +–––+ | | + | | | | | + | | | | | + +–––––––––+ | | | + | | | | | + | | | | | + +–––––––––+ | | | + | | +–––––––+ +–+ + | | + +–––––––––+ + + fig 2. + +The TLVs for Tx descriptor buffer are: + + field width description + --------------------------------------------------------------------- + PPORT 4 Destination physical port # + TX_OFFLOAD 1 Hardware offload modes: + 0: no offload + 1: insert IP csum (ipv4 only) + 2: insert TCP/UDP csum + 3: L3 csum calc and insert + into csum offset (TX_L3_CSUM_OFF) + 16-bit 1's complement csum value. + IPv4 pseudo-header and IP + already calculated by OS + and inserted. + 4: TSO (TCP Segmentation Offload) + TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset, + from the beginning of the packet, + of the csum field in the L3 header + TX_TSO_MSS 2 For TSO offload mode, the + Maximum Segment Size in bytes + TX_TSO_HDR_LEN 2 For TSO offload mode, the + length of ethernet, IP, and + TCP/UDP headers, including IP + and TCP options. + TX_FRAGS <array> Packet fragments + TX_FRAG <nest> Packet fragment + TX_FRAG_ADDR 8 DMA address of packet fragment + TX_FRAG_LEN 2 Packet fragment length + +Possible status return codes in descriptor on completion are: + + DESC_COMP_ERR reason + -------------------------------------------------------------------- + 0 OK + -ROCKER_ENXIO address or data read err on desc buf or packet + fragment + -ROCKER_EINVAL bad pport or TSO or csum offloading error + -ROCKER_ENOMEM no memory for internal staging tx fragment + +Rx Packet Processing +-------------------- + +For packets ingressing on switch ports that are not forwarded by the switch but +rather directed to the host CPU for further processing are delivered in the DMA +RX ring. Rx descriptor buffers are allocated by software and placed on the +ring. Hardware will fill Rx descriptor buffers with packet data, write the +completion, and signal to software that a new packet is ready. Since Rx packet +size is not known a-priori, the Rx descriptor buffer must be allocated for +worst-case packet size. A single Rx descriptor will contain the entire Rx +packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads +performed on the packet, such as checksum validation. + +The TLVs for Rx descriptor buffer are: + + field width description + --------------------------------------------------- + PPORT 4 Source physical port # + RX_FLAGS 2 Packet parsing flags: + (1 << 0): IPv4 packet + (1 << 1): IPv6 packet + (1 << 2): csum calculated + (1 << 3): IPv4 csum good + (1 << 4): IP fragment + (1 << 5): TCP packet + (1 << 6): UDP packet + (1 << 7): TCP/UDP csum good + (1 << 8): Offload forward + RX_CSUM 2 IP calculated checksum: + IPv4: IP payload csum + IPv6: header and payload csum + (Only valid is RX_FLAGS:csum calc is set) + RX_FRAG_ADDR 8 DMA address of packet fragment + RX_FRAG_MAX_LEN 2 Packet maximum fragment length + RX_FRAG_LEN 2 Actual packet fragment length after receive + +Offload forward RX_FLAG indicates the device has already forwarded the packet +so the host CPU should not also forward the packet. + +Possible status return codes in descriptor on completion are: + + DESC_COMP_ERR reason + -------------------------------------------------------------------- + 0 OK + -ROCKER_ENXIO address or data read err on desc buf + -ROCKER_ENOMEM no memory for internal staging desc buf + -ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain + packet data TLV and other TLVs. + + +SECTION 10: OF-DPA Mode +====================== + +OF-DPA mode allows the switch to offload flow packet processing functions to +hardware. An OpenFlow controller would communicate with an OpenFlow agent +installed on the switch. The OpenFlow agent would (directly or indirectly) +communicate with the Rocker switch driver, which in turn would program switch +hardware with flow functionality, as defined in OF-DPA. The block diagram is: + + +–––––––––––––––----–––+ + | OF | + | Remote Controller | + +––––––––+––----–––––––+ + | + | + +––––––––+–––––––––+ + | OF | + | Local Agent | + +––––––––––––––––––+ + | | + | Rocker Driver | + +––––––––––––––––––+ + <this spec> + +––––––––––––––––––+ + | | + | Rocker Switch | + +––––––––––––––––––+ + +To participate in flow functions, ports must be configure for OF-DPA mode +during switch initialization. + +OF-DPA Flow Table Interface +--------------------------- + +There are commands to add, modify, delete, and get stats of flow table entries. +The commands are issued using the DMA CMD descriptor ring. The following +commands are defined: + + CMD_ADD: add an entry to flow table + CMD_MOD: modify an entry in flow table + CMD_DEL: delete an entry from flow table + CMD_GET_STATS: get stats for flow entry + +TLVs for add and modify commands are: + + field width description + ---------------------------------------------------- + OF_DPA_CMD 2 CMD_[ADD|MOD] + OF_DPA_TBL 2 Flow table ID + 0: ingress port + 10: vlan + 20: termination mac + 30: unicast routing + 40: multicast routing + 50: bridging + 60: ACL policy + OF_DPA_PRIORITY 4 Flow priority + OF_DPA_HARDTIME 4 Hard timeout for flow + OF_DPA_IDLETIME 4 Idle timeout for flow + OF_DPA_COOKIE 8 Cookie + +Additional TLVs based on flow table ID: + +Table ID 0: ingress port + + field width description + ---------------------------------------------------- + OF_DPA_IN_PPORT 4 ingress physical port number + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + +Table ID 10: vlan + + field width description + ---------------------------------------------------- + OF_DPA_IN_PPORT 4 ingress physical port number + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID + +Table ID 20: termination mac + + field width description + ---------------------------------------------------- + OF_DPA_IN_PPORT 4 ingress physical port number + OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask + OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd + OF_DPA_DST_MAC 6 (N) destination MAC + OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask + OF_DPA_GOTO_TBL 2 only acceptable values are + unicast or multicast routing + table IDs + OF_DPA_OUT_PPORT 2 if specified, must be + controller, set zero otherwise + +Table ID 30: unicast routing + + field width description + ---------------------------------------------------- + OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd + OF_DPA_DST_IP 4 (N) destination IPv4 address. + Must be unicast address + OF_DPA_DST_IP_MASK 4 (N) IP mask. Must be prefix mask + OF_DPA_DST_IPV6 16 (N) destination IPv6 address. + Must be unicast address + OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask. Must be prefix mask + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + OF_DPA_GROUP_ID 4 data for GROUP action must + be an L3 Unicast group entry + +Table ID 40: multicast routing + + field width description + ---------------------------------------------------- + OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_SRC_IP 4 (N) source IPv4. Optional, + can contain IPv4 address, + must be completely masked + if not used + OF_DPA_SRC_IP_MASK 4 (N) IP Mask + OF_DPA_DST_IP 4 (N) destination IPv4 address. + Must be multicast address + OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. + Can contain IPv6 address, + must be completely masked + if not used + OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask. + OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must + be multicast address + Must be multicast address + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + OF_DPA_GROUP_ID 4 data for GROUP action must + be an L3 multicast group entry + +Table ID 50: bridging + + field width description + ---------------------------------------------------- + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_TUNNEL_ID 4 tunnel ID + OF_DPA_DST_MAC 6 (N) destination MAC + OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + OF_DPA_GROUP_ID 4 data for GROUP action must + be a L2 Interface, L2 + Multicast, L2 Flood, + or L2 Overlay group entry + as appropriate + OF_DPA_TUNNEL_LPORT 4 unicast Tenant Bridging + flows specify a tunnel + logical port ID + OF_DPA_OUT_PPORT 2 data for OUTPUT action, + restricted to CONTROLLER, + set to 0 otherwise + +Table ID 60: acl policy + + field width description + ---------------------------------------------------- + OF_DPA_IN_PPORT 4 ingress physical port number + OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask + OF_DPA_ETHERTYPE 2 (N) ethertype + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask + OF_DPA_VLAN_PCP 2 (N) vlan Priority Code Point + OF_DPA_VLAN_PCP_MASK 2 (N) vlan Priority Code Point mask + OF_DPA_SRC_MAC 6 (N) source MAC + OF_DPA_SRC_MAC_MASK 6 (N) source MAC mask + OF_DPA_DST_MAC 6 (N) destination MAC + OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask + OF_DPA_TUNNEL_ID 4 tunnel ID + OF_DPA_SRC_IP 4 (N) source IPv4. Optional, + can contain IPv4 address, + must be completely masked + if not used + OF_DPA_SRC_IP_MASK 4 (N) IP Mask + OF_DPA_DST_IP 4 (N) destination IPv4 address. + Must be multicast address + OF_DPA_DST_IP_MASK 4 (N) IP Mask + OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. + Can contain IPv6 address, + must be completely masked + if not used + OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask + OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must + be multicast address. + OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask + OF_DPA_SRC_ARP_IP 4 (N) source IPv4 address in the ARP + payload. Only used if ethertype + == 0x0806. + OF_DPA_SRC_ARP_IP_MASK 4 (N) IP Mask + OF_DPA_IP_PROTO 1 IP protocol + OF_DPA_IP_PROTO_MASK 1 IP protocol mask + OF_DPA_IP_DSCP 1 DSCP + OF_DPA_IP_DSCP_MASK 1 DSCP mask + OF_DPA_IP_ECN 1 ECN + OF_DPA_IP_ECN_MASK 1 ECN mask + OF_DPA_L4_SRC_PORT 2 (N) L4 source port, only for + TCP, UDP, or SCTP + OF_DPA_L4_SRC_PORT_MASK 2 (N) L4 source port mask + OF_DPA_L4_DST_PORT 2 (N) L4 source port, only for + TCP, UDP, or SCTP + OF_DPA_L4_DST_PORT_MASK 2 (N) L4 source port mask + OF_DPA_ICMP_TYPE 1 ICMP type, only if IP + protocol is 1 + OF_DPA_ICMP_TYPE_MASK 1 ICMP type mask + OF_DPA_ICMP_CODE 1 ICMP code + OF_DPA_ICMP_CODE_MASK 1 ICMP code mask + OF_DPA_IPV6_LABEL 4 (N) IPv6 flow label + OF_DPA_IPV6_LABEL_MASK 4 (N) IPv6 flow label mask + OF_DPA_GROUP_ID 4 data for GROUP action + OF_DPA_QUEUE_ID_ACTION 1 write the queue ID + OF_DPA_NEW_QUEUE_ID 1 queue ID + OF_DPA_VLAN_PCP_ACTION 1 write the VLAN priority + OF_DPA_NEW_VLAN_PCP 1 VLAN priority + OF_DPA_IP_DSCP_ACTION 1 write the DSCP + OF_DPA_NEW_IP_DSCP 1 new DSCP + OF_DPA_TUNNEL_LPORT 4 restrct to valid tunnel + logical port, set to 0 + otherwise. + OF_DPA_OUT_PPORT 2 data for OUTPUT action, + restricted to CONTROLLER, + set to 0 otherwise + OF_DPA_CLEAR_ACTIONS 4 if 1 packets matching flow are + dropped (all other instructions + ignored) + +TLVs for flow delete and get stats command are: + + field width description + --------------------------------------------------- + OF_DPA_CMD 2 CMD_[DEL|GET_STATS] + OF_DPA_COOKIE 8 Cookie + +On completion of get stats command, the descriptor buffer is written back with +the following TLVs: + + field width description + --------------------------------------------------- + OF_DPA_STAT_DURATION 4 Flow duration + OF_DPA_STAT_RX_PKTS 8 Received packets + OF_DPA_STAT_TX_PKTS 8 Transmit packets + +Possible status return codes in descriptor on completion are: + + DESC_COMP_ERR command reason + -------------------------------------------------------------------- + 0 all OK + -ROCKER_EFAULT all head or tail index outside + of ring + -ROCKER_ENXIO all address or data read err on + desc buf + -ROCKER_EMSGSIZE GET_STATS cmd descriptor buffer wasn't + big enough to contain write-back + TLVs + -ROCKER_EINVAL all invalid parameters passed in + -ROCKER_EEXIST ADD entry already exists + -ROCKER_ENOSPC ADD no space left in flow table + -ROCKER_ENOENT MOD|DEL|GET_STATS cookie invalid + +Group Table Interface +--------------------- + +There are commands to add, modify, delete, and get stats of group table +entries. The commands are issued using the DMA CMD descriptor ring. The +following commands are defined: + + CMD_ADD: add an entry to group table + CMD_MOD: modify an entry in group table + CMD_DEL: delete an entry from group table + CMD_GET_STATS: get stats for group entry + +TLVs for add and modify commands are: + + field width description + ----------------------------------------------------------- + FLOW_GROUP_CMD 2 CMD_[ADD|MOD] + FLOW_GROUP_ID 2 Flow group ID + FLOW_GROUP_TYPE 1 Group type: + 0: L2 interface + 1: L2 rewrite + 2: L3 unicast + 3: L2 multicast + 4: L2 flood + 5: L3 interface + 6: L3 multicast + 7: L3 ECMP + 8: L2 overlay + FLOW_VLAN_ID 2 Vlan ID (types 0, 3, 4, 6) + FLOW_L2_PORT 2 Port (types 0) + FLOW_INDEX 4 Index (all types but 0) + FLOW_OVERLAY_TYPE 1 Overlay sub-type (type 8): + 0: Flood unicast tunnel + 1: Flood multicast tunnel + 2: Multicast unicast tunnel + 3: Multicast multicast tunnel + FLOW_GROUP_ACTION nest + FLOW_GROUP_ID 2 next group ID in chain (all + types except 0) + FLOW_OUT_PORT 4 egress port (types 0, 8) + FLOW_POP_VLAN_TAG 1 strip outer VLAN tag (type 1 + only) + FLOW_VLAN_ID 2 (types 1, 5) + FLOW_SRC_MAC 6 (types 1, 2, 5) + FLOW_DST_MAC 6 (types 1, 2) + +TLVs for flow delete and get stats command are: + + field width description + ----------------------------------------------------------- + FLOW_GROUP_CMD 2 CMD_[DEL|GET_STATS] + FLOW_GROUP_ID 2 Flow group ID + +On completion of get stats command, the descriptor buffer is written back with +the following TLVs: + + field width description + --------------------------------------------------- + FLOW_GROUP_ID 2 Flow group ID + FLOW_STAT_DURATION 4 Flow duration + FLOW_STAT_REF_COUNT 4 Flow reference count + FLOW_STAT_BUCKET_COUNT 4 Flow bucket count + +Possible status return codes in descriptor on completion are: + + DESC_COMP_ERR command reason + -------------------------------------------------------------------- + 0 all OK + -ROCKER_EFAULT all head or tail index outside + of ring + -ROCKER_ENXIO all address or data read err on + desc buf + -ROCKER_ENOSPC GET_STATS cmd descriptor buffer wasn't + big enough to contain write-back + TLVs + -ROCKER_EINVAL ADD|MOD invalid parameters passed in + -ROCKER_EEXIST ADD entry already exists + -ROCKER_ENOSPC ADD no space left in flow table + -ROCKER_ENOENT MOD|DEL|GET_STATS group ID invalid + -ROCKER_EBUSY DEL group reference count non-zero + -ROCKER_ENODEV ADD next group ID doesn't exist + + + +References +========== + +[1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification, +Version 1.0, from Broadcom Corporation, February 21, 2014. diff --git a/docs/specs/standard-vga.txt b/docs/specs/standard-vga.txt new file mode 100644 index 000000000..18f75f1b3 --- /dev/null +++ b/docs/specs/standard-vga.txt @@ -0,0 +1,81 @@ + +QEMU Standard VGA +================= + +Exists in two variants, for isa and pci. + +command line switches: + -vga std [ picks isa for -M isapc, otherwise pci ] + -device VGA [ pci variant ] + -device isa-vga [ isa variant ] + -device secondary-vga [ legacy-free pci variant ] + + +PCI spec +-------- + +Applies to the pci variant only for obvious reasons. + +PCI ID: 1234:1111 + +PCI Region 0: + Framebuffer memory, 16 MB in size (by default). + Size is tunable via vga_mem_mb property. + +PCI Region 1: + Reserved (so we have the option to make the framebuffer bar 64bit). + +PCI Region 2: + MMIO bar, 4096 bytes in size (qemu 1.3+) + +PCI ROM Region: + Holds the vgabios (qemu 0.14+). + + +The legacy-free variant has no ROM and has PCI_CLASS_DISPLAY_OTHER +instead of PCI_CLASS_DISPLAY_VGA. + + +IO ports used +------------- + +Doesn't apply to the legacy-free pci variant, use the MMIO bar instead. + +03c0 - 03df : standard vga ports +01ce : bochs vbe interface index port +01cf : bochs vbe interface data port (x86 only) +01d0 : bochs vbe interface data port + + +Memory regions used +------------------- + +0xe0000000 : Framebuffer memory, isa variant only. + +The pci variant used to mirror the framebuffer bar here, qemu 0.14+ +stops doing that (except when in -M pc-$old compat mode). + + +MMIO area spec +-------------- + +Likewise applies to the pci variant only for obvious reasons. + +0000 - 03ff : edid data blob. +0400 - 041f : vga ioports (0x3c0 -> 0x3df), remapped 1:1. + word access is supported, bytes are written + in little endia order (aka index port first), + so indexed registers can be updated with a + single mmio write (and thus only one vmexit). +0500 - 0515 : bochs dispi interface registers, mapped flat + without index/data ports. Use (index << 1) + as offset for (16bit) register access. + +0600 - 0607 : qemu extended registers. qemu 2.2+ only. + The pci revision is 2 (or greater) when + these registers are present. The registers + are 32bit. + 0600 : qemu extended register region size, in bytes. + 0604 : framebuffer endianness register. + - 0xbebebebe indicates big endian. + - 0x1e1e1e1e indicates little endian. diff --git a/docs/specs/tpm.rst b/docs/specs/tpm.rst new file mode 100644 index 000000000..3be190343 --- /dev/null +++ b/docs/specs/tpm.rst @@ -0,0 +1,524 @@ +=============== +QEMU TPM Device +=============== + +Guest-side hardware interface +============================= + +TIS interface +------------- + +The QEMU TPM emulation implements a TPM TIS hardware interface +following the Trusted Computing Group's specification "TCG PC Client +Specific TPM Interface Specification (TIS)", Specification Version +1.3, 21 March 2013. (see the `TIS specification`_, or a later version +of it). + +The TIS interface makes a memory mapped IO region in the area +0xfed40000-0xfed44fff available to the guest operating system. + +QEMU files related to TPM TIS interface: + - ``hw/tpm/tpm_tis_common.c`` + - ``hw/tpm/tpm_tis_isa.c`` + - ``hw/tpm/tpm_tis_sysbus.c`` + - ``hw/tpm/tpm_tis.h`` + +Both an ISA device and a sysbus device are available. The former is +used with pc/q35 machine while the latter can be instantiated in the +Arm virt machine. + +CRB interface +------------- + +QEMU also implements a TPM CRB interface following the Trusted +Computing Group's specification "TCG PC Client Platform TPM Profile +(PTP) Specification", Family "2.0", Level 00 Revision 01.03 v22, May +22, 2017. (see the `CRB specification`_, or a later version of it) + +The CRB interface makes a memory mapped IO region in the area +0xfed40000-0xfed40fff (1 locality) available to the guest +operating system. + +QEMU files related to TPM CRB interface: + - ``hw/tpm/tpm_crb.c`` + +SPAPR interface +--------------- + +pSeries (ppc64) machines offer a tpm-spapr device model. + +QEMU files related to the SPAPR interface: + - ``hw/tpm/tpm_spapr.c`` + +fw_cfg interface +================ + +The bios/firmware may read the ``"etc/tpm/config"`` fw_cfg entry for +configuring the guest appropriately. + +The entry of 6 bytes has the following content, in little-endian: + +.. code-block:: c + + #define TPM_VERSION_UNSPEC 0 + #define TPM_VERSION_1_2 1 + #define TPM_VERSION_2_0 2 + + #define TPM_PPI_VERSION_NONE 0 + #define TPM_PPI_VERSION_1_30 1 + + struct FwCfgTPMConfig { + uint32_t tpmppi_address; /* PPI memory location */ + uint8_t tpm_version; /* TPM version */ + uint8_t tpmppi_version; /* PPI version */ + }; + +ACPI interface +============== + +The TPM device is defined with ACPI ID "PNP0C31". QEMU builds a SSDT +and passes it into the guest through the fw_cfg device. The device +description contains the base address of the TIS interface 0xfed40000 +and the size of the MMIO area (0x5000). In case a TPM2 is used by +QEMU, a TPM2 ACPI table is also provided. The device is described to +be used in polling mode rather than interrupt mode primarily because +no unused IRQ could be found. + +To support measurement logs to be written by the firmware, +e.g. SeaBIOS, a TCPA table is implemented. This table provides a 64kb +buffer where the firmware can write its log into. For TPM 2 only a +more recent version of the TPM2 table provides support for +measurements logs and a TCPA table does not need to be created. + +The TCPA and TPM2 ACPI tables follow the Trusted Computing Group +specification "TCG ACPI Specification" Family "1.2" and "2.0", Level +00 Revision 00.37. (see the `ACPI specification`_, or a later version +of it) + +ACPI PPI Interface +------------------ + +QEMU supports the Physical Presence Interface (PPI) for TPM 1.2 and +TPM 2. This interface requires ACPI and firmware support. (see the +`PPI specification`_) + +PPI enables a system administrator (root) to request a modification to +the TPM upon reboot. The PPI specification defines the operation +requests and the actions the firmware has to take. The system +administrator passes the operation request number to the firmware +through an ACPI interface which writes this number to a memory +location that the firmware knows. Upon reboot, the firmware finds the +number and sends commands to the TPM. The firmware writes the TPM +result code and the operation request number to a memory location that +ACPI can read from and pass the result on to the administrator. + +The PPI specification defines a set of mandatory and optional +operations for the firmware to implement. The ACPI interface also +allows an administrator to list the supported operations. In QEMU the +ACPI code is generated by QEMU, yet the firmware needs to implement +support on a per-operations basis, and different firmwares may support +a different subset. Therefore, QEMU introduces the virtual memory +device for PPI where the firmware can indicate which operations it +supports and ACPI can enable the ones that are supported and disable +all others. This interface lies in main memory and has the following +layout: + + +-------------+--------+--------+-------------------------------------------+ + | Field | Length | Offset | Description | + +=============+========+========+===========================================+ + | ``func`` | 0x100 | 0x000 | Firmware sets values for each supported | + | | | | operation. See defined values below. | + +-------------+--------+--------+-------------------------------------------+ + | ``ppin`` | 0x1 | 0x100 | SMI interrupt to use. Set by firmware. | + | | | | Not supported. | + +-------------+--------+--------+-------------------------------------------+ + | ``ppip`` | 0x4 | 0x101 | ACPI function index to pass to SMM code. | + | | | | Set by ACPI. Not supported. | + +-------------+--------+--------+-------------------------------------------+ + | ``pprp`` | 0x4 | 0x105 | Result of last executed operation. Set by | + | | | | firmware. See function index 5 for values.| + +-------------+--------+--------+-------------------------------------------+ + | ``pprq`` | 0x4 | 0x109 | Operation request number to execute. See | + | | | | 'Physical Presence Interface Operation | + | | | | Summary' tables in specs. Set by ACPI. | + +-------------+--------+--------+-------------------------------------------+ + | ``pprm`` | 0x4 | 0x10d | Operation request optional parameter. | + | | | | Values depend on operation. Set by ACPI. | + +-------------+--------+--------+-------------------------------------------+ + | ``lppr`` | 0x4 | 0x111 | Last executed operation request number. | + | | | | Copied from pprq field by firmware. | + +-------------+--------+--------+-------------------------------------------+ + | ``fret`` | 0x4 | 0x115 | Result code from SMM function. | + | | | | Not supported. | + +-------------+--------+--------+-------------------------------------------+ + | ``res1`` | 0x40 | 0x119 | Reserved for future use | + +-------------+--------+--------+-------------------------------------------+ + |``next_step``| 0x1 | 0x159 | Operation to execute after reboot by | + | | | | firmware. Used by firmware. | + +-------------+--------+--------+-------------------------------------------+ + | ``movv`` | 0x1 | 0x15a | Memory overwrite variable | + +-------------+--------+--------+-------------------------------------------+ + +The following values are supported for the ``func`` field. They +correspond to the values used by ACPI function index 8. + + +----------+-------------------------------------------------------------+ + | Value | Description | + +==========+=============================================================+ + | 0 | Operation is not implemented. | + +----------+-------------------------------------------------------------+ + | 1 | Operation is only accessible through firmware. | + +----------+-------------------------------------------------------------+ + | 2 | Operation is blocked for OS by firmware configuration. | + +----------+-------------------------------------------------------------+ + | 3 | Operation is allowed and physically present user required. | + +----------+-------------------------------------------------------------+ + | 4 | Operation is allowed and physically present user is not | + | | required. | + +----------+-------------------------------------------------------------+ + +The location of the table is given by the fw_cfg ``tpmppi_address`` +field. The PPI memory region size is 0x400 (``TPM_PPI_ADDR_SIZE``) to +leave enough room for future updates. + +QEMU files related to TPM ACPI tables: + - ``hw/i386/acpi-build.c`` + - ``include/hw/acpi/tpm.h`` + +TPM backend devices +=================== + +The TPM implementation is split into two parts, frontend and +backend. The frontend part is the hardware interface, such as the TPM +TIS interface described earlier, and the other part is the TPM backend +interface. The backend interfaces implement the interaction with a TPM +device, which may be a physical or an emulated device. The split +between the front- and backend devices allows a frontend to be +connected with any available backend. This enables the TIS interface +to be used with the passthrough backend or the swtpm backend. + +QEMU files related to TPM backends: + - ``backends/tpm.c`` + - ``include/sysemu/tpm.h`` + - ``include/sysemu/tpm_backend.h`` + +The QEMU TPM passthrough device +------------------------------- + +In case QEMU is run on Linux as the host operating system it is +possible to make the hardware TPM device available to a single QEMU +guest. In this case the user must make sure that no other program is +using the device, e.g., /dev/tpm0, before trying to start QEMU with +it. + +The passthrough driver uses the host's TPM device for sending TPM +commands and receiving responses from. Besides that it accesses the +TPM device's sysfs entry for support of command cancellation. Since +none of the state of a hardware TPM can be migrated between hosts, +virtual machine migration is disabled when the TPM passthrough driver +is used. + +Since the host's TPM device will already be initialized by the host's +firmware, certain commands, e.g. ``TPM_Startup()``, sent by the +virtual firmware for device initialization, will fail. In this case +the firmware should not use the TPM. + +Sharing the device with the host is generally not a recommended usage +scenario for a TPM device. The primary reason for this is that two +operating systems can then access the device's single set of +resources, such as platform configuration registers +(PCRs). Applications or kernel security subsystems, such as the Linux +Integrity Measurement Architecture (IMA), are not expecting to share +PCRs. + +QEMU files related to the TPM passthrough device: + - ``backends/tpm/tpm_passthrough.c`` + - ``backends/tpm/tpm_util.c`` + - ``include/sysemu/tpm_util.h`` + + +Command line to start QEMU with the TPM passthrough device using the host's +hardware TPM ``/dev/tpm0``: + +.. code-block:: console + + qemu-system-x86_64 -display sdl -accel kvm \ + -m 1024 -boot d -bios bios-256k.bin -boot menu=on \ + -tpmdev passthrough,id=tpm0,path=/dev/tpm0 \ + -device tpm-tis,tpmdev=tpm0 test.img + + +The following commands should result in similar output inside the VM +with a Linux kernel that either has the TPM TIS driver built-in or +available as a module: + +.. code-block:: console + + # dmesg | grep -i tpm + [ 0.711310] tpm_tis 00:06: 1.2 TPM (device=id 0x1, rev-id 1) + + # dmesg | grep TCPA + [ 0.000000] ACPI: TCPA 0x0000000003FFD191C 000032 (v02 BOCHS \ + BXPCTCPA 0000001 BXPC 00000001) + + # ls -l /dev/tpm* + crw-------. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0 + + # find /sys/devices/ | grep pcrs$ | xargs cat + PCR-00: 35 4E 3B CE 23 9F 38 59 ... + ... + PCR-23: 00 00 00 00 00 00 00 00 ... + +The QEMU TPM emulator device +---------------------------- + +The TPM emulator device uses an external TPM emulator called 'swtpm' +for sending TPM commands to and receiving responses from. The swtpm +program must have been started before trying to access it through the +TPM emulator with QEMU. + +The TPM emulator implements a command channel for transferring TPM +commands and responses as well as a control channel over which control +commands can be sent. (see the `SWTPM protocol`_ specification) + +The control channel serves the purpose of resetting, initializing, and +migrating the TPM state, among other things. + +The swtpm program behaves like a hardware TPM and therefore needs to +be initialized by the firmware running inside the QEMU virtual +machine. One necessary step for initializing the device is to send +the TPM_Startup command to it. SeaBIOS, for example, has been +instrumented to initialize a TPM 1.2 or TPM 2 device using this +command. + +QEMU files related to the TPM emulator device: + - ``backends/tpm/tpm_emulator.c`` + - ``backends/tpm/tpm_util.c`` + - ``include/sysemu/tpm_util.h`` + +The following commands start the swtpm with a UnixIO control channel over +a socket interface. They do not need to be run as root. + +.. code-block:: console + + mkdir /tmp/mytpm1 + swtpm socket --tpmstate dir=/tmp/mytpm1 \ + --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \ + --log level=20 + +Command line to start QEMU with the TPM emulator device communicating +with the swtpm (x86): + +.. code-block:: console + + qemu-system-x86_64 -display sdl -accel kvm \ + -m 1024 -boot d -bios bios-256k.bin -boot menu=on \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-tis,tpmdev=tpm0 test.img + +In case a pSeries machine is emulated, use the following command line: + +.. code-block:: console + + qemu-system-ppc64 -display sdl -machine pseries,accel=kvm \ + -m 1024 -bios slof.bin -boot menu=on \ + -nodefaults -device VGA -device pci-ohci -device usb-kbd \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-spapr,tpmdev=tpm0 \ + -device spapr-vscsi,id=scsi0,reg=0x00002000 \ + -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0 \ + -drive file=test.img,format=raw,if=none,id=drive-virtio-disk0 + +In case an Arm virt machine is emulated, use the following command line: + +.. code-block:: console + + qemu-system-aarch64 -machine virt,gic-version=3,accel=kvm \ + -cpu host -m 4G \ + -nographic -no-acpi \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-tis-device,tpmdev=tpm0 \ + -device virtio-blk-pci,drive=drv0 \ + -drive format=qcow2,file=hda.qcow2,if=none,id=drv0 \ + -drive if=pflash,format=raw,file=flash0.img,readonly=on \ + -drive if=pflash,format=raw,file=flash1.img + +In case SeaBIOS is used as firmware, it should show the TPM menu item +after entering the menu with 'ESC'. + +.. code-block:: console + + Select boot device: + 1. DVD/CD [ata1-0: QEMU DVD-ROM ATAPI-4 DVD/CD] + [...] + 5. Legacy option rom + + t. TPM Configuration + +The following commands should result in similar output inside the VM +with a Linux kernel that either has the TPM TIS driver built-in or +available as a module: + +.. code-block:: console + + # dmesg | grep -i tpm + [ 0.711310] tpm_tis 00:06: 1.2 TPM (device=id 0x1, rev-id 1) + + # dmesg | grep TCPA + [ 0.000000] ACPI: TCPA 0x0000000003FFD191C 000032 (v02 BOCHS \ + BXPCTCPA 0000001 BXPC 00000001) + + # ls -l /dev/tpm* + crw-------. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0 + + # find /sys/devices/ | grep pcrs$ | xargs cat + PCR-00: 35 4E 3B CE 23 9F 38 59 ... + ... + PCR-23: 00 00 00 00 00 00 00 00 ... + +Migration with the TPM emulator +=============================== + +The TPM emulator supports the following types of virtual machine +migration: + +- VM save / restore (migration into a file) +- Network migration +- Snapshotting (migration into storage like QoW2 or QED) + +The following command sequences can be used to test VM save / restore. + +In a 1st terminal start an instance of a swtpm using the following command: + +.. code-block:: console + + mkdir /tmp/mytpm1 + swtpm socket --tpmstate dir=/tmp/mytpm1 \ + --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \ + --log level=20 --tpm2 + +In a 2nd terminal start the VM: + +.. code-block:: console + + qemu-system-x86_64 -display sdl -accel kvm \ + -m 1024 -boot d -bios bios-256k.bin -boot menu=on \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-tis,tpmdev=tpm0 \ + -monitor stdio \ + test.img + +Verify that the attached TPM is working as expected using applications +inside the VM. + +To store the state of the VM use the following command in the QEMU +monitor in the 2nd terminal: + +.. code-block:: console + + (qemu) migrate "exec:cat > testvm.bin" + (qemu) quit + +At this point a file called ``testvm.bin`` should exists and the swtpm +and QEMU processes should have ended. + +To test 'VM restore' you have to start the swtpm with the same +parameters as before. If previously a TPM 2 [--tpm2] was saved, --tpm2 +must now be passed again on the command line. + +In the 1st terminal restart the swtpm with the same command line as +before: + +.. code-block:: console + + swtpm socket --tpmstate dir=/tmp/mytpm1 \ + --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \ + --log level=20 --tpm2 + +In the 2nd terminal restore the state of the VM using the additional +'-incoming' option. + +.. code-block:: console + + qemu-system-x86_64 -display sdl -accel kvm \ + -m 1024 -boot d -bios bios-256k.bin -boot menu=on \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-tis,tpmdev=tpm0 \ + -incoming "exec:cat < testvm.bin" \ + test.img + +Troubleshooting migration +------------------------- + +There are several reasons why migration may fail. In case of problems, +please ensure that the command lines adhere to the following rules +and, if possible, that identical versions of QEMU and swtpm are used +at all times. + +VM save and restore: + + - QEMU command line parameters should be identical apart from the + '-incoming' option on VM restore + + - swtpm command line parameters should be identical + +VM migration to 'localhost': + + - QEMU command line parameters should be identical apart from the + '-incoming' option on the destination side + + - swtpm command line parameters should point to two different + directories on the source and destination swtpm (--tpmstate dir=...) + (especially if different versions of libtpms were to be used on the + same machine). + +VM migration across the network: + + - QEMU command line parameters should be identical apart from the + '-incoming' option on the destination side + + - swtpm command line parameters should be identical + +VM Snapshotting: + - QEMU command line parameters should be identical + + - swtpm command line parameters should be identical + + +Besides that, migration failure reasons on the swtpm level may include +the following: + + - the versions of the swtpm on the source and destination sides are + incompatible + + - downgrading of TPM state may not be supported + + - the source and destination libtpms were compiled with different + compile-time options and the destination side refuses to accept the + state + + - different migration keys are used on the source and destination side + and the destination side cannot decrypt the migrated state + (swtpm ... --migration-key ... ) + + +.. _TIS specification: + https://trustedcomputinggroup.org/pc-client-work-group-pc-client-specific-tpm-interface-specification-tis/ + +.. _CRB specification: + https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/ + + +.. _ACPI specification: + https://trustedcomputinggroup.org/tcg-acpi-specification/ + +.. _PPI specification: + https://trustedcomputinggroup.org/resource/tcg-physical-presence-interface-specification/ + +.. _SWTPM protocol: + https://github.com/stefanberger/swtpm/blob/master/man/man3/swtpm_ioctls.pod diff --git a/docs/specs/virt-ctlr.txt b/docs/specs/virt-ctlr.txt new file mode 100644 index 000000000..24d38084f --- /dev/null +++ b/docs/specs/virt-ctlr.txt @@ -0,0 +1,26 @@ +Virtual System Controller +========================= + +This device is a simple interface defined for the pure virtual machine with no +hardware reference implementation to allow the guest kernel to send command +to the host hypervisor. + +The specification can evolve, the current state is defined as below. + +This is a MMIO mapped device using 256 bytes. + +Two 32bit registers are defined: + +1- the features register (read-only, address 0x00) + + This register allows the device to report features supported by the + controller. + The only feature supported for the moment is power control (0x01). + +2- the command register (write-only, address 0x04) + + This register allows the kernel to send the commands to the hypervisor. + The implemented commands are part of the power control feature and + are reset (1), halt (2) and panic (3). + A basic command, no-op (0), is always present and can be used to test the + register access. This command has no effect. diff --git a/docs/specs/vmcoreinfo.txt b/docs/specs/vmcoreinfo.txt new file mode 100644 index 000000000..bcbca6fe4 --- /dev/null +++ b/docs/specs/vmcoreinfo.txt @@ -0,0 +1,53 @@ +================= +VMCoreInfo device +================= + +The `-device vmcoreinfo` will create a fw_cfg entry for a guest to +store dump details. + +etc/vmcoreinfo +************** + +A guest may use this fw_cfg entry to add information details to qemu +dumps. + +The entry of 16 bytes has the following layout, in little-endian:: + +#define VMCOREINFO_FORMAT_NONE 0x0 +#define VMCOREINFO_FORMAT_ELF 0x1 + + struct FWCfgVMCoreInfo { + uint16_t host_format; /* formats host supports */ + uint16_t guest_format; /* format guest supplies */ + uint32_t size; /* size of vmcoreinfo region */ + uint64_t paddr; /* physical address of vmcoreinfo region */ + }; + +Only full write (of 16 bytes) are considered valid for further +processing of entry values. + +A write of 0 in guest_format will disable further processing of +vmcoreinfo entry values & content. + +You may write a guest_format that is not supported by the host, in +which case the entry data can be ignored by qemu (but you may still +access it through a debugger, via vmcoreinfo_realize::vmcoreinfo_state). + +Format & content +**************** + +As of qemu 2.11, only VMCOREINFO_FORMAT_ELF is supported. + +The entry gives location and size of an ELF note that is appended in +qemu dumps. + +The note format/class must be of the target bitness and the size must +be less than 1Mb. + +If the ELF note name is "VMCOREINFO", it is expected to be the Linux +vmcoreinfo note (see Documentation/ABI/testing/sysfs-kernel-vmcoreinfo +in Linux source). In this case, qemu dump code will read the content +as a key=value text file, looking for "NUMBER(phys_base)" key +value. The value is expected to be more accurate than architecture +guess of the value. This is useful for KASLR-enabled guest with +ancient tools not handling the VMCOREINFO note. diff --git a/docs/specs/vmgenid.txt b/docs/specs/vmgenid.txt new file mode 100644 index 000000000..aa9f51867 --- /dev/null +++ b/docs/specs/vmgenid.txt @@ -0,0 +1,245 @@ +VIRTUAL MACHINE GENERATION ID +============================= + +Copyright (C) 2016 Red Hat, Inc. +Copyright (C) 2017 Skyport Systems, Inc. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +=== + +The VM generation ID (vmgenid) device is an emulated device which +exposes a 128-bit, cryptographically random, integer value identifier, +referred to as a Globally Unique Identifier, or GUID. + +This allows management applications (e.g. libvirt) to notify the guest +operating system when the virtual machine is executed with a different +configuration (e.g. snapshot execution or creation from a template). The +guest operating system notices the change, and is then able to react as +appropriate by marking its copies of distributed databases as dirty, +re-initializing its random number generator etc. + + +Requirements +------------ + +These requirements are extracted from the "How to implement virtual machine +generation ID support in a virtualization platform" section of the +specification, dated August 1, 2012. + + +The document may be found on the web at: + http://go.microsoft.com/fwlink/?LinkId=260709 + +R1a. The generation ID shall live in an 8-byte aligned buffer. + +R1b. The buffer holding the generation ID shall be in guest RAM, ROM, or device + MMIO range. + +R1c. The buffer holding the generation ID shall be kept separate from areas + used by the operating system. + +R1d. The buffer shall not be covered by an AddressRangeMemory or + AddressRangeACPI entry in the E820 or UEFI memory map. + +R1e. The generation ID shall not live in a page frame that could be mapped with + caching disabled. (In other words, regardless of whether the generation ID + lives in RAM, ROM or MMIO, it shall only be mapped as cacheable.) + +R2 to R5. [These AML requirements are isolated well enough in the Microsoft + specification for us to simply refer to them here.] + +R6. The hypervisor shall expose a _HID (hardware identifier) object in the + VMGenId device's scope that is unique to the hypervisor vendor. + + +QEMU Implementation +------------------- + +The above-mentioned specification does not dictate which ACPI descriptor table +will contain the VM Generation ID device. Other implementations (Hyper-V and +Xen) put it in the main descriptor table (Differentiated System Description +Table or DSDT). For ease of debugging and implementation, we have decided to +put it in its own Secondary System Description Table, or SSDT. + +The following is a dump of the contents from a running system: + +# iasl -p ./SSDT -d /sys/firmware/acpi/tables/SSDT + +Intel ACPI Component Architecture +ASL+ Optimizing Compiler version 20150717-64 +Copyright (c) 2000 - 2015 Intel Corporation + +Reading ACPI table from file /sys/firmware/acpi/tables/SSDT - Length +00000198 (0x0000C6) +ACPI: SSDT 0x0000000000000000 0000C6 (v01 BOCHS VMGENID 00000001 BXPC +00000001) +Acpi table [SSDT] successfully installed and loaded +Pass 1 parse of [SSDT] +Pass 2 parse of [SSDT] +Parsing Deferred Opcodes (Methods/Buffers/Packages/Regions) + +Parsing completed +Disassembly completed +ASL Output: ./SSDT.dsl - 1631 bytes +# cat SSDT.dsl +/* + * Intel ACPI Component Architecture + * AML/ASL+ Disassembler version 20150717-64 + * Copyright (c) 2000 - 2015 Intel Corporation + * + * Disassembling to symbolic ASL+ operators + * + * Disassembly of /sys/firmware/acpi/tables/SSDT, Sun Feb 5 00:19:37 2017 + * + * Original Table Header: + * Signature "SSDT" + * Length 0x000000CA (202) + * Revision 0x01 + * Checksum 0x4B + * OEM ID "BOCHS " + * OEM Table ID "VMGENID" + * OEM Revision 0x00000001 (1) + * Compiler ID "BXPC" + * Compiler Version 0x00000001 (1) + */ +DefinitionBlock ("/sys/firmware/acpi/tables/SSDT.aml", "SSDT", 1, "BOCHS ", +"VMGENID", 0x00000001) +{ + Name (VGIA, 0x07FFF000) + Scope (\_SB) + { + Device (VGEN) + { + Name (_HID, "QEMUVGID") // _HID: Hardware ID + Name (_CID, "VM_Gen_Counter") // _CID: Compatible ID + Name (_DDN, "VM_Gen_Counter") // _DDN: DOS Device Name + Method (_STA, 0, NotSerialized) // _STA: Status + { + Local0 = 0x0F + If ((VGIA == Zero)) + { + Local0 = Zero + } + + Return (Local0) + } + + Method (ADDR, 0, NotSerialized) + { + Local0 = Package (0x02) {} + Index (Local0, Zero) = (VGIA + 0x28) + Index (Local0, One) = Zero + Return (Local0) + } + } + } + + Method (\_GPE._E05, 0, NotSerialized) // _Exx: Edge-Triggered GPE + { + Notify (\_SB.VGEN, 0x80) // Status Change + } +} + + +Design Details: +--------------- + +Requirements R1a through R1e dictate that the memory holding the +VM Generation ID must be allocated and owned by the guest firmware, +in this case BIOS or UEFI. However, to be useful, QEMU must be able to +change the contents of the memory at runtime, specifically when starting a +backed-up or snapshotted image. In order to do this, QEMU must know the +address that has been allocated. + +The mechanism chosen for this memory sharing is writeable fw_cfg blobs. +These are data object that are visible to both QEMU and guests, and are +addressable as sequential files. + +More information about fw_cfg can be found in "docs/specs/fw_cfg.txt" + +Two fw_cfg blobs are used in this case: + +/etc/vmgenid_guid - contains the actual VM Generation ID GUID + - read-only to the guest +/etc/vmgenid_addr - contains the address of the downloaded vmgenid blob + - writeable by the guest + + +QEMU sends the following commands to the guest at startup: + +1. Allocate memory for vmgenid_guid fw_cfg blob. +2. Write the address of vmgenid_guid into the SSDT (VGIA ACPI variable as + shown above in the iasl dump). Note that this change is not propagated + back to QEMU. +3. Write the address of vmgenid_guid back to QEMU's copy of vmgenid_addr + via the fw_cfg DMA interface. + +After step 3, QEMU is able to update the contents of vmgenid_guid at will. + +Since BIOS or UEFI does not necessarily run when we wish to change the GUID, +the value of VGIA is persisted via the VMState mechanism. + +As spelled out in the specification, any change to the GUID executes an +ACPI notification. The exact handler to use is not specified, so the vmgenid +device uses the first unused one: \_GPE._E05. + + +Endian-ness Considerations: +--------------------------- + +Although not specified in Microsoft's document, it is assumed that the +device is expected to use little-endian format. + +All GUID passed in via command line or monitor are treated as big-endian. +GUID values displayed via monitor are shown in big-endian format. + + +GUID Storage Format: +-------------------- + +In order to implement an OVMF "SDT Header Probe Suppressor", the contents of +the vmgenid_guid fw_cfg blob are not simply a 128-bit GUID. There is also +significant padding in order to align and fill a memory page, as shown in the +following diagram: + ++----------------------------------+ +| SSDT with OEM Table ID = VMGENID | ++----------------------------------+ +| ... | TOP OF PAGE +| VGIA dword object ---------------|-----> +---------------------------+ +| ... | | fw-allocated array for | +| _STA method referring to VGIA | | "etc/vmgenid_guid" | +| ... | +---------------------------+ +| ADDR method referring to VGIA | | 0: OVMF SDT Header probe | +| ... | | suppressor | ++----------------------------------+ | 36: padding for 8-byte | + | alignment | + | 40: GUID | + | 56: padding to page size | + +---------------------------+ + END OF PAGE + + +Device Usage: +------------- + +The device has one property, which may be only be set using the command line: + + guid - sets the value of the GUID. A special value "auto" instructs + QEMU to generate a new random GUID. + +For example: + + QEMU -device vmgenid,guid="324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87" + QEMU -device vmgenid,guid=auto + +The property may be queried via QMP/HMP: + + (QEMU) query-vm-generation-id + {"return": {"guid": "324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87"}} + +Setting of this parameter is intentionally left out from the QMP/HMP +interfaces. There are no known use cases for changing the GUID once QEMU is +running, and adding this capability would greatly increase the complexity. diff --git a/docs/specs/vmw_pvscsi-spec.txt b/docs/specs/vmw_pvscsi-spec.txt new file mode 100644 index 000000000..49affb2a4 --- /dev/null +++ b/docs/specs/vmw_pvscsi-spec.txt @@ -0,0 +1,92 @@ +General Description +=================== + +This document describes VMWare PVSCSI device interface specification. +Created by Dmitry Fleytman (dmitry@daynix.com), Daynix Computing LTD. +Based on source code of PVSCSI Linux driver from kernel 3.0.4 + +PVSCSI Device Interface Overview +================================ + +The interface is based on memory area shared between hypervisor and VM. +Memory area is obtained by driver as device IO memory resource of +PVSCSI_MEM_SPACE_SIZE length. +The shared memory consists of registers area and rings area. +The registers area is used to raise hypervisor interrupts and issue device +commands. The rings area is used to transfer data descriptors and SCSI +commands from VM to hypervisor and to transfer messages produced by +hypervisor to VM. Data itself is transferred via virtual scatter-gather DMA. + +PVSCSI Device Registers +======================= + +The length of the registers area is 1 page (PVSCSI_MEM_SPACE_COMMAND_NUM_PAGES). +The structure of the registers area is described by the PVSCSIRegOffset enum. +There are registers to issue device command (with optional short data), +issue device interrupt, control interrupts masking. + +PVSCSI Device Rings +=================== + +There are three rings in shared memory: + + 1. Request ring (struct PVSCSIRingReqDesc *req_ring) + - ring for OS to device requests + 2. Completion ring (struct PVSCSIRingCmpDesc *cmp_ring) + - ring for device request completions + 3. Message ring (struct PVSCSIRingMsgDesc *msg_ring) + - ring for messages from device. + This ring is optional and the guest might not configure it. +There is a control area (struct PVSCSIRingsState *rings_state) used to control +rings operation. + +PVSCSI Device to Host Interrupts +================================ +There are following interrupt types supported by PVSCSI device: + 1. Completion interrupts (completion ring notifications): + PVSCSI_INTR_CMPL_0 + PVSCSI_INTR_CMPL_1 + 2. Message interrupts (message ring notifications): + PVSCSI_INTR_MSG_0 + PVSCSI_INTR_MSG_1 + +Interrupts are controlled via PVSCSI_REG_OFFSET_INTR_MASK register +Bit set means interrupt enabled, bit cleared - disabled + +Interrupt modes supported are legacy, MSI and MSI-X +In case of legacy interrupts, register PVSCSI_REG_OFFSET_INTR_STATUS +is used to check which interrupt has arrived. Interrupts are +acknowledged when the corresponding bit is written to the interrupt +status register. + +PVSCSI Device Operation Sequences +================================= + +1. Startup sequence: + a. Issue PVSCSI_CMD_ADAPTER_RESET command; + aa. Windows driver reads interrupt status register here; + b. Issue PVSCSI_CMD_SETUP_MSG_RING command with no additional data, + check status and disable device messages if error returned; + (Omitted if device messages disabled by driver configuration) + c. Issue PVSCSI_CMD_SETUP_RINGS command, provide rings configuration + as struct PVSCSICmdDescSetupRings; + d. Issue PVSCSI_CMD_SETUP_MSG_RING command again, provide + rings configuration as struct PVSCSICmdDescSetupMsgRing; + e. Unmask completion and message (if device messages enabled) interrupts. + +2. Shutdown sequences + a. Mask interrupts; + b. Flush request ring using PVSCSI_REG_OFFSET_KICK_NON_RW_IO; + c. Issue PVSCSI_CMD_ADAPTER_RESET command. + +3. Send request + a. Fill next free request ring descriptor; + b. Issue PVSCSI_REG_OFFSET_KICK_RW_IO for R/W operations; + or PVSCSI_REG_OFFSET_KICK_NON_RW_IO for other operations. + +4. Abort command + a. Issue PVSCSI_CMD_ABORT_CMD command; + +5. Request completion processing + a. Upon completion interrupt arrival process completion + and message (if enabled) rings. |