27 files changed, 5158 insertions, 0 deletions
diff --git a/docs/specs/acpi_cpu_hotplug.rst b/docs/specs/acpi_cpu_hotplug.rst
new file mode 100644
index 000000000..351057c96
--- /dev/null
+++ b/docs/specs/acpi_cpu_hotplug.rst
@@ -0,0 +1,235 @@
+QEMU<->ACPI BIOS CPU hotplug interface
+======================================
+
+QEMU supports CPU hotplug via ACPI. This document
+describes the interface between QEMU and the ACPI BIOS.
+
+ACPI BIOS GPE.2 handler is dedicated for notifying OS about CPU hot-add
+and hot-remove events.
+
+
+Legacy ACPI CPU hotplug interface registers
+-------------------------------------------
+
+CPU present bitmap for:
+
+- ICH9-LPC (IO port 0x0cd8-0xcf7, 1-byte access)
+- PIIX-PM  (IO port 0xaf00-0xaf1f, 1-byte access)
+- One bit per CPU. Bit position reflects corresponding CPU APIC ID. Read-only.
+- The first DWORD in bitmap is used in write mode to switch from legacy
+  to modern CPU hotplug interface, write 0 into it to do switch.
+
+QEMU sets corresponding CPU bit on hot-add event and issues SCI
+with GPE.2 event set. CPU present map is read by ACPI BIOS GPE.2 handler
+to notify OS about CPU hot-add events. CPU hot-remove isn't supported.
+
+
+Modern ACPI CPU hotplug interface registers
+-------------------------------------------
+
+Register block base address:
+
+- ICH9-LPC IO port 0x0cd8
+- PIIX-PM  IO port 0xaf00
+
+Register block size:
+
+- ACPI_CPU_HOTPLUG_REG_LEN = 12
+
+All accesses to registers described below, imply little-endian byte order.
+
+Reserved registers behavior:
+
+- write accesses are ignored
+- read accesses return all bits set to 0.
+
+The last stored value in 'CPU selector' must refer to a possible CPU, otherwise
+
+- reads from any register return 0
+- writes to any other register are ignored until valid value is stored into it
+
+On QEMU start, 'CPU selector' is initialized to a valid value, on reset it
+keeps the current value.
+
+Read access behavior
+^^^^^^^^^^^^^^^^^^^^
+
+offset [0x0-0x3]
+  Command data 2: (DWORD access)
+
+  If value last stored in 'Command field' is:
+
+  0:
+    reads as 0x0
+  3:
+    upper 32 bits of architecture specific CPU ID value
+  other values:
+    reserved
+
+offset [0x4]
+  CPU device status fields: (1 byte access)
+
+  bits:
+
+  0:
+    Device is enabled and may be used by guest
+  1:
+    Device insert event, used to distinguish device for which
+    no device check event to OSPM was issued.
+    It's valid only when bit 0 is set.
+  2:
+    Device remove event, used to distinguish device for which
+    no device eject request to OSPM was issued. Firmware must
+    ignore this bit.
+  3:
+    reserved and should be ignored by OSPM
+  4:
+    if set to 1, OSPM requests firmware to perform device eject.
+  5-7:
+    reserved and should be ignored by OSPM
+
+offset [0x5-0x7]
+  reserved
+
+offset [0x8]
+  Command data: (DWORD access)
+
+  If value last stored in 'Command field' is one of:
+
+  0:
+    contains 'CPU selector' value of a CPU with pending event[s]
+  3:
+    lower 32 bits of architecture specific CPU ID value
+    (in x86 case: APIC ID)
+  otherwise:
+    contains 0
+
+Write access behavior
+^^^^^^^^^^^^^^^^^^^^^
+
+offset [0x0-0x3]
+  CPU selector: (DWORD access)
+
+  Selects active CPU device. All following accesses to other
+  registers will read/store data from/to selected CPU.
+  Valid values: [0 .. max_cpus)
+
+offset [0x4]
+  CPU device control fields: (1 byte access)
+
+  bits:
+
+  0:
+    reserved, OSPM must clear it before writing to register.
+  1:
+    if set to 1 clears device insert event, set by OSPM
+    after it has emitted device check event for the
+    selected CPU device
+  2:
+    if set to 1 clears device remove event, set by OSPM
+    after it has emitted device eject request for the
+    selected CPU device.
+  3:
+    if set to 1 initiates device eject, set by OSPM when it
+    triggers CPU device removal and calls _EJ0 method or by firmware
+    when bit #4 is set. In case bit #4 were set, it's cleared as
+    part of device eject.
+  4:
+    if set to 1, OSPM hands over device eject to firmware.
+    Firmware shall issue device eject request as described above
+    (bit #3) and OSPM should not touch device eject bit (#3) in case
+    it's asked firmware to perform CPU device eject.
+  5-7:
+    reserved, OSPM must clear them before writing to register
+
+offset[0x5]
+  Command field: (1 byte access)
+
+  value:
+
+  0:
+    selects a CPU device with inserting/removing events and
+    following reads from 'Command data' register return
+    selected CPU ('CPU selector' value).
+    If no CPU with events found, the current 'CPU selector' doesn't
+    change and corresponding insert/remove event flags are not modified.
+
+  1:
+    following writes to 'Command data' register set OST event
+    register in QEMU
+  2:
+    following writes to 'Command data' register set OST status
+    register in QEMU
+  3:
+    following reads from 'Command data' and 'Command data 2' return
+    architecture specific CPU ID value for currently selected CPU.
+  other values:
+    reserved
+
+offset [0x6-0x7]
+  reserved
+
+offset [0x8]
+  Command data: (DWORD access)
+
+  If last stored 'Command field' value is:
+
+  1:
+    stores value into OST event register
+  2:
+    stores value into OST status register, triggers
+    ACPI_DEVICE_OST QMP event from QEMU to external applications
+    with current values of OST event and status registers.
+  other values:
+    reserved
+
+Typical usecases
+----------------
+
+(x86) Detecting and enabling modern CPU hotplug interface
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+QEMU starts with legacy CPU hotplug interface enabled. Detecting and
+switching to modern interface is based on the 2 legacy CPU hotplug features:
+
+#. Writes into CPU bitmap are ignored.
+#. CPU bitmap always has bit #0 set, corresponding to boot CPU.
+
+Use following steps to detect and enable modern CPU hotplug interface:
+
+#. Store 0x0 to the 'CPU selector' register, attempting to switch to modern mode
+#. Store 0x0 to the 'CPU selector' register, to ensure valid selector value
+#. Store 0x0 to the 'Command field' register
+#. Read the 'Command data 2' register.
+   If read value is 0x0, the modern interface is enabled.
+   Otherwise legacy or no CPU hotplug interface available
+
+Get a cpu with pending event
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Store 0x0 to the 'CPU selector' register.
+#. Store 0x0 to the 'Command field' register.
+#. Read the 'CPU device status fields' register.
+#. If both bit #1 and bit #2 are clear in the value read, there is no CPU
+   with a pending event and selected CPU remains unchanged.
+#. Otherwise, read the 'Command data' register. The value read is the
+   selector of the CPU with the pending event (which is already selected).
+
+Enumerate CPUs present/non present CPUs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Set the present CPU count to 0.
+#. Set the iterator to 0.
+#. Store 0x0 to the 'CPU selector' register, to ensure that it's in
+   a valid state and that access to other registers won't be ignored.
+#. Store 0x0 to the 'Command field' register to make 'Command data'
+   register return 'CPU selector' value of selected CPU
+#. Read the 'CPU device status fields' register.
+#. If bit #0 is set, increment the present CPU count.
+#. Increment the iterator.
+#. Store the iterator to the 'CPU selector' register.
+#. Read the 'Command data' register.
+#. If the value read is not zero, goto 05.
+#. Otherwise store 0x0 to the 'CPU selector' register, to put it
+   into a valid state and exit.
+   The iterator at this point equals "max_cpus".
diff --git a/docs/specs/acpi_hest_ghes.rst b/docs/specs/acpi_hest_ghes.rst
new file mode 100644
index 000000000..68f1fbe0a
--- /dev/null
+++ b/docs/specs/acpi_hest_ghes.rst
@@ -0,0 +1,110 @@
+APEI tables generating and CPER record
+======================================
+
+..
+   Copyright (c) 2020 HUAWEI TECHNOLOGIES CO., LTD.
+
+   This work is licensed under the terms of the GNU GPL, version 2 or later.
+   See the COPYING file in the top-level directory.
+
+Design Details
+--------------
+
+::
+
+         etc/acpi/tables                           etc/hardware_errors
+      ====================                   ===============================
+  + +--------------------------+            +----------------------------+
+  | | HEST                     | +--------->|    error_block_address1    |------+
+  | +--------------------------+ |          +----------------------------+      |
+  | | GHES1                    | | +------->|    error_block_address2    |------+-+
+  | +--------------------------+ | |        +----------------------------+      | |
+  | | .................        | | |        |      ..............        |      | |
+  | | error_status_address-----+-+ |        -----------------------------+      | |
+  | | .................        |   |   +--->|    error_block_addressN    |------+-+---+
+  | | read_ack_register--------+-+ |   |    +----------------------------+      | |   |
+  | | read_ack_preserve        | +-+---+--->|     read_ack_register1     |      | |   |
+  | | read_ack_write           |   |   |    +----------------------------+      | |   |
+  + +--------------------------+   | +-+--->|     read_ack_register2     |      | |   |
+  | | GHES2                    |   | | |    +----------------------------+      | |   |
+  + +--------------------------+   | | |    |       .............        |      | |   |
+  | | .................        |   | | |    +----------------------------+      | |   |
+  | | error_status_address-----+---+ | | +->|     read_ack_registerN     |      | |   |
+  | | .................        |     | | |  +----------------------------+      | |   |
+  | | read_ack_register--------+-----+ | |  |Generic Error Status Block 1|<-----+ |   |
+  | | read_ack_preserve        |       | |  |-+------------------------+-+        |   |
+  | | read_ack_write           |       | |  | |          CPER          | |        |   |
+  + +--------------------------|       | |  | |          CPER          | |        |   |
+  | | ...............          |       | |  | |          ....          | |        |   |
+  + +--------------------------+       | |  | |          CPER          | |        |   |
+  | | GHESN                    |       | |  |-+------------------------+-|        |   |
+  + +--------------------------+       | |  |Generic Error Status Block 2|<-------+   |
+  | | .................        |       | |  |-+------------------------+-+            |
+  | | error_status_address-----+-------+ |  | |           CPER         | |            |
+  | | .................        |         |  | |           CPER         | |            |
+  | | read_ack_register--------+---------+  | |           ....         | |            |
+  | | read_ack_preserve        |            | |           CPER         | |            |
+  | | read_ack_write           |            +-+------------------------+-+            |
+  + +--------------------------+            |         ..........         |            |
+                                            |----------------------------+            |
+                                            |Generic Error Status Block N |<----------+
+                                            |-+-------------------------+-+
+                                            | |          CPER           | |
+                                            | |          CPER           | |
+                                            | |          ....           | |
+                                            | |          CPER           | |
+                                            +-+-------------------------+-+
+
+
+(1) QEMU generates the ACPI HEST table. This table goes in the current
+    "etc/acpi/tables" fw_cfg blob. Each error source has different
+    notification types.
+
+(2) A new fw_cfg blob called "etc/hardware_errors" is introduced. QEMU
+    also needs to populate this blob. The "etc/hardware_errors" fw_cfg blob
+    contains an address registers table and an Error Status Data Block table.
+
+(3) The address registers table contains N Error Block Address entries
+    and N Read Ack Register entries. The size for each entry is 8-byte.
+    The Error Status Data Block table contains N Error Status Data Block
+    entries. The size for each entry is 4096(0x1000) bytes. The total size
+    for the "etc/hardware_errors" fw_cfg blob is (N * 8 * 2 + N * 4096) bytes.
+    N is the number of the kinds of hardware error sources.
+
+(4) QEMU generates the ACPI linker/loader script for the firmware. The
+    firmware pre-allocates memory for "etc/acpi/tables", "etc/hardware_errors"
+    and copies blob contents there.
+
+(5) QEMU generates N ADD_POINTER commands, which patch addresses in the
+    "error_status_address" fields of the HEST table with a pointer to the
+    corresponding "address registers" in the "etc/hardware_errors" blob.
+
+(6) QEMU generates N ADD_POINTER commands, which patch addresses in the
+    "read_ack_register" fields of the HEST table with a pointer to the
+    corresponding "read_ack_register" within the "etc/hardware_errors" blob.
+
+(7) QEMU generates N ADD_POINTER commands for the firmware, which patch
+    addresses in the "error_block_address" fields with a pointer to the
+    respective "Error Status Data Block" in the "etc/hardware_errors" blob.
+
+(8) QEMU defines a third and write-only fw_cfg blob which is called
+    "etc/hardware_errors_addr". Through that blob, the firmware can send back
+    the guest-side allocation addresses to QEMU. The "etc/hardware_errors_addr"
+    blob contains a 8-byte entry. QEMU generates a single WRITE_POINTER command
+    for the firmware. The firmware will write back the start address of
+    "etc/hardware_errors" blob to the fw_cfg file "etc/hardware_errors_addr".
+
+(9) When QEMU gets a SIGBUS from the kernel, QEMU writes CPER into corresponding
+    "Error Status Data Block", guest memory, and then injects platform specific
+    interrupt (in case of arm/virt machine it's Synchronous External Abort) as a
+    notification which is necessary for notifying the guest.
+
+(10) This notification (in virtual hardware) will be handled by the guest
+     kernel, on receiving notification, guest APEI driver could read the CPER error
+     and take appropriate action.
+
+(11) kvm_arch_on_sigbus_vcpu() uses source_id as index in "etc/hardware_errors" to
+     find out "Error Status Data Block" entry corresponding to error source. So supported
+     source_id values should be assigned here and not be changed afterwards to make sure
+     that guest will write error into expected "Error Status Data Block" even if guest was
+     migrated to a newer QEMU.
diff --git a/docs/specs/acpi_hw_reduced_hotplug.rst b/docs/specs/acpi_hw_reduced_hotplug.rst
new file mode 100644
index 000000000..0bd3f9399
--- /dev/null
+++ b/docs/specs/acpi_hw_reduced_hotplug.rst
@@ -0,0 +1,71 @@
+==================================================
+QEMU and ACPI BIOS Generic Event Device interface
+==================================================
+
+The ACPI *Generic Event Device* (GED) is a HW reduced platform
+specific device introduced in ACPI v6.1 that handles all platform
+events, including the hotplug ones. GED is modelled as a device
+in the namespace with a _HID defined to be ACPI0013. This document
+describes the interface between QEMU and the ACPI BIOS.
+
+GED allows HW reduced platforms to handle interrupts in ACPI ASL
+statements. It follows a very similar approach to the _EVT method
+from GPIO events. All interrupts are listed in  _CRS and the handler
+is written in _EVT method. However, the QEMU implementation uses a
+single interrupt for the GED device, relying on an IO memory region
+to communicate the type of device affected by the interrupt. This way,
+we can support up to 32 events with a unique interrupt.
+
+**Here is an example,**
+
+::
+
+   Device (\_SB.GED)
+   {
+       Name (_HID, "ACPI0013")
+       Name (_UID, Zero)
+       Name (_CRS, ResourceTemplate ()
+       {
+           Interrupt (ResourceConsumer, Edge, ActiveHigh, Exclusive, ,, )
+           {
+               0x00000029,
+           }
+       })
+       OperationRegion (EREG, SystemMemory, 0x09080000, 0x04)
+       Field (EREG, DWordAcc, NoLock, WriteAsZeros)
+       {
+           ESEL,   32
+       }
+       Method (_EVT, 1, Serialized)
+       {
+           Local0 = ESEL // ESEL = IO memory region which specifies the
+                         // device type.
+           If (((Local0 & One) == One))
+           {
+               MethodEvent1()
+           }
+           If ((Local0 & 0x2) == 0x2)
+           {
+               MethodEvent2()
+           }
+           ...
+       }
+   }
+
+GED IO interface (4 byte access)
+--------------------------------
+**read access:**
+
+::
+
+   [0x0-0x3] Event selector bit field (32 bit) set by QEMU.
+
+    bits:
+       0: Memory hotplug event
+       1: System power down event
+       2: NVDIMM hotplug event
+    3-31: Reserved
+
+**write_access:**
+
+Nothing is expected to be written into GED IO memory
diff --git a/docs/specs/acpi_mem_hotplug.rst b/docs/specs/acpi_mem_hotplug.rst
new file mode 100644
index 000000000..069819bc3
--- /dev/null
+++ b/docs/specs/acpi_mem_hotplug.rst
@@ -0,0 +1,128 @@
+QEMU<->ACPI BIOS memory hotplug interface
+=========================================
+
+ACPI BIOS GPE.3 handler is dedicated for notifying OS about memory hot-add
+and hot-remove events.
+
+Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access)
+----------------------------------------------------------------
+
+Read access behavior
+^^^^^^^^^^^^^^^^^^^^
+
+[0x0-0x3]
+  Lo part of memory device phys address
+[0x4-0x7]
+  Hi part of memory device phys address
+[0x8-0xb]
+  Lo part of memory device size in bytes
+[0xc-0xf]
+  Hi part of memory device size in bytes
+[0x10-0x13]
+  Memory device proximity domain
+[0x14]
+  Memory device status fields
+
+  bits:
+
+  0:
+    Device is enabled and may be used by guest
+  1:
+    Device insert event, used to distinguish device for which
+    no device check event to OSPM was issued.
+    It's valid only when bit 1 is set.
+  2:
+    Device remove event, used to distinguish device for which
+    no device eject request to OSPM was issued.
+  3-7:
+    reserved and should be ignored by OSPM
+
+[0x15-0x17]
+  reserved
+
+Write access behavior
+^^^^^^^^^^^^^^^^^^^^^
+
+
+[0x0-0x3]
+  Memory device slot selector, selects active memory device.
+  All following accesses to other registers in 0xa00-0xa17
+  region will read/store data from/to selected memory device.
+[0x4-0x7]
+  OST event code reported by OSPM
+[0x8-0xb]
+  OST status code reported by OSPM
+[0xc-0x13]
+  reserved, writes into it are ignored
+[0x14]
+  Memory device control fields
+
+  bits:
+
+  0:
+    reserved, OSPM must clear it before writing to register.
+    Due to BUG in versions prior 2.4 that field isn't cleared
+    when other fields are written. Keep it reserved and don't
+    try to reuse it.
+  1:
+    if set to 1 clears device insert event, set by OSPM
+    after it has emitted device check event for the
+    selected memory device
+  2:
+    if set to 1 clears device remove event, set by OSPM
+    after it has emitted device eject request for the
+    selected memory device
+  3:
+    if set to 1 initiates device eject, set by OSPM when it
+    triggers memory device removal and calls _EJ0 method
+  4-7:
+    reserved, OSPM must clear them before writing to register
+
+Selecting memory device slot beyond present range has no effect on platform:
+
+- write accesses to memory hot-plug registers not documented above are ignored
+- read accesses to memory hot-plug registers not documented above return
+  all bits set to 1.
+
+Memory hot remove process diagram
+---------------------------------
+
+::
+
+   +-------------+     +-----------------------+      +------------------+
+   |  1. QEMU    |     | 2. QEMU               |      |3. QEMU           |
+   |  device_del +---->+ device unplug request +----->+Send SCI to guest,|
+   |             |     |         cb            |      |return control to |
+   |             |     |                       |      |management        |
+   +-------------+     +-----------------------+      +------------------+
+
+   +---------------------------------------------------------------------+
+
+   +---------------------+              +-------------------------+
+   | OSPM:               | remove event | OSPM:                   |
+   | send Eject Request, |              | Scan memory devices     |
+   | clear remove event  +<-------------+ for event flags         |
+   |                     |              |                         |
+   +---------------------+              +-------------------------+
+             |
+             |
+   +---------v--------+            +-----------------------+
+   | Guest OS:        |  success   | OSPM:                 |
+   | process Ejection +----------->+ Execute _EJ0 method,  |
+   | request          |            | set eject bit in flags|
+   +------------------+            +-----------------------+
+             |failure                         |
+             v                                v
+   +------------------------+      +-----------------------+
+   | OSPM:                  |      | QEMU:                 |
+   | set OST event & status |      | call device unplug cb |
+   | fields                 |      |                       |
+   +------------------------+      +-----------------------+
+            |                                  |
+            v                                  v
+   +------------------+              +-------------------+
+   |QEMU:             |              |QEMU:              |
+   |Send OST QMP event|              |Send device deleted|
+   |                  |              |QMP event          |
+   +------------------+              |                   |
+                                     +-------------------+
diff --git a/docs/specs/acpi_nvdimm.rst b/docs/specs/acpi_nvdimm.rst
new file mode 100644
index 000000000..ab0335253
--- /dev/null
+++ b/docs/specs/acpi_nvdimm.rst
@@ -0,0 +1,228 @@
+QEMU<->ACPI BIOS NVDIMM interface
+=================================
+
+QEMU supports NVDIMM via ACPI. This document describes the basic concepts of
+NVDIMM ACPI and the interface between QEMU and the ACPI BIOS.
+
+NVDIMM ACPI Background
+----------------------
+
+NVDIMM is introduced in ACPI 6.0 which defines an NVDIMM root device under
+_SB scope with a _HID of "ACPI0012". For each NVDIMM present or intended
+to be supported by platform, platform firmware also exposes an ACPI
+Namespace Device under the root device.
+
+The NVDIMM child devices under the NVDIMM root device are defined with _ADR
+corresponding to the NFIT device handle. The NVDIMM root device and the
+NVDIMM devices can have device specific methods (_DSM) to provide additional
+functions specific to a particular NVDIMM implementation.
+
+This is an example from ACPI 6.0, a platform contains one NVDIMM::
+
+  Scope (\_SB){
+     Device (NVDR) // Root device
+     {
+        Name (_HID, "ACPI0012")
+        Method (_STA) {...}
+        Method (_FIT) {...}
+        Method (_DSM, ...) {...}
+        Device (NVD)
+        {
+           Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
+           Method (_DSM, ...) {...}
+        }
+     }
+  }
+
+Methods supported on both NVDIMM root device and NVDIMM device
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+_DSM (Device Specific Method)
+   It is a control method that enables devices to provide device specific
+   control functions that are consumed by the device driver.
+   The NVDIMM DSM specification can be found at
+   http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+
+   Arguments:
+
+   Arg0
+     A Buffer containing a UUID (16 Bytes)
+   Arg1
+     An Integer containing the Revision ID (4 Bytes)
+   Arg2
+     An Integer containing the Function Index (4 Bytes)
+   Arg3
+     A package containing parameters for the function specified by the
+     UUID, Revision ID, and Function Index
+
+   Return Value:
+
+   If Function Index = 0, a Buffer containing a function index bitfield.
+   Otherwise, the return value and type depends on the UUID, revision ID
+   and function index which are described in the DSM specification.
+
+Methods on NVDIMM ROOT Device
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+_FIT(Firmware Interface Table)
+   It evaluates to a buffer returning data in the format of a series of NFIT
+   Type Structure.
+
+   Arguments: None
+
+   Return Value:
+   A Buffer containing a list of NFIT Type structure entries.
+
+   The detailed definition of the structure can be found at ACPI 6.0: 5.2.25
+   NVDIMM Firmware Interface Table (NFIT).
+
+QEMU NVDIMM Implementation
+--------------------------
+
+QEMU uses 4 bytes IO Port starting from 0x0a18 and a RAM-based memory page
+for NVDIMM ACPI.
+
+Memory:
+   QEMU uses BIOS Linker/loader feature to ask BIOS to allocate a memory
+   page and dynamically patch its address into an int32 object named "MEMA"
+   in ACPI.
+
+   This page is RAM-based and it is used to transfer data between _DSM
+   method and QEMU. If ACPI has control, this pages is owned by ACPI which
+   writes _DSM input data to it, otherwise, it is owned by QEMU which
+   emulates _DSM access and writes the output data to it.
+
+   ACPI writes _DSM Input Data (based on the offset in the page):
+
+   [0x0 - 0x3]
+      4 bytes, NVDIMM Device Handle.
+
+      The handle is completely QEMU internal thing, the values in
+      range [1, 0xFFFF] indicate nvdimm device. Other values are
+      reserved for other purposes.
+
+      Reserved handles:
+
+      - 0 is reserved for nvdimm root device named NVDR.
+      - 0x10000 is reserved for QEMU internal DSM function called on
+        the root device.
+
+   [0x4 - 0x7]
+      4 bytes, Revision ID, that is the Arg1 of _DSM method.
+
+   [0x8 - 0xB]
+      4 bytes. Function Index, that is the Arg2 of _DSM method.
+
+   [0xC - 0xFFF]
+      4084 bytes, the Arg3 of _DSM method.
+
+   QEMU writes Output Data (based on the offset in the page):
+
+   [0x0 - 0x3]
+      4 bytes, the length of result
+
+   [0x4 - 0xFFF]
+      4092 bytes, the DSM result filled by QEMU
+
+IO Port 0x0a18 - 0xa1b:
+   ACPI writes the address of the memory page allocated by BIOS to this
+   port then QEMU gets the control and fills the result in the memory page.
+
+   Write Access:
+
+   [0x0a18 - 0xa1b]
+      4 bytes, the address of the memory page allocated by BIOS.
+
+_DSM process diagram
+--------------------
+
+"MEMA" indicates the address of memory page allocated by BIOS.
+
+::
+
+ +----------------------+      +-----------------------+
+ |    1. OSPM           |      |    2. OSPM            |
+ | save _DSM input data |      |  write "MEMA" to      | Exit to QEMU
+ | to the page          +----->|  IO port 0x0a18       +------------+
+ | indicated by "MEMA"  |      |                       |            |
+ +----------------------+      +-----------------------+            |
+                                                                    |
+                                                                    v
+ +--------------------+       +-----------+      +------------------+--------+
+ |      5 QEMU        |       | 4 QEMU    |      |        3. QEMU            |
+ | write _DSM result  |       |  emulate  |      | get _DSM input data from  |
+ | to the page        +<------+ _DSM      +<-----+ the page indicated by the |
+ |                    |       |           |      | value from the IO port    |
+ +--------+-----------+       +-----------+      +---------------------------+
+          |
+          | Enter Guest
+          |
+          v
+ +--------------------------+      +--------------+
+ |     6 OSPM               |      |   7 OSPM     |
+ | result size is returned  |      |  _DSM return |
+ | by reading  DSM          +----->+              |
+ | result from the page     |      |              |
+ +--------------------------+      +--------------+
+
+NVDIMM hotplug
+--------------
+
+ACPI BIOS GPE.4 handler is dedicated for notifying OS about nvdimm device
+hot-add event.
+
+QEMU internal use only _DSM functions
+-------------------------------------
+
+Read FIT
+^^^^^^^^
+
+_FIT method uses _DSM method to fetch NFIT structures blob from QEMU
+in 1 page sized increments which are then concatenated and returned
+as _FIT method result.
+
+Input parameters:
+
+Arg0
+  UUID {set to 648B9CF2-CDA1-4312-8AD9-49C4AF32BD62}
+Arg1
+  Revision ID (set to 1)
+Arg2
+  Function Index, 0x1
+Arg3
+  A package containing a buffer whose layout is as follows:
+
+   +----------+--------+--------+-------------------------------------------+
+   |  Field   | Length | Offset |                 Description               |
+   +----------+--------+--------+-------------------------------------------+
+   | offset   |   4    |   0    | offset in QEMU's NFIT structures blob to  |
+   |          |        |        | read from                                 |
+   +----------+--------+--------+-------------------------------------------+
+
+Output layout in the dsm memory page:
+
+   +----------+--------+--------+-------------------------------------------+
+   | Field    | Length | Offset | Description                               |
+   +----------+--------+--------+-------------------------------------------+
+   | length   | 4      | 0      | length of entire returned data            |
+   |          |        |        | (including this header)                   |
+   +----------+--------+--------+-------------------------------------------+
+   |          |        |        | return status codes                       |
+   |          |        |        |                                           |
+   |          |        |        | - 0x0 - success                           |
+   |          |        |        | - 0x100 - error caused by NFIT update     |
+   | status   | 4      | 4      |   while read by _FIT wasn't completed     |
+   |          |        |        | - other codes follow Chapter 3 in         |
+   |          |        |        |   DSM Spec Rev1                           |
+   +----------+--------+--------+-------------------------------------------+
+   | fit data | Varies | 8      | contains FIT data. This field is present  |
+   |          |        |        | if status field is 0.                     |
+   +----------+--------+--------+-------------------------------------------+
+
+The FIT offset is maintained by the OSPM itself, current offset plus
+the size of the fit data returned by the function is the next offset
+OSPM should read. When all FIT data has been read out, zero fit data
+size is returned.
+
+If it returns status code 0x100, OSPM should restart to read FIT (read
+from offset 0 again).
diff --git a/docs/specs/acpi_pci_hotplug.rst b/docs/specs/acpi_pci_hotplug.rst
new file mode 100644
index 000000000..685bc5c32
--- /dev/null
+++ b/docs/specs/acpi_pci_hotplug.rst
@@ -0,0 +1,48 @@
+QEMU<->ACPI BIOS PCI hotplug interface
+======================================
+
+QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document
+describes the interface between QEMU and the ACPI BIOS.
+
+ACPI GPE block (IO ports 0xafe0-0xafe3, byte access)
+----------------------------------------------------
+
+Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject
+event to ACPI BIOS, via SCI interrupt.
+
+PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access)
+------------------------------------------------------------------------------
+
+Slot injection notification pending. One bit per slot.
+
+Read by ACPI BIOS GPE.1 handler to notify OS of injection
+events.  Read-only.
+
+PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access)
+--------------------------------------------------------------------
+
+Slot removal notification pending. One bit per slot.
+
+Read by ACPI BIOS GPE.1 handler to notify OS of removal
+events.  Read-only.
+
+PCI device eject (IO port 0xae08-0xae0b, 4-byte access)
+-------------------------------------------------------
+
+Write: Used by ACPI BIOS _EJ0 method to request device removal.
+One bit per slot.
+
+Read: Hotplug features register.  Used by platform to identify features
+available.  Current base feature set (no bits set):
+
+- Read-only "up" register @0xae00, 4-byte access, bit per slot
+- Read-only "down" register @0xae04, 4-byte access, bit per slot
+- Read/write "eject" register @0xae08, 4-byte access,
+  write: bit per slot eject, read: hotplug feature set
+- Read-only hotplug capable register @0xae0c, 4-byte access, bit per slot
+
+PCI removability status (IO port 0xae0c-0xae0f, 4-byte access)
+--------------------------------------------------------------
+
+Used by ACPI BIOS _RMV method to indicate removability status to OS. One
+bit per slot.  Read-only.
diff --git a/docs/specs/edu.txt b/docs/specs/edu.txt
new file mode 100644
index 000000000..087631080
--- /dev/null
+++ b/docs/specs/edu.txt
@@ -0,0 +1,115 @@
+
+EDU device
+==========
+
+Copyright (c) 2014-2015 Jiri Slaby
+
+This document is licensed under the GPLv2 (or later).
+
+This is an educational device for writing (kernel) drivers. Its original
+intention was to support the Linux kernel lectures taught at the Masaryk
+University. Students are given this virtual device and are expected to write a
+driver with I/Os, IRQs, DMAs and such.
+
+The devices behaves very similar to the PCI bridge present in the COMBO6 cards
+developed under the Liberouter wings. Both PCI device ID and PCI space is
+inherited from that device.
+
+Command line switches:
+    -device edu[,dma_mask=mask]
+
+    dma_mask makes the virtual device work with DMA addresses with the given
+    mask. For educational purposes, the device supports only 28 bits (256 MiB)
+    by default. Students shall set dma_mask for the device in the OS driver
+    properly.
+
+PCI specs
+---------
+
+PCI ID: 1234:11e8
+
+PCI Region 0:
+   I/O memory, 1 MB in size. Users are supposed to communicate with the card
+   through this memory.
+
+MMIO area spec
+--------------
+
+Only size == 4 accesses are allowed for addresses < 0x80. size == 4 or
+size == 8 for the rest.
+
+0x00 (RO) : identification (0xRRrr00edu)
+	    RR -- major version
+	    rr -- minor version
+
+0x04 (RW) : card liveness check
+	    It is a simple value inversion (~ C operator).
+
+0x08 (RW) : factorial computation
+	    The stored value is taken and factorial of it is put back here.
+	    This happens only after factorial bit in the status register (0x20
+	    below) is cleared.
+
+0x20 (RW) : status register, bitwise OR
+	    0x01 -- computing factorial (RO)
+	    0x80 -- raise interrupt after finishing factorial computation
+
+0x24 (RO) : interrupt status register
+	    It contains values which raised the interrupt (see interrupt raise
+	    register below).
+
+0x60 (WO) : interrupt raise register
+	    Raise an interrupt. The value will be put to the interrupt status
+	    register (using bitwise OR).
+
+0x64 (WO) : interrupt acknowledge register
+	    Clear an interrupt. The value will be cleared from the interrupt
+	    status register. This needs to be done from the ISR to stop
+	    generating interrupts.
+
+0x80 (RW) : DMA source address
+	    Where to perform the DMA from.
+
+0x88 (RW) : DMA destination address
+	    Where to perform the DMA to.
+
+0x90 (RW) : DMA transfer count
+	    The size of the area to perform the DMA on.
+
+0x98 (RW) : DMA command register, bitwise OR
+	    0x01 -- start transfer
+	    0x02 -- direction (0: from RAM to EDU, 1: from EDU to RAM)
+	    0x04 -- raise interrupt 0x100 after finishing the DMA
+
+IRQ controller
+--------------
+An IRQ is generated when written to the interrupt raise register. The value
+appears in interrupt status register when the interrupt is raised and has to
+be written to the interrupt acknowledge register to lower it.
+
+The device supports both INTx and MSI interrupt. By default, INTx is
+used. Even if the driver disabled INTx and only uses MSI, it still
+needs to update the acknowledge register at the end of the IRQ handler
+routine.
+
+DMA controller
+--------------
+One has to specify, source, destination, size, and start the transfer. One
+4096 bytes long buffer at offset 0x40000 is available in the EDU device. I.e.
+one can perform DMA to/from this space when programmed properly.
+
+Example of transferring a 100 byte block to and from the buffer using a given
+PCI address 'addr':
+addr     -> DMA source address
+0x40000  -> DMA destination address
+100      -> DMA transfer count
+1        -> DMA command register
+while (DMA command register & 1)
+	;
+
+0x40000  -> DMA source address
+addr+100 -> DMA destination address
+100      -> DMA transfer count
+3        -> DMA command register
+while (DMA command register & 1)
+	;
diff --git a/docs/specs/fw_cfg.txt b/docs/specs/fw_cfg.txt
new file mode 100644
index 000000000..3e6d586f6
--- /dev/null
+++ b/docs/specs/fw_cfg.txt
@@ -0,0 +1,265 @@
+QEMU Firmware Configuration (fw_cfg) Device
+===========================================
+
+= Guest-side Hardware Interface =
+
+This hardware interface allows the guest to retrieve various data items
+(blobs) that can influence how the firmware configures itself, or may
+contain tables to be installed for the guest OS. Examples include device
+boot order, ACPI and SMBIOS tables, virtual machine UUID, SMP and NUMA
+information, kernel/initrd images for direct (Linux) kernel booting, etc.
+
+== Selector (Control) Register ==
+
+* Write only
+* Location: platform dependent (IOport or MMIO)
+* Width: 16-bit
+* Endianness: little-endian (if IOport), or big-endian (if MMIO)
+
+A write to this register sets the index of a firmware configuration
+item which can subsequently be accessed via the data register.
+
+Setting the selector register will cause the data offset to be set
+to zero. The data offset impacts which data is accessed via the data
+register, and is explained below.
+
+Bit14 of the selector register indicates whether the configuration
+setting is being written. A value of 0 means the item is only being
+read, and all write access to the data port will be ignored. A value
+of 1 means the item's data can be overwritten by writes to the data
+register. In other words, configuration write mode is enabled when
+the selector value is between 0x4000-0x7fff or 0xc000-0xffff.
+
+NOTE: As of QEMU v2.4, writes to the fw_cfg data register are no
+      longer supported, and will be ignored (treated as no-ops)!
+
+NOTE: As of QEMU v2.9, writes are reinstated, but only through the DMA
+      interface (see below). Furthermore, writeability of any specific item is
+      governed independently of Bit14 in the selector key value.
+
+Bit15 of the selector register indicates whether the configuration
+setting is architecture specific. A value of 0 means the item is a
+generic configuration item. A value of 1 means the item is specific
+to a particular architecture. In other words, generic configuration
+items are accessed with a selector value between 0x0000-0x7fff, and
+architecture specific configuration items are accessed with a selector
+value between 0x8000-0xffff.
+
+== Data Register ==
+
+* Read/Write (writes ignored as of QEMU v2.4, but see the DMA interface)
+* Location: platform dependent (IOport [*] or MMIO)
+* Width: 8-bit (if IOport), 8/16/32/64-bit (if MMIO)
+* Endianness: string-preserving
+
+[*] On platforms where the data register is exposed as an IOport, its
+port number will always be one greater than the port number of the
+selector register. In other words, the two ports overlap, and can not
+be mapped separately.
+
+The data register allows access to an array of bytes for each firmware
+configuration data item. The specific item is selected by writing to
+the selector register, as described above.
+
+Initially following a write to the selector register, the data offset
+will be set to zero. Each successful access to the data register will
+increment the data offset by the appropriate access width.
+
+Each firmware configuration item has a maximum length of data
+associated with the item. After the data offset has passed the
+end of this maximum data length, then any reads will return a data
+value of 0x00, and all writes will be ignored.
+
+An N-byte wide read of the data register will return the next available
+N bytes of the selected firmware configuration item, as a substring, in
+increasing address order, similar to memcpy().
+
+== Register Locations ==
+
+=== x86, x86_64 Register Locations ===
+
+Selector Register IOport: 0x510
+Data Register IOport:     0x511
+DMA Address IOport:       0x514
+
+=== Arm Register Locations ===
+
+Selector Register address: Base + 8 (2 bytes)
+Data Register address:     Base + 0 (8 bytes)
+DMA Address address:       Base + 16 (8 bytes)
+
+== ACPI Interface ==
+
+The fw_cfg device is defined with ACPI ID "QEMU0002". Since we expect
+ACPI tables to be passed into the guest through the fw_cfg device itself,
+the guest-side firmware can not use ACPI to find fw_cfg. However, once the
+firmware is finished setting up ACPI tables and hands control over to the
+guest kernel, the latter can use the fw_cfg ACPI node for a more accurate
+inventory of in-use IOport or MMIO regions.
+
+== Firmware Configuration Items ==
+
+=== Signature (Key 0x0000, FW_CFG_SIGNATURE) ===
+
+The presence of the fw_cfg selector and data registers can be verified
+by selecting the "signature" item using key 0x0000 (FW_CFG_SIGNATURE),
+and reading four bytes from the data register. If the fw_cfg device is
+present, the four bytes read will contain the characters "QEMU".
+
+If the DMA interface is available, then reading the DMA Address
+Register returns 0x51454d5520434647 ("QEMU CFG" in big-endian format).
+
+=== Revision / feature bitmap (Key 0x0001, FW_CFG_ID) ===
+
+A 32-bit little-endian unsigned int, this item is used to check for enabled
+features.
+ - Bit 0: traditional interface. Always set.
+ - Bit 1: DMA interface.
+
+=== File Directory (Key 0x0019, FW_CFG_FILE_DIR) ===
+
+Firmware configuration items stored at selector keys 0x0020 or higher
+(FW_CFG_FILE_FIRST or higher) have an associated entry in a directory
+structure, which makes it easier for guest-side firmware to identify
+and retrieve them. The format of this file directory (from fw_cfg.h in
+the QEMU source tree) is shown here, slightly annotated for clarity:
+
+struct FWCfgFiles {		/* the entire file directory fw_cfg item */
+    uint32_t count;		/* number of entries, in big-endian format */
+    struct FWCfgFile f[];	/* array of file entries, see below */
+};
+
+struct FWCfgFile {		/* an individual file entry, 64 bytes total */
+    uint32_t size;		/* size of referenced fw_cfg item, big-endian */
+    uint16_t select;		/* selector key of fw_cfg item, big-endian */
+    uint16_t reserved;
+    char name[56];		/* fw_cfg item name, NUL-terminated ascii */
+};
+
+=== All Other Data Items ===
+
+Please consult the QEMU source for the most up-to-date and authoritative list
+of selector keys and their respective items' purpose, format and writeability.
+
+=== Ranges ===
+
+Theoretically, there may be up to 0x4000 generic firmware configuration
+items, and up to 0x4000 architecturally specific ones.
+
+Selector Reg.    Range Usage
+---------------  -----------
+0x0000 - 0x3fff  Generic (0x0000 - 0x3fff, generally RO, possibly RW through
+                          the DMA interface in QEMU v2.9+)
+0x4000 - 0x7fff  Generic (0x0000 - 0x3fff, RW, ignored in QEMU v2.4+)
+0x8000 - 0xbfff  Arch. Specific (0x0000 - 0x3fff, generally RO, possibly RW
+                                 through the DMA interface in QEMU v2.9+)
+0xc000 - 0xffff  Arch. Specific (0x0000 - 0x3fff, RW, ignored in v2.4+)
+
+In practice, the number of allowed firmware configuration items depends on the
+machine type/version.
+
+= Guest-side DMA Interface =
+
+If bit 1 of the feature bitmap is set, the DMA interface is present. This does
+not replace the existing fw_cfg interface, it is an add-on. This interface
+can be used through the 64-bit wide address register.
+
+The address register is in big-endian format. The value for the register is 0
+at startup and after an operation. A write to the least significant half (at
+offset 4) triggers an operation. This means that operations with 32-bit
+addresses can be triggered with just one write, whereas operations with
+64-bit addresses can be triggered with one 64-bit write or two 32-bit writes,
+starting with the most significant half (at offset 0).
+
+In this register, the physical address of a FWCfgDmaAccess structure in RAM
+should be written. This is the format of the FWCfgDmaAccess structure:
+
+typedef struct FWCfgDmaAccess {
+    uint32_t control;
+    uint32_t length;
+    uint64_t address;
+} FWCfgDmaAccess;
+
+The fields of the structure are in big endian mode, and the field at the lowest
+address is the "control" field.
+
+The "control" field has the following bits:
+ - Bit 0: Error
+ - Bit 1: Read
+ - Bit 2: Skip
+ - Bit 3: Select. The upper 16 bits are the selected index.
+ - Bit 4: Write
+
+When an operation is triggered, if the "control" field has bit 3 set, the
+upper 16 bits are interpreted as an index of a firmware configuration item.
+This has the same effect as writing the selector register.
+
+If the "control" field has bit 1 set, a read operation will be performed.
+"length" bytes for the current selector and offset will be copied into the
+physical RAM address specified by the "address" field.
+
+If the "control" field has bit 4 set (and not bit 1), a write operation will be
+performed. "length" bytes will be copied from the physical RAM address
+specified by the "address" field to the current selector and offset. QEMU
+prevents starting or finishing the write beyond the end of the item associated
+with the current selector (i.e., the item cannot be resized). Truncated writes
+are dropped entirely. Writes to read-only items are also rejected. All of these
+write errors set bit 0 (the error bit) in the "control" field.
+
+If the "control" field has bit 2 set (and neither bit 1 nor bit 4), a skip
+operation will be performed. The offset for the current selector will be
+advanced "length" bytes.
+
+To check the result, read the "control" field:
+   error bit set        ->  something went wrong.
+   all bits cleared     ->  transfer finished successfully.
+   otherwise            ->  transfer still in progress (doesn't happen
+                            today due to implementation not being async,
+                            but may in the future).
+
+= Externally Provided Items =
+
+Since v2.4, "file" fw_cfg items (i.e., items with selector keys above
+FW_CFG_FILE_FIRST, and with a corresponding entry in the fw_cfg file
+directory structure) may be inserted via the QEMU command line, using
+the following syntax:
+
+    -fw_cfg [name=]<item_name>,file=<path>
+
+Or
+
+    -fw_cfg [name=]<item_name>,string=<string>
+
+Since v5.1, QEMU allows some objects to generate fw_cfg-specific content,
+the content is then associated with a "file" item using the 'gen_id' option
+in the command line, using the following syntax:
+
+    -object <generator-type>,id=<generated_id>,[generator-specific-options] \
+    -fw_cfg [name=]<item_name>,gen_id=<generated_id>
+
+See QEMU man page for more documentation.
+
+Using item_name with plain ASCII characters only is recommended.
+
+Item names beginning with "opt/" are reserved for users.  QEMU will
+never create entries with such names unless explicitly ordered by the
+user.
+
+To avoid clashes among different users, it is strongly recommended
+that you use names beginning with opt/RFQDN/, where RFQDN is a reverse
+fully qualified domain name you control.  For instance, if SeaBIOS
+wanted to define additional names, the prefix "opt/org.seabios/" would
+be appropriate.
+
+For historical reasons, "opt/ovmf/" is reserved for OVMF firmware.
+
+Prefix "opt/org.qemu/" is reserved for QEMU itself.
+
+Use of names not beginning with "opt/" is potentially dangerous and
+entirely unsupported.  QEMU will warn if you try.
+
+Use of names not beginning with "opt/" is tolerated with 'gen_id' (that
+is, the warning is suppressed), but you must know exactly what you're
+doing.
+
+All externally provided fw_cfg items are read-only to the guest.
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
new file mode 100644
index 000000000..ecc43896b
--- /dev/null
+++ b/docs/specs/index.rst
@@ -0,0 +1,20 @@
+----------------------------------------------
+System Emulation Guest Hardware Specifications
+----------------------------------------------
+
+This section of the manual contains specifications of
+guest hardware that is specific to QEMU.
+
+.. toctree::
+   :maxdepth: 2
+
+   ppc-xive
+   ppc-spapr-xive
+   ppc-spapr-numa
+   acpi_hw_reduced_hotplug
+   tpm
+   acpi_hest_ghes
+   acpi_cpu_hotplug
+   acpi_mem_hotplug
+   acpi_pci_hotplug
+   acpi_nvdimm
diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
new file mode 100644
index 000000000..1beb3a01e
--- /dev/null
+++ b/docs/specs/ivshmem-spec.txt
@@ -0,0 +1,258 @@
+= Device Specification for Inter-VM shared memory device =
+
+The Inter-VM shared memory device (ivshmem) is designed to share a
+memory region between multiple QEMU processes running different guests
+and the host.  In order for all guests to be able to pick up the
+shared memory area, it is modeled by QEMU as a PCI device exposing
+said memory to the guest as a PCI BAR.
+
+The device can use a shared memory object on the host directly, or it
+can obtain one from an ivshmem server.
+
+In the latter case, the device can additionally interrupt its peers, and
+get interrupted by its peers.
+
+
+== Configuring the ivshmem PCI device ==
+
+There are two basic configurations:
+
+- Just shared memory:
+
+      -device ivshmem-plain,memdev=HMB,...
+
+  This uses host memory backend HMB.  It should have option "share"
+  set.
+
+- Shared memory plus interrupts:
+
+      -device ivshmem-doorbell,chardev=CHR,vectors=N,...
+
+  An ivshmem server must already be running on the host.  The device
+  connects to the server's UNIX domain socket via character device
+  CHR.
+
+  Each peer gets assigned a unique ID by the server.  IDs must be
+  between 0 and 65535.
+
+  Interrupts are message-signaled (MSI-X).  vectors=N configures the
+  number of vectors to use.
+
+For more details on ivshmem device properties, see the QEMU Emulator
+user documentation.
+
+
+== The ivshmem PCI device's guest interface ==
+
+The device has vendor ID 1af4, device ID 1110, revision 1.  Before
+QEMU 2.6.0, it had revision 0.
+
+=== PCI BARs ===
+
+The ivshmem PCI device has two or three BARs:
+
+- BAR0 holds device registers (256 Byte MMIO)
+- BAR1 holds MSI-X table and PBA (only ivshmem-doorbell)
+- BAR2 maps the shared memory object
+
+There are two ways to use this device:
+
+- If you only need the shared memory part, BAR2 suffices.  This way,
+  you have access to the shared memory in the guest and can use it as
+  you see fit.  Memnic, for example, uses ivshmem this way from guest
+  user space (see http://dpdk.org/browse/memnic).
+
+- If you additionally need the capability for peers to interrupt each
+  other, you need BAR0 and BAR1.  You will most likely want to write a
+  kernel driver to handle interrupts.  Requires the device to be
+  configured for interrupts, obviously.
+
+Before QEMU 2.6.0, BAR2 can initially be invalid if the device is
+configured for interrupts.  It becomes safely accessible only after
+the ivshmem server provided the shared memory.  These devices have PCI
+revision 0 rather than 1.  Guest software should wait for the
+IVPosition register (described below) to become non-negative before
+accessing BAR2.
+
+Revision 0 of the device is not capable to tell guest software whether
+it is configured for interrupts.
+
+=== PCI device registers ===
+
+BAR 0 contains the following registers:
+
+    Offset  Size  Access      On reset  Function
+        0     4   read/write        0   Interrupt Mask
+                                        bit 0: peer interrupt (rev 0)
+                                               reserved       (rev 1)
+                                        bit 1..31: reserved
+        4     4   read/write        0   Interrupt Status
+                                        bit 0: peer interrupt (rev 0)
+                                               reserved       (rev 1)
+                                        bit 1..31: reserved
+        8     4   read-only   0 or ID   IVPosition
+       12     4   write-only      N/A   Doorbell
+                                        bit 0..15: vector
+                                        bit 16..31: peer ID
+       16   240   none            N/A   reserved
+
+Software should only access the registers as specified in column
+"Access".  Reserved bits should be ignored on read, and preserved on
+write.
+
+In revision 0 of the device, Interrupt Status and Mask Register
+together control the legacy INTx interrupt when the device has no
+MSI-X capability: INTx is asserted when the bit-wise AND of Status and
+Mask is non-zero and the device has no MSI-X capability.  Interrupt
+Status Register bit 0 becomes 1 when an interrupt request from a peer
+is received.  Reading the register clears it.
+
+IVPosition Register: if the device is not configured for interrupts,
+this is zero.  Else, it is the device's ID (between 0 and 65535).
+
+Before QEMU 2.6.0, the register may read -1 for a short while after
+reset.  These devices have PCI revision 0 rather than 1.
+
+There is no good way for software to find out whether the device is
+configured for interrupts.  A positive IVPosition means interrupts,
+but zero could be either.
+
+Doorbell Register: writing this register requests to interrupt a peer.
+The written value's high 16 bits are the ID of the peer to interrupt,
+and its low 16 bits select an interrupt vector.
+
+If the device is not configured for interrupts, the write is ignored.
+
+If the interrupt hasn't completed setup, the write is ignored.  The
+device is not capable to tell guest software whether setup is
+complete.  Interrupts can regress to this state on migration.
+
+If the peer with the requested ID isn't connected, or it has fewer
+interrupt vectors connected, the write is ignored.  The device is not
+capable to tell guest software what peers are connected, or how many
+interrupt vectors are connected.
+
+The peer's interrupt for this vector then becomes pending.  There is
+no way for software to clear the pending bit, and a polling mode of
+operation is therefore impossible.
+
+If the peer is a revision 0 device without MSI-X capability, its
+Interrupt Status register is set to 1.  This asserts INTx unless
+masked by the Interrupt Mask register.  The device is not capable to
+communicate the interrupt vector to guest software then.
+
+With multiple MSI-X vectors, different vectors can be used to indicate
+different events have occurred.  The semantics of interrupt vectors
+are left to the application.
+
+
+== Interrupt infrastructure ==
+
+When configured for interrupts, the peers share eventfd objects in
+addition to shared memory.  The shared resources are managed by an
+ivshmem server.
+
+=== The ivshmem server ===
+
+The server listens on a UNIX domain socket.
+
+For each new client that connects to the server, the server
+- picks an ID,
+- creates eventfd file descriptors for the interrupt vectors,
+- sends the ID and the file descriptor for the shared memory to the
+  new client,
+- sends connect notifications for the new client to the other clients
+  (these contain file descriptors for sending interrupts),
+- sends connect notifications for the other clients to the new client,
+  and
+- sends interrupt setup messages to the new client (these contain file
+  descriptors for receiving interrupts).
+
+The first client to connect to the server receives ID zero.
+
+When a client disconnects from the server, the server sends disconnect
+notifications to the other clients.
+
+The next section describes the protocol in detail.
+
+If the server terminates without sending disconnect notifications for
+its connected clients, the clients can elect to continue.  They can
+communicate with each other normally, but won't receive disconnect
+notification on disconnect, and no new clients can connect.  There is
+no way for the clients to connect to a restarted server.  The device
+is not capable to tell guest software whether the server is still up.
+
+Example server code is in contrib/ivshmem-server/.  Not to be used in
+production.  It assumes all clients use the same number of interrupt
+vectors.
+
+A standalone client is in contrib/ivshmem-client/.  It can be useful
+for debugging.
+
+=== The ivshmem Client-Server Protocol ===
+
+An ivshmem device configured for interrupts connects to an ivshmem
+server.  This section details the protocol between the two.
+
+The connection is one-way: the server sends messages to the client.
+Each message consists of a single 8 byte little-endian signed number,
+and may be accompanied by a file descriptor via SCM_RIGHTS.  Both
+client and server close the connection on error.
+
+Note: QEMU currently doesn't close the connection right on error, but
+only when the character device is destroyed.
+
+On connect, the server sends the following messages in order:
+
+1. The protocol version number, currently zero.  The client should
+   close the connection on receipt of versions it can't handle.
+
+2. The client's ID.  This is unique among all clients of this server.
+   IDs must be between 0 and 65535, because the Doorbell register
+   provides only 16 bits for them.
+
+3. The number -1, accompanied by the file descriptor for the shared
+   memory.
+
+4. Connect notifications for existing other clients, if any.  This is
+   a peer ID (number between 0 and 65535 other than the client's ID),
+   repeated N times.  Each repetition is accompanied by one file
+   descriptor.  These are for interrupting the peer with that ID using
+   vector 0,..,N-1, in order.  If the client is configured for fewer
+   vectors, it closes the extra file descriptors.  If it is configured
+   for more, the extra vectors remain unconnected.
+
+5. Interrupt setup.  This is the client's own ID, repeated N times.
+   Each repetition is accompanied by one file descriptor.  These are
+   for receiving interrupts from peers using vector 0,..,N-1, in
+   order.  If the client is configured for fewer vectors, it closes
+   the extra file descriptors.  If it is configured for more, the
+   extra vectors remain unconnected.
+
+From then on, the server sends these kinds of messages:
+
+6. Connection / disconnection notification.  This is a peer ID.
+
+  - If the number comes with a file descriptor, it's a connection
+    notification, exactly like in step 4.
+
+  - Else, it's a disconnection notification for the peer with that ID.
+
+Known bugs:
+
+* The protocol changed incompatibly in QEMU 2.5.  Before, messages
+  were native endian long, and there was no version number.
+
+* The protocol is poorly designed.
+
+=== The ivshmem Client-Client Protocol ===
+
+An ivshmem device configured for interrupts receives eventfd file
+descriptors for interrupting peers and getting interrupted by peers
+from the server, as explained in the previous section.
+
+To interrupt a peer, the device writes the 8-byte integer 1 in native
+byte order to the respective file descriptor.
+
+To receive an interrupt, the device reads and discards as many 8-byte
+integers as it can.
diff --git a/docs/specs/pci-ids.txt b/docs/specs/pci-ids.txt
new file mode 100644
index 000000000..5e407a6f3
--- /dev/null
+++ b/docs/specs/pci-ids.txt
@@ -0,0 +1,71 @@
+
+PCI IDs for qemu
+================
+
+Red Hat, Inc. donates a part of its device ID range to qemu, to be used for
+virtual devices.  The vendor IDs are 1af4 (formerly Qumranet ID) and 1b36.
+
+Contact Gerd Hoffmann <kraxel@redhat.com> to get a device ID assigned
+for your devices.
+
+1af4 vendor ID
+--------------
+
+The 1000 -> 10ff device ID range is used as follows for virtio-pci devices.
+Note that this allocation separate from the virtio device IDs, which are
+maintained as part of the virtio specification.
+
+1af4:1000  network device (legacy)
+1af4:1001  block device (legacy)
+1af4:1002  balloon device (legacy)
+1af4:1003  console device (legacy)
+1af4:1004  SCSI host bus adapter device (legacy)
+1af4:1005  entropy generator device (legacy)
+1af4:1009  9p filesystem device (legacy)
+
+1af4:1041  network device (modern)
+1af4:1042  block device (modern)
+1af4:1043  console device (modern)
+1af4:1044  entropy generator device (modern)
+1af4:1045  balloon device (modern)
+1af4:1048  SCSI host bus adapter device (modern)
+1af4:1049  9p filesystem device (modern)
+1af4:1050  virtio gpu device (modern)
+1af4:1052  virtio input device (modern)
+
+1af4:10f0  Available for experimental usage without registration.  Must get
+   to      official ID when the code leaves the test lab (i.e. when seeking
+1af4:10ff  upstream merge or shipping a distro/product) to avoid conflicts.
+
+1af4:1100  Used as PCI Subsystem ID for existing hardware devices emulated
+           by qemu.
+
+1af4:1110  ivshmem device (shared memory, docs/specs/ivshmem-spec.txt)
+
+All other device IDs are reserved.
+
+1b36 vendor ID
+--------------
+
+The 0000 -> 00ff device ID range is used as follows for QEMU-specific
+PCI devices (other than virtio):
+
+1b36:0001  PCI-PCI bridge
+1b36:0002  PCI serial port (16550A) adapter (docs/specs/pci-serial.txt)
+1b36:0003  PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt)
+1b36:0004  PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt)
+1b36:0005  PCI test device (docs/specs/pci-testdev.txt)
+1b36:0006  PCI Rocker Ethernet switch device
+1b36:0007  PCI SD Card Host Controller Interface (SDHCI)
+1b36:0008  PCIe host bridge
+1b36:0009  PCI Expander Bridge (-device pxb)
+1b36:000a  PCI-PCI bridge (multiseat)
+1b36:000b  PCIe Expander Bridge (-device pxb-pcie)
+1b36:000d  PCI xhci usb host adapter
+1b36:000f  mdpy (mdev sample device), linux/samples/vfio-mdev/mdpy.c
+1b36:0010  PCIe NVMe device (-device nvme)
+1b36:0011  PCI PVPanic device (-device pvpanic-pci)
+
+All these devices are documented in docs/specs.
+
+The 0100 device ID is used for the QXL video card device.
diff --git a/docs/specs/pci-serial.txt b/docs/specs/pci-serial.txt
new file mode 100644
index 000000000..66c761f2b
--- /dev/null
+++ b/docs/specs/pci-serial.txt
@@ -0,0 +1,34 @@
+
+QEMU pci serial devices
+=======================
+
+There is one single-port variant and two muliport-variants.  Linux
+guests out-of-the box with all cards.  There is a Windows inf file
+(docs/qemupciserial.inf) to setup the single-port card in Windows
+guests.
+
+
+single-port card
+----------------
+
+Name:   pci-serial
+PCI ID: 1b36:0002
+
+PCI Region 0:
+   IO bar, 8 bytes long, with the 16550 uart mapped to it.
+   Interrupt is wired to pin A.
+
+
+multiport cards
+---------------
+
+Name:   pci-serial-2x
+PCI ID: 1b36:0003
+
+Name:   pci-serial-4x
+PCI ID: 1b36:0004
+
+PCI Region 0:
+   IO bar, with two/four 16550 uart mapped after each other.
+   The first is at offset 0, second at offset 8, ...
+   Interrupt is wired to pin A.
diff --git a/docs/specs/pci-testdev.txt b/docs/specs/pci-testdev.txt
new file mode 100644
index 000000000..4280a1e73
--- /dev/null
+++ b/docs/specs/pci-testdev.txt
@@ -0,0 +1,31 @@
+pci-test is a device used for testing low level IO
+
+device implements up to three BARs: BAR0, BAR1 and BAR2.
+Each of BAR 0+1 can be memory or IO. Guests must detect
+BAR types and act accordingly.
+
+BAR 0+1 size is up to 4K bytes each.
+BAR 0+1 starts with the following header:
+
+typedef struct PCITestDevHdr {
+    uint8_t test;  <- write-only, starts a given test number
+    uint8_t width_type; <- read-only, type and width of access for a given test.
+                           1,2,4 for byte,word or long write.
+                           any other value if test not supported on this BAR
+    uint8_t pad0[2];
+    uint32_t offset; <- read-only, offset in this BAR for a given test
+    uint32_t data;    <- read-only, data to use for a given test
+    uint32_t count;  <- for debugging. number of writes detected.
+    uint8_t name[]; <- for debugging. 0-terminated ASCII string.
+} PCITestDevHdr;
+
+All registers are little endian.
+
+device is expected to always implement tests 0 to N on each BAR, and to add new
+tests with higher numbers.  In this way a guest can scan test numbers until it
+detects an access type that it does not support on this BAR, then stop.
+
+BAR2 is a 64bit memory bar, without backing storage.  It is disabled
+by default and can be enabled using the membar=<size> property.  This
+can be used to test whether guests handle pci bars of a specific
+(possibly quite large) size correctly.
diff --git a/docs/specs/ppc-spapr-hcalls.txt b/docs/specs/ppc-spapr-hcalls.txt
new file mode 100644
index 000000000..93fe3da91
--- /dev/null
+++ b/docs/specs/ppc-spapr-hcalls.txt
@@ -0,0 +1,78 @@
+When used with the "pseries" machine type, QEMU-system-ppc64 implements
+a set of hypervisor calls using a subset of the server "PAPR" specification
+(IBM internal at this point), which is also what IBM's proprietary hypervisor
+adheres too.
+
+The subset is selected based on the requirements of Linux as a guest.
+
+In addition to those calls, we have added our own private hypervisor
+calls which are mostly used as a private interface between the firmware
+running in the guest and QEMU.
+
+All those hypercalls start at hcall number 0xf000 which correspond
+to an implementation specific range in PAPR.
+
+- H_RTAS (0xf000)
+
+RTAS is a set of runtime services generally provided by the firmware
+inside the guest to the operating system. It predates the existence
+of hypervisors (it was originally an extension to Open Firmware) and
+is still used by PAPR to provide various services that aren't performance
+sensitive.
+
+We currently implement the RTAS services in QEMU itself. The actual RTAS
+"firmware" blob in the guest is a small stub of a few instructions which
+calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU.
+
+Arguments:
+
+  r3 : H_RTAS (0xf000)
+  r4 : Guest physical address of RTAS parameter block
+
+Returns:
+
+  H_SUCCESS   : Successfully called the RTAS function (RTAS result
+                will have been stored in the parameter block)
+  H_PARAMETER : Unknown token
+
+- H_LOGICAL_MEMOP (0xf001)
+
+When the guest runs in "real mode" (in powerpc lingua this means
+with MMU disabled, ie guest effective == guest physical), it only
+has access to a subset of memory and no IOs.
+
+PAPR provides a set of hypervisor calls to perform cacheable or
+non-cacheable accesses to any guest physical addresses that the
+guest can use in order to access IO devices while in real mode.
+
+This is typically used by the firmware running in the guest.
+
+However, doing a hypercall for each access is extremely inefficient
+(even more so when running KVM) when accessing the frame buffer. In
+that case, things like scrolling become unusably slow.
+
+This hypercall allows the guest to request a "memory op" to be applied
+to memory. The supported memory ops at this point are to copy a range
+of memory (supports overlap of source and destination) and XOR which
+is used by our SLOF firmware to invert the screen.
+
+Arguments:
+
+  r3: H_LOGICAL_MEMOP (0xf001)
+  r4: Guest physical address of destination
+  r5: Guest physical address of source
+  r6: Individual element size
+        0 = 1 byte
+        1 = 2 bytes
+        2 = 4 bytes
+        3 = 8 bytes
+  r7: Number of elements
+  r8: Operation
+        0 = copy
+        1 = xor
+
+Returns:
+
+  H_SUCCESS   : Success
+  H_PARAMETER : Invalid argument
+
diff --git a/docs/specs/ppc-spapr-hotplug.txt b/docs/specs/ppc-spapr-hotplug.txt
new file mode 100644
index 000000000..d4fb2d46d
--- /dev/null
+++ b/docs/specs/ppc-spapr-hotplug.txt
@@ -0,0 +1,409 @@
+= sPAPR Dynamic Reconfiguration =
+
+sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration
+to handle hotplugging of dynamic "physical" resources like PCI cards, or
+"logical"/paravirtual resources like memory, CPUs, and "physical"
+host-bridges, which are generally managed by the host/hypervisor and provided
+to guests as virtualized resources. The specifics of dynamic-reconfiguration
+are documented extensively in PAPR+ v2.7, Section 13.1. This document
+provides a summary of that information as it applies to the implementation
+within QEMU.
+
+== Dynamic-reconfiguration Connectors ==
+
+To manage hotplug/unplug of these resources, a firmware abstraction known as
+a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
+resource to the guest, and provide an interface for the guest to manage
+configuration/removal of the resource associated with it.
+
+== Device-tree description of DRCs ==
+
+A set of 4 Open Firmware device tree array properties are used to describe
+the name/index/power-domain/type of each DRC allocated to a guest at
+boot-time. There may be multiple sets of these arrays, rooted at different
+paths in the device tree depending on the type of resource the DRCs manage.
+
+In some cases, the DRCs themselves may be provided by a dynamic resource,
+such as the DRCs managing PCI slots on a hotplugged PHB. In this case the
+arrays would be fetched as part of the device tree retrieval interfaces
+for hotplugged resources described under "Guest->Host interface".
+
+The array properties are described below. Each entry/element in an array
+describes the DRC identified by the element in the corresponding position
+of ibm,drc-indexes:
+
+ibm,drc-names:
+  first 4-bytes: BE-encoded integer denoting the number of entries
+  each entry: a NULL-terminated <name> string encoded as a byte array
+
+  <name> values for logical/virtual resources are defined in PAPR+ v2.7,
+  Section 13.5.2.4, and basically consist of the type of the resource
+  followed by a space and a numerical value that's unique across resources
+  of that type.
+
+  <name> values for "physical" resources such as PCI or VIO devices are
+  defined as being "location codes", which are the "location labels" of
+  each encapsulating device, starting from the chassis down to the
+  individual slot for the device, concatenated by a hyphen. This provides
+  a mapping of resources to a physical location in a chassis for debugging
+  purposes. For QEMU, this mapping is less important, so we assign a
+  location code that conforms to naming specifications, but is simply a
+  location label for the slot by itself to simplify the implementation.
+  The naming convention for location labels is documented in detail in
+  PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>"
+  for PCI/VIO device slots, where <n> is unique across all PCI/VIO
+  device slots.
+
+ibm,drc-indexes:
+  first 4-bytes: BE-encoded integer denoting the number of entries
+  each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs
+    in the machine
+
+  <index> is arbitrary, but in the case of QEMU we try to maintain the
+  convention used to assign them to pSeries guests on pHyp:
+
+    bit[31:28]: integer encoding of <type>, where <type> is:
+                  1 for CPU resource
+                  2 for PHB resource
+                  3 for VIO resource
+                  4 for PCI resource
+                  8 for Memory resource
+    bit[27:0]: integer encoding of <id>, where <id> is unique across
+                 all resources of specified type
+
+ibm,drc-power-domains:
+  first 4-bytes: BE-encoded integer denoting the number of entries
+  each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the
+    power domain the resource will be assigned to. In the case of QEMU
+    we associated all resources with a "live insertion" domain, where the
+    power is assumed to be managed automatically. The integer value for
+    this domain is a special value of -1.
+
+
+ibm,drc-types:
+  first 4-bytes: BE-encoded integer denoting the number of entries
+  each entry: a NULL-terminated <type> string encoded as a byte array
+
+  <type> is assigned as follows:
+    "CPU" for a CPU
+    "PHB" for a physical host-bridge
+    "SLOT" for a VIO slot
+    "28" for a PCI slot
+    "MEM" for memory resource
+
+== Guest->Host interface to manage dynamic resources ==
+
+Each DRC is given a globally unique DRC Index, and resources associated with
+a particular DRC are configured/managed by the guest via a number of RTAS
+calls which reference individual DRCs based on the DRC index. This can be
+considered the guest->host interface.
+
+rtas-set-power-level:
+  arg[0]: integer identifying power domain
+  arg[1]: new power level for the domain, 0-100
+  output[0]: status, 0 on success
+  output[1]: power level after command
+
+  Set the power level for a specified power domain
+
+rtas-get-power-level:
+  arg[0]: integer identifying power domain
+  output[0]: status, 0 on success
+  output[1]: current power level
+
+  Get the power level for a specified power domain
+
+rtas-set-indicator:
+  arg[0]: integer identifying sensor/indicator type
+  arg[1]: index of sensor, for DR-related sensors this is generally the
+          DRC index
+  arg[2]: desired sensor value
+  output[0]: status, 0 on success
+
+  Set the state of an indicator or sensor. For the purpose of this document we
+  focus on the indicator/sensor types associated with a DRC. The types are:
+
+    9001: isolation-state, controls/indicates whether a device has been made
+          accessible to a guest
+
+          supported sensor values:
+            0: isolate, device is made unaccessible by guest OS
+            1: unisolate, device is made available to guest OS
+
+    9002: dr-indicator, controls "visual" indicator associated with device
+
+          supported sensor values:
+            0: inactive, resource may be safely removed
+            1: active, resource is in use and cannot be safely removed
+            2: identify, used to visually identify slot for interactive hotplug
+            3: action, in most cases, used in the same manner as identify
+
+    9003: allocation-state, generally only used for "logical" DR resources to
+          request the allocation/deallocation of a resource prior to acquiring
+          it via isolation-state->unisolate, or after releasing it via
+          isolation-state->isolate, respectively. for "physical" DR (like PCI
+          hotplug/unplug) the pre-allocation of the resource is implied and
+          this sensor is unused.
+
+          supported sensor values:
+            0: unusable, tell firmware/system the resource can be
+               unallocated/reclaimed and added back to the system resource pool
+            1: usable, request the resource be allocated/reserved for use by
+               guest OS
+            2: exchange, used to allocate a spare resource to use for fail-over
+               in certain situations. unused in QEMU
+            3: recover, used to reclaim a previously allocated resource that's
+               not currently allocated to the guest OS. unused in QEMU
+
+rtas-get-sensor-state:
+  arg[0]: integer identifying sensor/indicator type
+  arg[1]: index of sensor, for DR-related sensors this is generally the
+          DRC index
+  output[0]: status, 0 on success
+
+  Used to read an indicator or sensor value.
+
+  For DR-related operations, the only noteworthy sensor is dr-entity-sense,
+  which has a type value of 9003, as allocation-state does in the case of
+  rtas-set-indicator. The semantics/encodings of the sensor values are distinct
+  however:
+
+  supported sensor values for dr-entity-sense (9003) sensor:
+    0: empty,
+         for physical resources: DRC/slot is empty
+         for logical resources: unused
+    1: present,
+         for physical resources: DRC/slot is populated with a device/resource
+         for logical resources: resource has been allocated to the DRC
+    2: unusable,
+         for physical resources: unused
+         for logical resources: DRC has no resource allocated to it
+    3: exchange,
+         for physical resources: unused
+         for logical resources: resource available for exchange (see
+           allocation-state sensor semantics above)
+    4: recovery,
+         for physical resources: unused
+         for logical resources: resource available for recovery (see
+           allocation-state sensor semantics above)
+
+rtas-ibm-configure-connector:
+  arg[0]: guest physical address of 4096-byte work area buffer
+  arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero
+          if a prior RTAS response indicated a need for additional memory
+  output[0]: status:
+               0: completed transmittal of device-tree node
+               1: instruct guest to prepare for next DT sibling node
+               2: instruct guest to prepare for next DT child node
+               3: instruct guest to prepare for next DT property
+               4: instruct guest to ascend to parent DT node
+               5: instruct guest to provide additional work-area buffer
+                  via arg[1]
+            990x: instruct guest that operation took too long and to try
+                  again later
+
+  Used to fetch an OF device-tree description of the resource associated with
+  a particular DRC. The DRC index is encoded in the first 4-bytes of the first
+  work area buffer.
+
+  Work area layout, using 4-byte offsets:
+    wa[0]: DRC index of the DRC to fetch device-tree nodes from
+    wa[1]: 0 (hard-coded)
+    wa[2]: for next-sibling/next-child response:
+             wa offset of null-terminated string denoting the new node's name
+           for next-property response:
+             wa offset of null-terminated string denoting new property's name
+    wa[3]: for next-property response (unused otherwise):
+             byte-length of new property's value
+    wa[4]: for next-property response (unused otherwise):
+             new property's value, encoded as an OFDT-compatible byte array
+
+== hotplug/unplug events ==
+
+For most DR operations, the hypervisor will issue host->guest add/remove events
+using the EPOW/check-exception notification framework, where the host issues a
+check-exception interrupt, then provides an RTAS event log via an
+rtas-check-exception call issued by the guest in response. This framework is
+documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
+requests via EPOW events.
+
+For DR, this framework has been extended to include hotplug events, which were
+previously unneeded due to direct manipulation of DR-related guest userspace
+tools by host-level management such as an HMC. This level of management is not
+applicable to PowerKVM, hence the reason for extending the notification
+framework to support hotplug events.
+
+The format for these EPOW-signalled events is described below under
+"hotplug/unplug event structure". Note that these events are not
+formally part of the PAPR+ specification, and have been superseded by a
+newer format, also described below under "hotplug/unplug event structure",
+and so are now deemed a "legacy" format. The formats are similar, but the
+"modern" format contains additional fields/flags, which are denoted for the
+purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards.
+
+QEMU should assume support only for "legacy" fields/flags unless the guest
+advertises support for the "modern" format via ibm,client-architecture-support
+hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector
+structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events,
+"modern" format events are surfaced to the guest via check-exception RTAS calls,
+but use a dedicated event source to signal the guest. This event source is
+advertised to the guest by the addition of a "hot-plug-events" node under
+"/event-sources" node of the guest's device tree using the standard format
+described in LoPAPR v11, B.6.12.1.
+
+== hotplug/unplug event structure ==
+
+The hotplug-specific payload in QEMU is implemented as follows (with all values
+encoded in big-endian format):
+
+struct rtas_event_log_v6_hp {
+#define SECTION_ID_HOTPLUG              0x4850 /* HP */
+    struct section_header {
+        uint16_t section_id;            /* set to SECTION_ID_HOTPLUG */
+        uint16_t section_length;        /* sizeof(rtas_event_log_v6_hp),
+                                         * plus the length of the DRC name
+                                         * if a DRC name identifier is
+                                         * specified for hotplug_identifier
+                                         */
+        uint8_t section_version;        /* version 1 */
+        uint8_t section_subtype;        /* unused */
+        uint16_t creator_component_id;  /* unused */
+    } hdr;
+#define RTAS_LOG_V6_HP_TYPE_CPU         1
+#define RTAS_LOG_V6_HP_TYPE_MEMORY      2
+#define RTAS_LOG_V6_HP_TYPE_SLOT        3
+#define RTAS_LOG_V6_HP_TYPE_PHB         4
+#define RTAS_LOG_V6_HP_TYPE_PCI         5
+    uint8_t hotplug_type;               /* type of resource/device */
+#define RTAS_LOG_V6_HP_ACTION_ADD       1
+#define RTAS_LOG_V6_HP_ACTION_REMOVE    2
+    uint8_t hotplug_action;             /* action (add/remove) */
+#define RTAS_LOG_V6_HP_ID_DRC_NAME          1
+#define RTAS_LOG_V6_HP_ID_DRC_INDEX         2
+#define RTAS_LOG_V6_HP_ID_DRC_COUNT         3
+#ifdef GUEST_SUPPORTS_MODERN
+#define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4
+#endif
+    uint8_t hotplug_identifier;         /* type of the resource identifier,
+                                         * which serves as the discriminator
+                                         * for the 'drc' union field below
+                                         */
+#ifdef GUEST_SUPPORTS_MODERN
+    uint8_t capabilities;               /* capability flags, currently unused
+                                         * by QEMU
+                                         */
+#else
+    uint8_t reserved;
+#endif
+    union {
+        uint32_t index;                 /* DRC index of resource to take action
+                                         * on
+                                         */
+        uint32_t count;                 /* number of DR resources to take
+                                         * action on (guest chooses which)
+                                         */
+#ifdef GUEST_SUPPORTS_MODERN
+        struct {
+            uint32_t count;             /* number of DR resources to take
+                                         * action on
+                                         */
+            uint32_t index;             /* DRC index of first resource to take
+                                         * action on. guest will take action
+                                         * on DRC index <index> through
+                                         * DRC index <index + count - 1> in
+                                         * sequential order
+                                         */
+        } count_indexed;
+#endif
+        char name[1];                   /* string representing the name of the
+                                         * DRC to take action on
+                                         */
+    } drc;
+} QEMU_PACKED;
+
+== ibm,lrdr-capacity ==
+
+ibm,lrdr-capacity is a property in the /rtas device tree node that identifies
+the dynamic reconfiguration capabilities of the guest. It consists of a triple
+consisting of <phys>, <size> and <maxcpus>.
+
+  <phys>, encoded in BE format represents the maximum address in bytes and
+  hence the maximum memory that can be allocated to the guest.
+
+  <size>, encoded in BE format represents the size increments in which
+  memory can be hot-plugged to the guest.
+
+  <maxcpus>, a BE-encoded integer, represents the maximum number of
+  processors that the guest can have.
+
+pseries guests use this property to note the maximum allowed CPUs for the
+guest.
+
+== ibm,dynamic-reconfiguration-memory ==
+
+ibm,dynamic-reconfiguration-memory is a device tree node that represents
+dynamically reconfigurable logical memory blocks (LMB). This node
+is generated only when the guest advertises the support for it via
+ibm,client-architecture-support call. Memory that is not dynamically
+reconfigurable is represented by /memory nodes. The properties of this
+node that are of interest to the sPAPR memory hotplug implementation
+in QEMU are described here.
+
+ibm,lmb-size
+
+This 64bit integer defines the size of each dynamically reconfigurable LMB.
+
+ibm,associativity-lookup-arrays
+
+This property defines a lookup array in which the NUMA associativity
+information for each LMB can be found. It is a property encoded array
+that begins with an integer M, the number of associativity lists followed
+by an integer N, the number of entries per associativity list and terminated
+by M associativity lists each of length N integers.
+
+This property provides the same information as given by ibm,associativity
+property in a /memory node. Each assigned LMB has an index value between
+0 and M-1 which is used as an index into this table to select which
+associativity list to use for the LMB. This index value for each LMB
+is defined in ibm,dynamic-memory property.
+
+ibm,dynamic-memory
+
+This property describes the dynamically reconfigurable memory. It is a
+property encoded array that has an integer N, the number of LMBs followed
+by N LMB list entries.
+
+Each LMB list entry consists of the following elements:
+
+- Logical address of the start of the LMB encoded as a 64bit integer. This
+  corresponds to reg property in /memory node.
+- DRC index of the LMB that corresponds to ibm,my-drc-index property
+  in a /memory node.
+- Four bytes reserved for expansion.
+- Associativity list index for the LMB that is used as an index into
+  ibm,associativity-lookup-arrays property described earlier. This
+  is used to retrieve the right associativity list to be used for this
+  LMB.
+- A 32bit flags word. The bit at bit position 0x00000008 defines whether
+  the LMB is assigned to the partition as of boot time.
+
+ibm,dynamic-memory-v2
+
+This property describes the dynamically reconfigurable memory. This is
+an alternate and newer way to describe dynamically reconfigurable memory.
+It is a property encoded array that has an integer N (the number of
+LMB set entries) followed by N LMB set entries. There is an LMB set entry
+for each sequential group of LMBs that share common attributes.
+
+Each LMB set entry consists of the following elements:
+
+- Number of sequential LMBs in the entry represented by a 32bit integer.
+- Logical address of the first LMB in the set encoded as a 64bit integer.
+- DRC index of the first LMB in the set.
+- Associativity list index that is used as an index into
+  ibm,associativity-lookup-arrays property described earlier. This
+  is used to retrieve the right associativity list to be used for all
+  the LMBs in this set.
+- A 32bit flags word that applies to all the LMBs in the set.
+
+[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867
diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst
new file mode 100644
index 000000000..ffa687dc8
--- /dev/null
+++ b/docs/specs/ppc-spapr-numa.rst
@@ -0,0 +1,410 @@
+
+NUMA mechanics for sPAPR (pseries machines)
+============================================
+
+NUMA in sPAPR works different than the System Locality Distance
+Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR
+1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This
+document aims to complement this specification, providing details
+of the elements that impacts how QEMU views NUMA in pseries.
+
+Associativity and ibm,associativity property
+--------------------------------------------
+
+Associativity is defined as a group of platform resources that has
+similar mean performance (or in our context here, distance) relative to
+everyone else outside of the group.
+
+The format of the ibm,associativity property varies with the value of
+bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with
+bit 0 equal to zero is deprecated. The current format, with the bit 0
+with the value of one, makes ibm,associativity property represent the
+physical hierarchy of the platform, as one or more lists that starts
+with the highest level grouping up to the smallest. Considering the
+following topology:
+
+::
+
+    Mem M1 ---- Proc P1    |
+    -----------------      | Socket S1  ---|
+          chip C1          |               |
+                                           | HW module 1 (MOD1)
+    Mem M2 ---- Proc P2    |               |
+    -----------------      | Socket S2  ---|
+          chip C2          |
+
+The ibm,associativity property for the processors would be:
+
+* P1: {MOD1, S1, C1, P1}
+* P2: {MOD1, S2, C2, P2}
+
+Each allocable resource has an ibm,associativity property. The LOPAPR
+specification allows multiple lists to be present in this property,
+considering that the same resource can have multiple connections to the
+platform.
+
+Relative Performance Distance and ibm,associativity-reference-points
+--------------------------------------------------------------------
+
+The ibm,associativity-reference-points property is an array that is used
+to define the relevant performance/distance  related boundaries, defining
+the NUMA levels for the platform.
+
+The definition of its elements also varies with the value of bit 0 of byte 5
+of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero
+is also deprecated. With the current format, each integer of the
+ibm,associativity-reference-points represents an 1 based ordinal index (i.e.
+the first element is 1) of the ibm,associativity array. The first
+boundary is the most significant to application performance, followed by
+less significant boundaries. Allocated resources that belongs to the
+same performance boundaries are expected to have relative NUMA distance
+that matches the relevancy of the boundary itself. Resources that belongs
+to the same first boundary will have the shortest distance from each
+other. Subsequent boundaries represents greater distances and degraded
+performance.
+
+Using the previous example, the following setting reference points defines
+three NUMA levels:
+
+* ibm,associativity-reference-points = {0x3, 0x2, 0x1}
+
+The first NUMA level (0x3) is interpreted as the third element of each
+ibm,associativity array, the second level is the second element and
+the third level is the first element. Let's also consider that elements
+belonging to the first NUMA level have distance equal to 10 from each
+other, and each NUMA level doubles the distance from the previous. This
+means that the second would be 20 and the third level 40. For the P1 and
+P2 processors, we would have the following NUMA levels:
+
+::
+
+  * ibm,associativity-reference-points = {0x3, 0x2, 0x1}
+
+  * P1: associativity{MOD1, S1, C1, P1}
+
+  First NUMA level (0x3) => associativity[2] = C1
+  Second NUMA level (0x2) => associativity[1] = S1
+  Third NUMA level (0x1) => associativity[0] = MOD1
+
+  * P2: associativity{MOD1, S2, C2, P2}
+
+  First NUMA level (0x3) => associativity[2] = C2
+  Second NUMA level (0x2) => associativity[1] = S2
+  Third NUMA level (0x1) => associativity[0] = MOD1
+
+  P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40
+
+Changing the ibm,associativity-reference-points array changes the performance
+distance attributes for the same associativity arrays, as the following
+example illustrates:
+
+::
+
+  * ibm,associativity-reference-points = {0x2}
+
+  * P1: associativity{MOD1, S1, C1, P1}
+
+  First NUMA level (0x2) => associativity[1] = S1
+
+  * P2: associativity{MOD1, S2, C2, P2}
+
+  First NUMA level (0x2) => associativity[1] = S2
+
+  P1 and P2 does not have a common performance boundary. Since this is a one level
+  NUMA configuration, distance between them is one boundary above the first
+  level, 20.
+
+
+In a hypothetical platform where all resources inside the same hardware module
+is considered to be on the same performance boundary:
+
+::
+
+  * ibm,associativity-reference-points = {0x1}
+
+  * P1: associativity{MOD1, S1, C1, P1}
+
+  First NUMA level (0x1) => associativity[0] = MOD0
+
+  * P2: associativity{MOD1, S2, C2, P2}
+
+  First NUMA level (0x1) => associativity[0] = MOD0
+
+  P1 and P2 belongs to the same first order boundary. The distance between then
+  is 10.
+
+
+How the pseries Linux guest calculates NUMA distances
+=====================================================
+
+Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is
+how the distances are expressed. The SLIT table provides the NUMA distance
+value between the relevant resources. LOPAPR does not provide a standard
+way to calculate it. We have the ibm,associativity for each resource, which
+provides a common-performance hierarchy,  and the ibm,associativity-reference-points
+array that tells which level of associativity is considered to be relevant
+or not.
+
+The result is that each OS is free to implement and to interpret the distance
+as it sees fit. For the pseries Linux guest, each level of NUMA duplicates
+the distance of the previous level, and the maximum amount of levels is
+limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the
+kernel tree). This results in the following distances:
+
+* both resources in the first NUMA level: 10
+* resources one NUMA level apart: 20
+* resources two NUMA levels apart: 40
+* resources three NUMA levels apart: 80
+* resources four NUMA levels apart: 160
+
+
+pseries NUMA mechanics
+======================
+
+Starting in QEMU 5.2, the pseries machine considers user input when setting NUMA
+topology of the guest. The overall design is:
+
+* ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing
+  for 4 distinct NUMA distance values based on the NUMA levels
+
+* ibm,max-associativity-domains supports multiple associativity domains in all
+  NUMA levels, granting user flexibility
+
+* ibm,associativity for all resources varies with user input
+
+These changes are only effective for pseries-5.2 and newer machines that are
+created with more than one NUMA node (disconsidering NUMA nodes created by
+the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been
+around for such a long time, with users seeing NUMA distances 10 and 40
+(and 80 if using NVLink2 GPUs), and there is no need to disrupt the
+existing experience of those guests.
+
+To bring the user experience x86 users have when tuning up NUMA, we had
+to operate under the current pseries Linux kernel logic described in
+`How the pseries Linux guest calculates NUMA distances`_. The result
+is that we needed to translate NUMA distance user input to pseries
+Linux kernel input.
+
+Translating user distance to kernel distance
+--------------------------------------------
+
+User input for NUMA distance can vary from 10 to 254. We need to translate
+that to the values that the Linux kernel operates on (10, 20, 40, 80, 160).
+This is how it is being done:
+
+* user distance 11 to 30 will be interpreted as 20
+* user distance 31 to 60 will be interpreted as 40
+* user distance 61 to 120 will be interpreted as 80
+* user distance 121 and beyond will be interpreted as 160
+* user distance 10 stays 10
+
+The reasoning behind this approximation is to avoid any round up to the local
+distance (10), keeping it exclusive to the 4th NUMA level (which is still
+exclusive to the node_id). All other ranges were chosen under the developer
+discretion of what would be (somewhat) sensible considering the user input.
+Any other strategy can be used here, but in the end the reality is that we'll
+have to accept that a large array of values will be translated to the same
+NUMA topology in the guest, e.g. this user input:
+
+::
+
+      0   1   2
+  0  10  31 120
+  1  31  10  30
+  2 120  30  10
+
+And this other user input:
+
+::
+
+      0   1   2
+  0  10  60  61
+  1  60  10  11
+  2  61  11  10
+
+Will both be translated to the same values internally:
+
+::
+
+      0   1   2
+  0  10  40  80
+  1  40  10  20
+  2  80  20  10
+
+Users are encouraged to use only the kernel values in the NUMA definition to
+avoid being taken by surprise with that the guest is actually seeing in the
+topology. There are enough potential surprises that are inherent to the
+associativity domain assignment process, discussed below.
+
+
+How associativity domains are assigned
+--------------------------------------
+
+LOPAPR allows more than one associativity array (or 'string') per allocated
+resource. This would be used to represent that the resource has multiple
+connections with the board, and then the operational system, when deciding
+NUMA distancing, should consider the associativity information that provides
+the shortest distance.
+
+The spapr implementation does not support multiple associativity arrays per
+resource, neither does the pseries Linux kernel. We'll have to represent the
+NUMA topology using one associativity per resource, which means that choices
+and compromises are going to be made.
+
+Consider the following NUMA topology entered by user input:
+
+::
+
+      0   1   2   3
+  0  10  40  20  40
+  1  40  10  80  40
+  2  20  80  10  20
+  3  40  40  20  10
+
+All the associativity arrays are initialized with NUMA id in all associativity
+domains:
+
+* node 0: 0 0 0 0
+* node 1: 1 1 1 1
+* node 2: 2 2 2 2
+* node 3: 3 3 3 3
+
+
+Honoring just the relative distances of node 0 to every other node, we find the
+NUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for
+each distance:
+
+* distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match
+  at 0x2)
+* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3)
+* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match
+  at 0x2)
+
+We'll copy the associativity domains of node 0 to all other nodes, based on
+the NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy
+the domains 0x2 and 0x1 from 0 to 1 as well. This will give us:
+
+* node 0: 0 0 0 0
+* node 1: 0 0 1 1
+
+Doing the same to node 2 and node 3, these are the associativity arrays
+after considering all matches with node 0:
+
+* node 0: 0 0 0 0
+* node 1: 0 0 1 1
+* node 2: 0 0 0 2
+* node 3: 0 0 3 3
+
+The distances related to node 0 are accounted for. For node 1, and keeping
+in mind that we don't need to revisit node 0 again, the distance from
+node 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40,
+match in 0x2. Repeating the same logic of copying all domains up to
+the NUMA level match:
+
+* node 0: 0 0 0 0
+* node 1: 1 0 1 1
+* node 2: 1 0 0 2
+* node 3: 1 0 3 3
+
+In the last step we will analyze just nodes 2 and 3. The desired distance
+between 2 and 3 is 20, i.e. a match in 0x3:
+
+* node 0: 0 0 0 0
+* node 1: 1 0 1 1
+* node 2: 1 0 0 2
+* node 3: 1 0 0 3
+
+
+The kernel will read these arrays and will calculate the following NUMA topology for
+the guest:
+
+::
+
+      0   1   2   3
+  0  10  40  20  20
+  1  40  10  40  40
+  2  20  40  10  20
+  3  20  40  20  10
+
+Note that this is not what the user wanted - the desired distance between
+0 and 3 is 40, we calculated it as 20. This is what the current logic and
+implementation constraints of the kernel and QEMU will provide inside the
+LOPAPR specification.
+
+Users are welcome to use this knowledge and experiment with the input to get
+the NUMA topology they want, or as closer as they want. The important thing
+is to keep expectations up to par with what we are capable of provide at this
+moment: an approximation.
+
+Limitations of the implementation
+---------------------------------
+
+As mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate
+user choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways
+to fully map user input to actual NUMA distance the guest will use. These limitations
+creates two notable limitations in our support:
+
+* Asymmetrical topologies aren't supported. We only support NUMA topologies where
+  the distance from node A to B is always the same as B to A. We do not support
+  any A-B pair where the distance back and forth is asymmetric. For example, the
+  following topology isn't supported and the pSeries guest will not boot with this
+  user input:
+
+::
+
+      0   1
+  0  10  40
+  1  20  10
+
+
+* 'non-transitive' topologies will be poorly translated to the guest. This is the
+  kind of topology where the distance from a node A to B is X, B to C is X, but
+  the distance A to C is not X. E.g.:
+
+::
+
+      0   1   2   3
+  0  10  20  20  40
+  1  20  10  80  40
+  2  20  80  10  20
+  3  40  40  20  10
+
+  In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40.
+  The kernel will always match with the shortest associativity domain possible,
+  and we're attempting to retain the previous established relations between the
+  nodes. This means that a distance equal to 20 between nodes 0 and 2 and the
+  same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3
+  to also be 20.
+
+
+Legacy (5.1 and older) pseries NUMA mechanics
+=============================================
+
+In short, we can summarize the NUMA distances seem in pseries Linux guests, using
+QEMU up to 5.1, as follows:
+
+* local distance, i.e. the distance of the resource to its own NUMA node: 10
+* if it's a NVLink GPU device, distance: 80
+* every other resource, distance: 40
+
+The way the pseries Linux guest calculates NUMA distances has a direct effect
+on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is
+the default ibm,associativity-reference-points being used in the pseries
+machine:
+
+ibm,associativity-reference-points = {0x4, 0x4, 0x2}
+
+The first and second level are equal, 0x4, and a third one was added in
+commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that
+regardless of how the ibm,associativity properties are being created in
+the device tree, the pseries Linux guest will only recognize three scenarios
+as far as NUMA distance goes:
+
+* if the resources belongs to the same first NUMA level = 10
+* second level is skipped since it's equal to the first
+* all resources that aren't a NVLink GPU, it is guaranteed that they will belong
+  to the same third NUMA level, having distance = 40
+* for NVLink GPUs, distance = 80 from everything else
+
+This also means that user input in QEMU command line does not change the
+NUMA distancing inside the guest for the pseries machine.
diff --git a/docs/specs/ppc-spapr-uv-hcalls.txt b/docs/specs/ppc-spapr-uv-hcalls.txt
new file mode 100644
index 000000000..389c2740d
--- /dev/null
+++ b/docs/specs/ppc-spapr-uv-hcalls.txt
@@ -0,0 +1,76 @@
+On PPC64 systems supporting Protected Execution Facility (PEF), system
+memory can be placed in a secured region where only an "ultravisor"
+running in firmware can provide to access it. pseries guests on such
+systems can communicate with the ultravisor (via ultracalls) to switch to a
+secure VM mode (SVM) where the guest's memory is relocated to this secured
+region, making its memory inaccessible to normal processes/guests running on
+the host.
+
+The various ultracalls/hypercalls relating to SVM mode are currently
+only documented internally, but are planned for direct inclusion into the
+public OpenPOWER version of the PAPR specification (LoPAPR/LoPAR). An internal
+ACR has been filed to reserve a hypercall number range specific to this
+use-case to avoid any future conflicts with the internally-maintained PAPR
+specification. This document summarizes some of these details as they relate
+to QEMU.
+
+== hypercalls needed by the ultravisor ==
+
+Switching to SVM mode involves a number of hcalls issued by the ultravisor
+to the hypervisor to orchestrate the movement of guest memory to secure
+memory and various other aspects SVM mode. Numbers are assigned for these
+hcalls within the reserved range 0xEF00-0xEF80. The below documents the
+hcalls relevant to QEMU.
+
+- H_TPM_COMM (0xef10)
+
+  For TPM_COMM_OP_EXECUTE operation:
+    Send a request to a TPM and receive a response, opening a new TPM session
+    if one has not already been opened.
+
+  For TPM_COMM_OP_CLOSE_SESSION operation:
+    Close the existing TPM session, if any.
+
+  Arguments:
+
+    r3 : H_TPM_COMM (0xef10)
+    r4 : TPM operation, one of:
+         TPM_COMM_OP_EXECUTE (0x1)
+         TPM_COMM_OP_CLOSE_SESSION (0x2)
+    r5 : in_buffer, guest physical address of buffer containing the request
+         - Caller may use the same address for both request and response
+    r6 : in_size, size of the in buffer
+         - Must be less than or equal to 4KB
+    r7 : out_buffer, guest physical address of buffer to store the response
+         - Caller may use the same address for both request and response
+    r8 : out_size, size of the out buffer
+         - Must be at least 4KB, as this is the maximum request/response size
+           supported by most TPM implementations, including the TPM Resource
+           Manager in the linux kernel.
+
+  Return values:
+
+    r3 : H_Success    request processed successfully
+         H_PARAMETER  invalid TPM operation
+         H_P2         in_buffer is invalid
+         H_P3         in_size is invalid
+         H_P4         out_buffer is invalid
+         H_P5         out_size is invalid
+         H_RESOURCE   problem communicating with TPM
+         H_FUNCTION   TPM access is not currently allowed/configured
+    r4 : For TPM_COMM_OP_EXECUTE, the size of the response will be stored here
+         upon success.
+
+  Use-case/notes:
+
+    SVM filesystems are encrypted using a symmetric key. This key is then
+    wrapped/encrypted using the public key of a trusted system which has the
+    private key stored in the system's TPM. An Ultravisor will use this
+    hcall to unwrap/unseal the symmetric key using the system's TPM device
+    or a TPM Resource Manager associated with the device.
+
+    The Ultravisor sets up a separate session key with the TPM in advance
+    during host system boot. All sensitive in and out values will be
+    encrypted using the session key. Though the hypervisor will see the 'in'
+    and 'out' buffers in raw form, any sensitive contents will generally be
+    encrypted using this session key.
diff --git a/docs/specs/ppc-spapr-xive.rst b/docs/specs/ppc-spapr-xive.rst
new file mode 100644
index 000000000..f47f739e0
--- /dev/null
+++ b/docs/specs/ppc-spapr-xive.rst
@@ -0,0 +1,282 @@
+XIVE for sPAPR (pseries machines)
+=================================
+
+The POWER9 processor comes with a new interrupt controller
+architecture, called XIVE as "eXternal Interrupt Virtualization
+Engine". It supports a larger number of interrupt sources and offers
+virtualization features which enables the HW to deliver interrupts
+directly to virtual processors without hypervisor assistance.
+
+A QEMU ``pseries`` machine (which is PAPR compliant) using POWER9
+processors can run under two interrupt modes:
+
+- *Legacy Compatibility Mode*
+
+  the hypervisor provides identical interfaces and similar
+  functionality to PAPR+ Version 2.7.  This is the default mode
+
+  It is also referred as *XICS* in QEMU.
+
+- *XIVE native exploitation mode*
+
+  the hypervisor provides new interfaces to manage the XIVE control
+  structures, and provides direct control for interrupt management
+  through MMIO pages.
+
+Which interrupt modes can be used by the machine is negotiated with
+the guest O/S during the Client Architecture Support negotiation
+sequence. The two modes are mutually exclusive.
+
+Both interrupt mode share the same IRQ number space. See below for the
+layout.
+
+CAS Negotiation
+---------------
+
+QEMU advertises the supported interrupt modes in the device tree
+property ``ibm,arch-vec-5-platform-support`` in byte 23 and the OS
+Selection for XIVE is indicated in the ``ibm,architecture-vec-5``
+property byte 23.
+
+The interrupt modes supported by the machine depend on the CPU type
+(POWER9 is required for XIVE) but also on the machine property
+``ic-mode`` which can be set on the command line. It can take the
+following values: ``xics``, ``xive``, and ``dual`` which is the
+default mode. ``dual`` means that both modes XICS **and** XIVE are
+supported and if the guest OS supports XIVE, this mode will be
+selected.
+
+The chosen interrupt mode is activated after a reconfiguration done
+in a machine reset.
+
+KVM negotiation
+---------------
+
+When the guest starts under KVM, the capabilities of the host kernel
+and QEMU are also negotiated. Depending on the version of the host
+kernel, KVM will advertise the XIVE capability to QEMU or not.
+
+Nevertheless, the available interrupt modes in the machine should not
+depend on the XIVE KVM capability of the host. On older kernels
+without XIVE KVM support, QEMU will use the emulated XIVE device as a
+fallback and on newer kernels (>=5.2), the KVM XIVE device.
+
+XIVE native exploitation mode is not supported for KVM nested guests,
+VMs running under a L1 hypervisor (KVM on pSeries). In that case, the
+hypervisor will not advertise the KVM capability and QEMU will use the
+emulated XIVE device, same as for older versions of KVM.
+
+As a final refinement, the user can also switch the use of the KVM
+device with the machine option ``kernel_irqchip``.
+
+
+XIVE support in KVM
+~~~~~~~~~~~~~~~~~~~
+
+For guest OSes supporting XIVE, the resulting interrupt modes on host
+kernels with XIVE KVM support are the following:
+
+==============  =============  =============  ================
+ic-mode                            kernel_irqchip
+--------------  ----------------------------------------------
+/               allowed        off            on
+                (default)
+==============  =============  =============  ================
+dual (default)  XIVE KVM       XIVE emul.     XIVE KVM
+xive            XIVE KVM       XIVE emul.     XIVE KVM
+xics            XICS KVM       XICS emul.     XICS KVM
+==============  =============  =============  ================
+
+For legacy guest OSes without XIVE support, the resulting interrupt
+modes are the following:
+
+==============  =============  =============  ================
+ic-mode                            kernel_irqchip
+--------------  ----------------------------------------------
+/               allowed        off            on
+                (default)
+==============  =============  =============  ================
+dual (default)  XICS KVM       XICS emul.     XICS KVM
+xive            QEMU error(3)  QEMU error(3)  QEMU error(3)
+xics            XICS KVM       XICS emul.     XICS KVM
+==============  =============  =============  ================
+
+(3) QEMU fails at CAS with ``Guest requested unavailable interrupt
+    mode (XICS), either don't set the ic-mode machine property or try
+    ic-mode=xics or ic-mode=dual``
+
+
+No XIVE support in KVM
+~~~~~~~~~~~~~~~~~~~~~~
+
+For guest OSes supporting XIVE, the resulting interrupt modes on host
+kernels without XIVE KVM support are the following:
+
+==============  =============  =============  ================
+ic-mode                            kernel_irqchip
+--------------  ----------------------------------------------
+/               allowed        off            on
+                (default)
+==============  =============  =============  ================
+dual (default)  XIVE emul.(1)  XIVE emul.     QEMU error (2)
+xive            XIVE emul.(1)  XIVE emul.     QEMU error (2)
+xics            XICS KVM       XICS emul.     XICS KVM
+==============  =============  =============  ================
+
+
+(1) QEMU warns with ``warning: kernel_irqchip requested but unavailable:
+    IRQ_XIVE capability must be present for KVM``
+    In some cases (old host kernels or KVM nested guests), one may hit a
+    QEMU/KVM incompatibility due to device destruction in reset. QEMU fails
+    with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on``
+(2) QEMU fails with ``kernel_irqchip requested but unavailable:
+    IRQ_XIVE capability must be present for KVM``
+
+
+For legacy guest OSes without XIVE support, the resulting interrupt
+modes are the following:
+
+==============  =============  =============  ================
+ic-mode                            kernel_irqchip
+--------------  ----------------------------------------------
+/               allowed        off            on
+                (default)
+==============  =============  =============  ================
+dual (default)  QEMU error(4)  XICS emul.     QEMU error(4)
+xive            QEMU error(3)  QEMU error(3)  QEMU error(3)
+xics            XICS KVM       XICS emul.     XICS KVM
+==============  =============  =============  ================
+
+(3) QEMU fails at CAS with ``Guest requested unavailable interrupt
+    mode (XICS), either don't set the ic-mode machine property or try
+    ic-mode=xics or ic-mode=dual``
+(4) QEMU/KVM incompatibility due to device destruction in reset. QEMU fails
+    with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on``
+
+
+XIVE Device tree properties
+---------------------------
+
+The properties for the PAPR interrupt controller node when the *XIVE
+native exploitation mode* is selected should contain:
+
+- ``device_type``
+
+  value should be "power-ivpe".
+
+- ``compatible``
+
+  value should be "ibm,power-ivpe".
+
+- ``reg``
+
+  contains the base address and size of the thread interrupt
+  managnement areas (TIMA), for the User level and for the Guest OS
+  level. Only the Guest OS level is taken into account today.
+
+- ``ibm,xive-eq-sizes``
+
+  the size of the event queues. One cell per size supported, contains
+  log2 of size, in ascending order.
+
+- ``ibm,xive-lisn-ranges``
+
+  the IRQ interrupt number ranges assigned to the guest for the IPIs.
+
+The root node also exports :
+
+- ``ibm,plat-res-int-priorities``
+
+  contains a list of priorities that the hypervisor has reserved for
+  its own use.
+
+IRQ number space
+----------------
+
+IRQ Number space of the ``pseries`` machine is 8K wide and is the same
+for both interrupt mode. The different ranges are defined as follow :
+
+- ``0x0000 .. 0x0FFF`` 4K CPU IPIs (only used under XIVE)
+- ``0x1000 .. 0x1000`` 1 EPOW
+- ``0x1001 .. 0x1001`` 1 HOTPLUG
+- ``0x1002 .. 0x10FF`` unused
+- ``0x1100 .. 0x11FF`` 256 VIO devices
+- ``0x1200 .. 0x127F`` 32x4 LSIs for PHB devices
+- ``0x1280 .. 0x12FF`` unused
+- ``0x1300 .. 0x1FFF`` PHB MSIs (dynamically allocated)
+
+Monitoring XIVE
+---------------
+
+The state of the XIVE interrupt controller can be queried through the
+monitor commands ``info pic``. The output comes in two parts.
+
+First, the state of the thread interrupt context registers is dumped
+for each CPU :
+
+::
+
+   (qemu) info pic
+   CPU[0000]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
+   CPU[0000]: USER    00   00  00    00   00  00  00   00  00000000
+   CPU[0000]:   OS    00   ff  00    00   ff  00  ff   ff  80000400
+   CPU[0000]: POOL    00   00  00    00   00  00  00   00  00000000
+   CPU[0000]: PHYS    00   00  00    00   00  00  00   ff  00000000
+   ...
+
+In the case of a ``pseries`` machine, QEMU acts as the hypervisor and only
+the O/S and USER register rings make sense. ``W2`` contains the vCPU CAM
+line which is set to the VP identifier.
+
+Then comes the routing information which aggregates the EAS and the
+END configuration:
+
+::
+
+   ...
+   LISN         PQ    EISN     CPU/PRIO EQ
+   00000000 MSI --    00000010   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
+   00000001 MSI --    00000010   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
+   00000002 MSI --    00000010   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
+   00000003 MSI --    00000010   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
+   00000004 MSI -Q  M 00000000
+   00000005 MSI -Q  M 00000000
+   00000006 MSI -Q  M 00000000
+   00000007 MSI -Q  M 00000000
+   00001000 MSI --    00000012   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
+   00001001 MSI --    00000013   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
+   00001100 MSI --    00000100   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
+   00001101 MSI -Q  M 00000000
+   00001200 LSI -Q  M 00000000
+   00001201 LSI -Q  M 00000000
+   00001202 LSI -Q  M 00000000
+   00001203 LSI -Q  M 00000000
+   00001300 MSI --    00000102   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
+   00001301 MSI --    00000103   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
+   00001302 MSI --    00000104   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
+
+The source information and configuration:
+
+- The ``LISN`` column outputs the interrupt number of the source in
+  range ``[ 0x0 ... 0x1FFF ]`` and its type : ``MSI`` or ``LSI``
+- The ``PQ`` column reflects the state of the PQ bits of the source :
+
+  - ``--`` source is ready to take events
+  - ``P-`` an event was sent and an EOI is PENDING
+  - ``PQ`` an event was QUEUED
+  - ``-Q`` source is OFF
+
+  a ``M`` indicates that source is *MASKED* at the EAS level,
+
+The targeting configuration :
+
+- The ``EISN`` column is the event data that will be queued in the event
+  queue of the O/S.
+- The ``CPU/PRIO`` column is the tuple defining the CPU number and
+  priority queue serving the source.
+- The ``EQ`` column outputs :
+
+  - the current index of the event queue/ the max number of entries
+  - the O/S event queue address
+  - the toggle bit
+  - the last entries that were pushed in the event queue.
diff --git a/docs/specs/ppc-xive.rst b/docs/specs/ppc-xive.rst
new file mode 100644
index 000000000..83d43f658
--- /dev/null
+++ b/docs/specs/ppc-xive.rst
@@ -0,0 +1,200 @@
+================================
+POWER9 XIVE interrupt controller
+================================
+
+The POWER9 processor comes with a new interrupt controller
+architecture, called XIVE as "eXternal Interrupt Virtualization
+Engine".
+
+Compared to the previous architecture, the main characteristics of
+XIVE are to support a larger number of interrupt sources and to
+deliver interrupts directly to virtual processors without hypervisor
+assistance. This removes the context switches required for the
+delivery process.
+
+
+XIVE architecture
+=================
+
+The XIVE IC is composed of three sub-engines, each taking care of a
+processing layer of external interrupts:
+
+- Interrupt Virtualization Source Engine (IVSE), or Source Controller
+  (SC). These are found in PCI PHBs, in the Processor Service
+  Interface (PSI) host bridge Controller, but also inside the main
+  controller for the core IPIs and other sub-chips (NX, CAP, NPU) of
+  the chip/processor. They are configured to feed the IVRE with
+  events.
+- Interrupt Virtualization Routing Engine (IVRE) or Virtualization
+  Controller (VC). It handles event coalescing and perform interrupt
+  routing by matching an event source number with an Event
+  Notification Descriptor (END).
+- Interrupt Virtualization Presentation Engine (IVPE) or Presentation
+  Controller (PC). It maintains the interrupt context state of each
+  thread and handles the delivery of the external interrupt to the
+  thread.
+
+::
+
+                XIVE Interrupt Controller
+                +------------------------------------+      IPIs
+                | +---------+ +---------+ +--------+ |    +-------+
+                | |IVRE     | |Common Q | |IVPE    |----> | CORES |
+                | |     esb | |         | |        |----> |       |
+                | |     eas | |  Bridge | |   tctx |----> |       |
+                | |SC   end | |         | |    nvt | |    |       |
+    +------+    | +---------+ +----+----+ +--------+ |    +-+-+-+-+
+    | RAM  |    +------------------|-----------------+      | | |
+    |      |                       |                        | | |
+    |      |                       |                        | | |
+    |      |  +--------------------v------------------------v-v-v--+    other
+    |      <--+                     Power Bus                      +--> chips
+    |  esb |  +---------+-----------------------+------------------+
+    |  eas |            |                       |
+    |  end |         +--|------+                |
+    |  nvt |       +----+----+ |           +----+----+
+    +------+       |IVSE     | |           |IVSE     |
+                   |         | |           |         |
+                   | PQ-bits | |           | PQ-bits |
+                   | local   |-+           |  in VC  |
+                   +---------+             +---------+
+                      PCIe                 NX,NPU,CAPI
+
+
+    PQ-bits: 2 bits source state machine (P:pending Q:queued)
+    esb: Event State Buffer (Array of PQ bits in an IVSE)
+    eas: Event Assignment Structure
+    end: Event Notification Descriptor
+    nvt: Notification Virtual Target
+    tctx: Thread interrupt Context registers
+
+
+
+XIVE internal tables
+--------------------
+
+Each of the sub-engines uses a set of tables to redirect interrupts
+from event sources to CPU threads.
+
+::
+
+                                            +-------+
+    User or O/S                             |  EQ   |
+        or                          +------>|entries|
+    Hypervisor                      |       |  ..   |
+      Memory                        |       +-------+
+                                    |           ^
+                                    |           |
+               +-------------------------------------------------+
+                                    |           |
+    Hypervisor      +------+    +---+--+    +---+--+   +------+
+      Memory        | ESB  |    | EAT  |    | ENDT |   | NVTT |
+     (skiboot)      +----+-+    +----+-+    +----+-+   +------+
+                      ^  |        ^  |        ^  |       ^
+                      |  |        |  |        |  |       |
+               +-------------------------------------------------+
+                      |  |        |  |        |  |       |
+                      |  |        |  |        |  |       |
+                 +----|--|--------|--|--------|--|-+   +-|-----+    +------+
+                 |    |  |        |  |        |  | |   | | tctx|    |Thread|
+     IPI or   ---+    +  v        +  v        +  v |---| +  .. |----->     |
+    HW events    |                                 |   |       |    |      |
+                 |             IVRE                |   | IVPE  |    +------+
+                 +---------------------------------+   +-------+
+
+
+The IVSE have a 2-bits state machine, P for pending and Q for queued,
+for each source that allows events to be triggered. They are stored in
+an Event State Buffer (ESB) array and can be controlled by MMIOs.
+
+If the event is let through, the IVRE looks up in the Event Assignment
+Structure (EAS) table for an Event Notification Descriptor (END)
+configured for the source. Each Event Notification Descriptor defines
+a notification path to a CPU and an in-memory Event Queue, in which
+will be enqueued an EQ data for the O/S to pull.
+
+The IVPE determines if a Notification Virtual Target (NVT) can handle
+the event by scanning the thread contexts of the VCPUs dispatched on
+the processor HW threads. It maintains the interrupt context state of
+each thread in a NVT table.
+
+XIVE thread interrupt context
+-----------------------------
+
+The XIVE presenter can generate four different exceptions to its
+HW threads:
+
+- hypervisor exception
+- O/S exception
+- Event-Based Branch (user level)
+- msgsnd (doorbell)
+
+Each exception has a state independent from the others called a Thread
+Interrupt Management context. This context is a set of registers which
+lets the thread handle priority management and interrupt
+acknowledgment among other things. The most important ones being :
+
+- Interrupt Priority Register  (PIPR)
+- Interrupt Pending Buffer     (IPB)
+- Current Processor Priority   (CPPR)
+- Notification Source Register (NSR)
+
+TIMA
+~~~~
+
+The Thread Interrupt Management registers are accessible through a
+specific MMIO region, called the Thread Interrupt Management Area
+(TIMA), four aligned pages, each exposing a different view of the
+registers. First page (page address ending in ``0b00``) gives access
+to the entire context and is reserved for the ring 0 view for the
+physical thread context. The second (page address ending in ``0b01``)
+is for the hypervisor, ring 1 view. The third (page address ending in
+``0b10``) is for the operating system, ring 2 view. The fourth (page
+address ending in ``0b11``) is for user level, ring 3 view.
+
+Interrupt flow from an O/S perspective
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+After an event data has been enqueued in the O/S Event Queue, the IVPE
+raises the bit corresponding to the priority of the pending interrupt
+in the register IBP (Interrupt Pending Buffer) to indicate that an
+event is pending in one of the 8 priority queues. The Pending
+Interrupt Priority Register (PIPR) is also updated using the IPB. This
+register represent the priority of the most favored pending
+notification.
+
+The PIPR is then compared to the Current Processor Priority
+Register (CPPR). If it is more favored (numerically less than), the
+CPU interrupt line is raised and the EO bit of the Notification Source
+Register (NSR) is updated to notify the presence of an exception for
+the O/S. The O/S acknowledges the interrupt with a special load in the
+Thread Interrupt Management Area.
+
+The O/S handles the interrupt and when done, performs an EOI using a
+MMIO operation on the ESB management page of the associate source.
+
+Overview of the QEMU models for XIVE
+====================================
+
+The XiveSource models the IVSE in general, internal and external. It
+handles the source ESBs and the MMIO interface to control them.
+
+The XiveNotifier is a small helper interface interconnecting the
+XiveSource to the XiveRouter.
+
+The XiveRouter is an abstract model acting as a combined IVRE and
+IVPE. It routes event notifications using the EAS and END tables to
+the IVPE sub-engine which does a CAM scan to find a CPU to deliver the
+exception. Storage should be provided by the inheriting classes.
+
+XiveEnDSource is a special source object. It exposes the END ESB MMIOs
+of the Event Queues which are used for coalescing event notifications
+and for escalation. Not used on the field, only to sync the EQ cache
+in OPAL.
+
+Finally, the XiveTCTX contains the interrupt state context of a thread,
+four sets of registers, one for each exception that can be delivered
+to a CPU. These contexts are scanned by the IVPE to find a matching VP
+when a notification is triggered. It also models the Thread Interrupt
+Management Area (TIMA), which exposes the thread context registers to
+the CPU for interrupt management.
diff --git a/docs/specs/pvpanic.txt b/docs/specs/pvpanic.txt
new file mode 100644
index 000000000..8afcde11c
--- /dev/null
+++ b/docs/specs/pvpanic.txt
@@ -0,0 +1,54 @@
+PVPANIC DEVICE
+==============
+
+pvpanic device is a simulated device, through which a guest panic
+event is sent to qemu, and a QMP event is generated. This allows
+management apps (e.g. libvirt) to be notified and respond to the event.
+
+The management app has the option of waiting for GUEST_PANICKED events,
+and/or polling for guest-panicked RunState, to learn when the pvpanic
+device has fired a panic event.
+
+The pvpanic device can be implemented as an ISA device (using IOPORT) or as a
+PCI device.
+
+ISA Interface
+-------------
+
+pvpanic exposes a single I/O port, by default 0x505. On read, the bits
+recognized by the device are set. Software should ignore bits it doesn't
+recognize. On write, the bits not recognized by the device are ignored.
+Software should set only bits both itself and the device recognize.
+
+Bit Definition
+--------------
+bit 0: a guest panic has happened and should be processed by the host
+bit 1: a guest panic has happened and will be handled by the guest;
+       the host should record it or report it, but should not affect
+       the execution of the guest.
+
+PCI Interface
+-------------
+
+The PCI interface is similar to the ISA interface except that it uses an MMIO
+address space provided by its BAR0, 1 byte long. Any machine with a PCI bus
+can enable a pvpanic device by adding '-device pvpanic-pci' to the command
+line.
+
+ACPI Interface
+--------------
+
+pvpanic device is defined with ACPI ID "QEMU0001". Custom methods:
+
+RDPT:       To determine whether guest panic notification is supported.
+Arguments:  None
+Return:     Returns a byte, with the same semantics as the I/O port
+            interface.
+
+WRPT:       To send a guest panic event
+Arguments:  Arg0 is a byte to be written, with the same semantics as
+            the I/O interface.
+Return:     None
+
+The ACPI device will automatically refer to the right port in case it
+is modified.
diff --git a/docs/specs/rocker.txt b/docs/specs/rocker.txt
new file mode 100644
index 000000000..1857b3170
--- /dev/null
+++ b/docs/specs/rocker.txt
@@ -0,0 +1,1014 @@
+Rocker Network Switch Register Programming Guide
+Copyright (c) Scott Feldman <sfeldma@gmail.com>
+Copyright (c) Neil Horman <nhorman@tuxdriver.com>
+Version 0.11, 12/29/2014
+
+LICENSE
+=======
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+SECTION 1: Introduction
+=======================
+
+Overview
+--------
+
+This document describes the hardware/software interface for the Rocker switch
+device.  The intended audience is authors of OS drivers and device emulation
+software.
+
+Notations and Conventions
+-------------------------
+
+o In register descriptions, [n:m] indicates a range from bit n to bit m,
+inclusive.
+o Use of leading 0x indicates a hexadecimal number.
+o Use of leading 0b indicates a binary number.
+o The use of RSVD or Reserved indicates that a bit or field is reserved for
+future use.
+o Field width is in bytes, unless otherwise noted.
+o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear
+on read
+o TLV values in network-byte-order are designated with (N).
+
+
+SECTION 2: PCI Configuration Registers
+======================================
+
+PCI Configuration Space
+-----------------------
+
+Each switch instance registers as a PCI device with PCI configuration space:
+
+	offset	width	description		value
+	---------------------------------------------
+	0x0	2	Vendor ID		0x1b36
+	0x2	2	Device ID		0x0006
+	0x4	4	Command/Status
+	0x8	1	Revision ID		0x01
+	0x9	3	Class code		0x2800
+	0xC	1	Cache line size
+	0xD	1	Latency timer
+	0xE	1	Header type
+	0xF	1	Built-in self test
+	0x10	4	Base address low
+	0x14	4	Base address high
+	0x18-28		Reserved
+	0x2C	2	Subsystem vendor ID	*
+	0x2E	2	Subsystem ID		*
+	0x30-38		Reserved
+	0x3C	1	Interrupt line
+	0x3D	1	Interrupt pin		0x00
+	0x3E	1	Min grant		0x00
+	0x3D	1	Max latency		0x00
+	0x40	1	TRDY timeout
+	0x41	1	Retry count
+	0x42	2	Reserved
+
+
+* Assigned by sub-system implementation
+
+SECTION 3: Memory-Mapped Register Space
+=======================================
+
+There are two memory-mapped BARs.  BAR0 maps device register space and is
+0x2000 in size.  BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
+size, allowing for 256 MSI-X vectors.
+
+All registers are 4 or 8 bytes long.  It is assumed host software will access 4
+byte registers with one 4-byte access, and 8 byte registers with either two
+4-byte accesses or a single 8-byte access.  In the case of two 4-byte accesses,
+access must be lower and then upper 4-bytes, in that order.
+
+BAR0 device register space is organized as follows:
+
+	offset		description
+	------------------------------------------------------
+	0x0000-0x000f	Bogus registers to catch misbehaving
+			drivers.  Writes do nothing.  Reads
+			back as 0xDEADBABE.
+	0x0010-0x00ff	Test registers
+	0x0300-0x03ff	General purpose registers
+	0x1000-0x1fff	Descriptor control
+
+Holes in register space are reserved.  Writes to reserved registers do nothing.
+Reads to reserved registers read back as 0.
+
+No fancy stuff like write-combining is enabled on any of the registers.
+
+BAR1 MSI-X register space is organized as follows:
+
+	offset		description
+	------------------------------------------------------
+	0x0000-0x0fff	MSI-X vector table (256 vectors total)
+	0x1000-0x1fff	MSI-X PBA table
+
+
+SECTION 4: Interrupts, DMA, and Endianness
+==========================================
+
+PCI Interrupts
+--------------
+
+The device supports only MSI-X interrupts.  BAR1 memory-mapped region contains
+the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors.
+
+The vector assignment is:
+
+	vector		description
+	-----------------------------------------------------
+	0		Command descriptor ring completion
+	1		Event descriptor ring completion
+	2		Test operation completion
+	3		RSVD
+	4-255		Tx and Rx descriptor ring completion
+			  Tx vector is even
+			  Rx vector is odd
+
+A MSI-X vector table entry is 16 bytes:
+
+	field		offset	width	description
+	-------------------------------------------------------------
+	lower_addr	0x0	4	[31:2] message address[31:2]
+					[1:0] Rsvd (4 byte alignment
+						    required)
+	upper_addr	0x4	4	[31:19] Rsvd
+					[14:0] message address[46:32]
+	data		0x8	4	message data[31:0]
+	control		0xc	4	[31:1] Rsvd
+					[0] mask (0 = enable,
+						  1 = masked)
+
+Software should install the Interrupt Service Routine (ISR) before any ports
+are enabled or any commands are issued on the command ring.
+
+DMA Operations
+--------------
+
+DMA operations are used for packet DMA to/from the CPU, command and event
+processing.  Command processing includes statistical counters and table dumps,
+table insertion/deletion, and more.  Event processing provides an async
+notification method for device-originating events.  Each DMA operation has a
+set of control registers to manage a descriptor ring.  The descriptor rings are
+allocated from contiguous host DMA-able memory and registers specify the rings
+base address, size and current head and tail indices.  Software always writes
+the head, and hardware always writes the tail.
+
+The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion
+of a descriptor.  Software will clear this bit when posting a descriptor to the
+ring, and hardware will set this bit when the descriptor is complete.
+
+Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries.
+Descriptor rings' base address must be 8-byte aligned.  Descriptors must be
+packed within ring.  Each descriptor in each ring must also be aligned on an 8
+byte boundary.  Each descriptor ring will have these registers:
+
+	DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W)
+	DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W)
+	DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W)
+	DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R)
+	DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W)
+	DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W)
+	DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W)
+
+Where x is descriptor ring index:
+
+	index		ring
+	--------------------
+	0		CMD
+	1		EVENT
+	2		TX (port 0)
+	3		RX (port 0)
+	4		TX (port 1)
+	5		RX (port 1)
+	.
+	.
+	.
+	124		TX (port 61)
+	125		RX (port 61)
+	126		Resv
+	127		Resv
+
+Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero.  HEAD cannot be
+written past TAIL.  To do so would wrap the ring.  An empty ring is when HEAD
+== TAIL.  A full ring is when HEAD is one position behind TAIL.  Both HEAD and
+TAIL increment and modulo wrap at the ring size.
+
+CTRL register bits:
+
+	bit	name		description
+	------------------------------------------------------------------------
+	[0]	CTRL_RESET	Reset the descriptor ring
+	[1:31]	Reserved
+
+All descriptor types share some common fields:
+
+	field			width	description
+	-------------------------------------------------------------------
+	DMA_DESC_BUF_ADDR	8	Phys addr of desc payload, 8-byte
+					aligned
+	DMA_DESC_COOKIE		8	Desc cookie for completion matching,
+					upper-most bit is reserved
+	DMA_DESC_BUF_SIZE	2	Desc payload size in bytes
+	DMA_DESC_TLV_SIZE	2	Desc payload total size in bytes
+					used for TLVs.  Must be <=
+					DMA_DESC_BUF_SIZE.
+	DMA_DESC_COMP_ERR	2	Completion status of associated
+					desc payload.  High order bit is
+					clear on new descs, toggled by
+					hw for completed items.
+
+To support forward- and backward-compatibility, descriptor and completion
+payloads are specified in TLV format.  Fields are packed with Type=field name,
+Length=field length, and Value=field value.  Software will ignore unknown fields
+filled in by the switch.  Likewise, the switch will ignore unknown fields
+filled in by software.
+
+Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned.  The
+value within a TLV is also 8-byte aligned.  The (packed, 8 byte) TLV header is:
+
+	field	width	description
+	-----------------------------
+	type	4	TLV type
+	len	2	TLV value length
+	pad	2	Reserved
+
+The alignment requirements for descriptors and TLVs are to avoid unaligned
+access exceptions in software.  Note that the payload for each TLV is also
+8 byte aligned.
+
+Figure 1 shows an example descriptor buffer with two TLVs.
+
+                  <------- 8 bytes ------->
+
+  8-byte  +––––+  +–––––––––––+–––––+–––––+                     +–+
+  align           |   type    | len | pad |    TLV#1 hdr          |
+                  +–––––––––––+–––––+–––––+    (len=22)           |
+                  |                       |                       |
+                  |  value                |    TVL#1 value        |
+                  |                       |    (padded to 8-byte  |
+                  |                 +–––––+     alignment)        |
+                  |                 |/////|                       |
+   8-byte +––––+  +–––––––––––+–––––––––––+                       |
+   align          |   type    | len | pad |    TLV#2 hdr    DESC_BUF_SIZE
+                  +–––––+–––––+–––––+–––––+    (len=2)            |
+                  |value|/////////////////|    TLV#2 value        |
+                  +–––––+/////////////////|                       |
+                  |///////////////////////|                       |
+                  |///////////////////////|                       |
+                  |///////////////////////|                       |
+                  |////////unused/////////|                       |
+                  |////////space//////////|                       |
+                  |///////////////////////|                       |
+                  |///////////////////////|                       |
+                  |///////////////////////|                       |
+                  +–––––––––––––––––––––––+                     +–+
+
+				fig. 1
+
+TLVs can be nested within the NEST TLV type.
+
+Interrupt credits
+^^^^^^^^^^^^^^^^^
+
+MSI-X vectors used for descriptor ring completions use a credit mechanism for
+efficient device, PCIe bus, OS and driver operations.  Each descriptor ring has
+a credit count which represents the number of outstanding descriptors to be
+processed by the driver.  As the device marks descriptors complete, the credit
+count is incremented.  As the driver processes those outstanding descriptors,
+it returns credits back to the device.  This way, the device knows the driver's
+progress and can make decisions about when to fire the next interrupt or not.
+When the credit count is zero, and the first descriptors are posted for the
+driver, a single interrupt is fired.  Once the interrupt is fired, the
+interrupt is disabled (auto-masked*).  In response to the interrupt, the driver
+will process descriptors and PIO write a returned credit value for that
+descriptor ring.  If the driver returns all credits (the driver caught up with
+the device and there is no outstanding work), then the interrupt is unmasked,
+but not fired.  If only partial credits are returned, the interrupt remains
+masked but the device generates an interrupt, signaling the driver that more
+outstanding work is available.
+
+(* this masking is unrelated to the MSI-X interrupt mask register)
+
+Endianness
+----------
+
+Device registers are hard-coded to little-endian (LE).  The driver should
+convert to/from host endianness to LE for device register accesses.
+
+Descriptors are LE.  Descriptor buffer TLVs will have LE type and length
+fields, but the value field can either be LE or network-byte-order, depending
+on context.  TLV values containing network packet data will be in network-byte
+order.  A TLV value containing a field or mask used to compare against network
+packet data is network-byte order.  For example, flow match fields (and masks)
+are network-byte-order since they're matched directly, byte-by-byte, against
+network packet data.  All non-network-packet TLV multi-byte values will be LE.
+
+TLV values in network-byte-order are designated with (N).
+
+
+SECTION 5: Test Registers
+=========================
+
+Rocker has several test registers to support troubleshooting register access,
+interrupt generation, and DMA operations:
+
+	TEST_REG, offset 0x0010, 32-bit (R/W)
+	TEST_REG64, offset 0x0018, 64-bit (R/W)
+	TEST_IRQ, offset 0x0020, 32-bit (R/W)
+	TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
+	TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
+	TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
+
+Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last
+value written to the register.  The 32-bit and 64-bit versions are for testing
+32-bit and 64-bit host accesses.
+
+A vector can be written to TEST_IRQ and the device will generate an interrupt
+for that vector.
+
+To test basic DMA operations, allocate a DMA-able host buffer and put the
+buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE.  Then, write to
+TEST_DMA_CTRL to manipulate the buffer contents.  TEST_DMA_CTRL operations are:
+
+	operation		value	description
+	-----------------------------------------------------------
+	TEST_DMA_CTRL_CLEAR	1	clear buffer
+	TEST_DMA_CTRL_FILL	2	fill buffer bytes with 0x96
+	TEST_DMA_CTRL_INVERT	4	invert bytes in buffer
+
+Various buffer address and sizes should be tested to verify no address boundary
+issue exists.  In particular, buffers that start on odd-8-byte boundary and/or
+span multiple PAGE sizes should be tested.
+
+
+SECTION 6: Ports
+================
+
+Physical and Logical Ports
+------------------------------------
+
+The switch supports up to 62 physical (front-panel) ports.  Register
+PORT_PHYS_COUNT returns the actual number of physical ports available:
+
+	PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R)
+
+In addition to front-panel ports, the switch supports logical ports for
+tunnels.
+
+Front-panel ports and logical tunnel ports are mapped into a single 32-bit port
+space.  A special CPU port is assigned port 0.  The front-panel ports are
+mapped to ports 1-62.  A special loopback port is assigned port 63.  Logical
+tunnel ports are assigned ports 0x0001000-0x0001ffff.
+To summarize the port assignments:
+
+	port			mapping
+	-------------------------------------------------------
+	0			CPU port (for packets to/from host CPU)
+	1-62			front-panel physical ports
+	63			loopback port
+	64-0x0000ffff		RSVD
+	0x00010000-0x0001ffff	logical tunnel ports
+	0x00020000-0xffffffff	RSVD
+
+Physical Port Mode
+------------------
+
+Switch front-panel ports operate in a mode.  Currently, the only mode is
+OF-DPA.  OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA)
+Abstract Switch Specification, Version 1.0, from Broadcom Corporation.  To
+set/get the mode for front-panel ports, see port settings, below.
+
+Port Settings
+-------------
+
+Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS:
+
+	PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R)
+
+	Value is port bitmap.  Bits 0 and 63 always read 0.  Bits 1-62
+	read 1 for link UP and 0 for link DOWN for respective front-panel ports.
+
+Other properties for front-panel ports are available via DMA CMD descriptors:
+
+	Get PORT_SETTINGS descriptor:
+
+		field		width	description
+		----------------------------------------------
+		PORT_SETTINGS	2	CMD_GET
+		PPORT		4	Physical port #
+
+	Get PORT_SETTINGS completion:
+
+		field		width	description
+		----------------------------------------------
+		PPORT		4	Physical port #
+		SPEED		4	Current port interface speed, in Mbps
+		DUPLEX		1	1 = Full, 0 = Half
+		AUTONEG		1	1 = enabled, 0 = disabled
+		MACADDR		6	Port MAC address
+		MODE		1	0 = OF-DPA
+		LEARNING	1	MAC address learning on port
+						1 = enabled
+						0 = disabled
+		PHYS_NAME	<var>	Physical port name (string)
+
+	Set PORT_SETTINGS descriptor:
+
+		field		width	description
+		----------------------------------------------
+		PORT_SETTINGS	2	CMD_SET
+		PPORT		4	Physical port #
+		SPEED		4	Port interface speed, in Mbps
+		DUPLEX		1	1 = Full, 0 = Half
+		AUTONEG		1	1 = enabled, 0 = disabled
+		MACADDR		6	Port MAC address
+		MODE		1	0 = OF-DPA
+
+Port Enable
+-----------
+
+Front-panel ports are initially disabled, which means port ingress and egress
+packets will be dropped.  To enable or disable a port, use PORT_PHYS_ENABLE:
+
+	PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W)
+
+	Value is bitmap of first 64 ports.  Bits 0 and 63 are ignored
+	and always read as 0.  Write 1 to enable port; write 0 to disable it.
+	Default is 0.
+
+
+SECTION 7: Switch Control
+=========================
+
+This section covers switch-wide register settings.
+
+Control
+-------
+
+This register is used for low level control of the switch.
+
+	CONTROL: offset 0x0300, 32-bit, (W)
+
+	bit	name		description
+	------------------------------------------------------------------------
+	[0]	CONTROL_RESET	If set, device will perform reset
+	[1:31]	Reserved
+
+Switch ID
+---------
+
+The switch has a SWITCH_ID to be used by software to uniquely identify the
+switch:
+
+	SWITCH_ID: offset 0x0320, 64-bit, (R)
+
+	Value is opaque to switch software and no special encoding is implied.
+
+
+SECTION 8: Events
+=================
+
+Non-I/O asynchronous events from the device are notified to the host using the
+event ring.  The TLV structure for events is:
+
+	field		width	description
+	---------------------------------------------------
+	TYPE		4	Event type, one of:
+					1: LINK_CHANGED
+					2: MAC_VLAN_SEEN
+	INFO		<nest>	Event info (details below)
+
+Link Changed Event
+------------------
+
+When link status changes on a physical port, this event is generated.
+
+	field		width	description
+	---------------------------------------------------
+	INFO		<nest>
+	  PPORT		4	Physical port
+	  LINKUP	1	Link status:
+					0: down
+					1: up
+
+MAC VLAN Seen Event
+-------------------
+
+When a packet ingresses on a port and the source MAC/VLAN isn't known to the
+device, the device will generate this event.  In response to the event, the
+driver should install to the device the MAC/VLAN on the port into the bridge
+table.  Once installed, the MAC/VLAN is known on the port and this event will
+no longer be generated.
+
+	field		width	description
+	---------------------------------------------------
+	INFO		<nest>
+	  PPORT		4	Physical port
+	  MAC		6	MAC address
+	  VLAN		2	VLAN ID
+
+
+SECTION 9: CPU Packet Processing
+================================
+
+Ingress packets directed to the host CPU for further processing are delivered
+in the DMA RX ring.  Likewise, host CPU originating packets destined to egress
+on switch ports are scheduled by software using the DMA TX ring.
+
+Tx Packet Processing
+--------------------
+
+Software schedules packets for egress on switch ports using the DMA TX ring.  A
+TX descriptor buffer describes the packet location and size in host DMA-able
+memory, the destination port, and any hardware-offload functions (such as L3
+payload checksum offload).  Software then bumps the descriptor head to signal
+hardware of new Tx work.  In response, hardware will DMA read Tx descriptors up
+to head, DMA read descriptor buffer and packet data, perform offloading
+functions, and finally frame packet on wire (network).  Once packet processing
+is complete, hardware will writeback status to descriptor(s) to signal to
+software that Tx is complete and software resources (e.g. skb) backing packet
+can be released.
+
+Figure 2 shows an example 3-fragment packet queued with one Tx descriptor.  A
+TLV is used for each packet fragment.
+
+	                                           pkt frag 1
+	                                           +–––––––+  +–+
+	                                       +–––+       |    |
+	                         desc buf      |   |       |    |
+	                        +––––––––+     |   |       |    |
+	        Tx ring     +–––+        +–––––+   |       |    |
+	      +–––––––––+   |   |  TLVs  |         +–––––––+    |
+	      |         +–––+   +––––––––+         pkt frag 2   |
+	      | desc 0  |       |        +–––––+   +–––––––+    |
+	      +–––––––––+       |  TLVs  |     +–––+       |    |
+	head+–+         |       +––––––––+         |       |    |
+	      | desc 1  |       |        +–––––+   +–––––––+    |pkt
+	      +–––––––––+       |  TLVs  |     |                |
+	      |         |       +––––––––+     |   pkt frag 3   |
+	      |         |                      |   +–––––––+    |
+	      +–––––––––+                      +–––+       |    |
+	      |         |                          |       |    |
+	      |         |                          |       |    |
+	      +–––––––––+                          |       |    |
+	      |         |                          |       |    |
+	      |         |                          |       |    |
+	      +–––––––––+                          |       |    |
+	      |         |                          +–––––––+  +–+
+	      |         |
+	      +–––––––––+
+
+				fig 2.
+
+The TLVs for Tx descriptor buffer are:
+
+	field			width	description
+	---------------------------------------------------------------------
+	PPORT			4	Destination physical port #
+	TX_OFFLOAD		1	Hardware offload modes:
+					  0: no offload
+					  1: insert IP csum (ipv4 only)
+					  2: insert TCP/UDP csum
+					  3: L3 csum calc and insert
+                        	             into csum offset (TX_L3_CSUM_OFF)
+                 	                    16-bit 1's complement csum value.
+                                	     IPv4 pseudo-header and IP
+                        	             already calculated by OS
+                  	                   and inserted.
+					  4: TSO (TCP Segmentation Offload)
+	TX_L3_CSUM_OFF		2	For L3 csum offload mode, the offset,
+					from the beginning of the packet,
+					of the csum field in the L3 header
+	TX_TSO_MSS		2	For TSO offload mode, the
+					Maximum Segment Size in bytes
+        TX_TSO_HDR_LEN		2	For TSO offload mode, the
+					length of ethernet, IP, and
+					TCP/UDP headers, including IP
+					and TCP options.
+	TX_FRAGS		<array>	Packet fragments
+	  TX_FRAG		<nest>	Packet fragment
+	    TX_FRAG_ADDR	8	DMA address of packet fragment
+	    TX_FRAG_LEN		2	Packet fragment length
+
+Possible status return codes in descriptor on completion are:
+
+	DESC_COMP_ERR	reason
+	--------------------------------------------------------------------
+	0		OK
+	-ROCKER_ENXIO	address or data read err on desc buf or packet
+			fragment
+	-ROCKER_EINVAL	bad pport or TSO or csum offloading error
+	-ROCKER_ENOMEM	no memory for internal staging tx fragment
+
+Rx Packet Processing
+--------------------
+
+For packets ingressing on switch ports that are not forwarded by the switch but
+rather directed to the host CPU for further processing are delivered in the DMA
+RX ring.  Rx descriptor buffers are allocated by software and placed on the
+ring.  Hardware will fill Rx descriptor buffers with packet data, write the
+completion, and signal to software that a new packet is ready.  Since Rx packet
+size is not known a-priori, the Rx descriptor buffer must be allocated for
+worst-case packet size.  A single Rx descriptor will contain the entire Rx
+packet data in one RX_FRAG.  Other Rx TLVs describe and hardware offloads
+performed on the packet, such as checksum validation.
+
+The TLVs for Rx descriptor buffer are:
+
+	field		width	description
+	---------------------------------------------------
+	PPORT		4	Source physical port #
+	RX_FLAGS	2	Packet parsing flags:
+				  (1 << 0): IPv4 packet
+				  (1 << 1): IPv6 packet
+				  (1 << 2): csum calculated
+				  (1 << 3): IPv4 csum good
+				  (1 << 4): IP fragment
+				  (1 << 5): TCP packet
+				  (1 << 6): UDP packet
+				  (1 << 7): TCP/UDP csum good
+				  (1 << 8): Offload forward
+	RX_CSUM		2	IP calculated checksum:
+				  IPv4: IP payload csum
+				  IPv6: header and payload csum
+				(Only valid is RX_FLAGS:csum calc is set)
+	RX_FRAG_ADDR	8	DMA address of packet fragment
+	RX_FRAG_MAX_LEN	2	Packet maximum fragment length
+	RX_FRAG_LEN	2	Actual packet fragment length after receive
+
+Offload forward RX_FLAG indicates the device has already forwarded the packet
+so the host CPU should not also forward the packet.
+
+Possible status return codes in descriptor on completion are:
+
+	DESC_COMP_ERR	reason
+	--------------------------------------------------------------------
+	0		OK
+	-ROCKER_ENXIO	address or data read err on desc buf
+	-ROCKER_ENOMEM	no memory for internal staging desc buf
+	-ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain
+			packet data TLV and other TLVs.
+
+
+SECTION 10: OF-DPA Mode
+======================
+
+OF-DPA mode allows the switch to offload flow packet processing functions to
+hardware.  An OpenFlow controller would communicate with an OpenFlow agent
+installed on the switch.  The OpenFlow agent would (directly or indirectly)
+communicate with the Rocker switch driver, which in turn would program switch
+hardware with flow functionality, as defined in OF-DPA.  The block diagram is:
+
+		+–––––––––––––––----–––+
+		|        OF            |
+		|  Remote Controller   |
+		+––––––––+––----–––––––+
+		         |
+		         |
+		+––––––––+–––––––––+
+		|       OF         |
+		|   Local Agent    |
+		+––––––––––––––––––+
+		|                  |
+		|   Rocker Driver  |
+		+––––––––––––––––––+
+		    <this spec>
+		+––––––––––––––––––+
+		|                  |
+		|   Rocker Switch  |
+		+––––––––––––––––––+
+
+To participate in flow functions, ports must be configure for OF-DPA mode
+during switch initialization.
+
+OF-DPA Flow Table Interface
+---------------------------
+
+There are commands to add, modify, delete, and get stats of flow table entries.
+The commands are issued using the DMA CMD descriptor ring.  The following
+commands are defined:
+
+	CMD_ADD:		add an entry to flow table
+	CMD_MOD:		modify an entry in flow table
+	CMD_DEL:		delete an entry from flow table
+	CMD_GET_STATS:		get stats for flow entry
+
+TLVs for add and modify commands are:
+
+	field			width	description
+	----------------------------------------------------
+	OF_DPA_CMD		2	CMD_[ADD|MOD]
+	OF_DPA_TBL		2	Flow table ID
+					  0: ingress port
+					  10: vlan
+					  20: termination mac
+					  30: unicast routing
+					  40: multicast routing
+					  50: bridging
+					  60: ACL policy
+	OF_DPA_PRIORITY		4	Flow priority
+	OF_DPA_HARDTIME		4	Hard timeout for flow
+	OF_DPA_IDLETIME		4	Idle timeout for flow
+	OF_DPA_COOKIE		8	Cookie
+
+Additional TLVs based on flow table ID:
+
+Table ID 0: ingress port
+
+	field			width	description
+	----------------------------------------------------
+	OF_DPA_IN_PPORT		4	ingress physical port number
+	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
+
+Table ID 10: vlan
+
+	field			width	description
+	----------------------------------------------------
+	OF_DPA_IN_PPORT		4	ingress physical port number
+	OF_DPA_VLAN_ID		2 (N)	vlan ID
+	OF_DPA_VLAN_ID_MASK	2 (N)	vlan ID mask
+	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
+	OF_DPA_NEW_VLAN_ID	2 (N)	new vlan ID
+
+Table ID 20: termination mac
+
+	field			width	description
+	----------------------------------------------------
+	OF_DPA_IN_PPORT		4	ingress physical port number
+	OF_DPA_IN_PPORT_MASK	4	ingress physical port number mask
+	OF_DPA_ETHERTYPE	2 (N)	must be either 0x0800 or 0x86dd
+	OF_DPA_DST_MAC		6 (N)	destination MAC
+	OF_DPA_DST_MAC_MASK	6 (N)	destination MAC mask
+	OF_DPA_VLAN_ID		2 (N)	vlan ID
+	OF_DPA_VLAN_ID_MASK	2 (N)	vlan ID mask
+	OF_DPA_GOTO_TBL		2	only acceptable values are
+					unicast or multicast routing
+					table IDs
+	OF_DPA_OUT_PPORT	2	if specified, must be
+					controller, set zero otherwise
+
+Table ID 30: unicast routing
+
+	field			width	description
+	----------------------------------------------------
+	OF_DPA_ETHERTYPE	2 (N)	must be either 0x0800 or 0x86dd
+	OF_DPA_DST_IP		4 (N)	destination IPv4 address.
+					Must be unicast address
+	OF_DPA_DST_IP_MASK	4 (N)	IP mask.  Must be prefix mask
+	OF_DPA_DST_IPV6		16 (N)	destination IPv6 address.
+					Must be unicast address
+	OF_DPA_DST_IPV6_MASK	16 (N)	IPv6 mask. Must be prefix mask
+	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
+	OF_DPA_GROUP_ID		4	data for GROUP action must
+					be an L3 Unicast group entry
+
+Table ID 40: multicast routing
+
+	field			width	description
+	----------------------------------------------------
+	OF_DPA_ETHERTYPE	2 (N)	must be either 0x0800 or 0x86dd
+	OF_DPA_VLAN_ID		2 (N)	vlan ID
+	OF_DPA_SRC_IP		4 (N)	source IPv4. Optional,
+					can contain IPv4 address,
+					must be completely masked
+					if not used
+	OF_DPA_SRC_IP_MASK	4 (N)	IP Mask
+	OF_DPA_DST_IP		4 (N)	destination IPv4 address.
+					Must be multicast address
+	OF_DPA_SRC_IPV6		16 (N)	source IPv6 Address. Optional.
+					Can contain IPv6 address,
+					must be completely masked
+					if not used
+	OF_DPA_SRC_IPV6_MASK	16 (N)	IPv6 mask.
+	OF_DPA_DST_IPV6		16 (N)	destination IPv6 Address. Must
+					be multicast address
+					Must be multicast address
+	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
+	OF_DPA_GROUP_ID		4	data for GROUP action must
+					be an L3 multicast group entry
+
+Table ID 50: bridging
+
+	field			width	description
+	----------------------------------------------------
+	OF_DPA_VLAN_ID		2 (N)	vlan ID
+	OF_DPA_TUNNEL_ID	4	tunnel ID
+	OF_DPA_DST_MAC		6 (N)	destination MAC
+	OF_DPA_DST_MAC_MASK	6 (N)	destination MAC mask
+	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
+	OF_DPA_GROUP_ID		4	data for GROUP action must
+					be a L2 Interface, L2
+					Multicast, L2 Flood,
+					or L2 Overlay group entry
+					as appropriate
+	OF_DPA_TUNNEL_LPORT	4	unicast Tenant Bridging
+					flows specify a tunnel
+					logical port ID
+	OF_DPA_OUT_PPORT	2	data for OUTPUT action,
+					restricted to CONTROLLER,
+					set to 0 otherwise
+
+Table ID 60: acl policy
+
+	field			width	description
+	----------------------------------------------------
+	OF_DPA_IN_PPORT		4	ingress physical port number
+	OF_DPA_IN_PPORT_MASK	4	ingress physical port number mask
+	OF_DPA_ETHERTYPE	2 (N)	ethertype
+	OF_DPA_VLAN_ID		2 (N)	vlan ID
+	OF_DPA_VLAN_ID_MASK	2 (N)	vlan ID mask
+	OF_DPA_VLAN_PCP		2 (N)	vlan Priority Code Point
+	OF_DPA_VLAN_PCP_MASK	2 (N)	vlan Priority Code Point mask
+	OF_DPA_SRC_MAC		6 (N)	source MAC
+	OF_DPA_SRC_MAC_MASK	6 (N)	source MAC mask
+	OF_DPA_DST_MAC		6 (N)	destination MAC
+	OF_DPA_DST_MAC_MASK	6 (N)	destination MAC mask
+	OF_DPA_TUNNEL_ID	4	tunnel ID
+	OF_DPA_SRC_IP		4 (N)	source IPv4. Optional,
+					can contain IPv4 address,
+					must be completely masked
+					if not used
+	OF_DPA_SRC_IP_MASK	4 (N)	IP Mask
+	OF_DPA_DST_IP		4 (N)	destination IPv4 address.
+					Must be multicast address
+	OF_DPA_DST_IP_MASK	4 (N)	IP Mask
+	OF_DPA_SRC_IPV6		16 (N)	source IPv6 Address. Optional.
+					Can contain IPv6 address,
+					must be completely masked
+					if not used
+	OF_DPA_SRC_IPV6_MASK	16 (N)	IPv6 mask
+	OF_DPA_DST_IPV6		16 (N)	destination IPv6 Address. Must
+					be multicast address.
+	OF_DPA_DST_IPV6_MASK	16 (N)	IPv6 mask
+	OF_DPA_SRC_ARP_IP	4 (N)	source IPv4 address in the ARP
+					payload.  Only used if ethertype
+					== 0x0806.
+	OF_DPA_SRC_ARP_IP_MASK	4 (N)	IP Mask
+	OF_DPA_IP_PROTO		1	IP protocol
+	OF_DPA_IP_PROTO_MASK	1	IP protocol mask
+	OF_DPA_IP_DSCP		1	DSCP
+	OF_DPA_IP_DSCP_MASK	1	DSCP mask
+	OF_DPA_IP_ECN		1	ECN
+	OF_DPA_IP_ECN_MASK		1	ECN mask
+	OF_DPA_L4_SRC_PORT	2 (N)	L4 source port, only for
+					TCP, UDP, or SCTP
+	OF_DPA_L4_SRC_PORT_MASK	2 (N)	L4 source port mask
+	OF_DPA_L4_DST_PORT	2 (N)	L4 source port, only for
+					TCP, UDP, or SCTP
+	OF_DPA_L4_DST_PORT_MASK	2 (N)	L4 source port mask
+	OF_DPA_ICMP_TYPE	1	ICMP type, only if IP
+					protocol is 1
+	OF_DPA_ICMP_TYPE_MASK	1	ICMP type mask
+	OF_DPA_ICMP_CODE	1	ICMP code
+	OF_DPA_ICMP_CODE_MASK	1	ICMP code mask
+	OF_DPA_IPV6_LABEL	4 (N)	IPv6 flow label
+	OF_DPA_IPV6_LABEL_MASK	4 (N)	IPv6 flow label mask
+	OF_DPA_GROUP_ID		4	data for GROUP action
+	OF_DPA_QUEUE_ID_ACTION	1	write the queue ID
+	OF_DPA_NEW_QUEUE_ID	1	queue ID
+	OF_DPA_VLAN_PCP_ACTION	1	write the VLAN priority
+	OF_DPA_NEW_VLAN_PCP	1	VLAN priority
+	OF_DPA_IP_DSCP_ACTION	1	write the DSCP
+	OF_DPA_NEW_IP_DSCP	1	new DSCP
+	OF_DPA_TUNNEL_LPORT	4	restrct to valid tunnel
+					logical port, set to 0
+					otherwise.
+	OF_DPA_OUT_PPORT	2	data for OUTPUT action,
+					restricted to CONTROLLER,
+					set to 0 otherwise
+	OF_DPA_CLEAR_ACTIONS	4	if 1 packets matching flow are
+					dropped (all other instructions
+					ignored)
+
+TLVs for flow delete and get stats command are:
+
+	field			width	description
+	---------------------------------------------------
+	OF_DPA_CMD		2	CMD_[DEL|GET_STATS]
+	OF_DPA_COOKIE		8	Cookie
+
+On completion of get stats command, the descriptor buffer is written back with
+the following TLVs:
+
+	field			width	description
+	---------------------------------------------------
+	OF_DPA_STAT_DURATION	4	Flow duration
+	OF_DPA_STAT_RX_PKTS	8	Received packets
+	OF_DPA_STAT_TX_PKTS	8	Transmit packets
+
+Possible status return codes in descriptor on completion are:
+
+	DESC_COMP_ERR	command			reason
+	--------------------------------------------------------------------
+	0		all			OK
+	-ROCKER_EFAULT	all			head or tail index outside
+						of ring
+	-ROCKER_ENXIO	all			address or data read err on
+						desc buf
+	-ROCKER_EMSGSIZE GET_STATS		cmd descriptor buffer wasn't
+						big enough to contain write-back
+						TLVs
+	-ROCKER_EINVAL	all			invalid parameters passed in
+	-ROCKER_EEXIST	ADD			entry already exists
+	-ROCKER_ENOSPC	ADD			no space left in flow table
+	-ROCKER_ENOENT	MOD|DEL|GET_STATS	cookie invalid
+
+Group Table Interface
+---------------------
+
+There are commands to add, modify, delete, and get stats of group table
+entries.  The commands are issued using the DMA CMD descriptor ring.  The
+following commands are defined:
+
+	CMD_ADD:		add an entry to group table
+	CMD_MOD:		modify an entry in group table
+	CMD_DEL:		delete an entry from group table
+	CMD_GET_STATS:		get stats for group entry
+
+TLVs for add and modify commands are:
+
+	field			width	description
+	-----------------------------------------------------------
+	FLOW_GROUP_CMD		2	CMD_[ADD|MOD]
+	FLOW_GROUP_ID		2	Flow group ID
+	FLOW_GROUP_TYPE		1	Group type:
+					  0: L2 interface
+					  1: L2 rewrite
+					  2: L3 unicast
+					  3: L2 multicast
+					  4: L2 flood
+					  5: L3 interface
+					  6: L3 multicast
+					  7: L3 ECMP
+					  8: L2 overlay
+	FLOW_VLAN_ID		2	Vlan ID (types 0, 3, 4, 6)
+	FLOW_L2_PORT		2	Port (types 0)
+	FLOW_INDEX		4	Index (all types but 0)
+	FLOW_OVERLAY_TYPE	1	Overlay sub-type (type 8):
+					  0: Flood unicast tunnel
+					  1: Flood multicast tunnel
+					  2: Multicast unicast tunnel
+					  3: Multicast multicast tunnel
+	FLOW_GROUP_ACTION		nest
+	  FLOW_GROUP_ID		2	next group ID in chain (all
+					types except 0)
+	  FLOW_OUT_PORT		4	egress port (types 0, 8)
+	  FLOW_POP_VLAN_TAG	1	strip outer VLAN tag (type 1
+					only)
+	  FLOW_VLAN_ID		2	(types 1, 5)
+	  FLOW_SRC_MAC		6	(types 1, 2, 5)
+	  FLOW_DST_MAC		6	(types 1, 2)
+
+TLVs for flow delete and get stats command are:
+
+	field			width	description
+	-----------------------------------------------------------
+	FLOW_GROUP_CMD		2	CMD_[DEL|GET_STATS]
+	FLOW_GROUP_ID		2	Flow group ID
+
+On completion of get stats command, the descriptor buffer is written back with
+the following TLVs:
+
+	field			width	description
+	---------------------------------------------------
+	FLOW_GROUP_ID		2	Flow group ID
+	FLOW_STAT_DURATION	4	Flow duration
+	FLOW_STAT_REF_COUNT	4	Flow reference count
+	FLOW_STAT_BUCKET_COUNT	4	Flow bucket count
+
+Possible status return codes in descriptor on completion are:
+
+	DESC_COMP_ERR	command			reason
+	--------------------------------------------------------------------
+	0		all			OK
+	-ROCKER_EFAULT	all			head or tail index outside
+						of ring
+	-ROCKER_ENXIO	all			address or data read err on
+						desc buf
+	-ROCKER_ENOSPC	GET_STATS		cmd descriptor buffer wasn't
+						big enough to contain write-back
+						TLVs
+	-ROCKER_EINVAL	ADD|MOD			invalid parameters passed in
+	-ROCKER_EEXIST	ADD			entry already exists
+	-ROCKER_ENOSPC	ADD			no space left in flow table
+	-ROCKER_ENOENT	MOD|DEL|GET_STATS	group ID invalid
+	-ROCKER_EBUSY	DEL			group reference count non-zero
+	-ROCKER_ENODEV	ADD			next group ID doesn't exist
+
+
+
+References
+==========
+
+[1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification,
+Version 1.0, from Broadcom Corporation, February 21, 2014.
diff --git a/docs/specs/standard-vga.txt b/docs/specs/standard-vga.txt
new file mode 100644
index 000000000..18f75f1b3
--- /dev/null
+++ b/docs/specs/standard-vga.txt
@@ -0,0 +1,81 @@
+
+QEMU Standard VGA
+=================
+
+Exists in two variants, for isa and pci.
+
+command line switches:
+    -vga std               [ picks isa for -M isapc, otherwise pci ]
+    -device VGA            [ pci variant ]
+    -device isa-vga        [ isa variant ]
+    -device secondary-vga  [ legacy-free pci variant ]
+
+
+PCI spec
+--------
+
+Applies to the pci variant only for obvious reasons.
+
+PCI ID: 1234:1111
+
+PCI Region 0:
+   Framebuffer memory, 16 MB in size (by default).
+   Size is tunable via vga_mem_mb property.
+
+PCI Region 1:
+   Reserved (so we have the option to make the framebuffer bar 64bit).
+
+PCI Region 2:
+   MMIO bar, 4096 bytes in size (qemu 1.3+)
+
+PCI ROM Region:
+   Holds the vgabios (qemu 0.14+).
+
+
+The legacy-free variant has no ROM and has PCI_CLASS_DISPLAY_OTHER
+instead of PCI_CLASS_DISPLAY_VGA.
+
+
+IO ports used
+-------------
+
+Doesn't apply to the legacy-free pci variant, use the MMIO bar instead.
+
+03c0 - 03df : standard vga ports
+01ce        : bochs vbe interface index port
+01cf        : bochs vbe interface data port (x86 only)
+01d0        : bochs vbe interface data port
+
+
+Memory regions used
+-------------------
+
+0xe0000000 : Framebuffer memory, isa variant only.
+
+The pci variant used to mirror the framebuffer bar here, qemu 0.14+
+stops doing that (except when in -M pc-$old compat mode).
+
+
+MMIO area spec
+--------------
+
+Likewise applies to the pci variant only for obvious reasons.
+
+0000 - 03ff : edid data blob.
+0400 - 041f : vga ioports (0x3c0 -> 0x3df), remapped 1:1.
+              word access is supported, bytes are written
+              in little endia order (aka index port first),
+              so indexed registers can be updated with a
+              single mmio write (and thus only one vmexit).
+0500 - 0515 : bochs dispi interface registers, mapped flat
+              without index/data ports.  Use (index << 1)
+              as offset for (16bit) register access.
+
+0600 - 0607 : qemu extended registers.  qemu 2.2+ only.
+              The pci revision is 2 (or greater) when
+              these registers are present.  The registers
+              are 32bit.
+  0600      : qemu extended register region size, in bytes.
+  0604      : framebuffer endianness register.
+              - 0xbebebebe indicates big endian.
+              - 0x1e1e1e1e indicates little endian.
diff --git a/docs/specs/tpm.rst b/docs/specs/tpm.rst
new file mode 100644
index 000000000..3be190343
--- /dev/null
+++ b/docs/specs/tpm.rst
@@ -0,0 +1,524 @@
+===============
+QEMU TPM Device
+===============
+
+Guest-side hardware interface
+=============================
+
+TIS interface
+-------------
+
+The QEMU TPM emulation implements a TPM TIS hardware interface
+following the Trusted Computing Group's specification "TCG PC Client
+Specific TPM Interface Specification (TIS)", Specification Version
+1.3, 21 March 2013. (see the `TIS specification`_, or a later version
+of it).
+
+The TIS interface makes a memory mapped IO region in the area
+0xfed40000-0xfed44fff available to the guest operating system.
+
+QEMU files related to TPM TIS interface:
+ - ``hw/tpm/tpm_tis_common.c``
+ - ``hw/tpm/tpm_tis_isa.c``
+ - ``hw/tpm/tpm_tis_sysbus.c``
+ - ``hw/tpm/tpm_tis.h``
+
+Both an ISA device and a sysbus device are available. The former is
+used with pc/q35 machine while the latter can be instantiated in the
+Arm virt machine.
+
+CRB interface
+-------------
+
+QEMU also implements a TPM CRB interface following the Trusted
+Computing Group's specification "TCG PC Client Platform TPM Profile
+(PTP) Specification", Family "2.0", Level 00 Revision 01.03 v22, May
+22, 2017. (see the `CRB specification`_, or a later version of it)
+
+The CRB interface makes a memory mapped IO region in the area
+0xfed40000-0xfed40fff (1 locality) available to the guest
+operating system.
+
+QEMU files related to TPM CRB interface:
+ - ``hw/tpm/tpm_crb.c``
+
+SPAPR interface
+---------------
+
+pSeries (ppc64) machines offer a tpm-spapr device model.
+
+QEMU files related to the SPAPR interface:
+ - ``hw/tpm/tpm_spapr.c``
+
+fw_cfg interface
+================
+
+The bios/firmware may read the ``"etc/tpm/config"`` fw_cfg entry for
+configuring the guest appropriately.
+
+The entry of 6 bytes has the following content, in little-endian:
+
+.. code-block:: c
+
+    #define TPM_VERSION_UNSPEC          0
+    #define TPM_VERSION_1_2             1
+    #define TPM_VERSION_2_0             2
+
+    #define TPM_PPI_VERSION_NONE        0
+    #define TPM_PPI_VERSION_1_30        1
+
+    struct FwCfgTPMConfig {
+        uint32_t tpmppi_address;         /* PPI memory location */
+        uint8_t tpm_version;             /* TPM version */
+        uint8_t tpmppi_version;          /* PPI version */
+    };
+
+ACPI interface
+==============
+
+The TPM device is defined with ACPI ID "PNP0C31". QEMU builds a SSDT
+and passes it into the guest through the fw_cfg device. The device
+description contains the base address of the TIS interface 0xfed40000
+and the size of the MMIO area (0x5000). In case a TPM2 is used by
+QEMU, a TPM2 ACPI table is also provided.  The device is described to
+be used in polling mode rather than interrupt mode primarily because
+no unused IRQ could be found.
+
+To support measurement logs to be written by the firmware,
+e.g. SeaBIOS, a TCPA table is implemented. This table provides a 64kb
+buffer where the firmware can write its log into. For TPM 2 only a
+more recent version of the TPM2 table provides support for
+measurements logs and a TCPA table does not need to be created.
+
+The TCPA and TPM2 ACPI tables follow the Trusted Computing Group
+specification "TCG ACPI Specification" Family "1.2" and "2.0", Level
+00 Revision 00.37. (see the `ACPI specification`_, or a later version
+of it)
+
+ACPI PPI Interface
+------------------
+
+QEMU supports the Physical Presence Interface (PPI) for TPM 1.2 and
+TPM 2. This interface requires ACPI and firmware support. (see the
+`PPI specification`_)
+
+PPI enables a system administrator (root) to request a modification to
+the TPM upon reboot. The PPI specification defines the operation
+requests and the actions the firmware has to take. The system
+administrator passes the operation request number to the firmware
+through an ACPI interface which writes this number to a memory
+location that the firmware knows. Upon reboot, the firmware finds the
+number and sends commands to the TPM. The firmware writes the TPM
+result code and the operation request number to a memory location that
+ACPI can read from and pass the result on to the administrator.
+
+The PPI specification defines a set of mandatory and optional
+operations for the firmware to implement. The ACPI interface also
+allows an administrator to list the supported operations. In QEMU the
+ACPI code is generated by QEMU, yet the firmware needs to implement
+support on a per-operations basis, and different firmwares may support
+a different subset. Therefore, QEMU introduces the virtual memory
+device for PPI where the firmware can indicate which operations it
+supports and ACPI can enable the ones that are supported and disable
+all others. This interface lies in main memory and has the following
+layout:
+
+ +-------------+--------+--------+-------------------------------------------+
+ |  Field      | Length | Offset | Description                               |
+ +=============+========+========+===========================================+
+ | ``func``    |  0x100 |  0x000 | Firmware sets values for each supported   |
+ |             |        |        | operation. See defined values below.      |
+ +-------------+--------+--------+-------------------------------------------+
+ | ``ppin``    |   0x1  |  0x100 | SMI interrupt to use. Set by firmware.    |
+ |             |        |        | Not supported.                            |
+ +-------------+--------+--------+-------------------------------------------+
+ | ``ppip``    |   0x4  |  0x101 | ACPI function index to pass to SMM code.  |
+ |             |        |        | Set by ACPI. Not supported.               |
+ +-------------+--------+--------+-------------------------------------------+
+ | ``pprp``    |   0x4  |  0x105 | Result of last executed operation. Set by |
+ |             |        |        | firmware. See function index 5 for values.|
+ +-------------+--------+--------+-------------------------------------------+
+ | ``pprq``    |   0x4  |  0x109 | Operation request number to execute. See  |
+ |             |        |        | 'Physical Presence Interface Operation    |
+ |             |        |        | Summary' tables in specs. Set by ACPI.    |
+ +-------------+--------+--------+-------------------------------------------+
+ | ``pprm``    |   0x4  |  0x10d | Operation request optional parameter.     |
+ |             |        |        | Values depend on operation. Set by ACPI.  |
+ +-------------+--------+--------+-------------------------------------------+
+ | ``lppr``    |   0x4  |  0x111 | Last executed operation request number.   |
+ |             |        |        | Copied from pprq field by firmware.       |
+ +-------------+--------+--------+-------------------------------------------+
+ | ``fret``    |   0x4  |  0x115 | Result code from SMM function.            |
+ |             |        |        | Not supported.                            |
+ +-------------+--------+--------+-------------------------------------------+
+ | ``res1``    |  0x40  |  0x119 | Reserved for future use                   |
+ +-------------+--------+--------+-------------------------------------------+
+ |``next_step``|   0x1  |  0x159 | Operation to execute after reboot by      |
+ |             |        |        | firmware. Used by firmware.               |
+ +-------------+--------+--------+-------------------------------------------+
+ | ``movv``    |   0x1  |  0x15a | Memory overwrite variable                 |
+ +-------------+--------+--------+-------------------------------------------+
+
+The following values are supported for the ``func`` field. They
+correspond to the values used by ACPI function index 8.
+
+ +----------+-------------------------------------------------------------+
+ | Value    | Description                                                 |
+ +==========+=============================================================+
+ | 0        | Operation is not implemented.                               |
+ +----------+-------------------------------------------------------------+
+ | 1        | Operation is only accessible through firmware.              |
+ +----------+-------------------------------------------------------------+
+ | 2        | Operation is blocked for OS by firmware configuration.      |
+ +----------+-------------------------------------------------------------+
+ | 3        | Operation is allowed and physically present user required.  |
+ +----------+-------------------------------------------------------------+
+ | 4        | Operation is allowed and physically present user is not     |
+ |          | required.                                                   |
+ +----------+-------------------------------------------------------------+
+
+The location of the table is given by the fw_cfg ``tpmppi_address``
+field.  The PPI memory region size is 0x400 (``TPM_PPI_ADDR_SIZE``) to
+leave enough room for future updates.
+
+QEMU files related to TPM ACPI tables:
+ - ``hw/i386/acpi-build.c``
+ - ``include/hw/acpi/tpm.h``
+
+TPM backend devices
+===================
+
+The TPM implementation is split into two parts, frontend and
+backend. The frontend part is the hardware interface, such as the TPM
+TIS interface described earlier, and the other part is the TPM backend
+interface. The backend interfaces implement the interaction with a TPM
+device, which may be a physical or an emulated device. The split
+between the front- and backend devices allows a frontend to be
+connected with any available backend. This enables the TIS interface
+to be used with the passthrough backend or the swtpm backend.
+
+QEMU files related to TPM backends:
+ - ``backends/tpm.c``
+ - ``include/sysemu/tpm.h``
+ - ``include/sysemu/tpm_backend.h``
+
+The QEMU TPM passthrough device
+-------------------------------
+
+In case QEMU is run on Linux as the host operating system it is
+possible to make the hardware TPM device available to a single QEMU
+guest. In this case the user must make sure that no other program is
+using the device, e.g., /dev/tpm0, before trying to start QEMU with
+it.
+
+The passthrough driver uses the host's TPM device for sending TPM
+commands and receiving responses from. Besides that it accesses the
+TPM device's sysfs entry for support of command cancellation. Since
+none of the state of a hardware TPM can be migrated between hosts,
+virtual machine migration is disabled when the TPM passthrough driver
+is used.
+
+Since the host's TPM device will already be initialized by the host's
+firmware, certain commands, e.g. ``TPM_Startup()``, sent by the
+virtual firmware for device initialization, will fail. In this case
+the firmware should not use the TPM.
+
+Sharing the device with the host is generally not a recommended usage
+scenario for a TPM device. The primary reason for this is that two
+operating systems can then access the device's single set of
+resources, such as platform configuration registers
+(PCRs). Applications or kernel security subsystems, such as the Linux
+Integrity Measurement Architecture (IMA), are not expecting to share
+PCRs.
+
+QEMU files related to the TPM passthrough device:
+ - ``backends/tpm/tpm_passthrough.c``
+ - ``backends/tpm/tpm_util.c``
+ - ``include/sysemu/tpm_util.h``
+
+
+Command line to start QEMU with the TPM passthrough device using the host's
+hardware TPM ``/dev/tpm0``:
+
+.. code-block:: console
+
+  qemu-system-x86_64 -display sdl -accel kvm \
+  -m 1024 -boot d -bios bios-256k.bin -boot menu=on \
+  -tpmdev passthrough,id=tpm0,path=/dev/tpm0 \
+  -device tpm-tis,tpmdev=tpm0 test.img
+
+
+The following commands should result in similar output inside the VM
+with a Linux kernel that either has the TPM TIS driver built-in or
+available as a module:
+
+.. code-block:: console
+
+  # dmesg | grep -i tpm
+  [    0.711310] tpm_tis 00:06: 1.2 TPM (device=id 0x1, rev-id 1)
+
+  # dmesg | grep TCPA
+  [    0.000000] ACPI: TCPA 0x0000000003FFD191C 000032 (v02 BOCHS  \
+      BXPCTCPA 0000001 BXPC 00000001)
+
+  # ls -l /dev/tpm*
+  crw-------. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0
+
+  # find /sys/devices/ | grep pcrs$ | xargs cat
+  PCR-00: 35 4E 3B CE 23 9F 38 59 ...
+  ...
+  PCR-23: 00 00 00 00 00 00 00 00 ...
+
+The QEMU TPM emulator device
+----------------------------
+
+The TPM emulator device uses an external TPM emulator called 'swtpm'
+for sending TPM commands to and receiving responses from. The swtpm
+program must have been started before trying to access it through the
+TPM emulator with QEMU.
+
+The TPM emulator implements a command channel for transferring TPM
+commands and responses as well as a control channel over which control
+commands can be sent. (see the `SWTPM protocol`_ specification)
+
+The control channel serves the purpose of resetting, initializing, and
+migrating the TPM state, among other things.
+
+The swtpm program behaves like a hardware TPM and therefore needs to
+be initialized by the firmware running inside the QEMU virtual
+machine.  One necessary step for initializing the device is to send
+the TPM_Startup command to it. SeaBIOS, for example, has been
+instrumented to initialize a TPM 1.2 or TPM 2 device using this
+command.
+
+QEMU files related to the TPM emulator device:
+ - ``backends/tpm/tpm_emulator.c``
+ - ``backends/tpm/tpm_util.c``
+ - ``include/sysemu/tpm_util.h``
+
+The following commands start the swtpm with a UnixIO control channel over
+a socket interface. They do not need to be run as root.
+
+.. code-block:: console
+
+  mkdir /tmp/mytpm1
+  swtpm socket --tpmstate dir=/tmp/mytpm1 \
+    --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \
+    --log level=20
+
+Command line to start QEMU with the TPM emulator device communicating
+with the swtpm (x86):
+
+.. code-block:: console
+
+  qemu-system-x86_64 -display sdl -accel kvm \
+    -m 1024 -boot d -bios bios-256k.bin -boot menu=on \
+    -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
+    -tpmdev emulator,id=tpm0,chardev=chrtpm \
+    -device tpm-tis,tpmdev=tpm0 test.img
+
+In case a pSeries machine is emulated, use the following command line:
+
+.. code-block:: console
+
+  qemu-system-ppc64 -display sdl -machine pseries,accel=kvm \
+    -m 1024 -bios slof.bin -boot menu=on \
+    -nodefaults -device VGA -device pci-ohci -device usb-kbd \
+    -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
+    -tpmdev emulator,id=tpm0,chardev=chrtpm \
+    -device tpm-spapr,tpmdev=tpm0 \
+    -device spapr-vscsi,id=scsi0,reg=0x00002000 \
+    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0 \
+    -drive file=test.img,format=raw,if=none,id=drive-virtio-disk0
+
+In case an Arm virt machine is emulated, use the following command line:
+
+.. code-block:: console
+
+  qemu-system-aarch64 -machine virt,gic-version=3,accel=kvm \
+    -cpu host -m 4G \
+    -nographic -no-acpi \
+    -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
+    -tpmdev emulator,id=tpm0,chardev=chrtpm \
+    -device tpm-tis-device,tpmdev=tpm0 \
+    -device virtio-blk-pci,drive=drv0 \
+    -drive format=qcow2,file=hda.qcow2,if=none,id=drv0 \
+    -drive if=pflash,format=raw,file=flash0.img,readonly=on \
+    -drive if=pflash,format=raw,file=flash1.img
+
+In case SeaBIOS is used as firmware, it should show the TPM menu item
+after entering the menu with 'ESC'.
+
+.. code-block:: console
+
+  Select boot device:
+  1. DVD/CD [ata1-0: QEMU DVD-ROM ATAPI-4 DVD/CD]
+  [...]
+  5. Legacy option rom
+
+  t. TPM Configuration
+
+The following commands should result in similar output inside the VM
+with a Linux kernel that either has the TPM TIS driver built-in or
+available as a module:
+
+.. code-block:: console
+
+  # dmesg | grep -i tpm
+  [    0.711310] tpm_tis 00:06: 1.2 TPM (device=id 0x1, rev-id 1)
+
+  # dmesg | grep TCPA
+  [    0.000000] ACPI: TCPA 0x0000000003FFD191C 000032 (v02 BOCHS  \
+      BXPCTCPA 0000001 BXPC 00000001)
+
+  # ls -l /dev/tpm*
+  crw-------. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0
+
+  # find /sys/devices/ | grep pcrs$ | xargs cat
+  PCR-00: 35 4E 3B CE 23 9F 38 59 ...
+  ...
+  PCR-23: 00 00 00 00 00 00 00 00 ...
+
+Migration with the TPM emulator
+===============================
+
+The TPM emulator supports the following types of virtual machine
+migration:
+
+- VM save / restore (migration into a file)
+- Network migration
+- Snapshotting (migration into storage like QoW2 or QED)
+
+The following command sequences can be used to test VM save / restore.
+
+In a 1st terminal start an instance of a swtpm using the following command:
+
+.. code-block:: console
+
+  mkdir /tmp/mytpm1
+  swtpm socket --tpmstate dir=/tmp/mytpm1 \
+    --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \
+    --log level=20 --tpm2
+
+In a 2nd terminal start the VM:
+
+.. code-block:: console
+
+  qemu-system-x86_64 -display sdl -accel kvm \
+    -m 1024 -boot d -bios bios-256k.bin -boot menu=on \
+    -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
+    -tpmdev emulator,id=tpm0,chardev=chrtpm \
+    -device tpm-tis,tpmdev=tpm0 \
+    -monitor stdio \
+    test.img
+
+Verify that the attached TPM is working as expected using applications
+inside the VM.
+
+To store the state of the VM use the following command in the QEMU
+monitor in the 2nd terminal:
+
+.. code-block:: console
+
+  (qemu) migrate "exec:cat > testvm.bin"
+  (qemu) quit
+
+At this point a file called ``testvm.bin`` should exists and the swtpm
+and QEMU processes should have ended.
+
+To test 'VM restore' you have to start the swtpm with the same
+parameters as before. If previously a TPM 2 [--tpm2] was saved, --tpm2
+must now be passed again on the command line.
+
+In the 1st terminal restart the swtpm with the same command line as
+before:
+
+.. code-block:: console
+
+  swtpm socket --tpmstate dir=/tmp/mytpm1 \
+    --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \
+    --log level=20 --tpm2
+
+In the 2nd terminal restore the state of the VM using the additional
+'-incoming' option.
+
+.. code-block:: console
+
+  qemu-system-x86_64 -display sdl -accel kvm \
+    -m 1024 -boot d -bios bios-256k.bin -boot menu=on \
+    -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
+    -tpmdev emulator,id=tpm0,chardev=chrtpm \
+    -device tpm-tis,tpmdev=tpm0 \
+    -incoming "exec:cat < testvm.bin" \
+    test.img
+
+Troubleshooting migration
+-------------------------
+
+There are several reasons why migration may fail. In case of problems,
+please ensure that the command lines adhere to the following rules
+and, if possible, that identical versions of QEMU and swtpm are used
+at all times.
+
+VM save and restore:
+
+ - QEMU command line parameters should be identical apart from the
+   '-incoming' option on VM restore
+
+ - swtpm command line parameters should be identical
+
+VM migration to 'localhost':
+
+ - QEMU command line parameters should be identical apart from the
+   '-incoming' option on the destination side
+
+ - swtpm command line parameters should point to two different
+   directories on the source and destination swtpm (--tpmstate dir=...)
+   (especially if different versions of libtpms were to be used on the
+   same machine).
+
+VM migration across the network:
+
+ - QEMU command line parameters should be identical apart from the
+   '-incoming' option on the destination side
+
+ - swtpm command line parameters should be identical
+
+VM Snapshotting:
+ - QEMU command line parameters should be identical
+
+ - swtpm command line parameters should be identical
+
+
+Besides that, migration failure reasons on the swtpm level may include
+the following:
+
+ - the versions of the swtpm on the source and destination sides are
+   incompatible
+
+   - downgrading of TPM state may not be supported
+
+   - the source and destination libtpms were compiled with different
+     compile-time options and the destination side refuses to accept the
+     state
+
+ - different migration keys are used on the source and destination side
+   and the destination side cannot decrypt the migrated state
+   (swtpm ... --migration-key ... )
+
+
+.. _TIS specification:
+   https://trustedcomputinggroup.org/pc-client-work-group-pc-client-specific-tpm-interface-specification-tis/
+
+.. _CRB specification:
+   https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/
+
+
+.. _ACPI specification:
+   https://trustedcomputinggroup.org/tcg-acpi-specification/
+
+.. _PPI specification:
+   https://trustedcomputinggroup.org/resource/tcg-physical-presence-interface-specification/
+
+.. _SWTPM protocol:
+   https://github.com/stefanberger/swtpm/blob/master/man/man3/swtpm_ioctls.pod
diff --git a/docs/specs/virt-ctlr.txt b/docs/specs/virt-ctlr.txt
new file mode 100644
index 000000000..24d38084f
--- /dev/null
+++ b/docs/specs/virt-ctlr.txt
@@ -0,0 +1,26 @@
+Virtual System Controller
+=========================
+
+This device is a simple interface defined for the pure virtual machine with no
+hardware reference implementation to allow the guest kernel to send command
+to the host hypervisor.
+
+The specification can evolve, the current state is defined as below.
+
+This is a MMIO mapped device using 256 bytes.
+
+Two 32bit registers are defined:
+
+1- the features register (read-only, address 0x00)
+
+   This register allows the device to report features supported by the
+   controller.
+   The only feature supported for the moment is power control (0x01).
+
+2- the command register (write-only, address 0x04)
+
+   This register allows the kernel to send the commands to the hypervisor.
+   The implemented commands are part of the power control feature and
+   are reset (1), halt (2) and panic (3).
+   A basic command, no-op (0), is always present and can be used to test the
+   register access. This command has no effect.
diff --git a/docs/specs/vmcoreinfo.txt b/docs/specs/vmcoreinfo.txt
new file mode 100644
index 000000000..bcbca6fe4
--- /dev/null
+++ b/docs/specs/vmcoreinfo.txt
@@ -0,0 +1,53 @@
+=================
+VMCoreInfo device
+=================
+
+The `-device vmcoreinfo` will create a fw_cfg entry for a guest to
+store dump details.
+
+etc/vmcoreinfo
+**************
+
+A guest may use this fw_cfg entry to add information details to qemu
+dumps.
+
+The entry of 16 bytes has the following layout, in little-endian::
+
+#define VMCOREINFO_FORMAT_NONE 0x0
+#define VMCOREINFO_FORMAT_ELF 0x1
+
+    struct FWCfgVMCoreInfo {
+        uint16_t host_format;  /* formats host supports */
+        uint16_t guest_format; /* format guest supplies */
+        uint32_t size;         /* size of vmcoreinfo region */
+        uint64_t paddr;        /* physical address of vmcoreinfo region */
+    };
+
+Only full write (of 16 bytes) are considered valid for further
+processing of entry values.
+
+A write of 0 in guest_format will disable further processing of
+vmcoreinfo entry values & content.
+
+You may write a guest_format that is not supported by the host, in
+which case the entry data can be ignored by qemu (but you may still
+access it through a debugger, via vmcoreinfo_realize::vmcoreinfo_state).
+
+Format & content
+****************
+
+As of qemu 2.11, only VMCOREINFO_FORMAT_ELF is supported.
+
+The entry gives location and size of an ELF note that is appended in
+qemu dumps.
+
+The note format/class must be of the target bitness and the size must
+be less than 1Mb.
+
+If the ELF note name is "VMCOREINFO", it is expected to be the Linux
+vmcoreinfo note (see Documentation/ABI/testing/sysfs-kernel-vmcoreinfo
+in Linux source). In this case, qemu dump code will read the content
+as a key=value text file, looking for "NUMBER(phys_base)" key
+value. The value is expected to be more accurate than architecture
+guess of the value. This is useful for KASLR-enabled guest with
+ancient tools not handling the VMCOREINFO note.
diff --git a/docs/specs/vmgenid.txt b/docs/specs/vmgenid.txt
new file mode 100644
index 000000000..aa9f51867
--- /dev/null
+++ b/docs/specs/vmgenid.txt
@@ -0,0 +1,245 @@
+VIRTUAL MACHINE GENERATION ID
+=============================
+
+Copyright (C) 2016 Red Hat, Inc.
+Copyright (C) 2017 Skyport Systems, Inc.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+===
+
+The VM generation ID (vmgenid) device is an emulated device which
+exposes a 128-bit, cryptographically random, integer value identifier,
+referred to as a Globally Unique Identifier, or GUID.
+
+This allows management applications (e.g. libvirt) to notify the guest
+operating system when the virtual machine is executed with a different
+configuration (e.g. snapshot execution or creation from a template).  The
+guest operating system notices the change, and is then able to react as
+appropriate by marking its copies of distributed databases as dirty,
+re-initializing its random number generator etc.
+
+
+Requirements
+------------
+
+These requirements are extracted from the "How to implement virtual machine
+generation ID support in a virtualization platform" section of the
+specification, dated August 1, 2012.
+
+
+The document may be found on the web at:
+  http://go.microsoft.com/fwlink/?LinkId=260709
+
+R1a. The generation ID shall live in an 8-byte aligned buffer.
+
+R1b. The buffer holding the generation ID shall be in guest RAM, ROM, or device
+     MMIO range.
+
+R1c. The buffer holding the generation ID shall be kept separate from areas
+     used by the operating system.
+
+R1d. The buffer shall not be covered by an AddressRangeMemory or
+     AddressRangeACPI entry in the E820 or UEFI memory map.
+
+R1e. The generation ID shall not live in a page frame that could be mapped with
+     caching disabled. (In other words, regardless of whether the generation ID
+     lives in RAM, ROM or MMIO, it shall only be mapped as cacheable.)
+
+R2 to R5. [These AML requirements are isolated well enough in the Microsoft
+          specification for us to simply refer to them here.]
+
+R6. The hypervisor shall expose a _HID (hardware identifier) object in the
+    VMGenId device's scope that is unique to the hypervisor vendor.
+
+
+QEMU Implementation
+-------------------
+
+The above-mentioned specification does not dictate which ACPI descriptor table
+will contain the VM Generation ID device.  Other implementations (Hyper-V and
+Xen) put it in the main descriptor table (Differentiated System Description
+Table or DSDT).  For ease of debugging and implementation, we have decided to
+put it in its own Secondary System Description Table, or SSDT.
+
+The following is a dump of the contents from a running system:
+
+# iasl -p ./SSDT -d /sys/firmware/acpi/tables/SSDT
+
+Intel ACPI Component Architecture
+ASL+ Optimizing Compiler version 20150717-64
+Copyright (c) 2000 - 2015 Intel Corporation
+
+Reading ACPI table from file /sys/firmware/acpi/tables/SSDT - Length
+00000198 (0x0000C6)
+ACPI: SSDT 0x0000000000000000 0000C6 (v01 BOCHS  VMGENID  00000001 BXPC
+00000001)
+Acpi table [SSDT] successfully installed and loaded
+Pass 1 parse of [SSDT]
+Pass 2 parse of [SSDT]
+Parsing Deferred Opcodes (Methods/Buffers/Packages/Regions)
+
+Parsing completed
+Disassembly completed
+ASL Output:    ./SSDT.dsl - 1631 bytes
+# cat SSDT.dsl
+/*
+ * Intel ACPI Component Architecture
+ * AML/ASL+ Disassembler version 20150717-64
+ * Copyright (c) 2000 - 2015 Intel Corporation
+ *
+ * Disassembling to symbolic ASL+ operators
+ *
+ * Disassembly of /sys/firmware/acpi/tables/SSDT, Sun Feb  5 00:19:37 2017
+ *
+ * Original Table Header:
+ *     Signature        "SSDT"
+ *     Length           0x000000CA (202)
+ *     Revision         0x01
+ *     Checksum         0x4B
+ *     OEM ID           "BOCHS "
+ *     OEM Table ID     "VMGENID"
+ *     OEM Revision     0x00000001 (1)
+ *     Compiler ID      "BXPC"
+ *     Compiler Version 0x00000001 (1)
+ */
+DefinitionBlock ("/sys/firmware/acpi/tables/SSDT.aml", "SSDT", 1, "BOCHS ",
+"VMGENID", 0x00000001)
+{
+    Name (VGIA, 0x07FFF000)
+    Scope (\_SB)
+    {
+        Device (VGEN)
+        {
+            Name (_HID, "QEMUVGID")  // _HID: Hardware ID
+            Name (_CID, "VM_Gen_Counter")  // _CID: Compatible ID
+            Name (_DDN, "VM_Gen_Counter")  // _DDN: DOS Device Name
+            Method (_STA, 0, NotSerialized)  // _STA: Status
+            {
+                Local0 = 0x0F
+                If ((VGIA == Zero))
+                {
+                    Local0 = Zero
+                }
+
+                Return (Local0)
+            }
+
+            Method (ADDR, 0, NotSerialized)
+            {
+                Local0 = Package (0x02) {}
+                Index (Local0, Zero) = (VGIA + 0x28)
+                Index (Local0, One) = Zero
+                Return (Local0)
+            }
+        }
+    }
+
+    Method (\_GPE._E05, 0, NotSerialized)  // _Exx: Edge-Triggered GPE
+    {
+        Notify (\_SB.VGEN, 0x80) // Status Change
+    }
+}
+
+
+Design Details:
+---------------
+
+Requirements R1a through R1e dictate that the memory holding the
+VM Generation ID must be allocated and owned by the guest firmware,
+in this case BIOS or UEFI.  However, to be useful, QEMU must be able to
+change the contents of the memory at runtime, specifically when starting a
+backed-up or snapshotted image.  In order to do this, QEMU must know the
+address that has been allocated.
+
+The mechanism chosen for this memory sharing is writeable fw_cfg blobs.
+These are data object that are visible to both QEMU and guests, and are
+addressable as sequential files.
+
+More information about fw_cfg can be found in "docs/specs/fw_cfg.txt"
+
+Two fw_cfg blobs are used in this case:
+
+/etc/vmgenid_guid - contains the actual VM Generation ID GUID
+                  - read-only to the guest
+/etc/vmgenid_addr - contains the address of the downloaded vmgenid blob
+                  - writeable by the guest
+
+
+QEMU sends the following commands to the guest at startup:
+
+1. Allocate memory for vmgenid_guid fw_cfg blob.
+2. Write the address of vmgenid_guid into the SSDT (VGIA ACPI variable as
+   shown above in the iasl dump).  Note that this change is not propagated
+   back to QEMU.
+3. Write the address of vmgenid_guid back to QEMU's copy of vmgenid_addr
+   via the fw_cfg DMA interface.
+
+After step 3, QEMU is able to update the contents of vmgenid_guid at will.
+
+Since BIOS or UEFI does not necessarily run when we wish to change the GUID,
+the value of VGIA is persisted via the VMState mechanism.
+
+As spelled out in the specification, any change to the GUID executes an
+ACPI notification.  The exact handler to use is not specified, so the vmgenid
+device uses the first unused one:  \_GPE._E05.
+
+
+Endian-ness Considerations:
+---------------------------
+
+Although not specified in Microsoft's document, it is assumed that the
+device is expected to use little-endian format.
+
+All GUID passed in via command line or monitor are treated as big-endian.
+GUID values displayed via monitor are shown in big-endian format.
+
+
+GUID Storage Format:
+--------------------
+
+In order to implement an OVMF "SDT Header Probe Suppressor", the contents of
+the vmgenid_guid fw_cfg blob are not simply a 128-bit GUID.  There is also
+significant padding in order to align and fill a memory page, as shown in the
+following diagram:
+
++----------------------------------+
+| SSDT with OEM Table ID = VMGENID |
++----------------------------------+
+| ...                              |       TOP OF PAGE
+| VGIA dword object ---------------|-----> +---------------------------+
+| ...                              |       | fw-allocated array for    |
+| _STA method referring to VGIA    |       | "etc/vmgenid_guid"        |
+| ...                              |       +---------------------------+
+| ADDR method referring to VGIA    |       |  0: OVMF SDT Header probe |
+| ...                              |       |     suppressor            |
++----------------------------------+       | 36: padding for 8-byte    |
+                                           |     alignment             |
+                                           | 40: GUID                  |
+                                           | 56: padding to page size  |
+                                           +---------------------------+
+                                           END OF PAGE
+
+
+Device Usage:
+-------------
+
+The device has one property, which may be only be set using the command line:
+
+  guid - sets the value of the GUID.  A special value "auto" instructs
+         QEMU to generate a new random GUID.
+
+For example:
+
+  QEMU  -device vmgenid,guid="324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87"
+  QEMU  -device vmgenid,guid=auto
+
+The property may be queried via QMP/HMP:
+
+  (QEMU) query-vm-generation-id
+  {"return": {"guid": "324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87"}}
+
+Setting of this parameter is intentionally left out from the QMP/HMP
+interfaces.  There are no known use cases for changing the GUID once QEMU is
+running, and adding this capability would greatly increase the complexity.
diff --git a/docs/specs/vmw_pvscsi-spec.txt b/docs/specs/vmw_pvscsi-spec.txt
new file mode 100644
index 000000000..49affb2a4
--- /dev/null
+++ b/docs/specs/vmw_pvscsi-spec.txt
@@ -0,0 +1,92 @@
+General Description
+===================
+
+This document describes VMWare PVSCSI device interface specification.
+Created by Dmitry Fleytman (dmitry@daynix.com), Daynix Computing LTD.
+Based on source code of PVSCSI Linux driver from kernel 3.0.4
+
+PVSCSI Device Interface Overview
+================================
+
+The interface is based on memory area shared between hypervisor and VM.
+Memory area is obtained by driver as device IO memory resource of
+PVSCSI_MEM_SPACE_SIZE length.
+The shared memory consists of registers area and rings area.
+The registers area is used to raise hypervisor interrupts and issue device
+commands. The rings area is used to transfer data descriptors and SCSI
+commands from VM to hypervisor and to transfer messages produced by
+hypervisor to VM. Data itself is transferred via virtual scatter-gather DMA.
+
+PVSCSI Device Registers
+=======================
+
+The length of the registers area is 1 page (PVSCSI_MEM_SPACE_COMMAND_NUM_PAGES).
+The structure of the registers area is described by the PVSCSIRegOffset enum.
+There are registers to issue device command (with optional short data),
+issue device interrupt, control interrupts masking.
+
+PVSCSI Device Rings
+===================
+
+There are three rings in shared memory:
+
+    1. Request ring (struct PVSCSIRingReqDesc *req_ring)
+        - ring for OS to device requests
+    2. Completion ring (struct PVSCSIRingCmpDesc *cmp_ring)
+        - ring for device request completions
+    3. Message ring (struct PVSCSIRingMsgDesc *msg_ring)
+        - ring for messages from device.
+       This ring is optional and the guest might not configure it.
+There is a control area (struct PVSCSIRingsState *rings_state) used to control
+rings operation.
+
+PVSCSI Device to Host Interrupts
+================================
+There are following interrupt types supported by PVSCSI device:
+    1. Completion interrupts (completion ring notifications):
+        PVSCSI_INTR_CMPL_0
+        PVSCSI_INTR_CMPL_1
+    2. Message interrupts (message ring notifications):
+        PVSCSI_INTR_MSG_0
+        PVSCSI_INTR_MSG_1
+
+Interrupts are controlled via PVSCSI_REG_OFFSET_INTR_MASK register
+Bit set means interrupt enabled, bit cleared - disabled
+
+Interrupt modes supported are legacy, MSI and MSI-X
+In case of legacy interrupts, register PVSCSI_REG_OFFSET_INTR_STATUS
+is used to check which interrupt has arrived.  Interrupts are
+acknowledged when the corresponding bit is written to the interrupt
+status register.
+
+PVSCSI Device Operation Sequences
+=================================
+
+1. Startup sequence:
+    a. Issue PVSCSI_CMD_ADAPTER_RESET command;
+    aa. Windows driver reads interrupt status register here;
+    b. Issue PVSCSI_CMD_SETUP_MSG_RING command with no additional data,
+       check status and disable device messages if error returned;
+       (Omitted if device messages disabled by driver configuration)
+    c. Issue PVSCSI_CMD_SETUP_RINGS command, provide rings configuration
+       as struct PVSCSICmdDescSetupRings;
+    d. Issue PVSCSI_CMD_SETUP_MSG_RING command again, provide
+       rings configuration as struct PVSCSICmdDescSetupMsgRing;
+    e. Unmask completion and message (if device messages enabled) interrupts.
+
+2. Shutdown sequences
+    a. Mask interrupts;
+    b. Flush request ring using PVSCSI_REG_OFFSET_KICK_NON_RW_IO;
+    c. Issue PVSCSI_CMD_ADAPTER_RESET command.
+
+3. Send request
+    a. Fill next free request ring descriptor;
+    b. Issue PVSCSI_REG_OFFSET_KICK_RW_IO for R/W operations;
+       or PVSCSI_REG_OFFSET_KICK_NON_RW_IO for other operations.
+
+4. Abort command
+    a. Issue PVSCSI_CMD_ABORT_CMD command;
+
+5. Request completion processing
+    a. Upon completion interrupt arrival process completion
+       and message (if enabled) rings.