aboutsummaryrefslogtreecommitdiffstats
path: root/docs/interop
diff options
context:
space:
mode:
Diffstat (limited to 'docs/interop')
-rw-r--r--docs/interop/barrier.rst426
-rw-r--r--docs/interop/bitmaps.rst1629
-rw-r--r--docs/interop/dbus-vmstate.rst74
-rw-r--r--docs/interop/dbus.rst110
-rw-r--r--docs/interop/firmware.json593
-rw-r--r--docs/interop/index.rst23
-rw-r--r--docs/interop/live-block-operations.rst1121
-rw-r--r--docs/interop/nbd.txt70
-rw-r--r--docs/interop/parallels.txt232
-rw-r--r--docs/interop/pr-helper.rst83
-rw-r--r--docs/interop/prl-xml.txt158
-rw-r--r--docs/interop/qcow2.txt901
-rw-r--r--docs/interop/qed_spec.txt138
-rw-r--r--docs/interop/qemu-ga-ref.rst7
-rw-r--r--docs/interop/qemu-ga.rst134
-rw-r--r--docs/interop/qemu-qmp-ref.rst7
-rw-r--r--docs/interop/qemu-storage-daemon-qmp-ref.rst7
-rw-r--r--docs/interop/qmp-intro.txt88
-rw-r--r--docs/interop/qmp-spec.txt406
-rw-r--r--docs/interop/vhost-user-gpu.rst243
-rw-r--r--docs/interop/vhost-user.json267
-rw-r--r--docs/interop/vhost-user.rst1585
-rw-r--r--docs/interop/vhost-vdpa.rst17
-rw-r--r--docs/interop/vnc-ledstate-Pseudo-encoding.txt50
24 files changed, 8369 insertions, 0 deletions
diff --git a/docs/interop/barrier.rst b/docs/interop/barrier.rst
new file mode 100644
index 000000000..055f2c1ae
--- /dev/null
+++ b/docs/interop/barrier.rst
@@ -0,0 +1,426 @@
+Barrier client protocol
+=======================
+
+QEMU's ``input-barrier`` device implements the client end of
+the KVM (Keyboard-Video-Mouse) software
+`Barrier <https://github.com/debauchee/barrier>`__.
+
+This document briefly describes the protocol as we implement it.
+
+Message format
+--------------
+
+Message format between the server and client is in two parts:
+
+#. the payload length, a 32bit integer in network endianness
+#. the payload
+
+The payload starts with a 4byte string (without NUL) which is the
+command. The first command between the server and the client
+is the only command not encoded on 4 bytes ("Barrier").
+The remaining part of the payload is decoded according to the command.
+
+Protocol Description
+--------------------
+
+This comes from ``barrier/src/lib/barrier/protocol_types.h``.
+
+barrierCmdHello "Barrier"
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t minor, int16_t major }``
+Description:
+ Say hello to client
+
+ ``minor`` = protocol major version number supported by server
+
+ ``major`` = protocol minor version number supported by server
+
+barrierCmdHelloBack "Barrier"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ client ->server
+Parameters:
+ ``{ int16_t minor, int16_t major, char *name}``
+Description:
+ Respond to hello from server
+
+ ``minor`` = protocol major version number supported by client
+
+ ``major`` = protocol minor version number supported by client
+
+ ``name`` = client name
+
+barrierCmdDInfo "DINF"
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ client ->server
+Parameters:
+ ``{ int16_t x_origin, int16_t y_origin, int16_t width, int16_t height, int16_t x, int16_t y}``
+Description:
+ The client screen must send this message in response to the
+ barrierCmdQInfo message. It must also send this message when the
+ screen's resolution changes. In this case, the client screen should
+ ignore any barrierCmdDMouseMove messages until it receives a
+ barrierCmdCInfoAck in order to prevent attempts to move the mouse off
+ the new screen area.
+
+barrierCmdCNoop "CNOP"
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ client -> server
+Parameters:
+ None
+Description:
+ No operation
+
+barrierCmdCClose "CBYE"
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ None
+Description:
+ Close connection
+
+barrierCmdCEnter "CINN"
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t x, int16_t y, int32_t seq, int16_t modifier }``
+Description:
+ Enter screen.
+
+ ``x``, ``y`` = entering screen absolute coordinates
+
+ ``seq`` = sequence number, which is used to order messages between
+ screens. the secondary screen must return this number
+ with some messages
+
+ ``modifier`` = modifier key mask. this will have bits set for each
+ toggle modifier key that is activated on entry to the
+ screen. the secondary screen should adjust its toggle
+ modifiers to reflect that state.
+
+barrierCmdCLeave "COUT"
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ None
+Description:
+ Leaving screen. the secondary screen should send clipboard data in
+ response to this message for those clipboards that it has grabbed
+ (i.e. has sent a barrierCmdCClipboard for and has not received a
+ barrierCmdCClipboard for with a greater sequence number) and that
+ were grabbed or have changed since the last leave.
+
+barrierCmdCClipboard "CCLP"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int8_t id, int32_t seq }``
+Description:
+ Grab clipboard. Sent by screen when some other app on that screen
+ grabs a clipboard.
+
+ ``id`` = the clipboard identifier
+
+ ``seq`` = sequence number. Client must use the sequence number passed in
+ the most recent barrierCmdCEnter. the server always sends 0.
+
+barrierCmdCScreenSaver "CSEC"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int8_t started }``
+Description:
+ Screensaver change.
+
+ ``started`` = Screensaver on primary has started (1) or closed (0)
+
+barrierCmdCResetOptions "CROP"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ None
+Description:
+ Reset options. Client should reset all of its options to their
+ defaults.
+
+barrierCmdCInfoAck "CIAK"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ None
+Description:
+ Resolution change acknowledgment. Sent by server in response to a
+ client screen's barrierCmdDInfo. This is sent for every
+ barrierCmdDInfo, whether or not the server had sent a barrierCmdQInfo.
+
+barrierCmdCKeepAlive "CALV"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ None
+Description:
+ Keep connection alive. Sent by the server periodically to verify
+ that connections are still up and running. clients must reply in
+ kind on receipt. if the server gets an error sending the message or
+ does not receive a reply within a reasonable time then the server
+ disconnects the client. if the client doesn't receive these (or any
+ message) periodically then it should disconnect from the server. the
+ appropriate interval is defined by an option.
+
+barrierCmdDKeyDown "DKDN"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t keyid, int16_t modifier [,int16_t button] }``
+Description:
+ Key pressed.
+
+ ``keyid`` = X11 key id
+
+ ``modified`` = modified mask
+
+ ``button`` = X11 Xkb keycode (optional)
+
+barrierCmdDKeyRepeat "DKRP"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t keyid, int16_t modifier, int16_t repeat [,int16_t button] }``
+Description:
+ Key auto-repeat.
+
+ ``keyid`` = X11 key id
+
+ ``modified`` = modified mask
+
+ ``repeat`` = number of repeats
+
+ ``button`` = X11 Xkb keycode (optional)
+
+barrierCmdDKeyUp "DKUP"
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t keyid, int16_t modifier [,int16_t button] }``
+Description:
+ Key released.
+
+ ``keyid`` = X11 key id
+
+ ``modified`` = modified mask
+
+ ``button`` = X11 Xkb keycode (optional)
+
+barrierCmdDMouseDown "DMDN"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int8_t button }``
+Description:
+ Mouse button pressed.
+
+ ``button`` = button id
+
+barrierCmdDMouseUp "DMUP"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int8_t button }``
+Description:
+ Mouse button release.
+
+ ``button`` = button id
+
+barrierCmdDMouseMove "DMMV"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t x, int16_t y }``
+Description:
+ Absolute mouse moved.
+
+ ``x``, ``y`` = absolute screen coordinates
+
+barrierCmdDMouseRelMove "DMRM"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t x, int16_t y }``
+Description:
+ Relative mouse moved.
+
+ ``x``, ``y`` = r relative screen coordinates
+
+barrierCmdDMouseWheel "DMWM"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t x , int16_t y }`` or ``{ int16_t y }``
+Description:
+ Mouse scroll. The delta should be +120 for one tick forward (away
+ from the user) or right and -120 for one tick backward (toward the
+ user) or left.
+
+ ``x`` = x delta
+
+ ``y`` = y delta
+
+barrierCmdDClipboard "DCLP"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int8_t id, int32_t seq, int8_t mark, char *data }``
+Description:
+ Clipboard data.
+
+ ``id`` = clipboard id
+
+ ``seq`` = sequence number. The sequence number is 0 when sent by the
+ server. Client screens should use the/ sequence number from
+ the most recent barrierCmdCEnter.
+
+barrierCmdDSetOptions "DSOP"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int32 t nb, { int32_t id, int32_t val }[] }``
+Description:
+ Set options. Client should set the given option/value pairs.
+
+ ``nb`` = numbers of ``{ id, val }`` entries
+
+ ``id`` = option id
+
+ ``val`` = option new value
+
+barrierCmdDFileTransfer "DFTR"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int8_t mark, char *content }``
+Description:
+ Transfer file data.
+
+ * ``mark`` = 0 means the content followed is the file size
+ * 1 means the content followed is the chunk data
+ * 2 means the file transfer is finished
+
+barrierCmdDDragInfo "DDRG"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t nb, char *content }``
+Description:
+ Drag information.
+
+ ``nb`` = number of dragging objects
+
+ ``content`` = object's directory
+
+barrierCmdQInfo "QINF"
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ None
+Description:
+ Query screen info
+
+ Client should reply with a barrierCmdDInfo
+
+barrierCmdEIncompatible "EICV"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ ``{ int16_t nb, major *minor }``
+Description:
+ Incompatible version.
+
+ ``major`` = major version
+
+ ``minor`` = minor version
+
+barrierCmdEBusy "EBSY"
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ None
+Description:
+ Name provided when connecting is already in use.
+
+barrierCmdEUnknown "EUNK"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ None
+Description:
+ Unknown client. Name provided when connecting is not in primary's
+ screen configuration map.
+
+barrierCmdEBad "EBAD"
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Direction:
+ server -> client
+Parameters:
+ None
+Description:
+ Protocol violation. Server should disconnect after sending this
+ message.
+
diff --git a/docs/interop/bitmaps.rst b/docs/interop/bitmaps.rst
new file mode 100644
index 000000000..1de46febd
--- /dev/null
+++ b/docs/interop/bitmaps.rst
@@ -0,0 +1,1629 @@
+..
+ Copyright 2019 John Snow <jsnow@redhat.com> and Red Hat, Inc.
+ All rights reserved.
+
+ This file is licensed via The FreeBSD Documentation License, the full
+ text of which is included at the end of this document.
+
+====================================
+Dirty Bitmaps and Incremental Backup
+====================================
+
+Dirty Bitmaps are in-memory objects that track writes to block devices. They
+can be used in conjunction with various block job operations to perform
+incremental or differential backup regimens.
+
+This document explains the conceptual mechanisms, as well as up-to-date,
+complete and comprehensive documentation on the API to manipulate them.
+(Hopefully, the "why", "what", and "how".)
+
+The intended audience for this document is developers who are adding QEMU
+backup features to management applications, or power users who run and
+administer QEMU directly via QMP.
+
+.. contents::
+
+Overview
+--------
+
+Bitmaps are bit vectors where each '1' bit in the vector indicates a modified
+("dirty") segment of the corresponding block device. The size of the segment
+that is tracked is the granularity of the bitmap. If the granularity of a
+bitmap is 64K, each '1' bit means that a 64K region as a whole may have
+changed in some way, possibly by as little as one byte.
+
+Smaller granularities mean more accurate tracking of modified disk data, but
+requires more computational overhead and larger bitmap sizes. Larger
+granularities mean smaller bitmap sizes, but less targeted backups.
+
+The size of a bitmap (in bytes) can be computed as such:
+ ``size`` = ceil(ceil(``image_size`` / ``granularity``) / 8)
+
+e.g. the size of a 64KiB granularity bitmap on a 2TiB image is:
+ ``size`` = ((2147483648K / 64K) / 8)
+ = 4194304B = 4MiB.
+
+QEMU uses these bitmaps when making incremental backups to know which sections
+of the file to copy out. They are not enabled by default and must be
+explicitly added in order to begin tracking writes.
+
+Bitmaps can be created at any time and can be attached to any arbitrary block
+node in the storage graph, but are most useful conceptually when attached to
+the root node attached to the guest's storage device model.
+
+That is to say: It's likely most useful to track the guest's writes to disk,
+but you could theoretically track things like qcow2 metadata changes by
+attaching the bitmap elsewhere in the storage graph. This is beyond the scope
+of this document.
+
+QEMU supports persisting these bitmaps to disk via the qcow2 image format.
+Bitmaps which are stored or loaded in this way are called "persistent",
+whereas bitmaps that are not are called "transient".
+
+QEMU also supports the migration of both transient bitmaps (tracking any
+arbitrary image format) or persistent bitmaps (qcow2) via live migration.
+
+Supported Image Formats
+-----------------------
+
+QEMU supports all documented features below on the qcow2 image format.
+
+However, qcow2 is only strictly necessary for the persistence feature, which
+writes bitmap data to disk upon close. If persistence is not required for a
+specific use case, all bitmap features excepting persistence are available for
+any arbitrary image format.
+
+For example, Dirty Bitmaps can be combined with the 'raw' image format, but
+any changes to the bitmap will be discarded upon exit.
+
+.. warning:: Transient bitmaps will not be saved on QEMU exit! Persistent
+ bitmaps are available only on qcow2 images.
+
+Dirty Bitmap Names
+------------------
+
+Bitmap objects need a method to reference them in the API. All API-created and
+managed bitmaps have a human-readable name chosen by the user at creation
+time.
+
+- A bitmap's name is unique to the node, but bitmaps attached to different
+ nodes can share the same name. Therefore, all bitmaps are addressed via
+ their (node, name) pair.
+
+- The name of a user-created bitmap cannot be empty ("").
+
+- Transient bitmaps can have JSON unicode names that are effectively not
+ length limited. (QMP protocol may restrict messages to less than 64MiB.)
+
+- Persistent storage formats may impose their own requirements on bitmap names
+ and namespaces. Presently, only qcow2 supports persistent bitmaps. See
+ docs/interop/qcow2.txt for more details on restrictions. Notably:
+
+ - qcow2 bitmap names are limited to between 1 and 1023 bytes long.
+
+ - No two bitmaps saved to the same qcow2 file may share the same name.
+
+- QEMU occasionally uses bitmaps for internal use which have no name. They are
+ hidden from API query calls, cannot be manipulated by the external API, are
+ never persistent, nor ever migrated.
+
+Bitmap Status
+-------------
+
+Dirty Bitmap objects can be queried with the QMP command `query-block
+<qemu-qmp-ref.html#index-query_002dblock>`_, and are visible via the
+`BlockDirtyInfo <qemu-qmp-ref.html#index-BlockDirtyInfo>`_ QAPI structure.
+
+This struct shows the name, granularity, and dirty byte count for each bitmap.
+Additionally, it shows several boolean status indicators:
+
+- ``recording``: This bitmap is recording writes.
+- ``busy``: This bitmap is in-use by an operation.
+- ``persistent``: This bitmap is a persistent type.
+- ``inconsistent``: This bitmap is corrupted and cannot be used.
+
+The ``+busy`` status prohibits you from deleting, clearing, or otherwise
+modifying a bitmap, and happens when the bitmap is being used for a backup
+operation or is in the process of being loaded from a migration. Many of the
+commands documented below will refuse to work on such bitmaps.
+
+The ``+inconsistent`` status similarly prohibits almost all operations,
+notably allowing only the ``block-dirty-bitmap-remove`` operation.
+
+There is also a deprecated ``status`` field of type `DirtyBitmapStatus
+<qemu-qmp-ref.html#index-DirtyBitmapStatus>`_. A bitmap historically had
+five visible states:
+
+ #. ``Frozen``: This bitmap is currently in-use by an operation and is
+ immutable. It can't be deleted, renamed, reset, etc.
+
+ (This is now ``+busy``.)
+
+ #. ``Disabled``: This bitmap is not recording new writes.
+
+ (This is now ``-recording -busy``.)
+
+ #. ``Active``: This bitmap is recording new writes.
+
+ (This is now ``+recording -busy``.)
+
+ #. ``Locked``: This bitmap is in-use by an operation, and is immutable.
+ The difference from "Frozen" was primarily implementation details.
+
+ (This is now ``+busy``.)
+
+ #. ``Inconsistent``: This persistent bitmap was not saved to disk
+ correctly, and can no longer be used. It remains in memory to serve as
+ an indicator of failure.
+
+ (This is now ``+inconsistent``.)
+
+These states are directly replaced by the status indicators and should not be
+used. The difference between ``Frozen`` and ``Locked`` is an implementation
+detail and should not be relevant to external users.
+
+Basic QMP Usage
+---------------
+
+The primary interface to manipulating bitmap objects is via the QMP
+interface. If you are not familiar, see docs/interop/qmp-intro.txt for a broad
+overview, and `qemu-qmp-ref <qemu-qmp-ref.html>`_ for a full reference of all
+QMP commands.
+
+Supported Commands
+~~~~~~~~~~~~~~~~~~
+
+There are six primary bitmap-management API commands:
+
+- ``block-dirty-bitmap-add``
+- ``block-dirty-bitmap-remove``
+- ``block-dirty-bitmap-clear``
+- ``block-dirty-bitmap-disable``
+- ``block-dirty-bitmap-enable``
+- ``block-dirty-bitmap-merge``
+
+And one related query command:
+
+- ``query-block``
+
+Creation: block-dirty-bitmap-add
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`block-dirty-bitmap-add
+<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002dadd>`_:
+
+Creates a new bitmap that tracks writes to the specified node. granularity,
+persistence, and recording state can be adjusted at creation time.
+
+.. admonition:: Example
+
+ to create a new, actively recording persistent bitmap:
+
+ .. code-block:: QMP
+
+ -> { "execute": "block-dirty-bitmap-add",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap0",
+ "persistent": true,
+ }
+ }
+
+ <- { "return": {} }
+
+- This bitmap will have a default granularity that matches the cluster size of
+ its associated drive, if available, clamped to between [4KiB, 64KiB]. The
+ current default for qcow2 is 64KiB.
+
+.. admonition:: Example
+
+ To create a new, disabled (``-recording``), transient bitmap that tracks
+ changes in 32KiB segments:
+
+ .. code-block:: QMP
+
+ -> { "execute": "block-dirty-bitmap-add",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap1",
+ "granularity": 32768,
+ "disabled": true
+ }
+ }
+
+ <- { "return": {} }
+
+Deletion: block-dirty-bitmap-remove
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`block-dirty-bitmap-remove
+<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002dremove>`_:
+
+Deletes a bitmap. Bitmaps that are ``+busy`` cannot be removed.
+
+- Deleting a bitmap does not impact any other bitmaps attached to the same
+ node, nor does it affect any backups already created from this bitmap or
+ node.
+
+- Because bitmaps are only unique to the node to which they are attached, you
+ must specify the node/drive name here, too.
+
+- Deleting a persistent bitmap will remove it from the qcow2 file.
+
+.. admonition:: Example
+
+ Remove a bitmap named ``bitmap0`` from node ``drive0``:
+
+ .. code-block:: QMP
+
+ -> { "execute": "block-dirty-bitmap-remove",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+ }
+
+ <- { "return": {} }
+
+Resetting: block-dirty-bitmap-clear
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`block-dirty-bitmap-clear
+<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002dclear>`_:
+
+Clears all dirty bits from a bitmap. ``+busy`` bitmaps cannot be cleared.
+
+- An incremental backup created from an empty bitmap will copy no data, as if
+ nothing has changed.
+
+.. admonition:: Example
+
+ Clear all dirty bits from bitmap ``bitmap0`` on node ``drive0``:
+
+ .. code-block:: QMP
+
+ -> { "execute": "block-dirty-bitmap-clear",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+ }
+
+ <- { "return": {} }
+
+Enabling: block-dirty-bitmap-enable
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`block-dirty-bitmap-enable
+<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002denable>`_:
+
+"Enables" a bitmap, setting the ``recording`` bit to true, causing writes to
+begin being recorded. ``+busy`` bitmaps cannot be enabled.
+
+- Bitmaps default to being enabled when created, unless configured otherwise.
+
+- Persistent enabled bitmaps will remember their ``+recording`` status on
+ load.
+
+.. admonition:: Example
+
+ To set ``+recording`` on bitmap ``bitmap0`` on node ``drive0``:
+
+ .. code-block:: QMP
+
+ -> { "execute": "block-dirty-bitmap-enable",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+ }
+
+ <- { "return": {} }
+
+Enabling: block-dirty-bitmap-disable
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`block-dirty-bitmap-disable
+<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002ddisable>`_:
+
+"Disables" a bitmap, setting the ``recording`` bit to false, causing further
+writes to begin being ignored. ``+busy`` bitmaps cannot be disabled.
+
+.. warning::
+
+ This is potentially dangerous: QEMU makes no effort to stop any writes if
+ there are disabled bitmaps on a node, and will not mark any disabled bitmaps
+ as ``+inconsistent`` if any such writes do happen. Backups made from such
+ bitmaps will not be able to be used to reconstruct a coherent image.
+
+- Disabling a bitmap may be useful for examining which sectors of a disk
+ changed during a specific time period, or for explicit management of
+ differential backup windows.
+
+- Persistent disabled bitmaps will remember their ``-recording`` status on
+ load.
+
+.. admonition:: Example
+
+ To set ``-recording`` on bitmap ``bitmap0`` on node ``drive0``:
+
+ .. code-block:: QMP
+
+ -> { "execute": "block-dirty-bitmap-disable",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+ }
+
+ <- { "return": {} }
+
+Merging, Copying: block-dirty-bitmap-merge
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`block-dirty-bitmap-merge
+<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002dmerge>`_:
+
+Merges one or more bitmaps into a target bitmap. For any segment that is dirty
+in any one source bitmap, the target bitmap will mark that segment dirty.
+
+- Merge takes one or more bitmaps as a source and merges them together into a
+ single destination, such that any segment marked as dirty in any source
+ bitmap(s) will be marked dirty in the destination bitmap.
+
+- Merge does not create the destination bitmap if it does not exist. A blank
+ bitmap can be created beforehand to achieve the same effect.
+
+- The destination is not cleared prior to merge, so subsequent merge
+ operations will continue to cumulatively mark more segments as dirty.
+
+- If the merge operation should fail, the destination bitmap is guaranteed to
+ be unmodified. The operation may fail if the source or destination bitmaps
+ are busy, or have different granularities.
+
+- Bitmaps can only be merged on the same node. There is only one "node"
+ argument, so all bitmaps must be attached to that same node.
+
+- Copy can be achieved by merging from a single source to an empty
+ destination.
+
+.. admonition:: Example
+
+ Merge the data from ``bitmap0`` into the bitmap ``new_bitmap`` on node
+ ``drive0``. If ``new_bitmap`` was empty prior to this command, this achieves
+ a copy.
+
+ .. code-block:: QMP
+
+ -> { "execute": "block-dirty-bitmap-merge",
+ "arguments": {
+ "node": "drive0",
+ "target": "new_bitmap",
+ "bitmaps": [ "bitmap0" ]
+ }
+ }
+
+ <- { "return": {} }
+
+Querying: query-block
+~~~~~~~~~~~~~~~~~~~~~
+
+`query-block
+<qemu-qmp-ref.html#index-query_002dblock>`_:
+
+Not strictly a bitmaps command, but will return information about any bitmaps
+attached to nodes serving as the root for guest devices.
+
+- The "inconsistent" bit will not appear when it is false, appearing only when
+ the value is true to indicate there is a problem.
+
+.. admonition:: Example
+
+ Query the block sub-system of QEMU. The following json has trimmed irrelevant
+ keys from the response to highlight only the bitmap-relevant portions of the
+ API. This result highlights a bitmap ``bitmap0`` attached to the root node of
+ device ``drive0``.
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "query-block",
+ "arguments": {}
+ }
+
+ <- {
+ "return": [ {
+ "dirty-bitmaps": [ {
+ "status": "active",
+ "count": 0,
+ "busy": false,
+ "name": "bitmap0",
+ "persistent": false,
+ "recording": true,
+ "granularity": 65536
+ } ],
+ "device": "drive0",
+ } ]
+ }
+
+Bitmap Persistence
+------------------
+
+As outlined in `Supported Image Formats`_, QEMU can persist bitmaps to qcow2
+files. Demonstrated in `Creation: block-dirty-bitmap-add`_, passing
+``persistent: true`` to ``block-dirty-bitmap-add`` will persist that bitmap to
+disk.
+
+Persistent bitmaps will be automatically loaded into memory upon load, and
+will be written back to disk upon close. Their usage should be mostly
+transparent.
+
+However, if QEMU does not get a chance to close the file cleanly, the bitmap
+will be marked as ``+inconsistent`` at next load and considered unsafe to use
+for any operation. At this point, the only valid operation on such bitmaps is
+``block-dirty-bitmap-remove``.
+
+Losing a bitmap in this way does not invalidate any existing backups that have
+been made from this bitmap, but no further backups will be able to be issued
+for this chain.
+
+Transactions
+------------
+
+Transactions are a QMP feature that allows you to submit multiple QMP commands
+at once, being guaranteed that they will all succeed or fail atomically,
+together. The interaction of bitmaps and transactions are demonstrated below.
+
+See `transaction <qemu-qmp.ref.html#index-transaction>`_ in the QMP reference
+for more details.
+
+Justification
+~~~~~~~~~~~~~
+
+Bitmaps can generally be modified at any time, but certain operations often
+only make sense when paired directly with other commands. When a VM is paused,
+it's easy to ensure that no guest writes occur between individual QMP
+commands. When a VM is running, this is difficult to accomplish with
+individual QMP commands that may allow guest writes to occur between each
+command.
+
+For example, using only individual QMP commands, we could:
+
+#. Boot the VM in a paused state.
+#. Create a full drive backup of drive0.
+#. Create a new bitmap attached to drive0, confident that nothing has been
+ written to drive0 in the meantime.
+#. Resume execution of the VM.
+#. At a later point, issue incremental backups from ``bitmap0``.
+
+At this point, the bitmap and drive backup would be correctly in sync, and
+incremental backups made from this point forward would be correctly aligned to
+the full drive backup.
+
+This is not particularly useful if we decide we want to start incremental
+backups after the VM has been running for a while, for which we would want to
+perform actions such as the following:
+
+#. Boot the VM and begin execution.
+#. Using a single transaction, perform the following operations:
+
+ - Create ``bitmap0``.
+ - Create a full drive backup of ``drive0``.
+
+#. At a later point, issue incremental backups from ``bitmap0``.
+
+.. note:: As a consideration, if ``bitmap0`` is created prior to the full
+ drive backup, incremental backups can still be authored from this
+ bitmap, but they will copy extra segments reflecting writes that
+ occurred prior to the backup operation. Transactions allow us to
+ narrow critical points in time to reduce waste, or, in the other
+ direction, to ensure that no segments are omitted.
+
+Supported Bitmap Transactions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- ``block-dirty-bitmap-add``
+- ``block-dirty-bitmap-clear``
+- ``block-dirty-bitmap-enable``
+- ``block-dirty-bitmap-disable``
+- ``block-dirty-bitmap-merge``
+
+The usages for these commands are identical to their respective QMP commands,
+but see the sections below for concrete examples.
+
+Incremental Backups - Push Model
+--------------------------------
+
+Incremental backups are simply partial disk images that can be combined with
+other partial disk images on top of a base image to reconstruct a full backup
+from the point in time at which the incremental backup was issued.
+
+The "Push Model" here references the fact that QEMU is "pushing" the modified
+blocks out to a destination. We will be using the `blockdev-backup
+<qemu-qmp-ref.html#index-blockdev_002dbackup>`_ QMP command to create both
+full and incremental backups.
+
+The command is a background job, which has its own QMP API for querying and
+management documented in `Background jobs
+<qemu-qmp-ref.html#Background-jobs>`_.
+
+Example: New Incremental Backup Anchor Point
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As outlined in the Transactions - `Justification`_ section, perhaps we want to
+create a new incremental backup chain attached to a drive.
+
+This example creates a new, full backup of "drive0" and accompanies it with a
+new, empty bitmap that records writes from this point in time forward.
+
+The target can be created with the help of `blockdev-add
+<qemu-qmp-ref.html#index-blockdev_002dadd>`_ or `blockdev-create
+<qemu-qmp-ref.html#index-blockdev_002dcreate>`_ command.
+
+.. note:: Any new writes that happen after this command is issued, even while
+ the backup job runs, will be written locally and not to the backup
+ destination. These writes will be recorded in the bitmap
+ accordingly.
+
+.. code-block:: QMP
+
+ -> {
+ "execute": "transaction",
+ "arguments": {
+ "actions": [
+ {
+ "type": "block-dirty-bitmap-add",
+ "data": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+ },
+ {
+ "type": "blockdev-backup",
+ "data": {
+ "device": "drive0",
+ "target": "target0",
+ "sync": "full"
+ }
+ }
+ ]
+ }
+ }
+
+ <- { "return": {} }
+
+ <- {
+ "timestamp": {
+ "seconds": 1555436945,
+ "microseconds": 179620
+ },
+ "data": {
+ "status": "created",
+ "id": "drive0"
+ },
+ "event": "JOB_STATUS_CHANGE"
+ }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 68719476736,
+ "offset": 68719476736
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "status": "concluded",
+ "id": "drive0"
+ },
+ "event": "JOB_STATUS_CHANGE"
+ }
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "status": "null",
+ "id": "drive0"
+ },
+ "event": "JOB_STATUS_CHANGE"
+ }
+
+A full explanation of the job transition semantics and the JOB_STATUS_CHANGE
+event are beyond the scope of this document and will be omitted in all
+subsequent examples; above, several more events have been omitted for brevity.
+
+.. note:: Subsequent examples will omit all events except BLOCK_JOB_COMPLETED
+ except where necessary to illustrate workflow differences.
+
+ Omitted events and json objects will be represented by ellipses:
+ ``...``
+
+Example: Resetting an Incremental Backup Anchor Point
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If we want to start a new backup chain with an existing bitmap, we can also
+use a transaction to reset the bitmap while making a new full backup:
+
+.. code-block:: QMP
+
+ -> {
+ "execute": "transaction",
+ "arguments": {
+ "actions": [
+ {
+ "type": "block-dirty-bitmap-clear",
+ "data": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+ },
+ {
+ "type": "blockdev-backup",
+ "data": {
+ "device": "drive0",
+ "target": "target0",
+ "sync": "full"
+ }
+ }
+ ]
+ }
+ }
+
+ <- { "return": {} }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 68719476736,
+ "offset": 68719476736
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+ ...
+
+The result of this example is identical to the first, but we clear an existing
+bitmap instead of adding a new one.
+
+.. tip:: In both of these examples, "bitmap0" is tied conceptually to the
+ creation of new, full backups. This relationship is not saved or
+ remembered by QEMU; it is up to the operator or management layer to
+ remember which bitmaps are associated with which backups.
+
+Example: First Incremental Backup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+#. Create a full backup and sync it to a dirty bitmap using any method:
+
+ - Either of the two live backup method demonstrated above,
+ - Using QMP commands with the VM paused as in the `Justification`_ section,
+ or
+ - With the VM offline, manually copy the image and start the VM in a paused
+ state, careful to add a new bitmap before the VM begins execution.
+
+ Whichever method is chosen, let's assume that at the end of this step:
+
+ - The full backup is named ``drive0.full.qcow2``.
+ - The bitmap we created is named ``bitmap0``, attached to ``drive0``.
+
+#. Create a destination image for the incremental backup that utilizes the
+ full backup as a backing image.
+
+ - Let's assume the new incremental image is named ``drive0.inc0.qcow2``:
+
+ .. code:: bash
+
+ $ qemu-img create -f qcow2 drive0.inc0.qcow2 \
+ -b drive0.full.qcow2 -F qcow2
+
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+#. Issue an incremental backup command:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-backup",
+ "arguments": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "target": "target0",
+ "sync": "incremental"
+ }
+ }
+
+ <- { "return": {} }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 68719476736,
+ "offset": 68719476736
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+ ...
+
+This copies any blocks modified since the full backup was created into the
+``drive0.inc0.qcow2`` file. During the operation, ``bitmap0`` is marked
+``+busy``. If the operation is successful, ``bitmap0`` will be cleared to
+reflect the "incremental" backup regimen, which only copies out new changes
+from each incremental backup.
+
+.. note:: Any new writes that occur after the backup operation starts do not
+ get copied to the destination. The backup's "point in time" is when
+ the backup starts, not when it ends. These writes are recorded in a
+ special bitmap that gets re-added to bitmap0 when the backup ends so
+ that the next incremental backup can copy them out.
+
+Example: Second Incremental Backup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+#. Create a new destination image for the incremental backup that points to
+ the previous one, e.g.: ``drive0.inc1.qcow2``
+
+ .. code:: bash
+
+ $ qemu-img create -f qcow2 drive0.inc1.qcow2 \
+ -b drive0.inc0.qcow2 -F qcow2
+
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc1.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+#. Issue a new incremental backup command. The only difference here is that we
+ have changed the target image below.
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-backup",
+ "arguments": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "target": "target0",
+ "sync": "incremental"
+ }
+ }
+
+ <- { "return": {} }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 68719476736,
+ "offset": 68719476736
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+ ...
+
+Because the first incremental backup from the previous example completed
+successfully, ``bitmap0`` was synchronized with ``drive0.inc0.qcow2``. Here,
+we use ``bitmap0`` again to create a new incremental backup that targets the
+previous one, creating a chain of three images:
+
+.. admonition:: Diagram
+
+ .. code:: text
+
+ +-------------------+ +-------------------+ +-------------------+
+ | drive0.full.qcow2 |<--| drive0.inc0.qcow2 |<--| drive0.inc1.qcow2 |
+ +-------------------+ +-------------------+ +-------------------+
+
+Each new incremental backup re-synchronizes the bitmap to the latest backup
+authored, allowing a user to continue to "consume" it to create new backups on
+top of an existing chain.
+
+In the above diagram, neither drive0.inc1.qcow2 nor drive0.inc0.qcow2 are
+complete images by themselves, but rely on their backing chain to reconstruct
+a full image. The dependency terminates with each full backup.
+
+Each backup in this chain remains independent, and is unchanged by new entries
+made later in the chain. For instance, drive0.inc0.qcow2 remains a perfectly
+valid backup of the disk as it was when that backup was issued.
+
+Example: Incremental Push Backups without Backing Files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Backup images are best kept off-site, so we often will not have the preceding
+backups in a chain available to link against. This is not a problem at backup
+time; we simply do not set the backing image when creating the destination
+image:
+
+#. Create a new destination image with no backing file set. We will need to
+ specify the size of the base image, because the backing file isn't
+ available for QEMU to use to determine it.
+
+ .. code:: bash
+
+ $ qemu-img create -f qcow2 drive0.inc2.qcow2 64G
+
+ .. note:: Alternatively, you can omit ``mode: "existing"`` from the push
+ backup commands to have QEMU create an image without a backing
+ file for you, but you lose control over format options like
+ compatibility and preallocation presets.
+
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc2.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+#. Issue a new incremental backup command. Apart from the new destination
+ image, there is no difference from the last two examples.
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-backup",
+ "arguments": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "target": "target0",
+ "sync": "incremental"
+ }
+ }
+
+ <- { "return": {} }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 68719476736,
+ "offset": 68719476736
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+ ...
+
+The only difference from the perspective of the user is that you will need to
+set the backing image when attempting to restore the backup:
+
+.. code:: bash
+
+ $ qemu-img rebase drive0.inc2.qcow2 \
+ -u -b drive0.inc1.qcow2
+
+This uses the "unsafe" rebase mode to simply set the backing file to a file
+that isn't present.
+
+It is also possible to use ``--image-opts`` to specify the entire backing
+chain by hand as an ephemeral property at runtime, but that is beyond the
+scope of this document.
+
+Example: Multi-drive Incremental Backup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Assume we have a VM with two drives, "drive0" and "drive1" and we wish to back
+both of them up such that the two backups represent the same crash-consistent
+point in time.
+
+#. For each drive, create an empty image:
+
+ .. code:: bash
+
+ $ qemu-img create -f qcow2 drive0.full.qcow2 64G
+ $ qemu-img create -f qcow2 drive1.full.qcow2 64G
+
+#. Add target block nodes:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.full.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target1",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive1.full.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+#. Create a full (anchor) backup for each drive, with accompanying bitmaps:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "transaction",
+ "arguments": {
+ "actions": [
+ {
+ "type": "block-dirty-bitmap-add",
+ "data": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+ },
+ {
+ "type": "block-dirty-bitmap-add",
+ "data": {
+ "node": "drive1",
+ "name": "bitmap0"
+ }
+ },
+ {
+ "type": "blockdev-backup",
+ "data": {
+ "device": "drive0",
+ "target": "target0",
+ "sync": "full"
+ }
+ },
+ {
+ "type": "blockdev-backup",
+ "data": {
+ "device": "drive1",
+ "target": "target1",
+ "sync": "full"
+ }
+ }
+ ]
+ }
+ }
+
+ <- { "return": {} }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 68719476736,
+ "offset": 68719476736
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive1",
+ "type": "backup",
+ "speed": 0,
+ "len": 68719476736,
+ "offset": 68719476736
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+ ...
+
+#. Later, create new destination images for each of the incremental backups
+ that point to their respective full backups:
+
+ .. code:: bash
+
+ $ qemu-img create -f qcow2 drive0.inc0.qcow2 \
+ -b drive0.full.qcow2 -F qcow2
+ $ qemu-img create -f qcow2 drive1.inc0.qcow2 \
+ -b drive1.full.qcow2 -F qcow2
+
+#. Add target block nodes:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target1",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive1.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+#. Issue a multi-drive incremental push backup transaction:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "transaction",
+ "arguments": {
+ "actions": [
+ {
+ "type": "blockev-backup",
+ "data": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "sync": "incremental",
+ "target": "target0"
+ }
+ },
+ {
+ "type": "blockdev-backup",
+ "data": {
+ "device": "drive1",
+ "bitmap": "bitmap0",
+ "sync": "incremental",
+ "target": "target1"
+ }
+ },
+ ]
+ }
+ }
+
+ <- { "return": {} }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 68719476736,
+ "offset": 68719476736
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive1",
+ "type": "backup",
+ "speed": 0,
+ "len": 68719476736,
+ "offset": 68719476736
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+ ...
+
+Push Backup Errors & Recovery
+-----------------------------
+
+In the event of an error that occurs after a push backup job is successfully
+launched, either by an individual QMP command or a QMP transaction, the user
+will receive a ``BLOCK_JOB_COMPLETE`` event with a failure message,
+accompanied by a ``BLOCK_JOB_ERROR`` event.
+
+In the case of a job being cancelled, the user will receive a
+``BLOCK_JOB_CANCELLED`` event instead of a pair of COMPLETE and ERROR
+events.
+
+In either failure case, the bitmap used for the failed operation is not
+cleared. It will contain all of the dirty bits it did at the start of the
+operation, plus any new bits that got marked during the operation.
+
+Effectively, the "point in time" that a bitmap is recording differences
+against is kept at the issuance of the last successful incremental backup,
+instead of being moved forward to the start of this now-failed backup.
+
+Once the underlying problem is addressed (e.g. more storage space is allocated
+on the destination), the incremental backup command can be retried with the
+same bitmap.
+
+Example: Individual Failures
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Incremental Push Backup jobs that fail individually behave simply as
+described above. This example demonstrates the single-job failure case:
+
+#. Create a target image:
+
+ .. code:: bash
+
+ $ qemu-img create -f qcow2 drive0.inc0.qcow2 \
+ -b drive0.full.qcow2 -F qcow2
+
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+#. Attempt to create an incremental backup via QMP:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-backup",
+ "arguments": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "target": "target0",
+ "sync": "incremental"
+ }
+ }
+
+ <- { "return": {} }
+
+#. Receive a pair of events indicating failure:
+
+ .. code-block:: QMP
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "action": "report",
+ "operation": "write"
+ },
+ "event": "BLOCK_JOB_ERROR"
+ }
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "speed": 0,
+ "offset": 0,
+ "len": 67108864,
+ "error": "No space left on device",
+ "device": "drive0",
+ "type": "backup"
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+#. Remove target node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-del",
+ "arguments": {
+ "node-name": "target0",
+ }
+ }
+
+ <- { "return": {} }
+
+#. Delete the failed image, and re-create it.
+
+ .. code:: bash
+
+ $ rm drive0.inc0.qcow2
+ $ qemu-img create -f qcow2 drive0.inc0.qcow2 \
+ -b drive0.full.qcow2 -F qcow2
+
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+#. Retry the command after fixing the underlying problem, such as
+ freeing up space on the backup volume:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-backup",
+ "arguments": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "target": "target0",
+ "sync": "incremental"
+ }
+ }
+
+ <- { "return": {} }
+
+#. Receive confirmation that the job completed successfully:
+
+ .. code-block:: QMP
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 67108864,
+ "offset": 67108864
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+Example: Partial Transactional Failures
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+QMP commands like `blockdev-backup
+<qemu-qmp-ref.html#index-blockdev_002dbackup>`_
+conceptually only start a job, and so transactions containing these commands
+may succeed even if the job it created later fails. This might have surprising
+interactions with notions of how a "transaction" ought to behave.
+
+This distinction means that on occasion, a transaction containing such job
+launching commands may appear to succeed and return success, but later
+individual jobs associated with the transaction may fail. It is possible that
+a management application may have to deal with a partial backup failure after
+a "successful" transaction.
+
+If multiple backup jobs are specified in a single transaction, if one of those
+jobs fails, it will not interact with the other backup jobs in any way by
+default. The job(s) that succeeded will clear the dirty bitmap associated with
+the operation, but the job(s) that failed will not. It is therefore not safe
+to delete any incremental backups that were created successfully in this
+scenario, even though others failed.
+
+This example illustrates a transaction with two backup jobs, where one fails
+and one succeeds:
+
+#. Issue the transaction to start a backup of both drives.
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "transaction",
+ "arguments": {
+ "actions": [
+ {
+ "type": "blockdev-backup",
+ "data": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "sync": "incremental",
+ "target": "target0"
+ }
+ },
+ {
+ "type": "blockdev-backup",
+ "data": {
+ "device": "drive1",
+ "bitmap": "bitmap0",
+ "sync": "incremental",
+ "target": "target1"
+ }
+ }]
+ }
+ }
+
+#. Receive notice that the Transaction was accepted, and jobs were
+ launched:
+
+ .. code-block:: QMP
+
+ <- { "return": {} }
+
+#. Receive notice that the first job has completed:
+
+ .. code-block:: QMP
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 67108864,
+ "offset": 67108864
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+#. Receive notice that the second job has failed:
+
+ .. code-block:: QMP
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive1",
+ "action": "report",
+ "operation": "read"
+ },
+ "event": "BLOCK_JOB_ERROR"
+ }
+
+ ...
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "speed": 0,
+ "offset": 0,
+ "len": 67108864,
+ "error": "Input/output error",
+ "device": "drive1",
+ "type": "backup"
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+At the conclusion of the above example, ``drive0.inc0.qcow2`` is valid and
+must be kept, but ``drive1.inc0.qcow2`` is incomplete and should be
+deleted. If a VM-wide incremental backup of all drives at a point-in-time is
+to be made, new backups for both drives will need to be made, taking into
+account that a new incremental backup for drive0 needs to be based on top of
+``drive0.inc0.qcow2``.
+
+For this example, an incremental backup for ``drive0`` was created, but not
+for ``drive1``. The last VM-wide crash-consistent backup that is available in
+this case is the full backup:
+
+.. code:: text
+
+ [drive0.full.qcow2] <-- [drive0.inc0.qcow2]
+ [drive1.full.qcow2]
+
+To repair this, issue a new incremental backup across both drives. The result
+will be backup chains that resemble the following:
+
+.. code:: text
+
+ [drive0.full.qcow2] <-- [drive0.inc0.qcow2] <-- [drive0.inc1.qcow2]
+ [drive1.full.qcow2] <-------------------------- [drive1.inc1.qcow2]
+
+Example: Grouped Completion Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+While jobs launched by transactions normally complete or fail individually,
+it's possible to instruct them to complete or fail together as a group. QMP
+transactions take an optional properties structure that can affect the
+behavior of the transaction.
+
+The ``completion-mode`` transaction property can be either ``individual``
+which is the default legacy behavior described above, or ``grouped``, detailed
+below.
+
+In ``grouped`` completion mode, no jobs will report success until all jobs are
+ready to report success. If any job fails, all other jobs will be cancelled.
+
+Regardless of if a participating incremental backup job failed or was
+cancelled, their associated bitmaps will all be held at their existing
+points-in-time, as in individual failure cases.
+
+Here's the same multi-drive backup scenario from `Example: Partial
+Transactional Failures`_, but with the ``grouped`` completion-mode property
+applied:
+
+#. Issue the multi-drive incremental backup transaction:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "transaction",
+ "arguments": {
+ "properties": {
+ "completion-mode": "grouped"
+ },
+ "actions": [
+ {
+ "type": "blockdev-backup",
+ "data": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "sync": "incremental",
+ "target": "target0"
+ }
+ },
+ {
+ "type": "blockdev-backup",
+ "data": {
+ "device": "drive1",
+ "bitmap": "bitmap0",
+ "sync": "incremental",
+ "target": "target1"
+ }
+ }]
+ }
+ }
+
+#. Receive notice that the Transaction was accepted, and jobs were launched:
+
+ .. code-block:: QMP
+
+ <- { "return": {} }
+
+#. Receive notification that the backup job for ``drive1`` has failed:
+
+ .. code-block:: QMP
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive1",
+ "action": "report",
+ "operation": "read"
+ },
+ "event": "BLOCK_JOB_ERROR"
+ }
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "speed": 0,
+ "offset": 0,
+ "len": 67108864,
+ "error": "Input/output error",
+ "device": "drive1",
+ "type": "backup"
+ },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+
+#. Receive notification that the job for ``drive0`` has been cancelled:
+
+ .. code-block:: QMP
+
+ <- {
+ "timestamp": {...},
+ "data": {
+ "device": "drive0",
+ "type": "backup",
+ "speed": 0,
+ "len": 67108864,
+ "offset": 16777216
+ },
+ "event": "BLOCK_JOB_CANCELLED"
+ }
+
+At the conclusion of *this* example, both jobs have been aborted due to a
+failure. Both destination images should be deleted and are no longer of use.
+
+The transaction as a whole can simply be re-issued at a later time.
+
+.. raw:: html
+
+ <!--
+ The FreeBSD Documentation License
+
+ Redistribution and use in source (ReST) and 'compiled' forms (SGML, HTML,
+ PDF, PostScript, RTF and so forth) with or without modification, are
+ permitted provided that the following conditions are met:
+
+ Redistributions of source code (ReST) must retain the above copyright notice,
+ this list of conditions and the following disclaimer of this file unmodified.
+
+ Redistributions in compiled form (transformed to other DTDs, converted to
+ PDF, PostScript, RTF and other formats) must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+
+ THIS DOCUMENTATION IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
+ IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENTATION, EVEN IF ADVISED OF
+ THE POSSIBILITY OF SUCH DAMAGE.
+ -->
diff --git a/docs/interop/dbus-vmstate.rst b/docs/interop/dbus-vmstate.rst
new file mode 100644
index 000000000..1d719c1c6
--- /dev/null
+++ b/docs/interop/dbus-vmstate.rst
@@ -0,0 +1,74 @@
+=============
+D-Bus VMState
+=============
+
+Introduction
+============
+
+The QEMU dbus-vmstate object's aim is to migrate helpers' data running
+on a QEMU D-Bus bus. (refer to the :doc:`dbus` document for
+some recommendations on D-Bus usage)
+
+Upon migration, QEMU will go through the queue of
+``org.qemu.VMState1`` D-Bus name owners and query their ``Id``. It
+must be unique among the helpers.
+
+It will then save arbitrary data of each Id to be transferred in the
+migration stream and restored/loaded at the corresponding destination
+helper.
+
+For now, the data amount to be transferred is arbitrarily limited to
+1Mb. The state must be saved quickly (a fraction of a second). (D-Bus
+imposes a time limit on reply anyway, and migration would fail if data
+isn't given quickly enough.)
+
+dbus-vmstate object can be configured with the expected list of
+helpers by setting its ``id-list`` property, with a comma-separated
+``Id`` list.
+
+Interface
+=========
+
+On object path ``/org/qemu/VMState1``, the following
+``org.qemu.VMState1`` interface should be implemented:
+
+.. code:: xml
+
+ <interface name="org.qemu.VMState1">
+ <property name="Id" type="s" access="read"/>
+ <method name="Load">
+ <arg type="ay" name="data" direction="in"/>
+ </method>
+ <method name="Save">
+ <arg type="ay" name="data" direction="out"/>
+ </method>
+ </interface>
+
+"Id" property
+-------------
+
+A string that identifies the helper uniquely. (maximum 256 bytes
+including terminating NUL byte)
+
+.. note::
+
+ The helper ID namespace is a separate namespace. In particular, it is not
+ related to QEMU "id" used in -object/-device objects.
+
+Load(in u8[] bytes) method
+--------------------------
+
+The method called on destination with the state to restore.
+
+The helper may be initially started in a waiting state (with
+an --incoming argument for example), and it may resume on success.
+
+An error may be returned to the caller.
+
+Save(out u8[] bytes) method
+---------------------------
+
+The method called on the source to get the current state to be
+migrated. The helper should continue to run normally.
+
+An error may be returned to the caller.
diff --git a/docs/interop/dbus.rst b/docs/interop/dbus.rst
new file mode 100644
index 000000000..be596d3f4
--- /dev/null
+++ b/docs/interop/dbus.rst
@@ -0,0 +1,110 @@
+=====
+D-Bus
+=====
+
+Introduction
+============
+
+QEMU may be running with various helper processes involved:
+ - vhost-user* processes (gpu, virtfs, input, etc...)
+ - TPM emulation (or other devices)
+ - user networking (slirp)
+ - network services (DHCP/DNS, samba/ftp etc)
+ - background tasks (compression, streaming etc)
+ - client UI
+ - admin & cli
+
+Having several processes allows stricter security rules, as well as
+greater modularity.
+
+While QEMU itself uses QMP as primary IPC (and Spice/VNC for remote
+display), D-Bus is the de facto IPC of choice on Unix systems. The
+wire format is machine friendly, good bindings exist for various
+languages, and there are various tools available.
+
+Using a bus, helper processes can discover and communicate with each
+other easily, without going through QEMU. The bus topology is also
+easier to apprehend and debug than a mesh. However, it is wise to
+consider the security aspects of it.
+
+Security
+========
+
+A QEMU D-Bus bus should be private to a single VM. Thus, only
+cooperative tasks are running on the same bus to serve the VM.
+
+D-Bus, the protocol and standard, doesn't have mechanisms to enforce
+security between peers once the connection is established. Peers may
+have additional mechanisms to enforce security rules, based for
+example on UNIX credentials.
+
+The daemon can control which peers can send/recv messages using
+various metadata attributes, however, this is alone is not generally
+sufficient to make the deployment secure. The semantics of the actual
+methods implemented using D-Bus are just as critical. Peers need to
+carefully validate any information they received from a peer with a
+different trust level.
+
+dbus-daemon policy
+------------------
+
+dbus-daemon can enforce various policies based on the UID/GID of the
+processes that are connected to it. It is thus a good idea to run
+helpers as different UID from QEMU and set appropriate policies.
+
+Depending on the use case, you may choose different scenarios:
+
+ - Everything the same UID
+
+ - Convenient for developers
+ - Improved reliability - crash of one part doesn't take
+ out entire VM
+ - No security benefit over traditional QEMU, unless additional
+ unless additional controls such as SELinux or AppArmor are
+ applied
+
+ - Two UIDs, one for QEMU, one for dbus & helpers
+
+ - Moderately improved user based security isolation
+
+ - Many UIDs, one for QEMU one for dbus and one for each helpers
+
+ - Best user based security isolation
+ - Complex to manager distinct UIDs needed for each VM
+
+For example, to allow only ``qemu`` user to talk to ``qemu-helper``
+``org.qemu.Helper1`` service, a dbus-daemon policy may contain:
+
+.. code:: xml
+
+ <policy user="qemu">
+ <allow send_destination="org.qemu.Helper1"/>
+ <allow receive_sender="org.qemu.Helper1"/>
+ </policy>
+
+ <policy user="qemu-helper">
+ <allow own="org.qemu.Helper1"/>
+ </policy>
+
+
+dbus-daemon can also perform SELinux checks based on the security
+context of the source and the target. For example, ``virtiofs_t``
+could be allowed to send a message to ``svirt_t``, but ``virtiofs_t``
+wouldn't be allowed to send a message to ``virtiofs_t``.
+
+See dbus-daemon man page for details.
+
+Guidelines
+==========
+
+When implementing new D-Bus interfaces, it is recommended to follow
+the "D-Bus API Design Guidelines":
+https://dbus.freedesktop.org/doc/dbus-api-design.html
+
+The "org.qemu.*" prefix is reserved for services implemented &
+distributed by the QEMU project.
+
+QEMU Interfaces
+===============
+
+:doc:`dbus-vmstate`
diff --git a/docs/interop/firmware.json b/docs/interop/firmware.json
new file mode 100644
index 000000000..8d8b0be03
--- /dev/null
+++ b/docs/interop/firmware.json
@@ -0,0 +1,593 @@
+# -*- Mode: Python -*-
+# vim: filetype=python
+#
+# Copyright (C) 2018 Red Hat, Inc.
+#
+# Authors:
+# Daniel P. Berrange <berrange@redhat.com>
+# Laszlo Ersek <lersek@redhat.com>
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later. See the COPYING file in the top-level directory.
+
+##
+# = Firmware
+##
+
+{ 'include' : 'machine.json' }
+{ 'include' : 'block-core.json' }
+
+##
+# @FirmwareOSInterface:
+#
+# Lists the firmware-OS interface types provided by various firmware
+# that is commonly used with QEMU virtual machines.
+#
+# @bios: Traditional x86 BIOS interface. For example, firmware built
+# from the SeaBIOS project usually provides this interface.
+#
+# @openfirmware: The interface is defined by the (historical) IEEE
+# 1275-1994 standard. Examples for firmware projects that
+# provide this interface are: OpenBIOS and SLOF.
+#
+# @uboot: Firmware interface defined by the U-Boot project.
+#
+# @uefi: Firmware interface defined by the UEFI specification. For
+# example, firmware built from the edk2 (EFI Development Kit II)
+# project usually provides this interface.
+#
+# Since: 3.0
+##
+{ 'enum' : 'FirmwareOSInterface',
+ 'data' : [ 'bios', 'openfirmware', 'uboot', 'uefi' ] }
+
+##
+# @FirmwareDevice:
+#
+# Defines the device types that firmware can be mapped into.
+#
+# @flash: The firmware executable and its accompanying NVRAM file are to
+# be mapped into a pflash chip each.
+#
+# @kernel: The firmware is to be loaded like a Linux kernel. This is
+# similar to @memory but may imply additional processing that
+# is specific to the target architecture and machine type.
+#
+# @memory: The firmware is to be mapped into memory.
+#
+# Since: 3.0
+##
+{ 'enum' : 'FirmwareDevice',
+ 'data' : [ 'flash', 'kernel', 'memory' ] }
+
+##
+# @FirmwareTarget:
+#
+# Defines the machine types that firmware may execute on.
+#
+# @architecture: Determines the emulation target (the QEMU system
+# emulator) that can execute the firmware.
+#
+# @machines: Lists the machine types (known by the emulator that is
+# specified through @architecture) that can execute the
+# firmware. Elements of @machines are supposed to be concrete
+# machine types, not aliases. Glob patterns are understood,
+# which is especially useful for versioned machine types.
+# (For example, the glob pattern "pc-i440fx-*" matches
+# "pc-i440fx-2.12".) On the QEMU command line, "-machine
+# type=..." specifies the requested machine type (but that
+# option does not accept glob patterns).
+#
+# Since: 3.0
+##
+{ 'struct' : 'FirmwareTarget',
+ 'data' : { 'architecture' : 'SysEmuTarget',
+ 'machines' : [ 'str' ] } }
+
+##
+# @FirmwareFeature:
+#
+# Defines the features that firmware may support, and the platform
+# requirements that firmware may present.
+#
+# @acpi-s3: The firmware supports S3 sleep (suspend to RAM), as defined
+# in the ACPI specification. On the "pc-i440fx-*" machine
+# types of the @i386 and @x86_64 emulation targets, S3 can be
+# enabled with "-global PIIX4_PM.disable_s3=0" and disabled
+# with "-global PIIX4_PM.disable_s3=1". On the "pc-q35-*"
+# machine types of the @i386 and @x86_64 emulation targets, S3
+# can be enabled with "-global ICH9-LPC.disable_s3=0" and
+# disabled with "-global ICH9-LPC.disable_s3=1".
+#
+# @acpi-s4: The firmware supports S4 hibernation (suspend to disk), as
+# defined in the ACPI specification. On the "pc-i440fx-*"
+# machine types of the @i386 and @x86_64 emulation targets, S4
+# can be enabled with "-global PIIX4_PM.disable_s4=0" and
+# disabled with "-global PIIX4_PM.disable_s4=1". On the
+# "pc-q35-*" machine types of the @i386 and @x86_64 emulation
+# targets, S4 can be enabled with "-global
+# ICH9-LPC.disable_s4=0" and disabled with "-global
+# ICH9-LPC.disable_s4=1".
+#
+# @amd-sev: The firmware supports running under AMD Secure Encrypted
+# Virtualization, as specified in the AMD64 Architecture
+# Programmer's Manual. QEMU command line options related to
+# this feature are documented in
+# "docs/amd-memory-encryption.txt".
+#
+# @amd-sev-es: The firmware supports running under AMD Secure Encrypted
+# Virtualization - Encrypted State, as specified in the AMD64
+# Architecture Programmer's Manual. QEMU command line options
+# related to this feature are documented in
+# "docs/amd-memory-encryption.txt".
+#
+# @enrolled-keys: The variable store (NVRAM) template associated with
+# the firmware binary has the UEFI Secure Boot
+# operational mode turned on, with certificates
+# enrolled.
+#
+# @requires-smm: The firmware requires the platform to emulate SMM
+# (System Management Mode), as defined in the AMD64
+# Architecture Programmer's Manual, and in the Intel(R)64
+# and IA-32 Architectures Software Developer's Manual. On
+# the "pc-q35-*" machine types of the @i386 and @x86_64
+# emulation targets, SMM emulation can be enabled with
+# "-machine smm=on". (On the "pc-q35-*" machine types of
+# the @i386 emulation target, @requires-smm presents
+# further CPU requirements; one combination known to work
+# is "-cpu coreduo,nx=off".) If the firmware is marked as
+# both @secure-boot and @requires-smm, then write
+# accesses to the pflash chip (NVRAM) that holds the UEFI
+# variable store must be restricted to code that executes
+# in SMM, using the additional option "-global
+# driver=cfi.pflash01,property=secure,value=on".
+# Furthermore, a large guest-physical address space
+# (comprising guest RAM, memory hotplug range, and 64-bit
+# PCI MMIO aperture), and/or a high VCPU count, may
+# present high SMRAM requirements from the firmware. On
+# the "pc-q35-*" machine types of the @i386 and @x86_64
+# emulation targets, the SMRAM size may be increased
+# above the default 16MB with the "-global
+# mch.extended-tseg-mbytes=uint16" option. As a rule of
+# thumb, the default 16MB size suffices for 1TB of
+# guest-phys address space and a few tens of VCPUs; for
+# every further TB of guest-phys address space, add 8MB
+# of SMRAM. 48MB should suffice for 4TB of guest-phys
+# address space and 2-3 hundred VCPUs.
+#
+# @secure-boot: The firmware implements the software interfaces for UEFI
+# Secure Boot, as defined in the UEFI specification. Note
+# that without @requires-smm, guest code running with
+# kernel privileges can undermine the security of Secure
+# Boot.
+#
+# @verbose-dynamic: When firmware log capture is enabled, the firmware
+# logs a large amount of debug messages, which may
+# impact boot performance. With log capture disabled,
+# there is no boot performance impact. On the
+# "pc-i440fx-*" and "pc-q35-*" machine types of the
+# @i386 and @x86_64 emulation targets, firmware log
+# capture can be enabled with the QEMU command line
+# options "-chardev file,id=fwdebug,path=LOGFILEPATH
+# -device isa-debugcon,iobase=0x402,chardev=fwdebug".
+# @verbose-dynamic is mutually exclusive with
+# @verbose-static.
+#
+# @verbose-static: The firmware unconditionally produces a large amount
+# of debug messages, which may impact boot performance.
+# This feature may typically be carried by certain UEFI
+# firmware for the "virt-*" machine types of the @arm
+# and @aarch64 emulation targets, where the debug
+# messages are written to the first (always present)
+# PL011 UART. @verbose-static is mutually exclusive
+# with @verbose-dynamic.
+#
+# Since: 3.0
+##
+{ 'enum' : 'FirmwareFeature',
+ 'data' : [ 'acpi-s3', 'acpi-s4', 'amd-sev', 'amd-sev-es', 'enrolled-keys',
+ 'requires-smm', 'secure-boot', 'verbose-dynamic',
+ 'verbose-static' ] }
+
+##
+# @FirmwareFlashFile:
+#
+# Defines common properties that are necessary for loading a firmware
+# file into a pflash chip. The corresponding QEMU command line option is
+# "-drive file=@filename,format=@format". Note however that the
+# option-argument shown here is incomplete; it is completed under
+# @FirmwareMappingFlash.
+#
+# @filename: Specifies the filename on the host filesystem where the
+# firmware file can be found.
+#
+# @format: Specifies the block format of the file pointed-to by
+# @filename, such as @raw or @qcow2.
+#
+# Since: 3.0
+##
+{ 'struct' : 'FirmwareFlashFile',
+ 'data' : { 'filename' : 'str',
+ 'format' : 'BlockdevDriver' } }
+
+##
+# @FirmwareMappingFlash:
+#
+# Describes loading and mapping properties for the firmware executable
+# and its accompanying NVRAM file, when @FirmwareDevice is @flash.
+#
+# @executable: Identifies the firmware executable. The firmware
+# executable may be shared by multiple virtual machine
+# definitions. The preferred corresponding QEMU command
+# line options are
+# -drive if=none,id=pflash0,readonly=on,file=@executable.@filename,format=@executable.@format
+# -machine pflash0=pflash0
+# or equivalent -blockdev instead of -drive.
+# With QEMU versions older than 4.0, you have to use
+# -drive if=pflash,unit=0,readonly=on,file=@executable.@filename,format=@executable.@format
+#
+# @nvram-template: Identifies the NVRAM template compatible with
+# @executable. Management software instantiates an
+# individual copy -- a specific NVRAM file -- from
+# @nvram-template.@filename for each new virtual
+# machine definition created. @nvram-template.@filename
+# itself is never mapped into virtual machines, only
+# individual copies of it are. An NVRAM file is
+# typically used for persistently storing the
+# non-volatile UEFI variables of a virtual machine
+# definition. The preferred corresponding QEMU
+# command line options are
+# -drive if=none,id=pflash1,readonly=off,file=FILENAME_OF_PRIVATE_NVRAM_FILE,format=@nvram-template.@format
+# -machine pflash1=pflash1
+# or equivalent -blockdev instead of -drive.
+# With QEMU versions older than 4.0, you have to use
+# -drive if=pflash,unit=1,readonly=off,file=FILENAME_OF_PRIVATE_NVRAM_FILE,format=@nvram-template.@format
+#
+# Since: 3.0
+##
+{ 'struct' : 'FirmwareMappingFlash',
+ 'data' : { 'executable' : 'FirmwareFlashFile',
+ 'nvram-template' : 'FirmwareFlashFile' } }
+
+##
+# @FirmwareMappingKernel:
+#
+# Describes loading and mapping properties for the firmware executable,
+# when @FirmwareDevice is @kernel.
+#
+# @filename: Identifies the firmware executable. The firmware executable
+# may be shared by multiple virtual machine definitions. The
+# corresponding QEMU command line option is "-kernel
+# @filename".
+#
+# Since: 3.0
+##
+{ 'struct' : 'FirmwareMappingKernel',
+ 'data' : { 'filename' : 'str' } }
+
+##
+# @FirmwareMappingMemory:
+#
+# Describes loading and mapping properties for the firmware executable,
+# when @FirmwareDevice is @memory.
+#
+# @filename: Identifies the firmware executable. The firmware executable
+# may be shared by multiple virtual machine definitions. The
+# corresponding QEMU command line option is "-bios
+# @filename".
+#
+# Since: 3.0
+##
+{ 'struct' : 'FirmwareMappingMemory',
+ 'data' : { 'filename' : 'str' } }
+
+##
+# @FirmwareMapping:
+#
+# Provides a discriminated structure for firmware to describe its
+# loading / mapping properties.
+#
+# @device: Selects the device type that the firmware must be mapped
+# into.
+#
+# Since: 3.0
+##
+{ 'union' : 'FirmwareMapping',
+ 'base' : { 'device' : 'FirmwareDevice' },
+ 'discriminator' : 'device',
+ 'data' : { 'flash' : 'FirmwareMappingFlash',
+ 'kernel' : 'FirmwareMappingKernel',
+ 'memory' : 'FirmwareMappingMemory' } }
+
+##
+# @Firmware:
+#
+# Describes a firmware (or a firmware use case) to management software.
+#
+# It is possible for multiple @Firmware elements to match the search
+# criteria of management software. Applications thus need rules to pick
+# one of the many matches, and users need the ability to override distro
+# defaults.
+#
+# It is recommended to create firmware JSON files (each containing a
+# single @Firmware root element) with a double-digit prefix, for example
+# "50-ovmf.json", "50-seabios-256k.json", etc, so they can be sorted in
+# predictable order. The firmware JSON files should be searched for in
+# three directories:
+#
+# - /usr/share/qemu/firmware -- populated by distro-provided firmware
+# packages (XDG_DATA_DIRS covers
+# /usr/share by default),
+#
+# - /etc/qemu/firmware -- exclusively for sysadmins' local additions,
+#
+# - $XDG_CONFIG_HOME/qemu/firmware -- exclusively for per-user local
+# additions (XDG_CONFIG_HOME
+# defaults to $HOME/.config).
+#
+# Top-down, the list of directories goes from general to specific.
+#
+# Management software should build a list of files from all three
+# locations, then sort the list by filename (i.e., last pathname
+# component). Management software should choose the first JSON file on
+# the sorted list that matches the search criteria. If a more specific
+# directory has a file with same name as a less specific directory, then
+# the file in the more specific directory takes effect. If the more
+# specific file is zero length, it hides the less specific one.
+#
+# For example, if a distro ships
+#
+# - /usr/share/qemu/firmware/50-ovmf.json
+#
+# - /usr/share/qemu/firmware/50-seabios-256k.json
+#
+# then the sysadmin can prevent the default OVMF being used at all with
+#
+# $ touch /etc/qemu/firmware/50-ovmf.json
+#
+# The sysadmin can replace/alter the distro default OVMF with
+#
+# $ vim /etc/qemu/firmware/50-ovmf.json
+#
+# or they can provide a parallel OVMF with higher priority
+#
+# $ vim /etc/qemu/firmware/10-ovmf.json
+#
+# or they can provide a parallel OVMF with lower priority
+#
+# $ vim /etc/qemu/firmware/99-ovmf.json
+#
+# @description: Provides a human-readable description of the firmware.
+# Management software may or may not display @description.
+#
+# @interface-types: Lists the types of interfaces that the firmware can
+# expose to the guest OS. This is a non-empty, ordered
+# list; entries near the beginning of @interface-types
+# are considered more native to the firmware, and/or
+# to have a higher quality implementation in the
+# firmware, than entries near the end of
+# @interface-types.
+#
+# @mapping: Describes the loading / mapping properties of the firmware.
+#
+# @targets: Collects the target architectures (QEMU system emulators)
+# and their machine types that may execute the firmware.
+#
+# @features: Lists the features that the firmware supports, and the
+# platform requirements it presents.
+#
+# @tags: A list of auxiliary strings associated with the firmware for
+# which @description is not appropriate, due to the latter's
+# possible exposure to the end-user. @tags serves development and
+# debugging purposes only, and management software shall
+# explicitly ignore it.
+#
+# Since: 3.0
+#
+# Examples:
+#
+# {
+# "description": "SeaBIOS",
+# "interface-types": [
+# "bios"
+# ],
+# "mapping": {
+# "device": "memory",
+# "filename": "/usr/share/seabios/bios-256k.bin"
+# },
+# "targets": [
+# {
+# "architecture": "i386",
+# "machines": [
+# "pc-i440fx-*",
+# "pc-q35-*"
+# ]
+# },
+# {
+# "architecture": "x86_64",
+# "machines": [
+# "pc-i440fx-*",
+# "pc-q35-*"
+# ]
+# }
+# ],
+# "features": [
+# "acpi-s3",
+# "acpi-s4"
+# ],
+# "tags": [
+# "CONFIG_BOOTSPLASH=n",
+# "CONFIG_ROM_SIZE=256",
+# "CONFIG_USE_SMM=n"
+# ]
+# }
+#
+# {
+# "description": "OVMF with SB+SMM, empty varstore",
+# "interface-types": [
+# "uefi"
+# ],
+# "mapping": {
+# "device": "flash",
+# "executable": {
+# "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd",
+# "format": "raw"
+# },
+# "nvram-template": {
+# "filename": "/usr/share/OVMF/OVMF_VARS.fd",
+# "format": "raw"
+# }
+# },
+# "targets": [
+# {
+# "architecture": "x86_64",
+# "machines": [
+# "pc-q35-*"
+# ]
+# }
+# ],
+# "features": [
+# "acpi-s3",
+# "amd-sev",
+# "requires-smm",
+# "secure-boot",
+# "verbose-dynamic"
+# ],
+# "tags": [
+# "-a IA32",
+# "-a X64",
+# "-p OvmfPkg/OvmfPkgIa32X64.dsc",
+# "-t GCC48",
+# "-b DEBUG",
+# "-D SMM_REQUIRE",
+# "-D SECURE_BOOT_ENABLE",
+# "-D FD_SIZE_4MB"
+# ]
+# }
+#
+# {
+# "description": "OVMF with SB+SMM, SB enabled, MS certs enrolled",
+# "interface-types": [
+# "uefi"
+# ],
+# "mapping": {
+# "device": "flash",
+# "executable": {
+# "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd",
+# "format": "raw"
+# },
+# "nvram-template": {
+# "filename": "/usr/share/OVMF/OVMF_VARS.secboot.fd",
+# "format": "raw"
+# }
+# },
+# "targets": [
+# {
+# "architecture": "x86_64",
+# "machines": [
+# "pc-q35-*"
+# ]
+# }
+# ],
+# "features": [
+# "acpi-s3",
+# "amd-sev",
+# "enrolled-keys",
+# "requires-smm",
+# "secure-boot",
+# "verbose-dynamic"
+# ],
+# "tags": [
+# "-a IA32",
+# "-a X64",
+# "-p OvmfPkg/OvmfPkgIa32X64.dsc",
+# "-t GCC48",
+# "-b DEBUG",
+# "-D SMM_REQUIRE",
+# "-D SECURE_BOOT_ENABLE",
+# "-D FD_SIZE_4MB"
+# ]
+# }
+#
+# {
+# "description": "OVMF with SEV-ES support",
+# "interface-types": [
+# "uefi"
+# ],
+# "mapping": {
+# "device": "flash",
+# "executable": {
+# "filename": "/usr/share/OVMF/OVMF_CODE.fd",
+# "format": "raw"
+# },
+# "nvram-template": {
+# "filename": "/usr/share/OVMF/OVMF_VARS.fd",
+# "format": "raw"
+# }
+# },
+# "targets": [
+# {
+# "architecture": "x86_64",
+# "machines": [
+# "pc-q35-*"
+# ]
+# }
+# ],
+# "features": [
+# "acpi-s3",
+# "amd-sev",
+# "amd-sev-es",
+# "verbose-dynamic"
+# ],
+# "tags": [
+# "-a X64",
+# "-p OvmfPkg/OvmfPkgX64.dsc",
+# "-t GCC48",
+# "-b DEBUG",
+# "-D FD_SIZE_4MB"
+# ]
+# }
+#
+# {
+# "description": "UEFI firmware for ARM64 virtual machines",
+# "interface-types": [
+# "uefi"
+# ],
+# "mapping": {
+# "device": "flash",
+# "executable": {
+# "filename": "/usr/share/AAVMF/AAVMF_CODE.fd",
+# "format": "raw"
+# },
+# "nvram-template": {
+# "filename": "/usr/share/AAVMF/AAVMF_VARS.fd",
+# "format": "raw"
+# }
+# },
+# "targets": [
+# {
+# "architecture": "aarch64",
+# "machines": [
+# "virt-*"
+# ]
+# }
+# ],
+# "features": [
+#
+# ],
+# "tags": [
+# "-a AARCH64",
+# "-p ArmVirtPkg/ArmVirtQemu.dsc",
+# "-t GCC48",
+# "-b DEBUG",
+# "-D DEBUG_PRINT_ERROR_LEVEL=0x80000000"
+# ]
+# }
+##
+{ 'struct' : 'Firmware',
+ 'data' : { 'description' : 'str',
+ 'interface-types' : [ 'FirmwareOSInterface' ],
+ 'mapping' : 'FirmwareMapping',
+ 'targets' : [ 'FirmwareTarget' ],
+ 'features' : [ 'FirmwareFeature' ],
+ 'tags' : [ 'str' ] } }
diff --git a/docs/interop/index.rst b/docs/interop/index.rst
new file mode 100644
index 000000000..47b9ed82b
--- /dev/null
+++ b/docs/interop/index.rst
@@ -0,0 +1,23 @@
+------------------------------------------------
+System Emulation Management and Interoperability
+------------------------------------------------
+
+This section of the manual contains documents and specifications that
+are useful for making QEMU interoperate with other software.
+
+.. toctree::
+ :maxdepth: 2
+
+ barrier
+ bitmaps
+ dbus
+ dbus-vmstate
+ live-block-operations
+ pr-helper
+ qemu-ga
+ qemu-ga-ref
+ qemu-qmp-ref
+ qemu-storage-daemon-qmp-ref
+ vhost-user
+ vhost-user-gpu
+ vhost-vdpa
diff --git a/docs/interop/live-block-operations.rst b/docs/interop/live-block-operations.rst
new file mode 100644
index 000000000..39e62c991
--- /dev/null
+++ b/docs/interop/live-block-operations.rst
@@ -0,0 +1,1121 @@
+..
+ Copyright (C) 2017 Red Hat Inc.
+
+ This work is licensed under the terms of the GNU GPL, version 2 or
+ later. See the COPYING file in the top-level directory.
+
+============================
+Live Block Device Operations
+============================
+
+QEMU Block Layer currently (as of QEMU 2.9) supports four major kinds of
+live block device jobs -- stream, commit, mirror, and backup. These can
+be used to manipulate disk image chains to accomplish certain tasks,
+namely: live copy data from backing files into overlays; shorten long
+disk image chains by merging data from overlays into backing files; live
+synchronize data from a disk image chain (including current active disk)
+to another target image; and point-in-time (and incremental) backups of
+a block device. Below is a description of the said block (QMP)
+primitives, and some (non-exhaustive list of) examples to illustrate
+their use.
+
+.. note::
+ The file ``qapi/block-core.json`` in the QEMU source tree has the
+ canonical QEMU API (QAPI) schema documentation for the QMP
+ primitives discussed here.
+
+.. todo (kashyapc):: Remove the ".. contents::" directive when Sphinx is
+ integrated.
+
+.. contents::
+
+Disk image backing chain notation
+---------------------------------
+
+A simple disk image chain. (This can be created live using QMP
+``blockdev-snapshot-sync``, or offline via ``qemu-img``)::
+
+ (Live QEMU)
+ |
+ .
+ V
+
+ [A] <----- [B]
+
+ (backing file) (overlay)
+
+The arrow can be read as: Image [A] is the backing file of disk image
+[B]. And live QEMU is currently writing to image [B], consequently, it
+is also referred to as the "active layer".
+
+There are two kinds of terminology that are common when referring to
+files in a disk image backing chain:
+
+(1) Directional: 'base' and 'top'. Given the simple disk image chain
+ above, image [A] can be referred to as 'base', and image [B] as
+ 'top'. (This terminology can be seen in in QAPI schema file,
+ block-core.json.)
+
+(2) Relational: 'backing file' and 'overlay'. Again, taking the same
+ simple disk image chain from the above, disk image [A] is referred
+ to as the backing file, and image [B] as overlay.
+
+ Throughout this document, we will use the relational terminology.
+
+.. important::
+ The overlay files can generally be any format that supports a
+ backing file, although QCOW2 is the preferred format and the one
+ used in this document.
+
+
+Brief overview of live block QMP primitives
+-------------------------------------------
+
+The following are the four different kinds of live block operations that
+QEMU block layer supports.
+
+(1) ``block-stream``: Live copy of data from backing files into overlay
+ files.
+
+ .. note:: Once the 'stream' operation has finished, three things to
+ note:
+
+ (a) QEMU rewrites the backing chain to remove
+ reference to the now-streamed and redundant backing
+ file;
+
+ (b) the streamed file *itself* won't be removed by QEMU,
+ and must be explicitly discarded by the user;
+
+ (c) the streamed file remains valid -- i.e. further
+ overlays can be created based on it. Refer the
+ ``block-stream`` section further below for more
+ details.
+
+(2) ``block-commit``: Live merge of data from overlay files into backing
+ files (with the optional goal of removing the overlay file from the
+ chain). Since QEMU 2.0, this includes "active ``block-commit``"
+ (i.e. merge the current active layer into the base image).
+
+ .. note:: Once the 'commit' operation has finished, there are three
+ things to note here as well:
+
+ (a) QEMU rewrites the backing chain to remove reference
+ to now-redundant overlay images that have been
+ committed into a backing file;
+
+ (b) the committed file *itself* won't be removed by QEMU
+ -- it ought to be manually removed;
+
+ (c) however, unlike in the case of ``block-stream``, the
+ intermediate images will be rendered invalid -- i.e.
+ no more further overlays can be created based on
+ them. Refer the ``block-commit`` section further
+ below for more details.
+
+(3) ``drive-mirror`` (and ``blockdev-mirror``): Synchronize a running
+ disk to another image.
+
+(4) ``blockdev-backup`` (and the deprecated ``drive-backup``):
+ Point-in-time (live) copy of a block device to a destination.
+
+
+.. _`Interacting with a QEMU instance`:
+
+Interacting with a QEMU instance
+--------------------------------
+
+To show some example invocations of command-line, we will use the
+following invocation of QEMU, with a QMP server running over UNIX
+socket:
+
+.. parsed-literal::
+
+ $ |qemu_system| -display none -no-user-config -nodefaults \\
+ -m 512 -blockdev \\
+ node-name=node-A,driver=qcow2,file.driver=file,file.node-name=file,file.filename=./a.qcow2 \\
+ -device virtio-blk,drive=node-A,id=virtio0 \\
+ -monitor stdio -qmp unix:/tmp/qmp-sock,server=on,wait=off
+
+The ``-blockdev`` command-line option, used above, is available from
+QEMU 2.9 onwards. In the above invocation, notice the ``node-name``
+parameter that is used to refer to the disk image a.qcow2 ('node-A') --
+this is a cleaner way to refer to a disk image (as opposed to referring
+to it by spelling out file paths). So, we will continue to designate a
+``node-name`` to each further disk image created (either via
+``blockdev-snapshot-sync``, or ``blockdev-add``) as part of the disk
+image chain, and continue to refer to the disks using their
+``node-name`` (where possible, because ``block-commit`` does not yet, as
+of QEMU 2.9, accept ``node-name`` parameter) when performing various
+block operations.
+
+To interact with the QEMU instance launched above, we will use the
+``qmp-shell`` utility (located at: ``qemu/scripts/qmp``, as part of the
+QEMU source directory), which takes key-value pairs for QMP commands.
+Invoke it as below (which will also print out the complete raw JSON
+syntax for reference -- examples in the following sections)::
+
+ $ ./qmp-shell -v -p /tmp/qmp-sock
+ (QEMU)
+
+.. note::
+ In the event we have to repeat a certain QMP command, we will: for
+ the first occurrence of it, show the ``qmp-shell`` invocation, *and*
+ the corresponding raw JSON QMP syntax; but for subsequent
+ invocations, present just the ``qmp-shell`` syntax, and omit the
+ equivalent JSON output.
+
+
+Example disk image chain
+------------------------
+
+We will use the below disk image chain (and occasionally spelling it
+out where appropriate) when discussing various primitives::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+Where [A] is the original base image; [B] and [C] are intermediate
+overlay images; image [D] is the active layer -- i.e. live QEMU is
+writing to it. (The rule of thumb is: live QEMU will always be pointing
+to the rightmost image in a disk image chain.)
+
+The above image chain can be created by invoking
+``blockdev-snapshot-sync`` commands as following (which shows the
+creation of overlay image [B]) using the ``qmp-shell`` (our invocation
+also prints the raw JSON invocation of it)::
+
+ (QEMU) blockdev-snapshot-sync node-name=node-A snapshot-file=b.qcow2 snapshot-node-name=node-B format=qcow2
+ {
+ "execute": "blockdev-snapshot-sync",
+ "arguments": {
+ "node-name": "node-A",
+ "snapshot-file": "b.qcow2",
+ "format": "qcow2",
+ "snapshot-node-name": "node-B"
+ }
+ }
+
+Here, "node-A" is the name QEMU internally uses to refer to the base
+image [A] -- it is the backing file, based on which the overlay image,
+[B], is created.
+
+To create the rest of the overlay images, [C], and [D] (omitting the raw
+JSON output for brevity)::
+
+ (QEMU) blockdev-snapshot-sync node-name=node-B snapshot-file=c.qcow2 snapshot-node-name=node-C format=qcow2
+ (QEMU) blockdev-snapshot-sync node-name=node-C snapshot-file=d.qcow2 snapshot-node-name=node-D format=qcow2
+
+
+A note on points-in-time vs file names
+--------------------------------------
+
+In our disk image chain::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+We have *three* points in time and an active layer:
+
+- Point 1: Guest state when [B] was created is contained in file [A]
+- Point 2: Guest state when [C] was created is contained in [A] + [B]
+- Point 3: Guest state when [D] was created is contained in
+ [A] + [B] + [C]
+- Active layer: Current guest state is contained in [A] + [B] + [C] +
+ [D]
+
+Therefore, be aware with naming choices:
+
+- Naming a file after the time it is created is misleading -- the
+ guest data for that point in time is *not* contained in that file
+ (as explained earlier)
+- Rather, think of files as a *delta* from the backing file
+
+
+Live block streaming --- ``block-stream``
+-----------------------------------------
+
+The ``block-stream`` command allows you to do live copy data from backing
+files into overlay images.
+
+Given our original example disk image chain from earlier::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+The disk image chain can be shortened in one of the following different
+ways (not an exhaustive list).
+
+.. _`Case-1`:
+
+(1) Merge everything into the active layer: I.e. copy all contents from
+ the base image, [A], and overlay images, [B] and [C], into [D],
+ *while* the guest is running. The resulting chain will be a
+ standalone image, [D] -- with contents from [A], [B] and [C] merged
+ into it (where live QEMU writes go to)::
+
+ [D]
+
+.. _`Case-2`:
+
+(2) Taking the same example disk image chain mentioned earlier, merge
+ only images [B] and [C] into [D], the active layer. The result will
+ be contents of images [B] and [C] will be copied into [D], and the
+ backing file pointer of image [D] will be adjusted to point to image
+ [A]. The resulting chain will be::
+
+ [A] <-- [D]
+
+.. _`Case-3`:
+
+(3) Intermediate streaming (available since QEMU 2.8): Starting afresh
+ with the original example disk image chain, with a total of four
+ images, it is possible to copy contents from image [B] into image
+ [C]. Once the copy is finished, image [B] can now be (optionally)
+ discarded; and the backing file pointer of image [C] will be
+ adjusted to point to [A]. I.e. after performing "intermediate
+ streaming" of [B] into [C], the resulting image chain will be (where
+ live QEMU is writing to [D])::
+
+ [A] <-- [C] <-- [D]
+
+
+QMP invocation for ``block-stream``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For `Case-1`_, to merge contents of all the backing files into the
+active layer, where 'node-D' is the current active image (by default
+``block-stream`` will flatten the entire chain); ``qmp-shell`` (and its
+corresponding JSON output)::
+
+ (QEMU) block-stream device=node-D job-id=job0
+ {
+ "execute": "block-stream",
+ "arguments": {
+ "device": "node-D",
+ "job-id": "job0"
+ }
+ }
+
+For `Case-2`_, merge contents of the images [B] and [C] into [D], where
+image [D] ends up referring to image [A] as its backing file::
+
+ (QEMU) block-stream device=node-D base-node=node-A job-id=job0
+
+And for `Case-3`_, of "intermediate" streaming", merge contents of
+images [B] into [C], where [C] ends up referring to [A] as its backing
+image::
+
+ (QEMU) block-stream device=node-C base-node=node-A job-id=job0
+
+Progress of a ``block-stream`` operation can be monitored via the QMP
+command::
+
+ (QEMU) query-block-jobs
+ {
+ "execute": "query-block-jobs",
+ "arguments": {}
+ }
+
+
+Once the ``block-stream`` operation has completed, QEMU will emit an
+event, ``BLOCK_JOB_COMPLETED``. The intermediate overlays remain valid,
+and can now be (optionally) discarded, or retained to create further
+overlays based on them. Finally, the ``block-stream`` jobs can be
+restarted at anytime.
+
+
+Live block commit --- ``block-commit``
+--------------------------------------
+
+The ``block-commit`` command lets you merge live data from overlay
+images into backing file(s). Since QEMU 2.0, this includes "live active
+commit" (i.e. it is possible to merge the "active layer", the right-most
+image in a disk image chain where live QEMU will be writing to, into the
+base image). This is analogous to ``block-stream``, but in the opposite
+direction.
+
+Again, starting afresh with our example disk image chain, where live
+QEMU is writing to the right-most image in the chain, [D]::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+The disk image chain can be shortened in one of the following ways:
+
+.. _`block-commit_Case-1`:
+
+(1) Commit content from only image [B] into image [A]. The resulting
+ chain is the following, where image [C] is adjusted to point at [A]
+ as its new backing file::
+
+ [A] <-- [C] <-- [D]
+
+(2) Commit content from images [B] and [C] into image [A]. The
+ resulting chain, where image [D] is adjusted to point to image [A]
+ as its new backing file::
+
+ [A] <-- [D]
+
+.. _`block-commit_Case-3`:
+
+(3) Commit content from images [B], [C], and the active layer [D] into
+ image [A]. The resulting chain (in this case, a consolidated single
+ image)::
+
+ [A]
+
+(4) Commit content from image only image [C] into image [B]. The
+ resulting chain::
+
+ [A] <-- [B] <-- [D]
+
+(5) Commit content from image [C] and the active layer [D] into image
+ [B]. The resulting chain::
+
+ [A] <-- [B]
+
+
+QMP invocation for ``block-commit``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For :ref:`Case-1 <block-commit_Case-1>`, to merge contents only from
+image [B] into image [A], the invocation is as follows::
+
+ (QEMU) block-commit device=node-D base=a.qcow2 top=b.qcow2 job-id=job0
+ {
+ "execute": "block-commit",
+ "arguments": {
+ "device": "node-D",
+ "job-id": "job0",
+ "top": "b.qcow2",
+ "base": "a.qcow2"
+ }
+ }
+
+Once the above ``block-commit`` operation has completed, a
+``BLOCK_JOB_COMPLETED`` event will be issued, and no further action is
+required. As the end result, the backing file of image [C] is adjusted
+to point to image [A], and the original 4-image chain will end up being
+transformed to::
+
+ [A] <-- [C] <-- [D]
+
+.. note::
+ The intermediate image [B] is invalid (as in: no more further
+ overlays based on it can be created).
+
+ Reasoning: An intermediate image after a 'stream' operation still
+ represents that old point-in-time, and may be valid in that context.
+ However, an intermediate image after a 'commit' operation no longer
+ represents any point-in-time, and is invalid in any context.
+
+
+However, :ref:`Case-3 <block-commit_Case-3>` (also called: "active
+``block-commit``") is a *two-phase* operation: In the first phase, the
+content from the active overlay, along with the intermediate overlays,
+is copied into the backing file (also called the base image). In the
+second phase, adjust the said backing file as the current active image
+-- possible via issuing the command ``block-job-complete``. Optionally,
+the ``block-commit`` operation can be cancelled by issuing the command
+``block-job-cancel``, but be careful when doing this.
+
+Once the ``block-commit`` operation has completed, the event
+``BLOCK_JOB_READY`` will be emitted, signalling that the synchronization
+has finished. Now the job can be gracefully completed by issuing the
+command ``block-job-complete`` -- until such a command is issued, the
+'commit' operation remains active.
+
+The following is the flow for :ref:`Case-3 <block-commit_Case-3>` to
+convert a disk image chain such as this::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+Into::
+
+ [A]
+
+Where content from all the subsequent overlays, [B], and [C], including
+the active layer, [D], is committed back to [A] -- which is where live
+QEMU is performing all its current writes).
+
+Start the "active ``block-commit``" operation::
+
+ (QEMU) block-commit device=node-D base=a.qcow2 top=d.qcow2 job-id=job0
+ {
+ "execute": "block-commit",
+ "arguments": {
+ "device": "node-D",
+ "job-id": "job0",
+ "top": "d.qcow2",
+ "base": "a.qcow2"
+ }
+ }
+
+
+Once the synchronization has completed, the event ``BLOCK_JOB_READY`` will
+be emitted.
+
+Then, optionally query for the status of the active block operations.
+We can see the 'commit' job is now ready to be completed, as indicated
+by the line *"ready": true*::
+
+ (QEMU) query-block-jobs
+ {
+ "execute": "query-block-jobs",
+ "arguments": {}
+ }
+ {
+ "return": [
+ {
+ "busy": false,
+ "type": "commit",
+ "len": 1376256,
+ "paused": false,
+ "ready": true,
+ "io-status": "ok",
+ "offset": 1376256,
+ "device": "job0",
+ "speed": 0
+ }
+ ]
+ }
+
+Gracefully complete the 'commit' block device job::
+
+ (QEMU) block-job-complete device=job0
+ {
+ "execute": "block-job-complete",
+ "arguments": {
+ "device": "job0"
+ }
+ }
+ {
+ "return": {}
+ }
+
+Finally, once the above job is completed, an event
+``BLOCK_JOB_COMPLETED`` will be emitted.
+
+.. note::
+ The invocation for rest of the cases (2, 4, and 5), discussed in the
+ previous section, is omitted for brevity.
+
+
+Live disk synchronization --- ``drive-mirror`` and ``blockdev-mirror``
+----------------------------------------------------------------------
+
+Synchronize a running disk image chain (all or part of it) to a target
+image.
+
+Again, given our familiar disk image chain::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+The ``drive-mirror`` (and its newer equivalent ``blockdev-mirror``)
+allows you to copy data from the entire chain into a single target image
+(which can be located on a different host), [E].
+
+.. note::
+
+ When you cancel an in-progress 'mirror' job *before* the source and
+ target are synchronized, ``block-job-cancel`` will emit the event
+ ``BLOCK_JOB_CANCELLED``. However, note that if you cancel a
+ 'mirror' job *after* it has indicated (via the event
+ ``BLOCK_JOB_READY``) that the source and target have reached
+ synchronization, then the event emitted by ``block-job-cancel``
+ changes to ``BLOCK_JOB_COMPLETED``.
+
+ Besides the 'mirror' job, the "active ``block-commit``" is the only
+ other block device job that emits the event ``BLOCK_JOB_READY``.
+ The rest of the block device jobs ('stream', "non-active
+ ``block-commit``", and 'backup') end automatically.
+
+So there are two possible actions to take, after a 'mirror' job has
+emitted the event ``BLOCK_JOB_READY``, indicating that the source and
+target have reached synchronization:
+
+(1) Issuing the command ``block-job-cancel`` (after it emits the event
+ ``BLOCK_JOB_COMPLETED``) will create a point-in-time (which is at
+ the time of *triggering* the cancel command) copy of the entire disk
+ image chain (or only the top-most image, depending on the ``sync``
+ mode), contained in the target image [E]. One use case for this is
+ live VM migration with non-shared storage.
+
+(2) Issuing the command ``block-job-complete`` (after it emits the event
+ ``BLOCK_JOB_COMPLETED``) will adjust the guest device (i.e. live
+ QEMU) to point to the target image, [E], causing all the new writes
+ from this point on to happen there.
+
+About synchronization modes: The synchronization mode determines
+*which* part of the disk image chain will be copied to the target.
+Currently, there are four different kinds:
+
+(1) ``full`` -- Synchronize the content of entire disk image chain to
+ the target
+
+(2) ``top`` -- Synchronize only the contents of the top-most disk image
+ in the chain to the target
+
+(3) ``none`` -- Synchronize only the new writes from this point on.
+
+ .. note:: In the case of ``blockdev-backup`` (or deprecated
+ ``drive-backup``), the behavior of ``none``
+ synchronization mode is different. Normally, a
+ ``backup`` job consists of two parts: Anything that is
+ overwritten by the guest is first copied out to the
+ backup, and in the background the whole image is copied
+ from start to end. With ``sync=none``, it's only the
+ first part.
+
+(4) ``incremental`` -- Synchronize content that is described by the
+ dirty bitmap
+
+.. note::
+ Refer to the :doc:`bitmaps` document in the QEMU source
+ tree to learn about the detailed workings of the ``incremental``
+ synchronization mode.
+
+
+QMP invocation for ``drive-mirror``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To copy the contents of the entire disk image chain, from [A] all the
+way to [D], to a new target (``drive-mirror`` will create the destination
+file, if it doesn't already exist), call it [E]::
+
+ (QEMU) drive-mirror device=node-D target=e.qcow2 sync=full job-id=job0
+ {
+ "execute": "drive-mirror",
+ "arguments": {
+ "device": "node-D",
+ "job-id": "job0",
+ "target": "e.qcow2",
+ "sync": "full"
+ }
+ }
+
+The ``"sync": "full"``, from the above, means: copy the *entire* chain
+to the destination.
+
+Following the above, querying for active block jobs will show that a
+'mirror' job is "ready" to be completed (and QEMU will also emit an
+event, ``BLOCK_JOB_READY``)::
+
+ (QEMU) query-block-jobs
+ {
+ "execute": "query-block-jobs",
+ "arguments": {}
+ }
+ {
+ "return": [
+ {
+ "busy": false,
+ "type": "mirror",
+ "len": 21757952,
+ "paused": false,
+ "ready": true,
+ "io-status": "ok",
+ "offset": 21757952,
+ "device": "job0",
+ "speed": 0
+ }
+ ]
+ }
+
+And, as noted in the previous section, there are two possible actions
+at this point:
+
+(a) Create a point-in-time snapshot by ending the synchronization. The
+ point-in-time is at the time of *ending* the sync. (The result of
+ the following being: the target image, [E], will be populated with
+ content from the entire chain, [A] to [D])::
+
+ (QEMU) block-job-cancel device=job0
+ {
+ "execute": "block-job-cancel",
+ "arguments": {
+ "device": "job0"
+ }
+ }
+
+(b) Or, complete the operation and pivot the live QEMU to the target
+ copy::
+
+ (QEMU) block-job-complete device=job0
+
+In either of the above cases, if you once again run the
+``query-block-jobs`` command, there should not be any active block
+operation.
+
+Comparing 'commit' and 'mirror': In both then cases, the overlay images
+can be discarded. However, with 'commit', the *existing* base image
+will be modified (by updating it with contents from overlays); while in
+the case of 'mirror', a *new* target image is populated with the data
+from the disk image chain.
+
+
+QMP invocation for live storage migration with ``drive-mirror`` + NBD
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Live storage migration (without shared storage setup) is one of the most
+common use-cases that takes advantage of the ``drive-mirror`` primitive
+and QEMU's built-in Network Block Device (NBD) server. Here's a quick
+walk-through of this setup.
+
+Given the disk image chain::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+Instead of copying content from the entire chain, synchronize *only* the
+contents of the *top*-most disk image (i.e. the active layer), [D], to a
+target, say, [TargetDisk].
+
+.. important::
+ The destination host must already have the contents of the backing
+ chain, involving images [A], [B], and [C], visible via other means
+ -- whether by ``cp``, ``rsync``, or by some storage array-specific
+ command.)
+
+Sometimes, this is also referred to as "shallow copy" -- because only
+the "active layer", and not the rest of the image chain, is copied to
+the destination.
+
+.. note::
+ In this example, for the sake of simplicity, we'll be using the same
+ ``localhost`` as both source and destination.
+
+As noted earlier, on the destination host the contents of the backing
+chain -- from images [A] to [C] -- are already expected to exist in some
+form (e.g. in a file called, ``Contents-of-A-B-C.qcow2``). Now, on the
+destination host, let's create a target overlay image (with the image
+``Contents-of-A-B-C.qcow2`` as its backing file), to which the contents
+of image [D] (from the source QEMU) will be mirrored to::
+
+ $ qemu-img create -f qcow2 -b ./Contents-of-A-B-C.qcow2 \
+ -F qcow2 ./target-disk.qcow2
+
+And start the destination QEMU (we already have the source QEMU running
+-- discussed in the section: `Interacting with a QEMU instance`_)
+instance, with the following invocation. (As noted earlier, for
+simplicity's sake, the destination QEMU is started on the same host, but
+it could be located elsewhere):
+
+.. parsed-literal::
+
+ $ |qemu_system| -display none -no-user-config -nodefaults \\
+ -m 512 -blockdev \\
+ node-name=node-TargetDisk,driver=qcow2,file.driver=file,file.node-name=file,file.filename=./target-disk.qcow2 \\
+ -device virtio-blk,drive=node-TargetDisk,id=virtio0 \\
+ -S -monitor stdio -qmp unix:./qmp-sock2,server=on,wait=off \\
+ -incoming tcp:localhost:6666
+
+Given the disk image chain on source QEMU::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+On the destination host, it is expected that the contents of the chain
+``[A] <-- [B] <-- [C]`` are *already* present, and therefore copy *only*
+the content of image [D].
+
+(1) [On *destination* QEMU] As part of the first step, start the
+ built-in NBD server on a given host (local host, represented by
+ ``::``)and port::
+
+ (QEMU) nbd-server-start addr={"type":"inet","data":{"host":"::","port":"49153"}}
+ {
+ "execute": "nbd-server-start",
+ "arguments": {
+ "addr": {
+ "data": {
+ "host": "::",
+ "port": "49153"
+ },
+ "type": "inet"
+ }
+ }
+ }
+
+(2) [On *destination* QEMU] And export the destination disk image using
+ QEMU's built-in NBD server::
+
+ (QEMU) nbd-server-add device=node-TargetDisk writable=true
+ {
+ "execute": "nbd-server-add",
+ "arguments": {
+ "device": "node-TargetDisk"
+ }
+ }
+
+(3) [On *source* QEMU] Then, invoke ``drive-mirror`` (NB: since we're
+ running ``drive-mirror`` with ``mode=existing`` (meaning:
+ synchronize to a pre-created file, therefore 'existing', file on the
+ target host), with the synchronization mode as 'top' (``"sync:
+ "top"``)::
+
+ (QEMU) drive-mirror device=node-D target=nbd:localhost:49153:exportname=node-TargetDisk sync=top mode=existing job-id=job0
+ {
+ "execute": "drive-mirror",
+ "arguments": {
+ "device": "node-D",
+ "mode": "existing",
+ "job-id": "job0",
+ "target": "nbd:localhost:49153:exportname=node-TargetDisk",
+ "sync": "top"
+ }
+ }
+
+(4) [On *source* QEMU] Once ``drive-mirror`` copies the entire data, and the
+ event ``BLOCK_JOB_READY`` is emitted, issue ``block-job-cancel`` to
+ gracefully end the synchronization, from source QEMU::
+
+ (QEMU) block-job-cancel device=job0
+ {
+ "execute": "block-job-cancel",
+ "arguments": {
+ "device": "job0"
+ }
+ }
+
+(5) [On *destination* QEMU] Then, stop the NBD server::
+
+ (QEMU) nbd-server-stop
+ {
+ "execute": "nbd-server-stop",
+ "arguments": {}
+ }
+
+(6) [On *destination* QEMU] Finally, resume the guest vCPUs by issuing the
+ QMP command ``cont``::
+
+ (QEMU) cont
+ {
+ "execute": "cont",
+ "arguments": {}
+ }
+
+.. note::
+ Higher-level libraries (e.g. libvirt) automate the entire above
+ process (although note that libvirt does not allow same-host
+ migrations to localhost for other reasons).
+
+
+Notes on ``blockdev-mirror``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``blockdev-mirror`` command is equivalent in core functionality to
+``drive-mirror``, except that it operates at node-level in a BDS graph.
+
+Also: for ``blockdev-mirror``, the 'target' image needs to be explicitly
+created (using ``qemu-img``) and attach it to live QEMU via
+``blockdev-add``, which assigns a name to the to-be created target node.
+
+E.g. the sequence of actions to create a point-in-time backup of an
+entire disk image chain, to a target, using ``blockdev-mirror`` would be:
+
+(0) Create the QCOW2 overlays, to arrive at a backing chain of desired
+ depth
+
+(1) Create the target image (using ``qemu-img``), say, ``e.qcow2``
+
+(2) Attach the above created file (``e.qcow2``), run-time, using
+ ``blockdev-add`` to QEMU
+
+(3) Perform ``blockdev-mirror`` (use ``"sync": "full"`` to copy the
+ entire chain to the target). And notice the event
+ ``BLOCK_JOB_READY``
+
+(4) Optionally, query for active block jobs, there should be a 'mirror'
+ job ready to be completed
+
+(5) Gracefully complete the 'mirror' block device job, and notice the
+ the event ``BLOCK_JOB_COMPLETED``
+
+(6) Shutdown the guest by issuing the QMP ``quit`` command so that
+ caches are flushed
+
+(7) Then, finally, compare the contents of the disk image chain, and
+ the target copy with ``qemu-img compare``. You should notice:
+ "Images are identical"
+
+
+QMP invocation for ``blockdev-mirror``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Given the disk image chain::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+To copy the contents of the entire disk image chain, from [A] all the
+way to [D], to a new target, call it [E]. The following is the flow.
+
+Create the overlay images, [B], [C], and [D]::
+
+ (QEMU) blockdev-snapshot-sync node-name=node-A snapshot-file=b.qcow2 snapshot-node-name=node-B format=qcow2
+ (QEMU) blockdev-snapshot-sync node-name=node-B snapshot-file=c.qcow2 snapshot-node-name=node-C format=qcow2
+ (QEMU) blockdev-snapshot-sync node-name=node-C snapshot-file=d.qcow2 snapshot-node-name=node-D format=qcow2
+
+Create the target image, [E]::
+
+ $ qemu-img create -f qcow2 e.qcow2 39M
+
+Add the above created target image to QEMU, via ``blockdev-add``::
+
+ (QEMU) blockdev-add driver=qcow2 node-name=node-E file={"driver":"file","filename":"e.qcow2"}
+ {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "node-E",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "e.qcow2"
+ }
+ }
+ }
+
+Perform ``blockdev-mirror``, and notice the event ``BLOCK_JOB_READY``::
+
+ (QEMU) blockdev-mirror device=node-B target=node-E sync=full job-id=job0
+ {
+ "execute": "blockdev-mirror",
+ "arguments": {
+ "device": "node-D",
+ "job-id": "job0",
+ "target": "node-E",
+ "sync": "full"
+ }
+ }
+
+Query for active block jobs, there should be a 'mirror' job ready::
+
+ (QEMU) query-block-jobs
+ {
+ "execute": "query-block-jobs",
+ "arguments": {}
+ }
+ {
+ "return": [
+ {
+ "busy": false,
+ "type": "mirror",
+ "len": 21561344,
+ "paused": false,
+ "ready": true,
+ "io-status": "ok",
+ "offset": 21561344,
+ "device": "job0",
+ "speed": 0
+ }
+ ]
+ }
+
+Gracefully complete the block device job operation, and notice the
+event ``BLOCK_JOB_COMPLETED``::
+
+ (QEMU) block-job-complete device=job0
+ {
+ "execute": "block-job-complete",
+ "arguments": {
+ "device": "job0"
+ }
+ }
+ {
+ "return": {}
+ }
+
+Shutdown the guest, by issuing the ``quit`` QMP command::
+
+ (QEMU) quit
+ {
+ "execute": "quit",
+ "arguments": {}
+ }
+
+
+Live disk backup --- ``blockdev-backup`` and the deprecated``drive-backup``
+---------------------------------------------------------------------------
+
+The ``blockdev-backup`` (and the deprecated ``drive-backup``) allows
+you to create a point-in-time snapshot.
+
+In this case, the point-in-time is when you *start* the
+``blockdev-backup`` (or deprecated ``drive-backup``) command.
+
+
+QMP invocation for ``drive-backup``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Note that ``drive-backup`` command is deprecated since QEMU 6.2 and
+will be removed in future.
+
+Yet again, starting afresh with our example disk image chain::
+
+ [A] <-- [B] <-- [C] <-- [D]
+
+To create a target image [E], with content populated from image [A] to
+[D], from the above chain, the following is the syntax. (If the target
+image does not exist, ``drive-backup`` will create it)::
+
+ (QEMU) drive-backup device=node-D sync=full target=e.qcow2 job-id=job0
+ {
+ "execute": "drive-backup",
+ "arguments": {
+ "device": "node-D",
+ "job-id": "job0",
+ "sync": "full",
+ "target": "e.qcow2"
+ }
+ }
+
+Once the above ``drive-backup`` has completed, a ``BLOCK_JOB_COMPLETED`` event
+will be issued, indicating the live block device job operation has
+completed, and no further action is required.
+
+
+Moving from the deprecated ``drive-backup`` to newer ``blockdev-backup``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``blockdev-backup`` differs from ``drive-backup`` in how you specify
+the backup target. With ``blockdev-backup`` you can't specify filename
+as a target. Instead you use ``node-name`` of existing block node,
+which you may add by ``blockdev-add`` or ``blockdev-create`` commands.
+Correspondingly, ``blockdev-backup`` doesn't have ``mode`` and
+``format`` arguments which don't apply to an existing block node. See
+following sections for details and examples.
+
+
+Notes on ``blockdev-backup``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``blockdev-backup`` command operates at node-level in a Block Driver
+State (BDS) graph.
+
+E.g. the sequence of actions to create a point-in-time backup
+of an entire disk image chain, to a target, using ``blockdev-backup``
+would be:
+
+(0) Create the QCOW2 overlays, to arrive at a backing chain of desired
+ depth
+
+(1) Create the target image (using ``qemu-img``), say, ``e.qcow2``
+
+(2) Attach the above created file (``e.qcow2``), run-time, using
+ ``blockdev-add`` to QEMU
+
+(3) Perform ``blockdev-backup`` (use ``"sync": "full"`` to copy the
+ entire chain to the target). And notice the event
+ ``BLOCK_JOB_COMPLETED``
+
+(4) Shutdown the guest, by issuing the QMP ``quit`` command, so that
+ caches are flushed
+
+(5) Then, finally, compare the contents of the disk image chain, and
+ the target copy with ``qemu-img compare``. You should notice:
+ "Images are identical"
+
+The following section shows an example QMP invocation for
+``blockdev-backup``.
+
+QMP invocation for ``blockdev-backup``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Given a disk image chain of depth 1 where image [B] is the active
+overlay (live QEMU is writing to it)::
+
+ [A] <-- [B]
+
+The following is the procedure to copy the content from the entire chain
+to a target image (say, [E]), which has the full content from [A] and
+[B].
+
+Create the overlay [B]::
+
+ (QEMU) blockdev-snapshot-sync node-name=node-A snapshot-file=b.qcow2 snapshot-node-name=node-B format=qcow2
+ {
+ "execute": "blockdev-snapshot-sync",
+ "arguments": {
+ "node-name": "node-A",
+ "snapshot-file": "b.qcow2",
+ "format": "qcow2",
+ "snapshot-node-name": "node-B"
+ }
+ }
+
+
+Create a target image that will contain the copy::
+
+ $ qemu-img create -f qcow2 e.qcow2 39M
+
+Then add it to QEMU via ``blockdev-add``::
+
+ (QEMU) blockdev-add driver=qcow2 node-name=node-E file={"driver":"file","filename":"e.qcow2"}
+ {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "node-E",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "e.qcow2"
+ }
+ }
+ }
+
+Then invoke ``blockdev-backup`` to copy the contents from the entire
+image chain, consisting of images [A] and [B] to the target image
+'e.qcow2'::
+
+ (QEMU) blockdev-backup device=node-B target=node-E sync=full job-id=job0
+ {
+ "execute": "blockdev-backup",
+ "arguments": {
+ "device": "node-B",
+ "job-id": "job0",
+ "target": "node-E",
+ "sync": "full"
+ }
+ }
+
+Once the above 'backup' operation has completed, the event,
+``BLOCK_JOB_COMPLETED`` will be emitted, signalling successful
+completion.
+
+Next, query for any active block device jobs (there should be none)::
+
+ (QEMU) query-block-jobs
+ {
+ "execute": "query-block-jobs",
+ "arguments": {}
+ }
+
+Shutdown the guest::
+
+ (QEMU) quit
+ {
+ "execute": "quit",
+ "arguments": {}
+ }
+ "return": {}
+ }
+
+.. note::
+ The above step is really important; if forgotten, an error, "Failed
+ to get shared "write" lock on e.qcow2", will be thrown when you do
+ ``qemu-img compare`` to verify the integrity of the disk image
+ with the backup content.
+
+
+The end result will be the image 'e.qcow2' containing a
+point-in-time backup of the disk image chain -- i.e. contents from
+images [A] and [B] at the time the ``blockdev-backup`` command was
+initiated.
+
+One way to confirm the backup disk image contains the identical content
+with the disk image chain is to compare the backup and the contents of
+the chain, you should see "Images are identical". (NB: this is assuming
+QEMU was launched with ``-S`` option, which will not start the CPUs at
+guest boot up)::
+
+ $ qemu-img compare b.qcow2 e.qcow2
+ Warning: Image size mismatch!
+ Images are identical.
+
+NOTE: The "Warning: Image size mismatch!" is expected, as we created the
+target image (e.qcow2) with 39M size.
diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
new file mode 100644
index 000000000..bdb0f2a41
--- /dev/null
+++ b/docs/interop/nbd.txt
@@ -0,0 +1,70 @@
+QEMU supports the NBD protocol, and has an internal NBD client (see
+block/nbd.c), an internal NBD server (see blockdev-nbd.c), and an
+external NBD server tool (see qemu-nbd.c). The common code is placed
+in nbd/*.
+
+The NBD protocol is specified here:
+https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md
+
+The following paragraphs describe some specific properties of NBD
+protocol realization in QEMU.
+
+= Metadata namespaces =
+
+QEMU supports the "base:allocation" metadata context as defined in the
+NBD protocol specification, and also defines an additional metadata
+namespace "qemu".
+
+== "qemu" namespace ==
+
+The "qemu" namespace currently contains two available metadata context
+types. The first is related to exposing the contents of a dirty
+bitmap alongside the associated disk contents. That metadata context
+is named with the following form:
+
+ qemu:dirty-bitmap:<dirty-bitmap-export-name>
+
+Each dirty-bitmap metadata context defines only one flag for extents
+in reply for NBD_CMD_BLOCK_STATUS:
+
+ bit 0: NBD_STATE_DIRTY, set when the extent is "dirty"
+
+The second is related to exposing the source of various extents within
+the image, with a single metadata context named:
+
+ qemu:allocation-depth
+
+In the allocation depth context, the entire 32-bit value represents a
+depth of which layer in a thin-provisioned backing chain provided the
+data (0 for unallocated, 1 for the active layer, 2 for the first
+backing layer, and so forth).
+
+For NBD_OPT_LIST_META_CONTEXT the following queries are supported
+in addition to the specific "qemu:allocation-depth" and
+"qemu:dirty-bitmap:<dirty-bitmap-export-name>":
+
+* "qemu:" - returns list of all available metadata contexts in the
+ namespace.
+* "qemu:dirty-bitmap:" - returns list of all available dirty-bitmap
+ metadata contexts.
+
+= Features by version =
+
+The following list documents which qemu version first implemented
+various features (both as a server exposing the feature, and as a
+client taking advantage of the feature when present), to make it
+easier to plan for cross-version interoperability. Note that in
+several cases, the initial release containing a feature may require
+additional patches from the corresponding stable branch to fix bugs in
+the operation of that feature.
+
+* 2.6: NBD_OPT_STARTTLS with TLS X.509 Certificates
+* 2.8: NBD_CMD_WRITE_ZEROES
+* 2.10: NBD_OPT_GO, NBD_INFO_BLOCK
+* 2.11: NBD_OPT_STRUCTURED_REPLY
+* 2.12: NBD_CMD_BLOCK_STATUS for "base:allocation"
+* 3.0: NBD_OPT_STARTTLS with TLS Pre-Shared Keys (PSK),
+NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
+* 4.2: NBD_FLAG_CAN_MULTI_CONN for shareable read-only exports,
+NBD_CMD_FLAG_FAST_ZERO
+* 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
diff --git a/docs/interop/parallels.txt b/docs/interop/parallels.txt
new file mode 100644
index 000000000..bb3fadf36
--- /dev/null
+++ b/docs/interop/parallels.txt
@@ -0,0 +1,232 @@
+= License =
+
+Copyright (c) 2015 Denis Lunev
+Copyright (c) 2015 Vladimir Sementsov-Ogievskiy
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+= Parallels Expandable Image File Format =
+
+A Parallels expandable image file consists of three consecutive parts:
+ * header
+ * BAT
+ * data area
+
+All numbers in a Parallels expandable image are stored in little-endian byte
+order.
+
+
+== Definitions ==
+
+ Sector A 512-byte data chunk.
+
+ Cluster A data chunk of the size specified in the image header.
+ Currently, the default size is 1MiB (2048 sectors). In previous
+ versions, cluster sizes of 63 sectors, 256 and 252 kilobytes were
+ used.
+
+ BAT Block Allocation Table, an entity that contains information for
+ guest-to-host I/O data address translation.
+
+
+== Header ==
+
+The header is placed at the start of an image and contains the following
+fields:
+
+Bytes:
+ 0 - 15: magic
+ Must contain "WithoutFreeSpace" or "WithouFreSpacExt".
+
+ 16 - 19: version
+ Must be 2.
+
+ 20 - 23: heads
+ Disk geometry parameter for guest.
+
+ 24 - 27: cylinders
+ Disk geometry parameter for guest.
+
+ 28 - 31: tracks
+ Cluster size, in sectors.
+
+ 32 - 35: nb_bat_entries
+ Disk size, in clusters (BAT size).
+
+ 36 - 43: nb_sectors
+ Disk size, in sectors.
+
+ For "WithoutFreeSpace" images:
+ Only the lowest 4 bytes are used. The highest 4 bytes must be
+ cleared in this case.
+
+ For "WithouFreSpacExt" images, there are no such
+ restrictions.
+
+ 44 - 47: in_use
+ Set to 0x746F6E59 when the image is opened by software in R/W
+ mode; set to 0x312e3276 when the image is closed.
+
+ A zero in this field means that the image was opened by an old
+ version of the software that doesn't support Format Extension
+ (see below).
+
+ Other values are not allowed.
+
+ 48 - 51: data_off
+ An offset, in sectors, from the start of the file to the start of
+ the data area.
+
+ For "WithoutFreeSpace" images:
+ - If data_off is zero, the offset is calculated as the end of BAT
+ table plus some padding to ensure sector size alignment.
+ - If data_off is non-zero, the offset should be aligned to sector
+ size. However it is recommended to align it to cluster size for
+ newly created images.
+
+ For "WithouFreSpacExt" images:
+ data_off must be non-zero and aligned to cluster size.
+
+ 52 - 55: flags
+ Miscellaneous flags.
+
+ Bit 0: Empty Image bit. If set, the image should be
+ considered clear.
+
+ Bits 1-31: Unused.
+
+ 56 - 63: ext_off
+ Format Extension offset, an offset, in sectors, from the start of
+ the file to the start of the Format Extension Cluster.
+
+ ext_off must meet the same requirements as cluster offsets
+ defined by BAT entries (see below).
+
+
+== BAT ==
+
+BAT is placed immediately after the image header. In the file, BAT is a
+contiguous array of 32-bit unsigned little-endian integers with
+(bat_entries * 4) bytes size.
+
+Each BAT entry contains an offset from the start of the file to the
+corresponding cluster. The offset set in clusters for "WithouFreSpacExt" images
+and in sectors for "WithoutFreeSpace" images.
+
+If a BAT entry is zero, the corresponding cluster is not allocated and should
+be considered as filled with zeroes.
+
+Cluster offsets specified by BAT entries must meet the following requirements:
+ - the value must not be lower than data offset (provided by header.data_off
+ or calculated as specified above),
+ - the value must be lower than the desired file size,
+ - the value must be unique among all BAT entries,
+ - the result of (cluster offset - data offset) must be aligned to cluster
+ size.
+
+
+== Data Area ==
+
+The data area is an area from the data offset (provided by header.data_off or
+calculated as specified above) to the end of the file. It represents a
+contiguous array of clusters. Most of them are allocated by the BAT, some may
+be allocated by the ext_off field in the header while other may be allocated by
+extensions. All clusters allocated by ext_off and extensions should meet the
+same requirements as clusters specified by BAT entries.
+
+
+== Format Extension ==
+
+The Format Extension is an area 1 cluster in size that provides additional
+format features. This cluster is addressed by the ext_off field in the header.
+The format of the Format Extension area is the following:
+
+ 0 - 7: magic
+ Must be 0xAB234CEF23DCEA87
+
+ 8 - 23: m_CheckSum
+ The MD5 checksum of the entire Header Extension cluster except
+ the first 24 bytes.
+
+ The above are followed by feature sections or "extensions". The last
+ extension must be "End of features" (see below).
+
+Each feature section has the following format:
+
+ 0 - 7: magic
+ The identifier of the feature:
+ 0x0000000000000000 - End of features
+ 0x20385FAE252CB34A - Dirty bitmap
+
+ 8 - 15: flags
+ External flags for extension:
+
+ Bit 0: NECESSARY
+ If the software cannot load the extension (due to an
+ unknown magic number or error), the file should not be
+ changed. If this flag is unset and there is an error on
+ loading the extension, said extension should be dropped.
+
+ Bit 1: TRANSIT
+ If there is an unknown extension with this flag set,
+ said extension should be left as is.
+
+ If neither NECESSARY nor TRANSIT are set, the extension should be
+ dropped.
+
+ 16 - 19: data_size
+ The size of the following feature data, in bytes.
+
+ 20 - 23: unused32
+ Align header to 8 bytes boundary.
+
+ variable: data (data_size bytes)
+
+ The above is followed by padding to the next 8 bytes boundary, then the
+ next extension starts.
+
+ The last extension must be "End of features" with all the fields set to 0.
+
+
+=== Dirty bitmaps feature ===
+
+This feature provides a way of storing dirty bitmaps in the image. The fields
+of its data area are:
+
+ 0 - 7: size
+ The bitmap size, should be equal to disk size in sectors.
+
+ 8 - 23: id
+ An identifier for backup consistency checking.
+
+ 24 - 27: granularity
+ Bitmap granularity, in sectors. I.e., the number of sectors
+ corresponding to one bit of the bitmap. Granularity must be
+ a power of 2.
+
+ 28 - 31: l1_size
+ The number of entries in the L1 table of the bitmap.
+
+ variable: L1 offset table (l1_table), size: 8 * l1_size bytes
+
+The dirty bitmap described by this feature extension is stored in a set of
+clusters inside the Parallels image file. The offsets of these clusters are
+saved in the L1 offset table specified by the feature extension. Each L1 table
+entry is a 64 bit integer as described below:
+
+Given an offset in bytes into the bitmap data, corresponding L1 entry is
+
+ l1_table[offset / cluster_size]
+
+If an L1 table entry is 0, all bits in the corresponding cluster of the bitmap
+are assumed to be 0.
+
+If an L1 table entry is 1, all bits in the corresponding cluster of the bitmap
+are assumed to be 1.
+
+If an L1 table entry is not 0 or 1, it contains the corresponding cluster
+offset (in 512b sectors). Given an offset in bytes into the bitmap data the
+offset in bytes into the image file can be obtained as follows:
+
+ offset = l1_table[offset / cluster_size] * 512 + (offset % cluster_size)
diff --git a/docs/interop/pr-helper.rst b/docs/interop/pr-helper.rst
new file mode 100644
index 000000000..e926f0a6c
--- /dev/null
+++ b/docs/interop/pr-helper.rst
@@ -0,0 +1,83 @@
+..
+
+======================================
+Persistent reservation helper protocol
+======================================
+
+QEMU's SCSI passthrough devices, ``scsi-block`` and ``scsi-generic``,
+can delegate implementation of persistent reservations to an external
+(and typically privileged) program. Persistent Reservations allow
+restricting access to block devices to specific initiators in a shared
+storage setup.
+
+For a more detailed reference please refer to the SCSI Primary
+Commands standard, specifically the section on Reservations and the
+"PERSISTENT RESERVE IN" and "PERSISTENT RESERVE OUT" commands.
+
+This document describes the socket protocol used between QEMU's
+``pr-manager-helper`` object and the external program.
+
+.. contents::
+
+Connection and initialization
+-----------------------------
+
+All data transmitted on the socket is big-endian.
+
+After connecting to the helper program's socket, the helper starts a simple
+feature negotiation process by writing four bytes corresponding to
+the features it exposes (``supported_features``). QEMU reads it,
+then writes four bytes corresponding to the desired features of the
+helper program (``requested_features``).
+
+If a bit is 1 in ``requested_features`` and 0 in ``supported_features``,
+the corresponding feature is not supported by the helper and the connection
+is closed. On the other hand, it is acceptable for a bit to be 0 in
+``requested_features`` and 1 in ``supported_features``; in this case,
+the helper will not enable the feature.
+
+Right now no feature is defined, so the two parties always write four
+zero bytes.
+
+Command format
+--------------
+
+It is invalid to send multiple commands concurrently on the same
+socket. It is however possible to connect multiple sockets to the
+helper and send multiple commands to the helper for one or more
+file descriptors.
+
+A command consists of a request and a response. A request consists
+of a 16-byte SCSI CDB. A file descriptor must be passed to the helper
+together with the SCSI CDB using ancillary data.
+
+The CDB has the following limitations:
+
+- the command (stored in the first byte) must be one of 0x5E
+ (PERSISTENT RESERVE IN) or 0x5F (PERSISTENT RESERVE OUT).
+
+- the allocation length (stored in bytes 7-8 of the CDB for PERSISTENT
+ RESERVE IN) or parameter list length (stored in bytes 5-8 of the CDB
+ for PERSISTENT RESERVE OUT) is limited to 8 KiB.
+
+For PERSISTENT RESERVE OUT, the parameter list is sent right after the
+CDB. The length of the parameter list is taken from the CDB itself.
+
+The helper's reply has the following structure:
+
+- 4 bytes for the SCSI status
+
+- 4 bytes for the payload size (nonzero only for PERSISTENT RESERVE IN
+ and only if the SCSI status is 0x00, i.e. GOOD)
+
+- 96 bytes for the SCSI sense data
+
+- if the size is nonzero, the payload follows
+
+The sense data is always sent to keep the protocol simple, even though
+it is only valid if the SCSI status is CHECK CONDITION (0x02).
+
+The payload size is always less than or equal to the allocation length
+specified in the CDB for the PERSISTENT RESERVE IN command.
+
+If the protocol is violated, the helper closes the socket.
diff --git a/docs/interop/prl-xml.txt b/docs/interop/prl-xml.txt
new file mode 100644
index 000000000..7031f8752
--- /dev/null
+++ b/docs/interop/prl-xml.txt
@@ -0,0 +1,158 @@
+= License =
+
+Copyright (c) 2015-2017, Virtuozzo, Inc.
+Authors:
+ 2015 Denis Lunev <den@openvz.org>
+ 2015 Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+ 2016-2017 Klim Kireev <klim.kireev@virtuozzo.com>
+ 2016-2017 Edgar Kaziakhmedov <edgar.kaziakhmedov@virtuozzo.com>
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+This specification contains minimal information about Parallels Disk Format,
+which is enough to proper work with QEMU. Nevertheless, Parallels Cloud Server
+and Parallels Desktop are able to add some unspecified nodes to xml and use
+them, but they are for internal work and don't affect functionality. Also it
+uses auxiliary xml "Snapshot.xml", which allows to store optional snapshot
+information, but it doesn't influence open/read/write functionality. QEMU and
+other software should not use fields not covered in this document and
+Snapshot.xml file and must leave them as is.
+
+= Parallels Disk Format =
+
+Parallels disk consists of two parts: the set of snapshots and the disk
+descriptor file, which stores information about all files and snapshots.
+
+== Definitions ==
+ Snapshot a record of the contents captured at a particular time,
+ capable of storing current state. A snapshot has UUID and
+ parent UUID.
+
+ Snapshot image an overlay representing the difference between this
+ snapshot and some earlier snapshot.
+
+ Overlay an image storing the different sectors between two captured
+ states.
+
+ Root image snapshot image with no parent, the root of snapshot tree.
+
+ Storage the backing storage for a subset of the virtual disk. When
+ there is more than one storage in a Parallels disk then that
+ is referred to as a split image. In this case every storage
+ covers specific address space area of the disk and has its
+ particular root image. Split images are not considered here
+ and are not supported. Each storage consists of disk
+ parameters and a list of images. The list of images always
+ contains a root image and may also contain overlays. The
+ root image can be an expandable Parallels image file or
+ plain. Overlays must be expandable.
+
+ Description DiskDescriptor.xml stores information about disk parameters,
+ file snapshots, storages.
+
+ Top The overlay between actual state and some previous snapshot.
+ Snapshot It is not a snapshot in the classical sense because it
+ serves as the active image that the guest writes to.
+
+ Sector a 512-byte data chunk.
+
+== Description file ==
+All information is placed in a single XML element Parallels_disk_image.
+The element has only one attribute "Version", that must be 1.0.
+Schema of DiskDescriptor.xml:
+
+<Parallels_disk_image Version="1.0">
+ <Disk_Parameters>
+ ...
+ </Disk_Parameters>
+ <StorageData>
+ ...
+ </StorageData>
+ <Snapshots>
+ ...
+ </Snapshots>
+</Parallels_disk_image>
+
+== Disk_Parameters element ==
+The Disk_Parameters element describes the physical layout of the virtual disk
+and some general settings.
+
+The Disk_Parameters element MUST contain the following child elements:
+ * Disk_size - number of sectors in the disk,
+ desired size of the disk.
+ * Cylinders - number of the disk cylinders.
+ * Heads - number of the disk heads.
+ * Sectors - number of the disk sectors per cylinder
+ (sector size is 512 bytes)
+ Limitation: Product of the Heads, Sectors and Cylinders
+ values MUST be equal to the value of the Disk_size parameter.
+ * Padding - must be 0. Parallels Cloud Server and Parallels Desktop may
+ use padding set to 1, however this case is not covered
+ by this spec, QEMU and other software should not open
+ such disks and should not create them.
+
+== StorageData element ==
+This element of the file describes the root image and all snapshot images.
+
+The StorageData element consists of the Storage child element, as shown below:
+<StorageData>
+ <Storage>
+ ...
+ </Storage>
+</StorageData>
+
+A Storage element has following child elements:
+ * Start - start sector of the storage, in case of non split storage
+ equals to 0.
+ * End - number of sector following the last sector, in case of non
+ split storage equals to Disk_size.
+ * Blocksize - storage cluster size, number of sectors per one cluster.
+ Cluster size for each "Compressed" (see below) image in
+ parallels disk must be equal to this field. Note: cluster
+ size for Parallels Expandable Image is in 'tracks' field of
+ its header (see docs/interop/parallels.txt).
+ * Several Image child elements.
+
+Each Image element has following child elements:
+ * GUID - image identifier, UUID in curly brackets.
+ For instance, {12345678-9abc-def1-2345-6789abcdef12}.
+ The GUID is used by the Snapshots element to reference images
+ (see below)
+ * Type - image type of the element. It can be:
+ "Plain" for raw files.
+ "Compressed" for expanding disks.
+ * File - path to image file. Path can be relative to DiskDecriptor.xml or
+ absolute.
+
+== Snapshots element ==
+The Snapshots element describes the snapshot relations with the snapshot tree.
+
+The element contains the set of Shot child elements, as shown below:
+<Snapshots>
+ <TopGUID> ... </TopGUID> /* Optional child element */
+ <Shot>
+ ...
+ </Shot>
+ <Shot>
+ ...
+ </Shot>
+ ...
+</Snapshots>
+
+Each Shot element contains the following child elements:
+ * GUID - an image GUID.
+ * ParentGUID - GUID of the image of the parent snapshot.
+
+The software may traverse snapshots from child to parent using <ParentGUID>
+field as reference. ParentGUID of root snapshot is
+{00000000-0000-0000-0000-000000000000}. There should be only one root
+snapshot. Top snapshot could be described via two ways: via TopGUID child
+element of the Snapshots element or via predefined GUID
+{5fbaabe3-6958-40ff-92a7-860e329aab41}. If TopGUID is defined, predefined GUID is
+interpreted as usual GUID. All snapshot images (except Top Snapshot) should be
+opened read-only. There is another predefined GUID,
+BackupID = {704718e1-2314-44c8-9087-d78ed36b0f4e}, which is used by original and
+some third-party software for backup, QEMU and other software may operate with
+images with GUID = BackupID as usual, however, it is not recommended to use this
+GUID for new disks. Top snapshot cannot have this GUID.
diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
new file mode 100644
index 000000000..f7dc304ff
--- /dev/null
+++ b/docs/interop/qcow2.txt
@@ -0,0 +1,901 @@
+== General ==
+
+A qcow2 image file is organized in units of constant size, which are called
+(host) clusters. A cluster is the unit in which all allocations are done,
+both for actual guest data and for image metadata.
+
+Likewise, the virtual disk as seen by the guest is divided into (guest)
+clusters of the same size.
+
+All numbers in qcow2 are stored in Big Endian byte order.
+
+
+== Header ==
+
+The first cluster of a qcow2 image contains the file header:
+
+ Byte 0 - 3: magic
+ QCOW magic string ("QFI\xfb")
+
+ 4 - 7: version
+ Version number (valid values are 2 and 3)
+
+ 8 - 15: backing_file_offset
+ Offset into the image file at which the backing file name
+ is stored (NB: The string is not null terminated). 0 if the
+ image doesn't have a backing file.
+
+ Note: backing files are incompatible with raw external data
+ files (auto-clear feature bit 1).
+
+ 16 - 19: backing_file_size
+ Length of the backing file name in bytes. Must not be
+ longer than 1023 bytes. Undefined if the image doesn't have
+ a backing file.
+
+ 20 - 23: cluster_bits
+ Number of bits that are used for addressing an offset
+ within a cluster (1 << cluster_bits is the cluster size).
+ Must not be less than 9 (i.e. 512 byte clusters).
+
+ Note: qemu as of today has an implementation limit of 2 MB
+ as the maximum cluster size and won't be able to open images
+ with larger cluster sizes.
+
+ Note: if the image has Extended L2 Entries then cluster_bits
+ must be at least 14 (i.e. 16384 byte clusters).
+
+ 24 - 31: size
+ Virtual disk size in bytes.
+
+ Note: qemu has an implementation limit of 32 MB as
+ the maximum L1 table size. With a 2 MB cluster
+ size, it is unable to populate a virtual cluster
+ beyond 2 EB (61 bits); with a 512 byte cluster
+ size, it is unable to populate a virtual size
+ larger than 128 GB (37 bits). Meanwhile, L1/L2
+ table layouts limit an image to no more than 64 PB
+ (56 bits) of populated clusters, and an image may
+ hit other limits first (such as a file system's
+ maximum size).
+
+ 32 - 35: crypt_method
+ 0 for no encryption
+ 1 for AES encryption
+ 2 for LUKS encryption
+
+ 36 - 39: l1_size
+ Number of entries in the active L1 table
+
+ 40 - 47: l1_table_offset
+ Offset into the image file at which the active L1 table
+ starts. Must be aligned to a cluster boundary.
+
+ 48 - 55: refcount_table_offset
+ Offset into the image file at which the refcount table
+ starts. Must be aligned to a cluster boundary.
+
+ 56 - 59: refcount_table_clusters
+ Number of clusters that the refcount table occupies
+
+ 60 - 63: nb_snapshots
+ Number of snapshots contained in the image
+
+ 64 - 71: snapshots_offset
+ Offset into the image file at which the snapshot table
+ starts. Must be aligned to a cluster boundary.
+
+For version 2, the header is exactly 72 bytes in length, and finishes here.
+For version 3 or higher, the header length is at least 104 bytes, including
+the next fields through header_length.
+
+ 72 - 79: incompatible_features
+ Bitmask of incompatible features. An implementation must
+ fail to open an image if an unknown bit is set.
+
+ Bit 0: Dirty bit. If this bit is set then refcounts
+ may be inconsistent, make sure to scan L1/L2
+ tables to repair refcounts before accessing the
+ image.
+
+ Bit 1: Corrupt bit. If this bit is set then any data
+ structure may be corrupt and the image must not
+ be written to (unless for regaining
+ consistency).
+
+ Bit 2: External data file bit. If this bit is set, an
+ external data file is used. Guest clusters are
+ then stored in the external data file. For such
+ images, clusters in the external data file are
+ not refcounted. The offset field in the
+ Standard Cluster Descriptor must match the
+ guest offset and neither compressed clusters
+ nor internal snapshots are supported.
+
+ An External Data File Name header extension may
+ be present if this bit is set.
+
+ Bit 3: Compression type bit. If this bit is set,
+ a non-default compression is used for compressed
+ clusters. The compression_type field must be
+ present and not zero.
+
+ Bit 4: Extended L2 Entries. If this bit is set then
+ L2 table entries use an extended format that
+ allows subcluster-based allocation. See the
+ Extended L2 Entries section for more details.
+
+ Bits 5-63: Reserved (set to 0)
+
+ 80 - 87: compatible_features
+ Bitmask of compatible features. An implementation can
+ safely ignore any unknown bits that are set.
+
+ Bit 0: Lazy refcounts bit. If this bit is set then
+ lazy refcount updates can be used. This means
+ marking the image file dirty and postponing
+ refcount metadata updates.
+
+ Bits 1-63: Reserved (set to 0)
+
+ 88 - 95: autoclear_features
+ Bitmask of auto-clear features. An implementation may only
+ write to an image with unknown auto-clear features if it
+ clears the respective bits from this field first.
+
+ Bit 0: Bitmaps extension bit
+ This bit indicates consistency for the bitmaps
+ extension data.
+
+ It is an error if this bit is set without the
+ bitmaps extension present.
+
+ If the bitmaps extension is present but this
+ bit is unset, the bitmaps extension data must be
+ considered inconsistent.
+
+ Bit 1: Raw external data bit
+ If this bit is set, the external data file can
+ be read as a consistent standalone raw image
+ without looking at the qcow2 metadata.
+
+ Setting this bit has a performance impact for
+ some operations on the image (e.g. writing
+ zeros requires writing to the data file instead
+ of only setting the zero flag in the L2 table
+ entry) and conflicts with backing files.
+
+ This bit may only be set if the External Data
+ File bit (incompatible feature bit 1) is also
+ set.
+
+ Bits 2-63: Reserved (set to 0)
+
+ 96 - 99: refcount_order
+ Describes the width of a reference count block entry (width
+ in bits: refcount_bits = 1 << refcount_order). For version 2
+ images, the order is always assumed to be 4
+ (i.e. refcount_bits = 16).
+ This value may not exceed 6 (i.e. refcount_bits = 64).
+
+ 100 - 103: header_length
+ Length of the header structure in bytes. For version 2
+ images, the length is always assumed to be 72 bytes.
+ For version 3 it's at least 104 bytes and must be a multiple
+ of 8.
+
+
+=== Additional fields (version 3 and higher) ===
+
+In general, these fields are optional and may be safely ignored by the software,
+as well as filled by zeros (which is equal to field absence), if software needs
+to set field B, but does not care about field A which precedes B. More
+formally, additional fields have the following compatibility rules:
+
+1. If the value of the additional field must not be ignored for correct
+handling of the file, it will be accompanied by a corresponding incompatible
+feature bit.
+
+2. If there are no unrecognized incompatible feature bits set, an unknown
+additional field may be safely ignored other than preserving its value when
+rewriting the image header.
+
+3. An explicit value of 0 will have the same behavior as when the field is not
+present*, if not altered by a specific incompatible bit.
+
+*. A field is considered not present when header_length is less than or equal
+to the field's offset. Also, all additional fields are not present for
+version 2.
+
+ 104: compression_type
+
+ Defines the compression method used for compressed clusters.
+ All compressed clusters in an image use the same compression
+ type.
+
+ If the incompatible bit "Compression type" is set: the field
+ must be present and non-zero (which means non-zlib
+ compression type). Otherwise, this field must not be present
+ or must be zero (which means zlib).
+
+ Available compression type values:
+ 0: zlib <https://www.zlib.net/>
+ 1: zstd <http://github.com/facebook/zstd>
+
+
+=== Header padding ===
+
+@header_length must be a multiple of 8, which means that if the end of the last
+additional field is not aligned, some padding is needed. This padding must be
+zeroed, so that if some existing (or future) additional field will fall into
+the padding, it will be interpreted accordingly to point [3.] of the previous
+paragraph, i.e. in the same manner as when this field is not present.
+
+
+=== Header extensions ===
+
+Directly after the image header, optional sections called header extensions can
+be stored. Each extension has a structure like the following:
+
+ Byte 0 - 3: Header extension type:
+ 0x00000000 - End of the header extension area
+ 0xe2792aca - Backing file format name string
+ 0x6803f857 - Feature name table
+ 0x23852875 - Bitmaps extension
+ 0x0537be77 - Full disk encryption header pointer
+ 0x44415441 - External data file name string
+ other - Unknown header extension, can be safely
+ ignored
+
+ 4 - 7: Length of the header extension data
+
+ 8 - n: Header extension data
+
+ n - m: Padding to round up the header extension size to the next
+ multiple of 8.
+
+Unless stated otherwise, each header extension type shall appear at most once
+in the same image.
+
+If the image has a backing file then the backing file name should be stored in
+the remaining space between the end of the header extension area and the end of
+the first cluster. It is not allowed to store other data here, so that an
+implementation can safely modify the header and add extensions without harming
+data of compatible features that it doesn't support. Compatible features that
+need space for additional data can use a header extension.
+
+
+== String header extensions ==
+
+Some header extensions (such as the backing file format name and the external
+data file name) are just a single string. In this case, the header extension
+length is the string length and the string is not '\0' terminated. (The header
+extension padding can make it look like a string is '\0' terminated, but
+neither is padding always necessary nor is there a guarantee that zero bytes
+are used for padding.)
+
+
+== Feature name table ==
+
+The feature name table is an optional header extension that contains the name
+for features used by the image. It can be used by applications that don't know
+the respective feature (e.g. because the feature was introduced only later) to
+display a useful error message.
+
+The number of entries in the feature name table is determined by the length of
+the header extension data. Each entry look like this:
+
+ Byte 0: Type of feature (select feature bitmap)
+ 0: Incompatible feature
+ 1: Compatible feature
+ 2: Autoclear feature
+
+ 1: Bit number within the selected feature bitmap (valid
+ values: 0-63)
+
+ 2 - 47: Feature name (padded with zeros, but not necessarily null
+ terminated if it has full length)
+
+
+== Bitmaps extension ==
+
+The bitmaps extension is an optional header extension. It provides the ability
+to store bitmaps related to a virtual disk. For now, there is only one bitmap
+type: the dirty tracking bitmap, which tracks virtual disk changes from some
+point in time.
+
+The data of the extension should be considered consistent only if the
+corresponding auto-clear feature bit is set, see autoclear_features above.
+
+The fields of the bitmaps extension are:
+
+ Byte 0 - 3: nb_bitmaps
+ The number of bitmaps contained in the image. Must be
+ greater than or equal to 1.
+
+ Note: QEMU currently only supports up to 65535 bitmaps per
+ image.
+
+ 4 - 7: Reserved, must be zero.
+
+ 8 - 15: bitmap_directory_size
+ Size of the bitmap directory in bytes. It is the cumulative
+ size of all (nb_bitmaps) bitmap directory entries.
+
+ 16 - 23: bitmap_directory_offset
+ Offset into the image file at which the bitmap directory
+ starts. Must be aligned to a cluster boundary.
+
+== Full disk encryption header pointer ==
+
+The full disk encryption header must be present if, and only if, the
+'crypt_method' header requires metadata. Currently this is only true
+of the 'LUKS' crypt method. The header extension must be absent for
+other methods.
+
+This header provides the offset at which the crypt method can store
+its additional data, as well as the length of such data.
+
+ Byte 0 - 7: Offset into the image file at which the encryption
+ header starts in bytes. Must be aligned to a cluster
+ boundary.
+ Byte 8 - 15: Length of the written encryption header in bytes.
+ Note actual space allocated in the qcow2 file may
+ be larger than this value, since it will be rounded
+ to the nearest multiple of the cluster size. Any
+ unused bytes in the allocated space will be initialized
+ to 0.
+
+For the LUKS crypt method, the encryption header works as follows.
+
+The first 592 bytes of the header clusters will contain the LUKS
+partition header. This is then followed by the key material data areas.
+The size of the key material data areas is determined by the number of
+stripes in the key slot and key size. Refer to the LUKS format
+specification ('docs/on-disk-format.pdf' in the cryptsetup source
+package) for details of the LUKS partition header format.
+
+In the LUKS partition header, the "payload-offset" field will be
+calculated as normal for the LUKS spec. ie the size of the LUKS
+header, plus key material regions, plus padding, relative to the
+start of the LUKS header. This offset value is not required to be
+qcow2 cluster aligned. Its value is currently never used in the
+context of qcow2, since the qcow2 file format itself defines where
+the real payload offset is, but none the less a valid payload offset
+should always be present.
+
+In the LUKS key slots header, the "key-material-offset" is relative
+to the start of the LUKS header clusters in the qcow2 container,
+not the start of the qcow2 file.
+
+Logically the layout looks like
+
+ +-----------------------------+
+ | QCow2 header |
+ | QCow2 header extension X |
+ | QCow2 header extension FDE |
+ | QCow2 header extension ... |
+ | QCow2 header extension Z |
+ +-----------------------------+
+ | ....other QCow2 tables.... |
+ . .
+ . .
+ +-----------------------------+
+ | +-------------------------+ |
+ | | LUKS partition header | |
+ | +-------------------------+ |
+ | | LUKS key material 1 | |
+ | +-------------------------+ |
+ | | LUKS key material 2 | |
+ | +-------------------------+ |
+ | | LUKS key material ... | |
+ | +-------------------------+ |
+ | | LUKS key material 8 | |
+ | +-------------------------+ |
+ +-----------------------------+
+ | QCow2 cluster payload |
+ . .
+ . .
+ . .
+ | |
+ +-----------------------------+
+
+== Data encryption ==
+
+When an encryption method is requested in the header, the image payload
+data must be encrypted/decrypted on every write/read. The image headers
+and metadata are never encrypted.
+
+The algorithms used for encryption vary depending on the method
+
+ - AES:
+
+ The AES cipher, in CBC mode, with 256 bit keys.
+
+ Initialization vectors generated using plain64 method, with
+ the virtual disk sector as the input tweak.
+
+ This format is no longer supported in QEMU system emulators, due
+ to a number of design flaws affecting its security. It is only
+ supported in the command line tools for the sake of back compatibility
+ and data liberation.
+
+ - LUKS:
+
+ The algorithms are specified in the LUKS header.
+
+ Initialization vectors generated using the method specified
+ in the LUKS header, with the physical disk sector as the
+ input tweak.
+
+== Host cluster management ==
+
+qcow2 manages the allocation of host clusters by maintaining a reference count
+for each host cluster. A refcount of 0 means that the cluster is free, 1 means
+that it is used, and >= 2 means that it is used and any write access must
+perform a COW (copy on write) operation.
+
+The refcounts are managed in a two-level table. The first level is called
+refcount table and has a variable size (which is stored in the header). The
+refcount table can cover multiple clusters, however it needs to be contiguous
+in the image file.
+
+It contains pointers to the second level structures which are called refcount
+blocks and are exactly one cluster in size.
+
+Although a large enough refcount table can reserve clusters past 64 PB
+(56 bits) (assuming the underlying protocol can even be sized that
+large), note that some qcow2 metadata such as L1/L2 tables must point
+to clusters prior to that point.
+
+Note: qemu has an implementation limit of 8 MB as the maximum refcount
+table size. With a 2 MB cluster size and a default refcount_order of
+4, it is unable to reference host resources beyond 2 EB (61 bits); in
+the worst case, with a 512 cluster size and refcount_order of 6, it is
+unable to access beyond 32 GB (35 bits).
+
+Given an offset into the image file, the refcount of its cluster can be
+obtained as follows:
+
+ refcount_block_entries = (cluster_size * 8 / refcount_bits)
+
+ refcount_block_index = (offset / cluster_size) % refcount_block_entries
+ refcount_table_index = (offset / cluster_size) / refcount_block_entries
+
+ refcount_block = load_cluster(refcount_table[refcount_table_index]);
+ return refcount_block[refcount_block_index];
+
+Refcount table entry:
+
+ Bit 0 - 8: Reserved (set to 0)
+
+ 9 - 63: Bits 9-63 of the offset into the image file at which the
+ refcount block starts. Must be aligned to a cluster
+ boundary.
+
+ If this is 0, the corresponding refcount block has not yet
+ been allocated. All refcounts managed by this refcount block
+ are 0.
+
+Refcount block entry (x = refcount_bits - 1):
+
+ Bit 0 - x: Reference count of the cluster. If refcount_bits implies a
+ sub-byte width, note that bit 0 means the least significant
+ bit in this context.
+
+
+== Cluster mapping ==
+
+Just as for refcounts, qcow2 uses a two-level structure for the mapping of
+guest clusters to host clusters. They are called L1 and L2 table.
+
+The L1 table has a variable size (stored in the header) and may use multiple
+clusters, however it must be contiguous in the image file. L2 tables are
+exactly one cluster in size.
+
+The L1 and L2 tables have implications on the maximum virtual file
+size; for a given L1 table size, a larger cluster size is required for
+the guest to have access to more space. Furthermore, a virtual
+cluster must currently map to a host offset below 64 PB (56 bits)
+(although this limit could be relaxed by putting reserved bits into
+use). Additionally, as cluster size increases, the maximum host
+offset for a compressed cluster is reduced (a 2M cluster size requires
+compressed clusters to reside below 512 TB (49 bits), and this limit
+cannot be relaxed without an incompatible layout change).
+
+Given an offset into the virtual disk, the offset into the image file can be
+obtained as follows:
+
+ l2_entries = (cluster_size / sizeof(uint64_t)) [*]
+
+ l2_index = (offset / cluster_size) % l2_entries
+ l1_index = (offset / cluster_size) / l2_entries
+
+ l2_table = load_cluster(l1_table[l1_index]);
+ cluster_offset = l2_table[l2_index];
+
+ return cluster_offset + (offset % cluster_size)
+
+ [*] this changes if Extended L2 Entries are enabled, see next section
+
+L1 table entry:
+
+ Bit 0 - 8: Reserved (set to 0)
+
+ 9 - 55: Bits 9-55 of the offset into the image file at which the L2
+ table starts. Must be aligned to a cluster boundary. If the
+ offset is 0, the L2 table and all clusters described by this
+ L2 table are unallocated.
+
+ 56 - 62: Reserved (set to 0)
+
+ 63: 0 for an L2 table that is unused or requires COW, 1 if its
+ refcount is exactly one. This information is only accurate
+ in the active L1 table.
+
+L2 table entry:
+
+ Bit 0 - 61: Cluster descriptor
+
+ 62: 0 for standard clusters
+ 1 for compressed clusters
+
+ 63: 0 for clusters that are unused, compressed or require COW.
+ 1 for standard clusters whose refcount is exactly one.
+ This information is only accurate in L2 tables
+ that are reachable from the active L1 table.
+
+ With external data files, all guest clusters have an
+ implicit refcount of 1 (because of the fixed host = guest
+ mapping for guest cluster offsets), so this bit should be 1
+ for all allocated clusters.
+
+Standard Cluster Descriptor:
+
+ Bit 0: If set to 1, the cluster reads as all zeros. The host
+ cluster offset can be used to describe a preallocation,
+ but it won't be used for reading data from this cluster,
+ nor is data read from the backing file if the cluster is
+ unallocated.
+
+ With version 2 or with extended L2 entries (see the next
+ section), this is always 0.
+
+ 1 - 8: Reserved (set to 0)
+
+ 9 - 55: Bits 9-55 of host cluster offset. Must be aligned to a
+ cluster boundary. If the offset is 0 and bit 63 is clear,
+ the cluster is unallocated. The offset may only be 0 with
+ bit 63 set (indicating a host cluster offset of 0) when an
+ external data file is used.
+
+ 56 - 61: Reserved (set to 0)
+
+
+Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
+
+ Bit 0 - x-1: Host cluster offset. This is usually _not_ aligned to a
+ cluster or sector boundary! If cluster_bits is
+ small enough that this field includes bits beyond
+ 55, those upper bits must be set to 0.
+
+ x - 61: Number of additional 512-byte sectors used for the
+ compressed data, beyond the sector containing the offset
+ in the previous field. Some of these sectors may reside
+ in the next contiguous host cluster.
+
+ Note that the compressed data does not necessarily occupy
+ all of the bytes in the final sector; rather, decompression
+ stops when it has produced a cluster of data.
+
+ Another compressed cluster may map to the tail of the final
+ sector used by this compressed cluster.
+
+If a cluster is unallocated, read requests shall read the data from the backing
+file (except if bit 0 in the Standard Cluster Descriptor is set). If there is
+no backing file or the backing file is smaller than the image, they shall read
+zeros for all parts that are not covered by the backing file.
+
+== Extended L2 Entries ==
+
+An image uses Extended L2 Entries if bit 4 is set on the incompatible_features
+field of the header.
+
+In these images standard data clusters are divided into 32 subclusters of the
+same size. They are contiguous and start from the beginning of the cluster.
+Subclusters can be allocated independently and the L2 entry contains information
+indicating the status of each one of them. Compressed data clusters don't have
+subclusters so they are treated the same as in images without this feature.
+
+The size of an extended L2 entry is 128 bits so the number of entries per table
+is calculated using this formula:
+
+ l2_entries = (cluster_size / (2 * sizeof(uint64_t)))
+
+The first 64 bits have the same format as the standard L2 table entry described
+in the previous section, with the exception of bit 0 of the standard cluster
+descriptor.
+
+The last 64 bits contain a subcluster allocation bitmap with this format:
+
+Subcluster Allocation Bitmap (for standard clusters):
+
+ Bit 0 - 31: Allocation status (one bit per subcluster)
+
+ 1: the subcluster is allocated. In this case the
+ host cluster offset field must contain a valid
+ offset.
+ 0: the subcluster is not allocated. In this case
+ read requests shall go to the backing file or
+ return zeros if there is no backing file data.
+
+ Bits are assigned starting from the least significant
+ one (i.e. bit x is used for subcluster x).
+
+ 32 - 63 Subcluster reads as zeros (one bit per subcluster)
+
+ 1: the subcluster reads as zeros. In this case the
+ allocation status bit must be unset. The host
+ cluster offset field may or may not be set.
+ 0: no effect.
+
+ Bits are assigned starting from the least significant
+ one (i.e. bit x is used for subcluster x - 32).
+
+Subcluster Allocation Bitmap (for compressed clusters):
+
+ Bit 0 - 63: Reserved (set to 0)
+ Compressed clusters don't have subclusters,
+ so this field is not used.
+
+== Snapshots ==
+
+qcow2 supports internal snapshots. Their basic principle of operation is to
+switch the active L1 table, so that a different set of host clusters are
+exposed to the guest.
+
+When creating a snapshot, the L1 table should be copied and the refcount of all
+L2 tables and clusters reachable from this L1 table must be increased, so that
+a write causes a COW and isn't visible in other snapshots.
+
+When loading a snapshot, bit 63 of all entries in the new active L1 table and
+all L2 tables referenced by it must be reconstructed from the refcount table
+as it doesn't need to be accurate in inactive L1 tables.
+
+A directory of all snapshots is stored in the snapshot table, a contiguous area
+in the image file, whose starting offset and length are given by the header
+fields snapshots_offset and nb_snapshots. The entries of the snapshot table
+have variable length, depending on the length of ID, name and extra data.
+
+Snapshot table entry:
+
+ Byte 0 - 7: Offset into the image file at which the L1 table for the
+ snapshot starts. Must be aligned to a cluster boundary.
+
+ 8 - 11: Number of entries in the L1 table of the snapshots
+
+ 12 - 13: Length of the unique ID string describing the snapshot
+
+ 14 - 15: Length of the name of the snapshot
+
+ 16 - 19: Time at which the snapshot was taken in seconds since the
+ Epoch
+
+ 20 - 23: Subsecond part of the time at which the snapshot was taken
+ in nanoseconds
+
+ 24 - 31: Time that the guest was running until the snapshot was
+ taken in nanoseconds
+
+ 32 - 35: Size of the VM state in bytes. 0 if no VM state is saved.
+ If there is VM state, it starts at the first cluster
+ described by first L1 table entry that doesn't describe a
+ regular guest cluster (i.e. VM state is stored like guest
+ disk content, except that it is stored at offsets that are
+ larger than the virtual disk presented to the guest)
+
+ 36 - 39: Size of extra data in the table entry (used for future
+ extensions of the format)
+
+ variable: Extra data for future extensions. Unknown fields must be
+ ignored. Currently defined are (offset relative to snapshot
+ table entry):
+
+ Byte 40 - 47: Size of the VM state in bytes. 0 if no VM
+ state is saved. If this field is present,
+ the 32-bit value in bytes 32-35 is ignored.
+
+ Byte 48 - 55: Virtual disk size of the snapshot in bytes
+
+ Byte 56 - 63: icount value which corresponds to
+ the record/replay instruction count
+ when the snapshot was taken. Set to -1
+ if icount was disabled
+
+ Version 3 images must include extra data at least up to
+ byte 55.
+
+ variable: Unique ID string for the snapshot (not null terminated)
+
+ variable: Name of the snapshot (not null terminated)
+
+ variable: Padding to round up the snapshot table entry size to the
+ next multiple of 8.
+
+
+== Bitmaps ==
+
+As mentioned above, the bitmaps extension provides the ability to store bitmaps
+related to a virtual disk. This section describes how these bitmaps are stored.
+
+All stored bitmaps are related to the virtual disk stored in the same image, so
+each bitmap size is equal to the virtual disk size.
+
+Each bit of the bitmap is responsible for strictly defined range of the virtual
+disk. For bit number bit_nr the corresponding range (in bytes) will be:
+
+ [bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1]
+
+Granularity is a property of the concrete bitmap, see below.
+
+
+=== Bitmap directory ===
+
+Each bitmap saved in the image is described in a bitmap directory entry. The
+bitmap directory is a contiguous area in the image file, whose starting offset
+and length are given by the header extension fields bitmap_directory_offset and
+bitmap_directory_size. The entries of the bitmap directory have variable
+length, depending on the lengths of the bitmap name and extra data.
+
+Structure of a bitmap directory entry:
+
+ Byte 0 - 7: bitmap_table_offset
+ Offset into the image file at which the bitmap table
+ (described below) for the bitmap starts. Must be aligned to
+ a cluster boundary.
+
+ 8 - 11: bitmap_table_size
+ Number of entries in the bitmap table of the bitmap.
+
+ 12 - 15: flags
+ Bit
+ 0: in_use
+ The bitmap was not saved correctly and may be
+ inconsistent. Although the bitmap metadata is still
+ well-formed from a qcow2 perspective, the metadata
+ (such as the auto flag or bitmap size) or data
+ contents may be outdated.
+
+ 1: auto
+ The bitmap must reflect all changes of the virtual
+ disk by any application that would write to this qcow2
+ file (including writes, snapshot switching, etc.). The
+ type of this bitmap must be 'dirty tracking bitmap'.
+
+ 2: extra_data_compatible
+ This flags is meaningful when the extra data is
+ unknown to the software (currently any extra data is
+ unknown to QEMU).
+ If it is set, the bitmap may be used as expected, extra
+ data must be left as is.
+ If it is not set, the bitmap must not be used, but
+ both it and its extra data be left as is.
+
+ Bits 3 - 31 are reserved and must be 0.
+
+ 16: type
+ This field describes the sort of the bitmap.
+ Values:
+ 1: Dirty tracking bitmap
+
+ Values 0, 2 - 255 are reserved.
+
+ 17: granularity_bits
+ Granularity bits. Valid values: 0 - 63.
+
+ Note: QEMU currently supports only values 9 - 31.
+
+ Granularity is calculated as
+ granularity = 1 << granularity_bits
+
+ A bitmap's granularity is how many bytes of the image
+ accounts for one bit of the bitmap.
+
+ 18 - 19: name_size
+ Size of the bitmap name. Must be non-zero.
+
+ Note: QEMU currently doesn't support values greater than
+ 1023.
+
+ 20 - 23: extra_data_size
+ Size of type-specific extra data.
+
+ For now, as no extra data is defined, extra_data_size is
+ reserved and should be zero. If it is non-zero the
+ behavior is defined by extra_data_compatible flag.
+
+ variable: extra_data
+ Extra data for the bitmap, occupying extra_data_size bytes.
+ Extra data must never contain references to clusters or in
+ some other way allocate additional clusters.
+
+ variable: name
+ The name of the bitmap (not null terminated), occupying
+ name_size bytes. Must be unique among all bitmap names
+ within the bitmaps extension.
+
+ variable: Padding to round up the bitmap directory entry size to the
+ next multiple of 8. All bytes of the padding must be zero.
+
+
+=== Bitmap table ===
+
+Each bitmap is stored using a one-level structure (as opposed to two-level
+structures like for refcounts and guest clusters mapping) for the mapping of
+bitmap data to host clusters. This structure is called the bitmap table.
+
+Each bitmap table has a variable size (stored in the bitmap directory entry)
+and may use multiple clusters, however, it must be contiguous in the image
+file.
+
+Structure of a bitmap table entry:
+
+ Bit 0: Reserved and must be zero if bits 9 - 55 are non-zero.
+ If bits 9 - 55 are zero:
+ 0: Cluster should be read as all zeros.
+ 1: Cluster should be read as all ones.
+
+ 1 - 8: Reserved and must be zero.
+
+ 9 - 55: Bits 9 - 55 of the host cluster offset. Must be aligned to
+ a cluster boundary. If the offset is 0, the cluster is
+ unallocated; in that case, bit 0 determines how this
+ cluster should be treated during reads.
+
+ 56 - 63: Reserved and must be zero.
+
+
+=== Bitmap data ===
+
+As noted above, bitmap data is stored in separate clusters, described by the
+bitmap table. Given an offset (in bytes) into the bitmap data, the offset into
+the image file can be obtained as follows:
+
+ image_offset(bitmap_data_offset) =
+ bitmap_table[bitmap_data_offset / cluster_size] +
+ (bitmap_data_offset % cluster_size)
+
+This offset is not defined if bits 9 - 55 of bitmap table entry are zero (see
+above).
+
+Given an offset byte_nr into the virtual disk and the bitmap's granularity, the
+bit offset into the image file to the corresponding bit of the bitmap can be
+calculated like this:
+
+ bit_offset(byte_nr) =
+ image_offset(byte_nr / granularity / 8) * 8 +
+ (byte_nr / granularity) % 8
+
+If the size of the bitmap data is not a multiple of the cluster size then the
+last cluster of the bitmap data contains some unused tail bits. These bits must
+be zero.
+
+
+=== Dirty tracking bitmaps ===
+
+Bitmaps with 'type' field equal to one are dirty tracking bitmaps.
+
+When the virtual disk is in use dirty tracking bitmap may be 'enabled' or
+'disabled'. While the bitmap is 'enabled', all writes to the virtual disk
+should be reflected in the bitmap. A set bit in the bitmap means that the
+corresponding range of the virtual disk (see above) was written to while the
+bitmap was 'enabled'. An unset bit means that this range was not written to.
+
+The software doesn't have to sync the bitmap in the image file with its
+representation in RAM after each write or metadata change. Flag 'in_use'
+should be set while the bitmap is not synced.
+
+In the image file the 'enabled' state is reflected by the 'auto' flag. If this
+flag is set, the software must consider the bitmap as 'enabled' and start
+tracking virtual disk changes to this bitmap from the first write to the
+virtual disk. If this flag is not set then the bitmap is disabled.
diff --git a/docs/interop/qed_spec.txt b/docs/interop/qed_spec.txt
new file mode 100644
index 000000000..7982e058b
--- /dev/null
+++ b/docs/interop/qed_spec.txt
@@ -0,0 +1,138 @@
+=Specification=
+
+The file format looks like this:
+
+ +----------+----------+----------+-----+
+ | cluster0 | cluster1 | cluster2 | ... |
+ +----------+----------+----------+-----+
+
+The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters.
+
+Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.
+
+All fields are little-endian.
+
+==Header==
+ Header {
+ uint32_t magic; /* QED\0 */
+
+ uint32_t cluster_size; /* in bytes */
+ uint32_t table_size; /* for L1 and L2 tables, in clusters */
+ uint32_t header_size; /* in clusters */
+
+ uint64_t features; /* format feature bits */
+ uint64_t compat_features; /* compat feature bits */
+ uint64_t autoclear_features; /* self-resetting feature bits */
+
+ uint64_t l1_table_offset; /* in bytes */
+ uint64_t image_size; /* total logical image size, in bytes */
+
+ /* if (features & QED_F_BACKING_FILE) */
+ uint32_t backing_filename_offset; /* in bytes from start of header */
+ uint32_t backing_filename_size; /* in bytes */
+ }
+
+Field descriptions:
+* ''cluster_size'' must be a power of 2 in range [2^12, 2^26].
+* ''table_size'' must be a power of 2 in range [1, 16].
+* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters.
+* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows:
+** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits.
+** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format.
+** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later.
+* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''.
+* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes.
+* ''backing_filename_offset'' and ''backing_filename_size'' describe a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints. The string must be stored within the first ''header_size'' clusters. The backing filename may be an absolute path or relative to the image file.
+
+Feature bits:
+* QED_F_BACKING_FILE = 0x01. The image uses a backing file.
+* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.
+* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing files are never detected as an image format if they happen to contain magic constants.
+
+There are currently no defined ''compat_features'' or ''autoclear_features'' bits.
+
+Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.
+
+==Tables==
+
+Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
+
+ #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
+
+ Table {
+ uint64_t offsets[TABLE_NOFFSETS];
+ }
+
+The tables are organized as follows:
+
+ +----------+
+ | L1 table |
+ +----------+
+ ,------' | '------.
+ +----------+ | +----------+
+ | L2 table | ... | L2 table |
+ +----------+ +----------+
+ ,------' | '------.
+ +----------+ | +----------+
+ | Data | ... | Data |
+ +----------+ +----------+
+
+A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
+
+The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
+ header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
+
+L1, L2, and data cluster offsets must be aligned to header.cluster_size. The following offsets have special meanings:
+
+===L2 table offsets===
+* 0 - unallocated. The L2 table is not yet allocated.
+
+===Data cluster offsets===
+* 0 - unallocated. The data cluster is not yet allocated.
+* 1 - zero. The data cluster contents are all zeroes and no cluster is allocated.
+
+Future format extensions may wish to store per-offset information. The least significant 12 bits of an offset are reserved for this purpose and must be set to zero. Image files with cluster_size > 2^12 will have more unused bits which should also be zeroed.
+
+===Unallocated L2 tables and data clusters===
+Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.
+
+Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing file (or zeroes if no backing file) and the data being written.
+
+===Zero data clusters===
+Zero data clusters are a space-efficient way of storing zeroed regions of the image.
+
+Reads to a zero data cluster produce zeroes. Note that the difference between an unallocated and a zero data cluster is that zero data clusters stop the reading of contents from the backing file.
+
+Writes to a zero data cluster cause a new data cluster to be allocated. The new data cluster is populated with zeroes and the data being written.
+
+===Logical offset translation===
+Logical offsets are translated into cluster offsets as follows:
+
+ table_bits table_bits cluster_bits
+ <--------> <--------> <--------------->
+ +----------+----------+-----------------+
+ | L1 index | L2 index | byte offset |
+ +----------+----------+-----------------+
+
+ Structure of a logical offset
+
+ offset_mask = ~(cluster_size - 1) # mask for the image file byte offset
+
+ def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
+ l2_offset = l1_table[l1_index]
+ l2_table = load_table(l2_offset)
+ cluster_offset = l2_table[l2_index] & offset_mask
+ return cluster_offset + byte_offset
+
+==Consistency checking==
+
+This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit.
+
+The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.
+
+Consistency check includes the following invariants:
+# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references.
+# Offsets must be within the image file size and must be ''cluster_size'' aligned.
+# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table.
+
+The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.
diff --git a/docs/interop/qemu-ga-ref.rst b/docs/interop/qemu-ga-ref.rst
new file mode 100644
index 000000000..032d49245
--- /dev/null
+++ b/docs/interop/qemu-ga-ref.rst
@@ -0,0 +1,7 @@
+QEMU Guest Agent Protocol Reference
+===================================
+
+.. contents::
+ :depth: 3
+
+.. qapi-doc:: qga/qapi-schema.json
diff --git a/docs/interop/qemu-ga.rst b/docs/interop/qemu-ga.rst
new file mode 100644
index 000000000..3063357bb
--- /dev/null
+++ b/docs/interop/qemu-ga.rst
@@ -0,0 +1,134 @@
+QEMU Guest Agent
+================
+
+Synopsis
+--------
+
+**qemu-ga** [*OPTIONS*]
+
+Description
+-----------
+
+The QEMU Guest Agent is a daemon intended to be run within virtual
+machines. It allows the hypervisor host to perform various operations
+in the guest, such as:
+
+- get information from the guest
+- set the guest's system time
+- read/write a file
+- sync and freeze the filesystems
+- suspend the guest
+- reconfigure guest local processors
+- set user's password
+- ...
+
+qemu-ga will read a system configuration file on startup (located at
+|CONFDIR|\ ``/qemu-ga.conf`` by default), then parse remaining
+configuration options on the command line. For the same key, the last
+option wins, but the lists accumulate (see below for configuration
+file format).
+
+Options
+-------
+
+.. program:: qemu-ga
+
+.. option:: -m, --method=METHOD
+
+ Transport method: one of ``unix-listen``, ``virtio-serial``, or
+ ``isa-serial``, or ``vsock-listen`` (``virtio-serial`` is the default).
+
+.. option:: -p, --path=PATH
+
+ Device/socket path (the default for virtio-serial is
+ ``/dev/virtio-ports/org.qemu.guest_agent.0``,
+ the default for isa-serial is ``/dev/ttyS0``). Socket addresses for
+ vsock-listen are written as ``<cid>:<port>``.
+
+.. option:: -l, --logfile=PATH
+
+ Set log file path (default is stderr).
+
+.. option:: -f, --pidfile=PATH
+
+ Specify pid file (default is ``/var/run/qemu-ga.pid``).
+
+.. option:: -F, --fsfreeze-hook=PATH
+
+ Enable fsfreeze hook. Accepts an optional argument that specifies
+ script to run on freeze/thaw. Script will be called with
+ 'freeze'/'thaw' arguments accordingly (default is
+ |CONFDIR|\ ``/fsfreeze-hook``). If using -F with an argument, do
+ not follow -F with a space (for example:
+ ``-F/var/run/fsfreezehook.sh``).
+
+.. option:: -t, --statedir=PATH
+
+ Specify the directory to store state information (absolute paths only,
+ default is ``/var/run``).
+
+.. option:: -v, --verbose
+
+ Log extra debugging information.
+
+.. option:: -V, --version
+
+ Print version information and exit.
+
+.. option:: -d, --daemon
+
+ Daemonize after startup (detach from terminal).
+
+.. option:: -b, --blacklist=LIST
+
+ Comma-separated list of RPCs to disable (no spaces, ``?`` to list
+ available RPCs).
+
+.. option:: -D, --dump-conf
+
+ Dump the configuration in a format compatible with ``qemu-ga.conf``
+ and exit.
+
+.. option:: -h, --help
+
+ Display this help and exit.
+
+Files
+-----
+
+
+The syntax of the ``qemu-ga.conf`` configuration file follows the
+Desktop Entry Specification, here is a quick summary: it consists of
+groups of key-value pairs, interspersed with comments.
+
+::
+
+ # qemu-ga configuration sample
+ [general]
+ daemonize = 0
+ pidfile = /var/run/qemu-ga.pid
+ verbose = 0
+ method = virtio-serial
+ path = /dev/virtio-ports/org.qemu.guest_agent.0
+ statedir = /var/run
+
+The list of keys follows the command line options:
+
+============= ===========
+Key Key type
+============= ===========
+daemon boolean
+method string
+path string
+logfile string
+pidfile string
+fsfreeze-hook string
+statedir string
+verbose boolean
+blacklist string list
+============= ===========
+
+See also
+--------
+
+:manpage:`qemu(1)`
diff --git a/docs/interop/qemu-qmp-ref.rst b/docs/interop/qemu-qmp-ref.rst
new file mode 100644
index 000000000..357effd64
--- /dev/null
+++ b/docs/interop/qemu-qmp-ref.rst
@@ -0,0 +1,7 @@
+QEMU QMP Reference Manual
+=========================
+
+.. contents::
+ :depth: 3
+
+.. qapi-doc:: qapi/qapi-schema.json
diff --git a/docs/interop/qemu-storage-daemon-qmp-ref.rst b/docs/interop/qemu-storage-daemon-qmp-ref.rst
new file mode 100644
index 000000000..9fed68152
--- /dev/null
+++ b/docs/interop/qemu-storage-daemon-qmp-ref.rst
@@ -0,0 +1,7 @@
+QEMU Storage Daemon QMP Reference Manual
+========================================
+
+.. contents::
+ :depth: 3
+
+.. qapi-doc:: storage-daemon/qapi/qapi-schema.json
diff --git a/docs/interop/qmp-intro.txt b/docs/interop/qmp-intro.txt
new file mode 100644
index 000000000..1c745a7af
--- /dev/null
+++ b/docs/interop/qmp-intro.txt
@@ -0,0 +1,88 @@
+ QEMU Machine Protocol
+ =====================
+
+Introduction
+------------
+
+The QEMU Machine Protocol (QMP) allows applications to operate a
+QEMU instance.
+
+QMP is JSON[1] based and features the following:
+
+- Lightweight, text-based, easy to parse data format
+- Asynchronous messages support (ie. events)
+- Capabilities Negotiation
+
+For detailed information on QMP's usage, please, refer to the following files:
+
+o qmp-spec.txt QEMU Machine Protocol current specification
+o qemu-qmp-ref.html QEMU QMP commands and events (auto-generated at build-time)
+
+[1] https://www.json.org
+
+Usage
+-----
+
+You can use the -qmp option to enable QMP. For example, the following
+makes QMP available on localhost port 4444:
+
+$ qemu [...] -qmp tcp:localhost:4444,server=on,wait=off
+
+However, for more flexibility and to make use of more options, the -mon
+command-line option should be used. For instance, the following example
+creates one HMP instance (human monitor) on stdio and one QMP instance
+on localhost port 4444:
+
+$ qemu [...] -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline \
+ -chardev socket,id=mon1,host=localhost,port=4444,server=on,wait=off \
+ -mon chardev=mon1,mode=control,pretty=on
+
+Please, refer to QEMU's manpage for more information.
+
+Simple Testing
+--------------
+
+To manually test QMP one can connect with telnet and issue commands by hand:
+
+$ telnet localhost 4444
+Trying 127.0.0.1...
+Connected to localhost.
+Escape character is '^]'.
+{
+ "QMP": {
+ "version": {
+ "qemu": {
+ "micro": 0,
+ "minor": 0,
+ "major": 3
+ },
+ "package": "v3.0.0"
+ },
+ "capabilities": [
+ "oob"
+ ]
+ }
+}
+
+{ "execute": "qmp_capabilities" }
+{
+ "return": {
+ }
+}
+
+{ "execute": "query-status" }
+{
+ "return": {
+ "status": "prelaunch",
+ "singlestep": false,
+ "running": false
+ }
+}
+
+Please refer to docs/interop/qemu-qmp-ref.* for a complete command
+reference, generated from qapi/qapi-schema.json.
+
+QMP wiki page
+-------------
+
+https://wiki.qemu.org/QMP
diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
new file mode 100644
index 000000000..b0e8351d5
--- /dev/null
+++ b/docs/interop/qmp-spec.txt
@@ -0,0 +1,406 @@
+ QEMU Machine Protocol Specification
+
+0. About This Document
+======================
+
+Copyright (C) 2009-2016 Red Hat, Inc.
+
+This work is licensed under the terms of the GNU GPL, version 2 or
+later. See the COPYING file in the top-level directory.
+
+1. Introduction
+===============
+
+This document specifies the QEMU Machine Protocol (QMP), a JSON-based
+protocol which is available for applications to operate QEMU at the
+machine-level. It is also in use by the QEMU Guest Agent (QGA), which
+is available for host applications to interact with the guest
+operating system.
+
+2. Protocol Specification
+=========================
+
+This section details the protocol format. For the purpose of this
+document, "Server" is either QEMU or the QEMU Guest Agent, and
+"Client" is any application communicating with it via QMP.
+
+JSON data structures, when mentioned in this document, are always in the
+following format:
+
+ json-DATA-STRUCTURE-NAME
+
+Where DATA-STRUCTURE-NAME is any valid JSON data structure, as defined
+by the JSON standard:
+
+http://www.ietf.org/rfc/rfc8259.txt
+
+The server expects its input to be encoded in UTF-8, and sends its
+output encoded in ASCII.
+
+For convenience, json-object members mentioned in this document will
+be in a certain order. However, in real protocol usage they can be in
+ANY order, thus no particular order should be assumed. On the other
+hand, use of json-array elements presumes that preserving order is
+important unless specifically documented otherwise. Repeating a key
+within a json-object gives unpredictable results.
+
+Also for convenience, the server will accept an extension of
+'single-quoted' strings in place of the usual "double-quoted"
+json-string, and both input forms of strings understand an additional
+escape sequence of "\'" for a single quote. The server will only use
+double quoting on output.
+
+2.1 General Definitions
+-----------------------
+
+2.1.1 All interactions transmitted by the Server are json-objects, always
+ terminating with CRLF
+
+2.1.2 All json-objects members are mandatory when not specified otherwise
+
+2.2 Server Greeting
+-------------------
+
+Right when connected the Server will issue a greeting message, which signals
+that the connection has been successfully established and that the Server is
+ready for capabilities negotiation (for more information refer to section
+'4. Capabilities Negotiation').
+
+The greeting message format is:
+
+{ "QMP": { "version": json-object, "capabilities": json-array } }
+
+ Where,
+
+- The "version" member contains the Server's version information (the format
+ is the same of the query-version command)
+- The "capabilities" member specify the availability of features beyond the
+ baseline specification; the order of elements in this array has no
+ particular significance.
+
+2.2.1 Capabilities
+------------------
+
+Currently supported capabilities are:
+
+- "oob": the QMP server supports "out-of-band" (OOB) command
+ execution, as described in section "2.3.1 Out-of-band execution".
+
+2.3 Issuing Commands
+--------------------
+
+The format for command execution is:
+
+{ "execute": json-string, "arguments": json-object, "id": json-value }
+
+or
+
+{ "exec-oob": json-string, "arguments": json-object, "id": json-value }
+
+ Where,
+
+- The "execute" or "exec-oob" member identifies the command to be
+ executed by the server. The latter requests out-of-band execution.
+- The "arguments" member is used to pass any arguments required for the
+ execution of the command, it is optional when no arguments are
+ required. Each command documents what contents will be considered
+ valid when handling the json-argument
+- The "id" member is a transaction identification associated with the
+ command execution, it is optional and will be part of the response
+ if provided. The "id" member can be any json-value. A json-number
+ incremented for each successive command works fine.
+
+The actual commands are documented in the QEMU QMP reference manual
+docs/interop/qemu-qmp-ref.{7,html,info,pdf,txt}.
+
+2.3.1 Out-of-band execution
+---------------------------
+
+The server normally reads, executes and responds to one command after
+the other. The client therefore receives command responses in issue
+order.
+
+With out-of-band execution enabled via capability negotiation (section
+4.), the server reads and queues commands as they arrive. It executes
+commands from the queue one after the other. Commands executed
+out-of-band jump the queue: the command get executed right away,
+possibly overtaking prior in-band commands. The client may therefore
+receive such a command's response before responses from prior in-band
+commands.
+
+To be able to match responses back to their commands, the client needs
+to pass "id" with out-of-band commands. Passing it with all commands
+is recommended for clients that accept capability "oob".
+
+If the client sends in-band commands faster than the server can
+execute them, the server will stop reading requests until the request
+queue length is reduced to an acceptable range.
+
+To ensure commands to be executed out-of-band get read and executed,
+the client should have at most eight in-band commands in flight.
+
+Only a few commands support out-of-band execution. The ones that do
+have "allow-oob": true in output of query-qmp-schema.
+
+2.4 Commands Responses
+----------------------
+
+There are two possible responses which the Server will issue as the result
+of a command execution: success or error.
+
+As long as the commands were issued with a proper "id" field, then the
+same "id" field will be attached in the corresponding response message
+so that requests and responses can match. Clients should drop all the
+responses that have an unknown "id" field.
+
+2.4.1 success
+-------------
+
+The format of a success response is:
+
+{ "return": json-value, "id": json-value }
+
+ Where,
+
+- The "return" member contains the data returned by the command, which
+ is defined on a per-command basis (usually a json-object or
+ json-array of json-objects, but sometimes a json-number, json-string,
+ or json-array of json-strings); it is an empty json-object if the
+ command does not return data
+- The "id" member contains the transaction identification associated
+ with the command execution if issued by the Client
+
+2.4.2 error
+-----------
+
+The format of an error response is:
+
+{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
+
+ Where,
+
+- The "class" member contains the error class name (eg. "GenericError")
+- The "desc" member is a human-readable error message. Clients should
+ not attempt to parse this message.
+- The "id" member contains the transaction identification associated with
+ the command execution if issued by the Client
+
+NOTE: Some errors can occur before the Server is able to read the "id" member,
+in these cases the "id" member will not be part of the error response, even
+if provided by the client.
+
+2.5 Asynchronous events
+-----------------------
+
+As a result of state changes, the Server may send messages unilaterally
+to the Client at any time, when not in the middle of any other
+response. They are called "asynchronous events".
+
+The format of asynchronous events is:
+
+{ "event": json-string, "data": json-object,
+ "timestamp": { "seconds": json-number, "microseconds": json-number } }
+
+ Where,
+
+- The "event" member contains the event's name
+- The "data" member contains event specific data, which is defined in a
+ per-event basis, it is optional
+- The "timestamp" member contains the exact time of when the event
+ occurred in the Server. It is a fixed json-object with time in
+ seconds and microseconds relative to the Unix Epoch (1 Jan 1970); if
+ there is a failure to retrieve host time, both members of the
+ timestamp will be set to -1.
+
+The actual asynchronous events are documented in the QEMU QMP
+reference manual docs/interop/qemu-qmp-ref.{7,html,info,pdf,txt}.
+
+Some events are rate-limited to at most one per second. If additional
+"similar" events arrive within one second, all but the last one are
+dropped, and the last one is delayed. "Similar" normally means same
+event type.
+
+2.6 Forcing the JSON parser into known-good state
+-------------------------------------------------
+
+Incomplete or invalid input can leave the server's JSON parser in a
+state where it can't parse additional commands. To get it back into
+known-good state, the client should provoke a lexical error.
+
+The cleanest way to do that is sending an ASCII control character
+other than '\t' (horizontal tab), '\r' (carriage return), or '\n' (new
+line).
+
+Sadly, older versions of QEMU can fail to flag this as an error. If a
+client needs to deal with them, it should send a 0xFF byte.
+
+2.7 QGA Synchronization
+-----------------------
+
+When a client connects to QGA over a transport lacking proper
+connection semantics such as virtio-serial, QGA may have read partial
+input from a previous client. The client needs to force QGA's parser
+into known-good state using the previous section's technique.
+Moreover, the client may receive output a previous client didn't read.
+To help with skipping that output, QGA provides the
+'guest-sync-delimited' command. Refer to its documentation for
+details.
+
+
+3. QMP Examples
+===============
+
+This section provides some examples of real QMP usage, in all of them
+"C" stands for "Client" and "S" stands for "Server".
+
+3.1 Server greeting
+-------------------
+
+S: { "QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 3},
+ "package": "v3.0.0"}, "capabilities": ["oob"] } }
+
+3.2 Capabilities negotiation
+----------------------------
+
+C: { "execute": "qmp_capabilities", "arguments": { "enable": ["oob"] } }
+S: { "return": {}}
+
+3.3 Simple 'stop' execution
+---------------------------
+
+C: { "execute": "stop" }
+S: { "return": {} }
+
+3.4 KVM information
+-------------------
+
+C: { "execute": "query-kvm", "id": "example" }
+S: { "return": { "enabled": true, "present": true }, "id": "example"}
+
+3.5 Parsing error
+------------------
+
+C: { "execute": }
+S: { "error": { "class": "GenericError", "desc": "Invalid JSON syntax" } }
+
+3.6 Powerdown event
+-------------------
+
+S: { "timestamp": { "seconds": 1258551470, "microseconds": 802384 },
+ "event": "POWERDOWN" }
+
+3.7 Out-of-band execution
+-------------------------
+
+C: { "exec-oob": "migrate-pause", "id": 42 }
+S: { "id": 42,
+ "error": { "class": "GenericError",
+ "desc": "migrate-pause is currently only supported during postcopy-active state" } }
+
+
+4. Capabilities Negotiation
+===========================
+
+When a Client successfully establishes a connection, the Server is in
+Capabilities Negotiation mode.
+
+In this mode only the qmp_capabilities command is allowed to run, all
+other commands will return the CommandNotFound error. Asynchronous
+messages are not delivered either.
+
+Clients should use the qmp_capabilities command to enable capabilities
+advertised in the Server's greeting (section '2.2 Server Greeting') they
+support.
+
+When the qmp_capabilities command is issued, and if it does not return an
+error, the Server enters in Command mode where capabilities changes take
+effect, all commands (except qmp_capabilities) are allowed and asynchronous
+messages are delivered.
+
+5 Compatibility Considerations
+==============================
+
+All protocol changes or new features which modify the protocol format in an
+incompatible way are disabled by default and will be advertised by the
+capabilities array (section '2.2 Server Greeting'). Thus, Clients can check
+that array and enable the capabilities they support.
+
+The QMP Server performs a type check on the arguments to a command. It
+generates an error if a value does not have the expected type for its
+key, or if it does not understand a key that the Client included. The
+strictness of the Server catches wrong assumptions of Clients about
+the Server's schema. Clients can assume that, when such validation
+errors occur, they will be reported before the command generated any
+side effect.
+
+However, Clients must not assume any particular:
+
+- Length of json-arrays
+- Size of json-objects; in particular, future versions of QEMU may add
+ new keys and Clients should be able to ignore them.
+- Order of json-object members or json-array elements
+- Amount of errors generated by a command, that is, new errors can be added
+ to any existing command in newer versions of the Server
+
+Any command or member name beginning with "x-" is deemed experimental,
+and may be withdrawn or changed in an incompatible manner in a future
+release.
+
+Of course, the Server does guarantee to send valid JSON. But apart from
+this, a Client should be "conservative in what they send, and liberal in
+what they accept".
+
+6. Downstream extension of QMP
+==============================
+
+We recommend that downstream consumers of QEMU do *not* modify QMP.
+Management tools should be able to support both upstream and downstream
+versions of QMP without special logic, and downstream extensions are
+inherently at odds with that.
+
+However, we recognize that it is sometimes impossible for downstreams to
+avoid modifying QMP. Both upstream and downstream need to take care to
+preserve long-term compatibility and interoperability.
+
+To help with that, QMP reserves JSON object member names beginning with
+'__' (double underscore) for downstream use ("downstream names"). This
+means upstream will never use any downstream names for its commands,
+arguments, errors, asynchronous events, and so forth.
+
+Any new names downstream wishes to add must begin with '__'. To
+ensure compatibility with other downstreams, it is strongly
+recommended that you prefix your downstream names with '__RFQDN_' where
+RFQDN is a valid, reverse fully qualified domain name which you
+control. For example, a qemu-kvm specific monitor command would be:
+
+ (qemu) __org.linux-kvm_enable_irqchip
+
+Downstream must not change the server greeting (section 2.2) other than
+to offer additional capabilities. But see below for why even that is
+discouraged.
+
+Section '5 Compatibility Considerations' applies to downstream as well
+as to upstream, obviously. It follows that downstream must behave
+exactly like upstream for any input not containing members with
+downstream names ("downstream members"), except it may add members
+with downstream names to its output.
+
+Thus, a client should not be able to distinguish downstream from
+upstream as long as it doesn't send input with downstream members, and
+properly ignores any downstream members in the output it receives.
+
+Advice on downstream modifications:
+
+1. Introducing new commands is okay. If you want to extend an existing
+ command, consider introducing a new one with the new behaviour
+ instead.
+
+2. Introducing new asynchronous messages is okay. If you want to extend
+ an existing message, consider adding a new one instead.
+
+3. Introducing new errors for use in new commands is okay. Adding new
+ errors to existing commands counts as extension, so 1. applies.
+
+4. New capabilities are strongly discouraged. Capabilities are for
+ evolving the basic protocol, and multiple diverging basic protocol
+ dialects are most undesirable.
diff --git a/docs/interop/vhost-user-gpu.rst b/docs/interop/vhost-user-gpu.rst
new file mode 100644
index 000000000..71a2c52b3
--- /dev/null
+++ b/docs/interop/vhost-user-gpu.rst
@@ -0,0 +1,243 @@
+=======================
+Vhost-user-gpu Protocol
+=======================
+
+..
+ Licence: This work is licensed under the terms of the GNU GPL,
+ version 2 or later. See the COPYING file in the top-level
+ directory.
+
+.. contents:: Table of Contents
+
+Introduction
+============
+
+The vhost-user-gpu protocol is aiming at sharing the rendering result
+of a virtio-gpu, done from a vhost-user slave process to a vhost-user
+master process (such as QEMU). It bears a resemblance to a display
+server protocol, if you consider QEMU as the display server and the
+slave as the client, but in a very limited way. Typically, it will
+work by setting a scanout/display configuration, before sending flush
+events for the display updates. It will also update the cursor shape
+and position.
+
+The protocol is sent over a UNIX domain stream socket, since it uses
+socket ancillary data to share opened file descriptors (DMABUF fds or
+shared memory). The socket is usually obtained via
+``VHOST_USER_GPU_SET_SOCKET``.
+
+Requests are sent by the *slave*, and the optional replies by the
+*master*.
+
+Wire format
+===========
+
+Unless specified differently, numbers are in the machine native byte
+order.
+
+A vhost-user-gpu message (request and reply) consists of 3 header
+fields and a payload.
+
++---------+-------+------+---------+
+| request | flags | size | payload |
++---------+-------+------+---------+
+
+Header
+------
+
+:request: ``u32``, type of the request
+
+:flags: ``u32``, 32-bit bit field:
+
+ - Bit 2 is the reply flag - needs to be set on each reply
+
+:size: ``u32``, size of the payload
+
+Payload types
+-------------
+
+Depending on the request type, **payload** can be:
+
+VhostUserGpuCursorPos
+^^^^^^^^^^^^^^^^^^^^^
+
++------------+---+---+
+| scanout-id | x | y |
++------------+---+---+
+
+:scanout-id: ``u32``, the scanout where the cursor is located
+
+:x/y: ``u32``, the cursor position
+
+VhostUserGpuCursorUpdate
+^^^^^^^^^^^^^^^^^^^^^^^^
+
++-----+-------+-------+--------+
+| pos | hot_x | hot_y | cursor |
++-----+-------+-------+--------+
+
+:pos: a ``VhostUserGpuCursorPos``, the cursor location
+
+:hot_x/hot_y: ``u32``, the cursor hot location
+
+:cursor: ``[u32; 64 * 64]``, 64x64 RGBA cursor data (PIXMAN_a8r8g8b8 format)
+
+VhostUserGpuScanout
+^^^^^^^^^^^^^^^^^^^
+
++------------+---+---+
+| scanout-id | w | h |
++------------+---+---+
+
+:scanout-id: ``u32``, the scanout configuration to set
+
+:w/h: ``u32``, the scanout width/height size
+
+VhostUserGpuUpdate
+^^^^^^^^^^^^^^^^^^
+
++------------+---+---+---+---+------+
+| scanout-id | x | y | w | h | data |
++------------+---+---+---+---+------+
+
+:scanout-id: ``u32``, the scanout content to update
+
+:x/y/w/h: ``u32``, region of the update
+
+:data: RGB data (PIXMAN_x8r8g8b8 format)
+
+VhostUserGpuDMABUFScanout
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
++------------+---+---+---+---+-----+-----+--------+-------+--------+
+| scanout-id | x | y | w | h | fdw | fwh | stride | flags | fourcc |
++------------+---+---+---+---+-----+-----+--------+-------+--------+
+
+:scanout-id: ``u32``, the scanout configuration to set
+
+:x/y: ``u32``, the location of the scanout within the DMABUF
+
+:w/h: ``u32``, the scanout width/height size
+
+:fdw/fdh/stride/flags: ``u32``, the DMABUF width/height/stride/flags
+
+:fourcc: ``i32``, the DMABUF fourcc
+
+
+C structure
+-----------
+
+In QEMU the vhost-user-gpu message is implemented with the following struct:
+
+.. code:: c
+
+ typedef struct VhostUserGpuMsg {
+ uint32_t request; /* VhostUserGpuRequest */
+ uint32_t flags;
+ uint32_t size; /* the following payload size */
+ union {
+ VhostUserGpuCursorPos cursor_pos;
+ VhostUserGpuCursorUpdate cursor_update;
+ VhostUserGpuScanout scanout;
+ VhostUserGpuUpdate update;
+ VhostUserGpuDMABUFScanout dmabuf_scanout;
+ struct virtio_gpu_resp_display_info display_info;
+ uint64_t u64;
+ } payload;
+ } QEMU_PACKED VhostUserGpuMsg;
+
+Protocol features
+-----------------
+
+None yet.
+
+As the protocol may need to evolve, new messages and communication
+changes are negotiated thanks to preliminary
+``VHOST_USER_GPU_GET_PROTOCOL_FEATURES`` and
+``VHOST_USER_GPU_SET_PROTOCOL_FEATURES`` requests.
+
+Communication
+=============
+
+Message types
+-------------
+
+``VHOST_USER_GPU_GET_PROTOCOL_FEATURES``
+ :id: 1
+ :request payload: N/A
+ :reply payload: ``u64``
+
+ Get the supported protocol features bitmask.
+
+``VHOST_USER_GPU_SET_PROTOCOL_FEATURES``
+ :id: 2
+ :request payload: ``u64``
+ :reply payload: N/A
+
+ Enable protocol features using a bitmask.
+
+``VHOST_USER_GPU_GET_DISPLAY_INFO``
+ :id: 3
+ :request payload: N/A
+ :reply payload: ``struct virtio_gpu_resp_display_info`` (from virtio specification)
+
+ Get the preferred display configuration.
+
+``VHOST_USER_GPU_CURSOR_POS``
+ :id: 4
+ :request payload: ``VhostUserGpuCursorPos``
+ :reply payload: N/A
+
+ Set/show the cursor position.
+
+``VHOST_USER_GPU_CURSOR_POS_HIDE``
+ :id: 5
+ :request payload: ``VhostUserGpuCursorPos``
+ :reply payload: N/A
+
+ Set/hide the cursor.
+
+``VHOST_USER_GPU_CURSOR_UPDATE``
+ :id: 6
+ :request payload: ``VhostUserGpuCursorUpdate``
+ :reply payload: N/A
+
+ Update the cursor shape and location.
+
+``VHOST_USER_GPU_SCANOUT``
+ :id: 7
+ :request payload: ``VhostUserGpuScanout``
+ :reply payload: N/A
+
+ Set the scanout resolution. To disable a scanout, the dimensions
+ width/height are set to 0.
+
+``VHOST_USER_GPU_UPDATE``
+ :id: 8
+ :request payload: ``VhostUserGpuUpdate``
+ :reply payload: N/A
+
+ Update the scanout content. The data payload contains the graphical bits.
+ The display should be flushed and presented.
+
+``VHOST_USER_GPU_DMABUF_SCANOUT``
+ :id: 9
+ :request payload: ``VhostUserGpuDMABUFScanout``
+ :reply payload: N/A
+
+ Set the scanout resolution/configuration, and share a DMABUF file
+ descriptor for the scanout content, which is passed as ancillary
+ data. To disable a scanout, the dimensions width/height are set
+ to 0, there is no file descriptor passed.
+
+``VHOST_USER_GPU_DMABUF_UPDATE``
+ :id: 10
+ :request payload: ``VhostUserGpuUpdate``
+ :reply payload: empty payload
+
+ The display should be flushed and presented according to updated
+ region from ``VhostUserGpuUpdate``.
+
+ Note: there is no data payload, since the scanout is shared thanks
+ to DMABUF, that must have been set previously with
+ ``VHOST_USER_GPU_DMABUF_SCANOUT``.
diff --git a/docs/interop/vhost-user.json b/docs/interop/vhost-user.json
new file mode 100644
index 000000000..b6ade9e49
--- /dev/null
+++ b/docs/interop/vhost-user.json
@@ -0,0 +1,267 @@
+# -*- Mode: Python -*-
+# vim: filetype=python
+#
+# Copyright (C) 2018 Red Hat, Inc.
+#
+# Authors:
+# Marc-André Lureau <marcandre.lureau@redhat.com>
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later. See the COPYING file in the top-level directory.
+
+##
+# = vhost user backend discovery & capabilities
+##
+
+##
+# @VHostUserBackendType:
+#
+# List the various vhost user backend types.
+#
+# @9p: 9p virtio console
+# @balloon: virtio balloon
+# @block: virtio block
+# @caif: virtio caif
+# @console: virtio console
+# @crypto: virtio crypto
+# @gpu: virtio gpu
+# @input: virtio input
+# @net: virtio net
+# @rng: virtio rng
+# @rpmsg: virtio remote processor messaging
+# @rproc-serial: virtio remoteproc serial link
+# @scsi: virtio scsi
+# @vsock: virtio vsock transport
+# @fs: virtio fs (since 4.2)
+#
+# Since: 4.0
+##
+{
+ 'enum': 'VHostUserBackendType',
+ 'data': [
+ '9p',
+ 'balloon',
+ 'block',
+ 'caif',
+ 'console',
+ 'crypto',
+ 'gpu',
+ 'input',
+ 'net',
+ 'rng',
+ 'rpmsg',
+ 'rproc-serial',
+ 'scsi',
+ 'vsock',
+ 'fs'
+ ]
+}
+
+##
+# @VHostUserBackendBlockFeature:
+#
+# List of vhost user "block" features.
+#
+# @read-only: The --read-only command line option is supported.
+# @blk-file: The --blk-file command line option is supported.
+#
+# Since: 5.0
+##
+{
+ 'enum': 'VHostUserBackendBlockFeature',
+ 'data': [ 'read-only', 'blk-file' ]
+}
+
+##
+# @VHostUserBackendCapabilitiesBlock:
+#
+# Capabilities reported by vhost user "block" backends
+#
+# @features: list of supported features.
+#
+# Since: 5.0
+##
+{
+ 'struct': 'VHostUserBackendCapabilitiesBlock',
+ 'data': {
+ 'features': [ 'VHostUserBackendBlockFeature' ]
+ }
+}
+
+##
+# @VHostUserBackendInputFeature:
+#
+# List of vhost user "input" features.
+#
+# @evdev-path: The --evdev-path command line option is supported.
+# @no-grab: The --no-grab command line option is supported.
+#
+# Since: 4.0
+##
+{
+ 'enum': 'VHostUserBackendInputFeature',
+ 'data': [ 'evdev-path', 'no-grab' ]
+}
+
+##
+# @VHostUserBackendCapabilitiesInput:
+#
+# Capabilities reported by vhost user "input" backends
+#
+# @features: list of supported features.
+#
+# Since: 4.0
+##
+{
+ 'struct': 'VHostUserBackendCapabilitiesInput',
+ 'data': {
+ 'features': [ 'VHostUserBackendInputFeature' ]
+ }
+}
+
+##
+# @VHostUserBackendGPUFeature:
+#
+# List of vhost user "gpu" features.
+#
+# @render-node: The --render-node command line option is supported.
+# @virgl: The --virgl command line option is supported.
+#
+# Since: 4.0
+##
+{
+ 'enum': 'VHostUserBackendGPUFeature',
+ 'data': [ 'render-node', 'virgl' ]
+}
+
+##
+# @VHostUserBackendCapabilitiesGPU:
+#
+# Capabilities reported by vhost user "gpu" backends.
+#
+# @features: list of supported features.
+#
+# Since: 4.0
+##
+{
+ 'struct': 'VHostUserBackendCapabilitiesGPU',
+ 'data': {
+ 'features': [ 'VHostUserBackendGPUFeature' ]
+ }
+}
+
+##
+# @VHostUserBackendCapabilities:
+#
+# Capabilities reported by vhost user backends.
+#
+# @type: The vhost user backend type.
+#
+# Since: 4.0
+##
+{
+ 'union': 'VHostUserBackendCapabilities',
+ 'base': { 'type': 'VHostUserBackendType' },
+ 'discriminator': 'type',
+ 'data': {
+ 'input': 'VHostUserBackendCapabilitiesInput',
+ 'gpu': 'VHostUserBackendCapabilitiesGPU'
+ }
+}
+
+##
+# @VhostUserBackend:
+#
+# Describes a vhost user backend to management software.
+#
+# It is possible for multiple @VhostUserBackend elements to match the
+# search criteria of management software. Applications thus need rules
+# to pick one of the many matches, and users need the ability to
+# override distro defaults.
+#
+# It is recommended to create vhost user backend JSON files (each
+# containing a single @VhostUserBackend root element) with a
+# double-digit prefix, for example "50-qemu-gpu.json",
+# "50-crosvm-gpu.json", etc, so they can be sorted in predictable
+# order. The backend JSON files should be searched for in three
+# directories:
+#
+# - /usr/share/qemu/vhost-user -- populated by distro-provided
+# packages (XDG_DATA_DIRS covers
+# /usr/share by default),
+#
+# - /etc/qemu/vhost-user -- exclusively for sysadmins' local additions,
+#
+# - $XDG_CONFIG_HOME/qemu/vhost-user -- exclusively for per-user local
+# additions (XDG_CONFIG_HOME
+# defaults to $HOME/.config).
+#
+# Top-down, the list of directories goes from general to specific.
+#
+# Management software should build a list of files from all three
+# locations, then sort the list by filename (i.e., basename
+# component). Management software should choose the first JSON file on
+# the sorted list that matches the search criteria. If a more specific
+# directory has a file with same name as a less specific directory,
+# then the file in the more specific directory takes effect. If the
+# more specific file is zero length, it hides the less specific one.
+#
+# For example, if a distro ships
+#
+# - /usr/share/qemu/vhost-user/50-qemu-gpu.json
+#
+# - /usr/share/qemu/vhost-user/50-crosvm-gpu.json
+#
+# then the sysadmin can prevent the default QEMU GPU being used at all with
+#
+# $ touch /etc/qemu/vhost-user/50-qemu-gpu.json
+#
+# The sysadmin can replace/alter the distro default QEMU GPU with
+#
+# $ vim /etc/qemu/vhost-user/50-qemu-gpu.json
+#
+# or they can provide a parallel QEMU GPU with higher priority
+#
+# $ vim /etc/qemu/vhost-user/10-qemu-gpu.json
+#
+# or they can provide a parallel QEMU GPU with lower priority
+#
+# $ vim /etc/qemu/vhost-user/99-qemu-gpu.json
+#
+# @type: The vhost user backend type.
+#
+# @description: Provides a human-readable description of the backend.
+# Management software may or may not display @description.
+#
+# @binary: Absolute path to the backend binary.
+#
+# @tags: An optional list of auxiliary strings associated with the
+# backend for which @description is not appropriate, due to the
+# latter's possible exposure to the end-user. @tags serves
+# development and debugging purposes only, and management
+# software shall explicitly ignore it.
+#
+# Since: 4.0
+#
+# Example:
+#
+# {
+# "description": "QEMU vhost-user-gpu",
+# "type": "gpu",
+# "binary": "/usr/libexec/qemu/vhost-user-gpu",
+# "tags": [
+# "CONFIG_OPENGL=y",
+# "CONFIG_GBM=y"
+# ]
+# }
+#
+##
+{
+ 'struct' : 'VhostUserBackend',
+ 'data' : {
+ 'description': 'str',
+ 'type': 'VHostUserBackendType',
+ 'binary': 'str',
+ '*tags': [ 'str' ]
+ }
+}
diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
new file mode 100644
index 000000000..edc3ad84a
--- /dev/null
+++ b/docs/interop/vhost-user.rst
@@ -0,0 +1,1585 @@
+.. _vhost_user_proto:
+
+===================
+Vhost-user Protocol
+===================
+
+..
+ Copyright 2014 Virtual Open Systems Sarl.
+ Copyright 2019 Intel Corporation
+ Licence: This work is licensed under the terms of the GNU GPL,
+ version 2 or later. See the COPYING file in the top-level
+ directory.
+
+.. contents:: Table of Contents
+
+Introduction
+============
+
+This protocol is aiming to complement the ``ioctl`` interface used to
+control the vhost implementation in the Linux kernel. It implements
+the control plane needed to establish virtqueue sharing with a user
+space process on the same host. It uses communication over a Unix
+domain socket to share file descriptors in the ancillary data of the
+message.
+
+The protocol defines 2 sides of the communication, *master* and
+*slave*. *Master* is the application that shares its virtqueues, in
+our case QEMU. *Slave* is the consumer of the virtqueues.
+
+In the current implementation QEMU is the *master*, and the *slave* is
+the external process consuming the virtio queues, for example a
+software Ethernet switch running in user space, such as Snabbswitch,
+or a block device backend processing read & write to a virtual
+disk. In order to facilitate interoperability between various backend
+implementations, it is recommended to follow the :ref:`Backend program
+conventions <backend_conventions>`.
+
+*Master* and *slave* can be either a client (i.e. connecting) or
+server (listening) in the socket communication.
+
+Message Specification
+=====================
+
+.. Note:: All numbers are in the machine native byte order.
+
+A vhost-user message consists of 3 header fields and a payload.
+
++---------+-------+------+---------+
+| request | flags | size | payload |
++---------+-------+------+---------+
+
+Header
+------
+
+:request: 32-bit type of the request
+
+:flags: 32-bit bit field
+
+- Lower 2 bits are the version (currently 0x01)
+- Bit 2 is the reply flag - needs to be sent on each reply from the slave
+- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for
+ details.
+
+:size: 32-bit size of the payload
+
+Payload
+-------
+
+Depending on the request type, **payload** can be:
+
+A single 64-bit integer
+^^^^^^^^^^^^^^^^^^^^^^^
+
++-----+
+| u64 |
++-----+
+
+:u64: a 64-bit unsigned integer
+
+A vring state description
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
++-------+-----+
+| index | num |
++-------+-----+
+
+:index: a 32-bit index
+
+:num: a 32-bit number
+
+A vring address description
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++-------+-------+------+------------+------+-----------+-----+
+| index | flags | size | descriptor | used | available | log |
++-------+-------+------+------------+------+-----------+-----+
+
+:index: a 32-bit vring index
+
+:flags: a 32-bit vring flags
+
+:descriptor: a 64-bit ring address of the vring descriptor table
+
+:used: a 64-bit ring address of the vring used ring
+
+:available: a 64-bit ring address of the vring available ring
+
+:log: a 64-bit guest address for logging
+
+Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has
+been negotiated. Otherwise it is a user address.
+
+Memory regions description
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++-------------+---------+---------+-----+---------+
+| num regions | padding | region0 | ... | region7 |
++-------------+---------+---------+-----+---------+
+
+:num regions: a 32-bit number of regions
+
+:padding: 32-bit
+
+A region is:
+
++---------------+------+--------------+-------------+
+| guest address | size | user address | mmap offset |
++---------------+------+--------------+-------------+
+
+:guest address: a 64-bit guest address of the region
+
+:size: a 64-bit size
+
+:user address: a 64-bit user address
+
+:mmap offset: 64-bit offset where region starts in the mapped memory
+
+Single memory region description
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++---------+---------------+------+--------------+-------------+
+| padding | guest address | size | user address | mmap offset |
++---------+---------------+------+--------------+-------------+
+
+:padding: 64-bit
+
+:guest address: a 64-bit guest address of the region
+
+:size: a 64-bit size
+
+:user address: a 64-bit user address
+
+:mmap offset: 64-bit offset where region starts in the mapped memory
+
+Log description
+^^^^^^^^^^^^^^^
+
++----------+------------+
+| log size | log offset |
++----------+------------+
+
+:log size: size of area used for logging
+
+:log offset: offset from start of supplied file descriptor where
+ logging starts (i.e. where guest address 0 would be
+ logged)
+
+An IOTLB message
+^^^^^^^^^^^^^^^^
+
++------+------+--------------+-------------------+------+
+| iova | size | user address | permissions flags | type |
++------+------+--------------+-------------------+------+
+
+:iova: a 64-bit I/O virtual address programmed by the guest
+
+:size: a 64-bit size
+
+:user address: a 64-bit user address
+
+:permissions flags: an 8-bit value:
+ - 0: No access
+ - 1: Read access
+ - 2: Write access
+ - 3: Read/Write access
+
+:type: an 8-bit IOTLB message type:
+ - 1: IOTLB miss
+ - 2: IOTLB update
+ - 3: IOTLB invalidate
+ - 4: IOTLB access fail
+
+Virtio device config space
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++--------+------+-------+---------+
+| offset | size | flags | payload |
++--------+------+-------+---------+
+
+:offset: a 32-bit offset of virtio device's configuration space
+
+:size: a 32-bit configuration space access size in bytes
+
+:flags: a 32-bit value:
+ - 0: Vhost master messages used for writeable fields
+ - 1: Vhost master messages used for live migration
+
+:payload: Size bytes array holding the contents of the virtio
+ device's configuration space
+
+Vring area description
+^^^^^^^^^^^^^^^^^^^^^^
+
++-----+------+--------+
+| u64 | size | offset |
++-----+------+--------+
+
+:u64: a 64-bit integer contains vring index and flags
+
+:size: a 64-bit size of this area
+
+:offset: a 64-bit offset of this area from the start of the
+ supplied file descriptor
+
+Inflight description
+^^^^^^^^^^^^^^^^^^^^
+
++-----------+-------------+------------+------------+
+| mmap size | mmap offset | num queues | queue size |
++-----------+-------------+------------+------------+
+
+:mmap size: a 64-bit size of area to track inflight I/O
+
+:mmap offset: a 64-bit offset of this area from the start
+ of the supplied file descriptor
+
+:num queues: a 16-bit number of virtqueues
+
+:queue size: a 16-bit size of virtqueues
+
+C structure
+-----------
+
+In QEMU the vhost-user message is implemented with the following struct:
+
+.. code:: c
+
+ typedef struct VhostUserMsg {
+ VhostUserRequest request;
+ uint32_t flags;
+ uint32_t size;
+ union {
+ uint64_t u64;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr;
+ VhostUserMemory memory;
+ VhostUserLog log;
+ struct vhost_iotlb_msg iotlb;
+ VhostUserConfig config;
+ VhostUserVringArea area;
+ VhostUserInflight inflight;
+ };
+ } QEMU_PACKED VhostUserMsg;
+
+Communication
+=============
+
+The protocol for vhost-user is based on the existing implementation of
+vhost for the Linux Kernel. Most messages that can be sent via the
+Unix domain socket implementing vhost-user have an equivalent ioctl to
+the kernel implementation.
+
+The communication consists of *master* sending message requests and
+*slave* sending message replies. Most of the requests don't require
+replies. Here is a list of the ones that do:
+
+* ``VHOST_USER_GET_FEATURES``
+* ``VHOST_USER_GET_PROTOCOL_FEATURES``
+* ``VHOST_USER_GET_VRING_BASE``
+* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
+* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
+
+.. seealso::
+
+ :ref:`REPLY_ACK <reply_ack>`
+ The section on ``REPLY_ACK`` protocol extension.
+
+There are several messages that the master sends with file descriptors passed
+in the ancillary data:
+
+* ``VHOST_USER_SET_MEM_TABLE``
+* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
+* ``VHOST_USER_SET_LOG_FD``
+* ``VHOST_USER_SET_VRING_KICK``
+* ``VHOST_USER_SET_VRING_CALL``
+* ``VHOST_USER_SET_VRING_ERR``
+* ``VHOST_USER_SET_SLAVE_REQ_FD``
+* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
+
+If *master* is unable to send the full message or receives a wrong
+reply it will close the connection. An optional reconnection mechanism
+can be implemented.
+
+If *slave* detects some error such as incompatible features, it may also
+close the connection. This should only happen in exceptional circumstances.
+
+Any protocol extensions are gated by protocol feature bits, which
+allows full backwards compatibility on both master and slave. As
+older slaves don't support negotiating protocol features, a feature
+bit was dedicated for this purpose::
+
+ #define VHOST_USER_F_PROTOCOL_FEATURES 30
+
+Starting and stopping rings
+---------------------------
+
+Client must only process each ring when it is started.
+
+Client must only pass data between the ring and the backend, when the
+ring is enabled.
+
+If ring is started but disabled, client must process the ring without
+talking to the backend.
+
+For example, for a networking device, in the disabled state client
+must not supply any new RX packets, but must process and discard any
+TX packets.
+
+If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
+ring is initialized in an enabled state.
+
+If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
+initialized in a disabled state. Client must not pass data to/from the
+backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with
+parameter 1, or after it has been disabled by
+``VHOST_USER_SET_VRING_ENABLE`` with parameter 0.
+
+Each ring is initialized in a stopped state, client must not process
+it until ring is started, or after it has been stopped.
+
+Client must start ring upon receiving a kick (that is, detecting that
+file descriptor is readable) on the descriptor specified by
+``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
+``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving
+``VHOST_USER_GET_VRING_BASE``.
+
+While processing the rings (whether they are enabled or not), client
+must support changing some configuration aspects on the fly.
+
+Multiple queue support
+----------------------
+
+Many devices have a fixed number of virtqueues. In this case the master
+already knows the number of available virtqueues without communicating with the
+slave.
+
+Some devices do not have a fixed number of virtqueues. Instead the maximum
+number of virtqueues is chosen by the slave. The number can depend on host
+resource availability or slave implementation details. Such devices are called
+multiple queue devices.
+
+Multiple queue support allows the slave to advertise the maximum number of
+queues. This is treated as a protocol extension, hence the slave has to
+implement protocol features first. The multiple queues feature is supported
+only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set.
+
+The max number of queues the slave supports can be queried with message
+``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested
+queues is bigger than that.
+
+As all queues share one connection, the master uses a unique index for each
+queue in the sent message to identify a specified queue.
+
+The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
+vhost-user-net has historically automatically enabled the first queue pair.
+
+Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
+feature, even for devices with a fixed number of virtqueues, since it is simple
+to implement and offers a degree of introspection.
+
+Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
+devices with a fixed number of virtqueues. Only true multiqueue devices
+require this protocol feature.
+
+Migration
+---------
+
+During live migration, the master may need to track the modifications
+the slave makes to the memory mapped regions. The client should mark
+the dirty pages in a log. Once it complies to this logging, it may
+declare the ``VHOST_F_LOG_ALL`` vhost feature.
+
+To start/stop logging of data/used ring writes, server may send
+messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and
+``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's
+flags set to 1/0, respectively.
+
+All the modifications to memory pointed by vring "descriptor" should
+be marked. Modifications to "used" vring should be marked if
+``VHOST_VRING_F_LOG`` is part of ring's flags.
+
+Dirty pages are of size::
+
+ #define VHOST_LOG_PAGE 0x1000
+
+The log memory fd is provided in the ancillary data of
+``VHOST_USER_SET_LOG_BASE`` message when the slave has
+``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature.
+
+The size of the log is supplied as part of ``VhostUserMsg`` which
+should be large enough to cover all known guest addresses. Log starts
+at the supplied offset in the supplied file descriptor. The log
+covers from address 0 to the maximum of guest regions. In pseudo-code,
+to mark page at ``addr`` as dirty::
+
+ page = addr / VHOST_LOG_PAGE
+ log[page / 8] |= 1 << page % 8
+
+Where ``addr`` is the guest physical address.
+
+Use atomic operations, as the log may be concurrently manipulated.
+
+Note that when logging modifications to the used ring (when
+``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should
+be used to calculate the log offset: the write to first byte of the
+used ring is logged at this offset from log start. Also note that this
+value might be outside the legal guest physical address range
+(i.e. does not have to be covered by the ``VhostUserMemory`` table), but
+the bit offset of the last byte of the ring must fall within the size
+supplied by ``VhostUserLog``.
+
+``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in
+ancillary data, it may be used to inform the master that the log has
+been modified.
+
+Once the source has finished migration, rings will be stopped by the
+source. No further update must be done before rings are restarted.
+
+In postcopy migration the slave is started before all the memory has
+been received from the source host, and care must be taken to avoid
+accessing pages that have yet to be received. The slave opens a
+'userfault'-fd and registers the memory with it; this fd is then
+passed back over to the master. The master services requests on the
+userfaultfd for pages that are accessed and when the page is available
+it performs WAKE ioctl's on the userfaultfd to wake the stalled
+slave. The client indicates support for this via the
+``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
+
+Memory access
+-------------
+
+The master sends a list of vhost memory regions to the slave using the
+``VHOST_USER_SET_MEM_TABLE`` message. Each region has two base
+addresses: a guest address and a user address.
+
+Messages contain guest addresses and/or user addresses to reference locations
+within the shared memory. The mapping of these addresses works as follows.
+
+User addresses map to the vhost memory region containing that user address.
+
+When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
+
+* Guest addresses map to the vhost memory region containing that guest
+ address.
+
+When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
+
+* Guest addresses are also called I/O virtual addresses (IOVAs). They are
+ translated to user addresses via the IOTLB.
+
+* The vhost memory region guest address is not used.
+
+IOMMU support
+-------------
+
+When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the
+master sends IOTLB entries update & invalidation by sending
+``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct
+vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload
+has to be filled with the update message type (2), the I/O virtual
+address, the size, the user virtual address, and the permissions
+flags. Addresses and size must be within vhost memory regions set via
+the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the
+``iotlb`` payload has to be filled with the invalidation message type
+(3), the I/O virtual address and the size. On success, the slave is
+expected to reply with a zero payload, non-zero otherwise.
+
+The slave relies on the slave communication channel (see :ref:`Slave
+communication <slave_communication>` section below) to send IOTLB miss
+and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG``
+requests to the master with a ``struct vhost_iotlb_msg`` as
+payload. For miss events, the iotlb payload has to be filled with the
+miss message type (1), the I/O virtual address and the permissions
+flags. For access failure event, the iotlb payload has to be filled
+with the access failure message type (4), the I/O virtual address and
+the permissions flags. For synchronization purpose, the slave may
+rely on the reply-ack feature, so the master may send a reply when
+operation is completed if the reply-ack feature is negotiated and
+slaves requests a reply. For miss events, completed operation means
+either master sent an update message containing the IOTLB entry
+containing requested address and permission, or master sent nothing if
+the IOTLB miss message is invalid (invalid IOVA or permission).
+
+The master isn't expected to take the initiative to send IOTLB update
+messages, as the slave sends IOTLB miss messages for the guest virtual
+memory areas it needs to access.
+
+.. _slave_communication:
+
+Slave communication
+-------------------
+
+An optional communication channel is provided if the slave declares
+``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the
+slave to make requests to the master.
+
+The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data.
+
+A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master
+using this fd communication channel.
+
+If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is
+negotiated, slave can send file descriptors (at most 8 descriptors in
+each message) to master via ancillary data using this fd communication
+channel.
+
+Inflight I/O tracking
+---------------------
+
+To support reconnecting after restart or crash, slave may need to
+resubmit inflight I/Os. If virtqueue is processed in order, we can
+easily achieve that by getting the inflight descriptors from
+descriptor table (split virtqueue) or descriptor ring (packed
+virtqueue). However, it can't work when we process descriptors
+out-of-order because some entries which store the information of
+inflight descriptors in available ring (split virtqueue) or descriptor
+ring (packed virtqueue) might be overridden by new entries. To solve
+this problem, slave need to allocate an extra buffer to store this
+information of inflight descriptors and share it with master for
+persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and
+``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer
+between master and slave. And the format of this buffer is described
+below:
+
++---------------+---------------+-----+---------------+
+| queue0 region | queue1 region | ... | queueN region |
++---------------+---------------+-----+---------------+
+
+N is the number of available virtqueues. Slave could get it from num
+queues field of ``VhostUserInflight``.
+
+For split virtqueue, queue region can be implemented as:
+
+.. code:: c
+
+ typedef struct DescStateSplit {
+ /* Indicate whether this descriptor is inflight or not.
+ * Only available for head-descriptor. */
+ uint8_t inflight;
+
+ /* Padding */
+ uint8_t padding[5];
+
+ /* Maintain a list for the last batch of used descriptors.
+ * Only available when batching is used for submitting */
+ uint16_t next;
+
+ /* Used to preserve the order of fetching available descriptors.
+ * Only available for head-descriptor. */
+ uint64_t counter;
+ } DescStateSplit;
+
+ typedef struct QueueRegionSplit {
+ /* The feature flags of this region. Now it's initialized to 0. */
+ uint64_t features;
+
+ /* The version of this region. It's 1 currently.
+ * Zero value indicates an uninitialized buffer */
+ uint16_t version;
+
+ /* The size of DescStateSplit array. It's equal to the virtqueue
+ * size. Slave could get it from queue size field of VhostUserInflight. */
+ uint16_t desc_num;
+
+ /* The head of list that track the last batch of used descriptors. */
+ uint16_t last_batch_head;
+
+ /* Store the idx value of used ring */
+ uint16_t used_idx;
+
+ /* Used to track the state of each descriptor in descriptor table */
+ DescStateSplit desc[];
+ } QueueRegionSplit;
+
+To track inflight I/O, the queue region should be processed as follows:
+
+When receiving available buffers from the driver:
+
+#. Get the next available head-descriptor index from available ring, ``i``
+
+#. Set ``desc[i].counter`` to the value of global counter
+
+#. Increase global counter by 1
+
+#. Set ``desc[i].inflight`` to 1
+
+When supplying used buffers to the driver:
+
+1. Get corresponding used head-descriptor index, i
+
+2. Set ``desc[i].next`` to ``last_batch_head``
+
+3. Set ``last_batch_head`` to ``i``
+
+#. Steps 1,2,3 may be performed repeatedly if batching is possible
+
+#. Increase the ``idx`` value of used ring by the size of the batch
+
+#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0
+
+#. Set ``used_idx`` to the ``idx`` value of used ring
+
+When reconnecting:
+
+#. If the value of ``used_idx`` does not match the ``idx`` value of
+ used ring (means the inflight field of ``DescStateSplit`` entries in
+ last batch may be incorrect),
+
+ a. Subtract the value of ``used_idx`` from the ``idx`` value of
+ used ring to get last batch size of ``DescStateSplit`` entries
+
+ #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch
+ list which starts from ``last_batch_head``
+
+ #. Set ``used_idx`` to the ``idx`` value of used ring
+
+#. Resubmit inflight ``DescStateSplit`` entries in order of their
+ counter value
+
+For packed virtqueue, queue region can be implemented as:
+
+.. code:: c
+
+ typedef struct DescStatePacked {
+ /* Indicate whether this descriptor is inflight or not.
+ * Only available for head-descriptor. */
+ uint8_t inflight;
+
+ /* Padding */
+ uint8_t padding;
+
+ /* Link to the next free entry */
+ uint16_t next;
+
+ /* Link to the last entry of descriptor list.
+ * Only available for head-descriptor. */
+ uint16_t last;
+
+ /* The length of descriptor list.
+ * Only available for head-descriptor. */
+ uint16_t num;
+
+ /* Used to preserve the order of fetching available descriptors.
+ * Only available for head-descriptor. */
+ uint64_t counter;
+
+ /* The buffer id */
+ uint16_t id;
+
+ /* The descriptor flags */
+ uint16_t flags;
+
+ /* The buffer length */
+ uint32_t len;
+
+ /* The buffer address */
+ uint64_t addr;
+ } DescStatePacked;
+
+ typedef struct QueueRegionPacked {
+ /* The feature flags of this region. Now it's initialized to 0. */
+ uint64_t features;
+
+ /* The version of this region. It's 1 currently.
+ * Zero value indicates an uninitialized buffer */
+ uint16_t version;
+
+ /* The size of DescStatePacked array. It's equal to the virtqueue
+ * size. Slave could get it from queue size field of VhostUserInflight. */
+ uint16_t desc_num;
+
+ /* The head of free DescStatePacked entry list */
+ uint16_t free_head;
+
+ /* The old head of free DescStatePacked entry list */
+ uint16_t old_free_head;
+
+ /* The used index of descriptor ring */
+ uint16_t used_idx;
+
+ /* The old used index of descriptor ring */
+ uint16_t old_used_idx;
+
+ /* Device ring wrap counter */
+ uint8_t used_wrap_counter;
+
+ /* The old device ring wrap counter */
+ uint8_t old_used_wrap_counter;
+
+ /* Padding */
+ uint8_t padding[7];
+
+ /* Used to track the state of each descriptor fetched from descriptor ring */
+ DescStatePacked desc[];
+ } QueueRegionPacked;
+
+To track inflight I/O, the queue region should be processed as follows:
+
+When receiving available buffers from the driver:
+
+#. Get the next available descriptor entry from descriptor ring, ``d``
+
+#. If ``d`` is head descriptor,
+
+ a. Set ``desc[old_free_head].num`` to 0
+
+ #. Set ``desc[old_free_head].counter`` to the value of global counter
+
+ #. Increase global counter by 1
+
+ #. Set ``desc[old_free_head].inflight`` to 1
+
+#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to
+ ``free_head``
+
+#. Increase ``desc[old_free_head].num`` by 1
+
+#. Set ``desc[free_head].addr``, ``desc[free_head].len``,
+ ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``,
+ ``d.len``, ``d.flags``, ``d.id``
+
+#. Set ``free_head`` to ``desc[free_head].next``
+
+#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head``
+
+When supplying used buffers to the driver:
+
+1. Get corresponding used head-descriptor entry from descriptor ring,
+ ``d``
+
+2. Get corresponding ``DescStatePacked`` entry, ``e``
+
+3. Set ``desc[e.last].next`` to ``free_head``
+
+4. Set ``free_head`` to the index of ``e``
+
+#. Steps 1,2,3,4 may be performed repeatedly if batching is possible
+
+#. Increase ``used_idx`` by the size of the batch and update
+ ``used_wrap_counter`` if needed
+
+#. Update ``d.flags``
+
+#. Set the ``inflight`` field of each head ``DescStatePacked`` entry
+ in the batch to 0
+
+#. Set ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
+ to ``free_head``, ``used_idx``, ``used_wrap_counter``
+
+When reconnecting:
+
+#. If ``used_idx`` does not match ``old_used_idx`` (means the
+ ``inflight`` field of ``DescStatePacked`` entries in last batch may
+ be incorrect),
+
+ a. Get the next descriptor ring entry through ``old_used_idx``, ``d``
+
+ #. Use ``old_used_wrap_counter`` to calculate the available flags
+
+ #. If ``d.flags`` is not equal to the calculated flags value (means
+ slave has submitted the buffer to guest driver before crash, so
+ it has to commit the in-progres update), set ``old_free_head``,
+ ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``,
+ ``used_idx``, ``used_wrap_counter``
+
+#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to
+ ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
+ (roll back any in-progress update)
+
+#. Set the ``inflight`` field of each ``DescStatePacked`` entry in
+ free list to 0
+
+#. Resubmit inflight ``DescStatePacked`` entries in order of their
+ counter value
+
+In-band notifications
+---------------------
+
+In some limited situations (e.g. for simulation) it is desirable to
+have the kick, call and error (if used) signals done via in-band
+messages instead of asynchronous eventfd notifications. This can be
+done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS``
+protocol feature.
+
+Note that due to the fact that too many messages on the sockets can
+cause the sending application(s) to block, it is not advised to use
+this feature unless absolutely necessary. It is also considered an
+error to negotiate this feature without also negotiating
+``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
+the former is necessary for getting a message channel from the slave
+to the master, while the latter needs to be used with the in-band
+notification messages to block until they are processed, both to avoid
+blocking later and for proper processing (at least in the simulation
+use case.) As it has no other way of signalling this error, the slave
+should close the connection as a response to a
+``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band
+notifications feature flag without the other two.
+
+Protocol features
+-----------------
+
+.. code:: c
+
+ #define VHOST_USER_PROTOCOL_F_MQ 0
+ #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1
+ #define VHOST_USER_PROTOCOL_F_RARP 2
+ #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
+ #define VHOST_USER_PROTOCOL_F_MTU 4
+ #define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5
+ #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6
+ #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7
+ #define VHOST_USER_PROTOCOL_F_PAGEFAULT 8
+ #define VHOST_USER_PROTOCOL_F_CONFIG 9
+ #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD 10
+ #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11
+ #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
+ #define VHOST_USER_PROTOCOL_F_RESET_DEVICE 13
+ #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
+ #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15
+ #define VHOST_USER_PROTOCOL_F_STATUS 16
+
+Master message types
+--------------------
+
+``VHOST_USER_GET_FEATURES``
+ :id: 1
+ :equivalent ioctl: ``VHOST_GET_FEATURES``
+ :master payload: N/A
+ :slave payload: ``u64``
+
+ Get from the underlying vhost implementation the features bitmask.
+ Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support
+ for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
+ ``VHOST_USER_SET_PROTOCOL_FEATURES``.
+
+``VHOST_USER_SET_FEATURES``
+ :id: 2
+ :equivalent ioctl: ``VHOST_SET_FEATURES``
+ :master payload: ``u64``
+
+ Enable features in the underlying vhost implementation using a
+ bitmask. Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals
+ slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
+ ``VHOST_USER_SET_PROTOCOL_FEATURES``.
+
+``VHOST_USER_GET_PROTOCOL_FEATURES``
+ :id: 15
+ :equivalent ioctl: ``VHOST_GET_FEATURES``
+ :master payload: N/A
+ :slave payload: ``u64``
+
+ Get the protocol feature bitmask from the underlying vhost
+ implementation. Only legal if feature bit
+ ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
+ ``VHOST_USER_GET_FEATURES``.
+
+.. Note::
+ Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must
+ support this message even before ``VHOST_USER_SET_FEATURES`` was
+ called.
+
+``VHOST_USER_SET_PROTOCOL_FEATURES``
+ :id: 16
+ :equivalent ioctl: ``VHOST_SET_FEATURES``
+ :master payload: ``u64``
+
+ Enable protocol features in the underlying vhost implementation.
+
+ Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
+ ``VHOST_USER_GET_FEATURES``.
+
+.. Note::
+ Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
+ this message even before ``VHOST_USER_SET_FEATURES`` was called.
+
+``VHOST_USER_SET_OWNER``
+ :id: 3
+ :equivalent ioctl: ``VHOST_SET_OWNER``
+ :master payload: N/A
+
+ Issued when a new connection is established. It sets the current
+ *master* as an owner of the session. This can be used on the *slave*
+ as a "session start" flag.
+
+``VHOST_USER_RESET_OWNER``
+ :id: 4
+ :master payload: N/A
+
+.. admonition:: Deprecated
+
+ This is no longer used. Used to be sent to request disabling all
+ rings, but some clients interpreted it to also discard connection
+ state (this interpretation would lead to bugs). It is recommended
+ that clients either ignore this message, or use it to disable all
+ rings.
+
+``VHOST_USER_SET_MEM_TABLE``
+ :id: 5
+ :equivalent ioctl: ``VHOST_SET_MEM_TABLE``
+ :master payload: memory regions description
+ :slave payload: (postcopy only) memory regions description
+
+ Sets the memory map regions on the slave so it can translate the
+ vring addresses. In the ancillary data there is an array of file
+ descriptors for each memory mapped region. The size and ordering of
+ the fds matches the number and ordering of memory regions.
+
+ When ``VHOST_USER_POSTCOPY_LISTEN`` has been received,
+ ``SET_MEM_TABLE`` replies with the bases of the memory mapped
+ regions to the master. The slave must have mmap'd the regions but
+ not yet accessed them and should not yet generate a userfault
+ event.
+
+.. Note::
+ ``NEED_REPLY_MASK`` is not set in this case. QEMU will then
+ reply back to the list of mappings with an empty
+ ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon
+ reception of this message may the guest start accessing the memory
+ and generating faults.
+
+``VHOST_USER_SET_LOG_BASE``
+ :id: 6
+ :equivalent ioctl: ``VHOST_SET_LOG_BASE``
+ :master payload: u64
+ :slave payload: N/A
+
+ Sets logging shared memory space.
+
+ When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
+ the log memory fd is provided in the ancillary data of
+ ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared
+ memory area provided in the message.
+
+``VHOST_USER_SET_LOG_FD``
+ :id: 7
+ :equivalent ioctl: ``VHOST_SET_LOG_FD``
+ :master payload: N/A
+
+ Sets the logging file descriptor, which is passed as ancillary data.
+
+``VHOST_USER_SET_VRING_NUM``
+ :id: 8
+ :equivalent ioctl: ``VHOST_SET_VRING_NUM``
+ :master payload: vring state description
+
+ Set the size of the queue.
+
+``VHOST_USER_SET_VRING_ADDR``
+ :id: 9
+ :equivalent ioctl: ``VHOST_SET_VRING_ADDR``
+ :master payload: vring address description
+ :slave payload: N/A
+
+ Sets the addresses of the different aspects of the vring.
+
+``VHOST_USER_SET_VRING_BASE``
+ :id: 10
+ :equivalent ioctl: ``VHOST_SET_VRING_BASE``
+ :master payload: vring state description
+
+ Sets the base offset in the available vring.
+
+``VHOST_USER_GET_VRING_BASE``
+ :id: 11
+ :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
+ :master payload: vring state description
+ :slave payload: vring state description
+
+ Get the available vring base offset.
+
+``VHOST_USER_SET_VRING_KICK``
+ :id: 12
+ :equivalent ioctl: ``VHOST_SET_VRING_KICK``
+ :master payload: ``u64``
+
+ Set the event file descriptor for adding buffers to the vring. It is
+ passed in the ancillary data.
+
+ Bits (0-7) of the payload contain the vring index. Bit 8 is the
+ invalid FD flag. This flag is set when there is no file descriptor
+ in the ancillary data. This signals that polling should be used
+ instead of waiting for the kick. Note that if the protocol feature
+ ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated
+ this message isn't necessary as the ring is also started on the
+ ``VHOST_USER_VRING_KICK`` message, it may however still be used to
+ set an event file descriptor (which will be preferred over the
+ message) or to enable polling.
+
+``VHOST_USER_SET_VRING_CALL``
+ :id: 13
+ :equivalent ioctl: ``VHOST_SET_VRING_CALL``
+ :master payload: ``u64``
+
+ Set the event file descriptor to signal when buffers are used. It is
+ passed in the ancillary data.
+
+ Bits (0-7) of the payload contain the vring index. Bit 8 is the
+ invalid FD flag. This flag is set when there is no file descriptor
+ in the ancillary data. This signals that polling will be used
+ instead of waiting for the call. Note that if the protocol features
+ ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
+ ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
+ isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be
+ used, it may however still be used to set an event file descriptor
+ or to enable polling.
+
+``VHOST_USER_SET_VRING_ERR``
+ :id: 14
+ :equivalent ioctl: ``VHOST_SET_VRING_ERR``
+ :master payload: ``u64``
+
+ Set the event file descriptor to signal when error occurs. It is
+ passed in the ancillary data.
+
+ Bits (0-7) of the payload contain the vring index. Bit 8 is the
+ invalid FD flag. This flag is set when there is no file descriptor
+ in the ancillary data. Note that if the protocol features
+ ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
+ ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
+ isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be
+ used, it may however still be used to set an event file descriptor
+ (which will be preferred over the message).
+
+``VHOST_USER_GET_QUEUE_NUM``
+ :id: 17
+ :equivalent ioctl: N/A
+ :master payload: N/A
+ :slave payload: u64
+
+ Query how many queues the backend supports.
+
+ This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ``
+ is set in queried protocol features by
+ ``VHOST_USER_GET_PROTOCOL_FEATURES``.
+
+``VHOST_USER_SET_VRING_ENABLE``
+ :id: 18
+ :equivalent ioctl: N/A
+ :master payload: vring state description
+
+ Signal slave to enable or disable corresponding vring.
+
+ This request should be sent only when
+ ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated.
+
+``VHOST_USER_SEND_RARP``
+ :id: 19
+ :equivalent ioctl: N/A
+ :master payload: ``u64``
+
+ Ask vhost user backend to broadcast a fake RARP to notify the migration
+ is terminated for guest that does not support GUEST_ANNOUNCE.
+
+ Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is
+ present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
+ ``VHOST_USER_PROTOCOL_F_RARP`` is present in
+ ``VHOST_USER_GET_PROTOCOL_FEATURES``. The first 6 bytes of the
+ payload contain the mac address of the guest to allow the vhost user
+ backend to construct and broadcast the fake RARP.
+
+``VHOST_USER_NET_SET_MTU``
+ :id: 20
+ :equivalent ioctl: N/A
+ :master payload: ``u64``
+
+ Set host MTU value exposed to the guest.
+
+ This request should be sent only when ``VIRTIO_NET_F_MTU`` feature
+ has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES``
+ is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
+ ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in
+ ``VHOST_USER_GET_PROTOCOL_FEATURES``.
+
+ If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
+ respond with zero in case the specified MTU is valid, or non-zero
+ otherwise.
+
+``VHOST_USER_SET_SLAVE_REQ_FD``
+ :id: 21
+ :equivalent ioctl: N/A
+ :master payload: N/A
+
+ Set the socket file descriptor for slave initiated requests. It is passed
+ in the ancillary data.
+
+ This request should be sent only when
+ ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol
+ feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in
+ ``VHOST_USER_GET_PROTOCOL_FEATURES``. If
+ ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
+ respond with zero for success, non-zero otherwise.
+
+``VHOST_USER_IOTLB_MSG``
+ :id: 22
+ :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
+ :master payload: ``struct vhost_iotlb_msg``
+ :slave payload: ``u64``
+
+ Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
+
+ Master sends such requests to update and invalidate entries in the
+ device IOTLB. The slave has to acknowledge the request with sending
+ zero as ``u64`` payload for success, non-zero otherwise.
+
+ This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM``
+ feature has been successfully negotiated.
+
+``VHOST_USER_SET_VRING_ENDIAN``
+ :id: 23
+ :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN``
+ :master payload: vring state description
+
+ Set the endianness of a VQ for legacy devices. Little-endian is
+ indicated with state.num set to 0 and big-endian is indicated with
+ state.num set to 1. Other values are invalid.
+
+ This request should be sent only when
+ ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated.
+ Backends that negotiated this feature should handle both
+ endiannesses and expect this message once (per VQ) during device
+ configuration (ie. before the master starts the VQ).
+
+``VHOST_USER_GET_CONFIG``
+ :id: 24
+ :equivalent ioctl: N/A
+ :master payload: virtio device config space
+ :slave payload: virtio device config space
+
+ When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
+ submitted by the vhost-user master to fetch the contents of the
+ virtio device configuration space, vhost-user slave's payload size
+ MUST match master's request, vhost-user slave uses zero length of
+ payload to indicate an error to vhost-user master. The vhost-user
+ master may cache the contents to avoid repeated
+ ``VHOST_USER_GET_CONFIG`` calls.
+
+``VHOST_USER_SET_CONFIG``
+ :id: 25
+ :equivalent ioctl: N/A
+ :master payload: virtio device config space
+ :slave payload: N/A
+
+ When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
+ submitted by the vhost-user master when the Guest changes the virtio
+ device configuration space and also can be used for live migration
+ on the destination host. The vhost-user slave must check the flags
+ field, and slaves MUST NOT accept SET_CONFIG for read-only
+ configuration space fields unless the live migration bit is set.
+
+``VHOST_USER_CREATE_CRYPTO_SESSION``
+ :id: 26
+ :equivalent ioctl: N/A
+ :master payload: crypto session description
+ :slave payload: crypto session description
+
+ Create a session for crypto operation. The server side must return
+ the session id, 0 or positive for success, negative for failure.
+ This request should be sent only when
+ ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
+ successfully negotiated. It's a required feature for crypto
+ devices.
+
+``VHOST_USER_CLOSE_CRYPTO_SESSION``
+ :id: 27
+ :equivalent ioctl: N/A
+ :master payload: ``u64``
+
+ Close a session for crypto operation which was previously
+ created by ``VHOST_USER_CREATE_CRYPTO_SESSION``.
+
+ This request should be sent only when
+ ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
+ successfully negotiated. It's a required feature for crypto
+ devices.
+
+``VHOST_USER_POSTCOPY_ADVISE``
+ :id: 28
+ :master payload: N/A
+ :slave payload: userfault fd
+
+ When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master
+ advises slave that a migration with postcopy enabled is underway,
+ the slave must open a userfaultfd for later use. Note that at this
+ stage the migration is still in precopy mode.
+
+``VHOST_USER_POSTCOPY_LISTEN``
+ :id: 29
+ :master payload: N/A
+
+ Master advises slave that a transition to postcopy mode has
+ happened. The slave must ensure that shared memory is registered
+ with userfaultfd to cause faulting of non-present pages.
+
+ This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``,
+ and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported.
+
+``VHOST_USER_POSTCOPY_END``
+ :id: 30
+ :slave payload: ``u64``
+
+ Master advises that postcopy migration has now completed. The slave
+ must disable the userfaultfd. The response is an acknowledgement
+ only.
+
+ When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message
+ is sent at the end of the migration, after
+ ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent.
+
+ The value returned is an error indication; 0 is success.
+
+``VHOST_USER_GET_INFLIGHT_FD``
+ :id: 31
+ :equivalent ioctl: N/A
+ :master payload: inflight description
+
+ When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
+ been successfully negotiated, this message is submitted by master to
+ get a shared buffer from slave. The shared buffer will be used to
+ track inflight I/O by slave. QEMU should retrieve a new one when vm
+ reset.
+
+``VHOST_USER_SET_INFLIGHT_FD``
+ :id: 32
+ :equivalent ioctl: N/A
+ :master payload: inflight description
+
+ When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
+ been successfully negotiated, this message is submitted by master to
+ send the shared inflight buffer back to slave so that slave could
+ get inflight I/O after a crash or restart.
+
+``VHOST_USER_GPU_SET_SOCKET``
+ :id: 33
+ :equivalent ioctl: N/A
+ :master payload: N/A
+
+ Sets the GPU protocol socket file descriptor, which is passed as
+ ancillary data. The GPU protocol is used to inform the master of
+ rendering state and updates. See vhost-user-gpu.rst for details.
+
+``VHOST_USER_RESET_DEVICE``
+ :id: 34
+ :equivalent ioctl: N/A
+ :master payload: N/A
+ :slave payload: N/A
+
+ Ask the vhost user backend to disable all rings and reset all
+ internal device state to the initial state, ready to be
+ reinitialized. The backend retains ownership of the device
+ throughout the reset operation.
+
+ Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol
+ feature is set by the backend.
+
+``VHOST_USER_VRING_KICK``
+ :id: 35
+ :equivalent ioctl: N/A
+ :slave payload: vring state description
+ :master payload: N/A
+
+ When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
+ feature has been successfully negotiated, this message may be
+ submitted by the master to indicate that a buffer was added to
+ the vring instead of signalling it using the vring's kick file
+ descriptor or having the slave rely on polling.
+
+ The state.num field is currently reserved and must be set to 0.
+
+``VHOST_USER_GET_MAX_MEM_SLOTS``
+ :id: 36
+ :equivalent ioctl: N/A
+ :slave payload: u64
+
+ When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
+ feature has been successfully negotiated, this message is submitted
+ by master to the slave. The slave should return the message with a
+ u64 payload containing the maximum number of memory slots for
+ QEMU to expose to the guest. The value returned by the backend
+ will be capped at the maximum number of ram slots which can be
+ supported by the target platform.
+
+``VHOST_USER_ADD_MEM_REG``
+ :id: 37
+ :equivalent ioctl: N/A
+ :slave payload: single memory region description
+
+ When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
+ feature has been successfully negotiated, this message is submitted
+ by the master to the slave. The message payload contains a memory
+ region descriptor struct, describing a region of guest memory which
+ the slave device must map in. When the
+ ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
+ been successfully negotiated, along with the
+ ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
+ update the memory tables of the slave device.
+
+``VHOST_USER_REM_MEM_REG``
+ :id: 38
+ :equivalent ioctl: N/A
+ :slave payload: single memory region description
+
+ When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
+ feature has been successfully negotiated, this message is submitted
+ by the master to the slave. The message payload contains a memory
+ region descriptor struct, describing a region of guest memory which
+ the slave device must unmap. When the
+ ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
+ been successfully negotiated, along with the
+ ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
+ update the memory tables of the slave device.
+
+``VHOST_USER_SET_STATUS``
+ :id: 39
+ :equivalent ioctl: VHOST_VDPA_SET_STATUS
+ :slave payload: N/A
+ :master payload: ``u64``
+
+ When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
+ successfully negotiated, this message is submitted by the master to
+ notify the backend with updated device status as defined in the Virtio
+ specification.
+
+``VHOST_USER_GET_STATUS``
+ :id: 40
+ :equivalent ioctl: VHOST_VDPA_GET_STATUS
+ :slave payload: ``u64``
+ :master payload: N/A
+
+ When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
+ successfully negotiated, this message is submitted by the master to
+ query the backend for its device status as defined in the Virtio
+ specification.
+
+
+Slave message types
+-------------------
+
+``VHOST_USER_SLAVE_IOTLB_MSG``
+ :id: 1
+ :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
+ :slave payload: ``struct vhost_iotlb_msg``
+ :master payload: N/A
+
+ Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
+ Slave sends such requests to notify of an IOTLB miss, or an IOTLB
+ access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is
+ negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master
+ must respond with zero when operation is successfully completed, or
+ non-zero otherwise. This request should be send only when
+ ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully
+ negotiated.
+
+``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``
+ :id: 2
+ :equivalent ioctl: N/A
+ :slave payload: N/A
+ :master payload: N/A
+
+ When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user
+ slave sends such messages to notify that the virtio device's
+ configuration space has changed, for those host devices which can
+ support such feature, host driver can send ``VHOST_USER_GET_CONFIG``
+ message to slave to get the latest content. If
+ ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the
+ ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when
+ operation is successfully completed, or non-zero otherwise.
+
+``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``
+ :id: 3
+ :equivalent ioctl: N/A
+ :slave payload: vring area description
+ :master payload: N/A
+
+ Sets host notifier for a specified queue. The queue index is
+ contained in the ``u64`` field of the vring area description. The
+ host notifier is described by the file descriptor (typically it's a
+ VFIO device fd) which is passed as ancillary data and the size
+ (which is mmap size and should be the same as host page size) and
+ offset (which is mmap offset) carried in the vring area
+ description. QEMU can mmap the file descriptor based on the size and
+ offset to get a memory range. Registering a host notifier means
+ mapping this memory range to the VM as the specified queue's notify
+ MMIO region. Slave sends this request to tell QEMU to de-register
+ the existing notifier if any and register the new notifier if the
+ request is sent with a file descriptor.
+
+ This request should be sent only when
+ ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been
+ successfully negotiated.
+
+``VHOST_USER_SLAVE_VRING_CALL``
+ :id: 4
+ :equivalent ioctl: N/A
+ :slave payload: vring state description
+ :master payload: N/A
+
+ When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
+ feature has been successfully negotiated, this message may be
+ submitted by the slave to indicate that a buffer was used from
+ the vring instead of signalling this using the vring's call file
+ descriptor or having the master relying on polling.
+
+ The state.num field is currently reserved and must be set to 0.
+
+``VHOST_USER_SLAVE_VRING_ERR``
+ :id: 5
+ :equivalent ioctl: N/A
+ :slave payload: vring state description
+ :master payload: N/A
+
+ When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
+ feature has been successfully negotiated, this message may be
+ submitted by the slave to indicate that an error occurred on the
+ specific vring, instead of signalling the error file descriptor
+ set by the master via ``VHOST_USER_SET_VRING_ERR``.
+
+ The state.num field is currently reserved and must be set to 0.
+
+.. _reply_ack:
+
+VHOST_USER_PROTOCOL_F_REPLY_ACK
+-------------------------------
+
+The original vhost-user specification only demands replies for certain
+commands. This differs from the vhost protocol implementation where
+commands are sent over an ``ioctl()`` call and block until the client
+has completed.
+
+With this protocol extension negotiated, the sender (QEMU) can set the
+``need_reply`` [Bit 3] flag to any command. This indicates that the
+client MUST respond with a Payload ``VhostUserMsg`` indicating success
+or failure. The payload should be set to zero on success or non-zero
+on failure, unless the message already has an explicit reply body.
+
+The response payload gives QEMU a deterministic indication of the result
+of the command. Today, QEMU is expected to terminate the main vhost-user
+loop upon receiving such errors. In future, qemu could be taught to be more
+resilient for selective requests.
+
+For the message types that already solicit a reply from the client,
+the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit
+being set brings no behavioural change. (See the Communication_
+section for details.)
+
+.. _backend_conventions:
+
+Backend program conventions
+===========================
+
+vhost-user backends can provide various devices & services and may
+need to be configured manually depending on the use case. However, it
+is a good idea to follow the conventions listed here when
+possible. Users, QEMU or libvirt, can then rely on some common
+behaviour to avoid heterogeneous configuration and management of the
+backend programs and facilitate interoperability.
+
+Each backend installed on a host system should come with at least one
+JSON file that conforms to the vhost-user.json schema. Each file
+informs the management applications about the backend type, and binary
+location. In addition, it defines rules for management apps for
+picking the highest priority backend when multiple match the search
+criteria (see ``@VhostUserBackend`` documentation in the schema file).
+
+If the backend is not capable of enabling a requested feature on the
+host (such as 3D acceleration with virgl), or the initialization
+failed, the backend should fail to start early and exit with a status
+!= 0. It may also print a message to stderr for further details.
+
+The backend program must not daemonize itself, but it may be
+daemonized by the management layer. It may also have a restricted
+access to the system.
+
+File descriptors 0, 1 and 2 will exist, and have regular
+stdin/stdout/stderr usage (they may have been redirected to /dev/null
+by the management layer, or to a log handler).
+
+The backend program must end (as quickly and cleanly as possible) when
+the SIGTERM signal is received. Eventually, it may receive SIGKILL by
+the management layer after a few seconds.
+
+The following command line options have an expected behaviour. They
+are mandatory, unless explicitly said differently:
+
+--socket-path=PATH
+
+ This option specify the location of the vhost-user Unix domain socket.
+ It is incompatible with --fd.
+
+--fd=FDNUM
+
+ When this argument is given, the backend program is started with the
+ vhost-user socket as file descriptor FDNUM. It is incompatible with
+ --socket-path.
+
+--print-capabilities
+
+ Output to stdout the backend capabilities in JSON format, and then
+ exit successfully. Other options and arguments should be ignored, and
+ the backend program should not perform its normal function. The
+ capabilities can be reported dynamically depending on the host
+ capabilities.
+
+The JSON output is described in the ``vhost-user.json`` schema, by
+```@VHostUserBackendCapabilities``. Example:
+
+.. code:: json
+
+ {
+ "type": "foo",
+ "features": [
+ "feature-a",
+ "feature-b"
+ ]
+ }
+
+vhost-user-input
+----------------
+
+Command line options:
+
+--evdev-path=PATH
+
+ Specify the linux input device.
+
+ (optional)
+
+--no-grab
+
+ Do no request exclusive access to the input device.
+
+ (optional)
+
+vhost-user-gpu
+--------------
+
+Command line options:
+
+--render-node=PATH
+
+ Specify the GPU DRM render node.
+
+ (optional)
+
+--virgl
+
+ Enable virgl rendering support.
+
+ (optional)
+
+vhost-user-blk
+--------------
+
+Command line options:
+
+--blk-file=PATH
+
+ Specify block device or file path.
+
+ (optional)
+
+--read-only
+
+ Enable read-only.
+
+ (optional)
diff --git a/docs/interop/vhost-vdpa.rst b/docs/interop/vhost-vdpa.rst
new file mode 100644
index 000000000..0c70ba01b
--- /dev/null
+++ b/docs/interop/vhost-vdpa.rst
@@ -0,0 +1,17 @@
+=====================
+Vhost-vdpa Protocol
+=====================
+
+Introduction
+=============
+vDPA(Virtual data path acceleration) device is a device that uses
+a datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software.
+
+This document describes the vDPA support in qemu
+
+Here is the kernel commit here
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c8cf31885f69e86be0b5b9e6677a26797365e1d
+
+TODO : More information will add later
diff --git a/docs/interop/vnc-ledstate-Pseudo-encoding.txt b/docs/interop/vnc-ledstate-Pseudo-encoding.txt
new file mode 100644
index 000000000..0f124f68b
--- /dev/null
+++ b/docs/interop/vnc-ledstate-Pseudo-encoding.txt
@@ -0,0 +1,50 @@
+VNC LED state Pseudo-encoding
+=============================
+
+Introduction
+------------
+
+This document describes the Pseudo-encoding of LED state for RFB which
+is the protocol used in VNC as reference link below:
+
+http://tigervnc.svn.sourceforge.net/viewvc/tigervnc/rfbproto/rfbproto.rst?content-type=text/plain
+
+When accessing a guest by console through VNC, there might be mismatch
+between the lock keys notification LED on the computer running the VNC
+client session and the current status of the lock keys on the guest
+machine.
+
+To solve this problem it attempts to add LED state Pseudo-encoding
+extension to VNC protocol to deal with setting LED state.
+
+Pseudo-encoding
+---------------
+
+This Pseudo-encoding requested by client declares to server that it supports
+LED state extensions to the protocol.
+
+The Pseudo-encoding number for LED state defined as:
+
+======= ===============================================================
+Number Name
+======= ===============================================================
+-261 'LED state Pseudo-encoding'
+======= ===============================================================
+
+LED state Pseudo-encoding
+--------------------------
+
+The LED state Pseudo-encoding describes the encoding of LED state which
+consists of 3 bits, from left to right each bit represents the Caps, Num,
+and Scroll lock key respectively. '1' indicates that the LED should be
+on and '0' should be off.
+
+Some example encodings for it as following:
+
+======= ===============================================================
+Code Description
+======= ===============================================================
+100 CapsLock is on, NumLock and ScrollLock are off
+010 NumLock is on, CapsLock and ScrollLock are off
+111 CapsLock, NumLock and ScrollLock are on
+======= ===============================================================