aboutsummaryrefslogtreecommitdiffstats
path: root/docs/devel
diff options
context:
space:
mode:
Diffstat (limited to 'docs/devel')
-rw-r--r--docs/devel/atomics.rst507
-rw-r--r--docs/devel/bitops.rst8
-rw-r--r--docs/devel/blkdebug.txt162
-rw-r--r--docs/devel/blkverify.txt69
-rw-r--r--docs/devel/block-coroutine-wrapper.rst54
-rw-r--r--docs/devel/build-system.rst486
-rw-r--r--docs/devel/ci-definitions.rst.inc121
-rw-r--r--docs/devel/ci-jobs.rst.inc58
-rw-r--r--docs/devel/ci-runners.rst.inc117
-rw-r--r--docs/devel/ci.rst13
-rw-r--r--docs/devel/clocks.rst528
-rw-r--r--docs/devel/code-of-conduct.rst60
-rw-r--r--docs/devel/conflict-resolution.rst80
-rw-r--r--docs/devel/control-flow-integrity.rst137
-rw-r--r--docs/devel/decodetree.rst237
-rw-r--r--docs/devel/ebpf_rss.rst125
-rw-r--r--docs/devel/fuzzing.rst322
-rw-r--r--docs/devel/index.rst50
-rw-r--r--docs/devel/kconfig.rst307
-rw-r--r--docs/devel/loads-stores.rst558
-rw-r--r--docs/devel/lockcnt.txt277
-rw-r--r--docs/devel/memory.rst368
-rw-r--r--docs/devel/migration.rst883
-rw-r--r--docs/devel/modules.rst5
-rw-r--r--docs/devel/multi-process.rst968
-rw-r--r--docs/devel/multi-thread-tcg.rst373
-rw-r--r--docs/devel/multiple-iothreads.txt138
-rw-r--r--docs/devel/qapi-code-gen.rst1932
-rw-r--r--docs/devel/qgraph.rst628
-rw-r--r--docs/devel/qom.rst389
-rw-r--r--docs/devel/qtest.rst92
-rw-r--r--docs/devel/rcu.txt406
-rw-r--r--docs/devel/replay.txt46
-rw-r--r--docs/devel/reset.rst289
-rw-r--r--docs/devel/s390-dasd-ipl.rst138
-rw-r--r--docs/devel/secure-coding-practices.rst115
-rw-r--r--docs/devel/stable-process.rst73
-rw-r--r--docs/devel/style.rst703
-rw-r--r--docs/devel/submitting-a-patch.rst562
-rw-r--r--docs/devel/submitting-a-pull-request.rst77
-rw-r--r--docs/devel/tcg-icount.rst94
-rw-r--r--docs/devel/tcg-plugins.rst438
-rw-r--r--docs/devel/tcg.rst190
-rw-r--r--docs/devel/testing.rst1309
-rw-r--r--docs/devel/tracing.rst498
-rw-r--r--docs/devel/trivial-patches.rst52
-rw-r--r--docs/devel/ui.rst8
-rw-r--r--docs/devel/vfio-migration.rst150
-rw-r--r--docs/devel/virtio-migration.txt108
-rw-r--r--docs/devel/writing-monitor-commands.rst751
50 files changed, 16059 insertions, 0 deletions
diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
new file mode 100644
index 000000000..52baa0736
--- /dev/null
+++ b/docs/devel/atomics.rst
@@ -0,0 +1,507 @@
+=========================
+Atomic operations in QEMU
+=========================
+
+CPUs perform independent memory operations effectively in random order.
+but this can be a problem for CPU-CPU interaction (including interactions
+between QEMU and the guest). Multi-threaded programs use various tools
+to instruct the compiler and the CPU to restrict the order to something
+that is consistent with the expectations of the programmer.
+
+The most basic tool is locking. Mutexes, condition variables and
+semaphores are used in QEMU, and should be the default approach to
+synchronization. Anything else is considerably harder, but it's
+also justified more often than one would like;
+the most performance-critical parts of QEMU in particular require
+a very low level approach to concurrency, involving memory barriers
+and atomic operations. The semantics of concurrent memory accesses are governed
+by the C11 memory model.
+
+QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to
+provide better portability and a less verbose syntax. ``qemu/atomic.h``
+provides macros that fall in three camps:
+
+- compiler barriers: ``barrier()``;
+
+- weak atomic access and manual memory barriers: ``qatomic_read()``,
+ ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
+ ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``;
+
+- sequentially consistent atomic access: everything else.
+
+In general, use of ``qemu/atomic.h`` should be wrapped with more easily
+used data structures (e.g. the lock-free singly-linked list operations
+``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
+primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of
+atomic operations and memory barriers should be limited to inter-thread
+checking of flags and documented thoroughly.
+
+
+
+Compiler memory barrier
+=======================
+
+``barrier()`` prevents the compiler from moving the memory accesses on
+either side of it to the other side. The compiler barrier has no direct
+effect on the CPU, which may then reorder things however it wishes.
+
+``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some
+architectures, CPU guarantees are strong enough that blocking compiler
+optimizations already ensures the correct order of execution. In this
+case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple
+compiler barriers.
+
+Still, ``barrier()`` can be useful when writing code that can be interrupted
+by signal handlers.
+
+
+Sequentially consistent atomic access
+=====================================
+
+Most of the operations in the ``qemu/atomic.h`` header ensure *sequential
+consistency*, where "the result of any execution is the same as if the
+operations of all the processors were executed in some sequential order,
+and the operations of each individual processor appear in this sequence
+in the order specified by its program".
+
+``qemu/atomic.h`` provides the following set of atomic read-modify-write
+operations::
+
+ void qatomic_inc(ptr)
+ void qatomic_dec(ptr)
+ void qatomic_add(ptr, val)
+ void qatomic_sub(ptr, val)
+ void qatomic_and(ptr, val)
+ void qatomic_or(ptr, val)
+
+ typeof(*ptr) qatomic_fetch_inc(ptr)
+ typeof(*ptr) qatomic_fetch_dec(ptr)
+ typeof(*ptr) qatomic_fetch_add(ptr, val)
+ typeof(*ptr) qatomic_fetch_sub(ptr, val)
+ typeof(*ptr) qatomic_fetch_and(ptr, val)
+ typeof(*ptr) qatomic_fetch_or(ptr, val)
+ typeof(*ptr) qatomic_fetch_xor(ptr, val)
+ typeof(*ptr) qatomic_fetch_inc_nonzero(ptr)
+ typeof(*ptr) qatomic_xchg(ptr, val)
+ typeof(*ptr) qatomic_cmpxchg(ptr, old, new)
+
+all of which return the old value of ``*ptr``. These operations are
+polymorphic; they operate on any type that is as wide as a pointer or
+smaller.
+
+Similar operations return the new value of ``*ptr``::
+
+ typeof(*ptr) qatomic_inc_fetch(ptr)
+ typeof(*ptr) qatomic_dec_fetch(ptr)
+ typeof(*ptr) qatomic_add_fetch(ptr, val)
+ typeof(*ptr) qatomic_sub_fetch(ptr, val)
+ typeof(*ptr) qatomic_and_fetch(ptr, val)
+ typeof(*ptr) qatomic_or_fetch(ptr, val)
+ typeof(*ptr) qatomic_xor_fetch(ptr, val)
+
+``qemu/atomic.h`` also provides loads and stores that cannot be reordered
+with each other::
+
+ typeof(*ptr) qatomic_mb_read(ptr)
+ void qatomic_mb_set(ptr, val)
+
+However these do not provide sequential consistency and, in particular,
+they do not participate in the total ordering enforced by
+sequentially-consistent operations. For this reason they are deprecated.
+They should instead be replaced with any of the following (ordered from
+easiest to hardest):
+
+- accesses inside a mutex or spinlock
+
+- lightweight synchronization primitives such as ``QemuEvent``
+
+- RCU operations (``qatomic_rcu_read``, ``qatomic_rcu_set``) when publishing
+ or accessing a new version of a data structure
+
+- other atomic accesses: ``qatomic_read`` and ``qatomic_load_acquire`` for
+ loads, ``qatomic_set`` and ``qatomic_store_release`` for stores, ``smp_mb``
+ to forbid reordering subsequent loads before a store.
+
+
+Weak atomic access and manual memory barriers
+=============================================
+
+Compared to sequentially consistent atomic access, programming with
+weaker consistency models can be considerably more complicated.
+The only guarantees that you can rely upon in this case are:
+
+- atomic accesses will not cause data races (and hence undefined behavior);
+ ordinary accesses instead cause data races if they are concurrent with
+ other accesses of which at least one is a write. In order to ensure this,
+ the compiler will not optimize accesses out of existence, create unsolicited
+ accesses, or perform other similar optimzations.
+
+- acquire operations will appear to happen, with respect to the other
+ components of the system, before all the LOAD or STORE operations
+ specified afterwards.
+
+- release operations will appear to happen, with respect to the other
+ components of the system, after all the LOAD or STORE operations
+ specified before.
+
+- release operations will *synchronize with* acquire operations;
+ see :ref:`acqrel` for a detailed explanation.
+
+When using this model, variables are accessed with:
+
+- ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from
+ optimizing accesses out of existence and creating unsolicited
+ accesses, but do not otherwise impose any ordering on loads and
+ stores: both the compiler and the processor are free to reorder
+ them.
+
+- ``qatomic_load_acquire()``, which guarantees the LOAD to appear to
+ happen, with respect to the other components of the system,
+ before all the LOAD or STORE operations specified afterwards.
+ Operations coming before ``qatomic_load_acquire()`` can still be
+ reordered after it.
+
+- ``qatomic_store_release()``, which guarantees the STORE to appear to
+ happen, with respect to the other components of the system,
+ after all the LOAD or STORE operations specified before.
+ Operations coming after ``qatomic_store_release()`` can still be
+ reordered before it.
+
+Restrictions to the ordering of accesses can also be specified
+using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
+``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``.
+
+Memory barriers control the order of references to shared memory.
+They come in six kinds:
+
+- ``smp_rmb()`` guarantees that all the LOAD operations specified before
+ the barrier will appear to happen before all the LOAD operations
+ specified after the barrier with respect to the other components of
+ the system.
+
+ In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not
+ required to have any effect on stores.
+
+- ``smp_wmb()`` guarantees that all the STORE operations specified before
+ the barrier will appear to happen before all the STORE operations
+ specified after the barrier with respect to the other components of
+ the system.
+
+ In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not
+ required to have any effect on loads.
+
+- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before
+ the barrier will appear to happen before all the LOAD or STORE operations
+ specified after the barrier with respect to the other components of
+ the system.
+
+- ``smp_mb_release()`` guarantees that all the STORE operations specified *after*
+ the barrier will appear to happen after all the LOAD or STORE operations
+ specified *before* the barrier with respect to the other components of
+ the system.
+
+- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified
+ before the barrier will appear to happen before all the LOAD and
+ STORE operations specified after the barrier with respect to the other
+ components of the system.
+
+ ``smp_mb()`` puts a partial ordering on both loads and stores. It is
+ stronger than both a read and a write memory barrier; it implies both
+ ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs
+ coming before the barrier from overtaking LOADs coming after the
+ barrier and vice versa.
+
+- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On
+ most processors, whenever two loads are performed such that the
+ second depends on the result of the first (e.g., the first load
+ retrieves the address to which the second load will be directed),
+ the processor will guarantee that the first LOAD will appear to happen
+ before the second with respect to the other components of the system.
+ However, this is not always true---for example, it was not true on
+ Alpha processors. Whenever this kind of access happens to shared
+ memory (that is not protected by a lock), a read barrier is needed,
+ and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``.
+
+ Note that the first load really has to have a _data_ dependency and not
+ a control dependency. If the address for the second load is dependent
+ on the first load, but the dependency is through a conditional rather
+ than actually loading the address itself, then it's a _control_
+ dependency and a full read barrier or better is required.
+
+
+Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are
+mostly used when a data structure has one thread that is always a writer
+and one thread that is always a reader:
+
+ +----------------------------------+----------------------------------+
+ | thread 1 | thread 2 |
+ +==================================+==================================+
+ | :: | :: |
+ | | |
+ | qatomic_store_release(&a, x); | y = qatomic_load_acquire(&b); |
+ | qatomic_store_release(&b, y); | x = qatomic_load_acquire(&a); |
+ +----------------------------------+----------------------------------+
+
+In this case, correctness is easy to check for using the "pairing"
+trick that is explained below.
+
+Sometimes, a thread is accessing many variables that are otherwise
+unrelated to each other (for example because, apart from the current
+thread, exactly one other thread will read or write each of these
+variables). In this case, it is possible to "hoist" the barriers
+outside a loop. For example:
+
+ +------------------------------------------+----------------------------------+
+ | before | after |
+ +==========================================+==================================+
+ | :: | :: |
+ | | |
+ | n = 0; | n = 0; |
+ | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
+ | n += qatomic_load_acquire(&a[i]); | n += qatomic_read(&a[i]); |
+ | | smp_mb_acquire(); |
+ +------------------------------------------+----------------------------------+
+ | :: | :: |
+ | | |
+ | | smp_mb_release(); |
+ | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
+ | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); |
+ +------------------------------------------+----------------------------------+
+
+Splitting a loop can also be useful to reduce the number of barriers:
+
+ +------------------------------------------+----------------------------------+
+ | before | after |
+ +==========================================+==================================+
+ | :: | :: |
+ | | |
+ | n = 0; | smp_mb_release(); |
+ | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
+ | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); |
+ | smp_mb(); | smb_mb(); |
+ | n += qatomic_read(&b[i]); | n = 0; |
+ | } | for (i = 0; i < 10; i++) |
+ | | n += qatomic_read(&b[i]); |
+ +------------------------------------------+----------------------------------+
+
+In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer
+as well) ``smp_wmb()``:
+
+ +------------------------------------------+----------------------------------+
+ | before | after |
+ +==========================================+==================================+
+ | :: | :: |
+ | | |
+ | | smp_mb_release(); |
+ | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
+ | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); |
+ | qatomic_store_release(&b[i], false); | smb_wmb(); |
+ | } | for (i = 0; i < 10; i++) |
+ | | qatomic_set(&b[i], false); |
+ +------------------------------------------+----------------------------------+
+
+
+.. _acqrel:
+
+Acquire/release pairing and the *synchronizes-with* relation
+------------------------------------------------------------
+
+Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have
+either *acquire* or *release* semantics [#rmw]_. This has two effects:
+
+.. [#rmw] Read-modify-write operations can have both---acquire applies to the
+ read part, and release to the write.
+
+- within a thread, they are ordered either before subsequent operations
+ (for acquire) or after previous operations (for release).
+
+- if a release operation in one thread *synchronizes with* an acquire operation
+ in another thread, the ordering constraints propagates from the first to the
+ second thread. That is, everything before the release operation in the
+ first thread is guaranteed to *happen before* everything after the
+ acquire operation in the second thread.
+
+The concept of acquire and release semantics is not exclusive to atomic
+operations; almost all higher-level synchronization primitives also have
+acquire or release semantics. For example:
+
+- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has
+ release semantics and synchronizes with a ``pthread_mutex_lock`` for the
+ same mutex.
+
+- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics;
+ ``pthread_cond_wait`` has both release semantics (synchronizing with
+ ``pthread_mutex_lock``) and acquire semantics (synchronizing with
+ ``pthread_mutex_unlock`` and signaling of the condition variable).
+
+- ``pthread_create`` has release semantics and synchronizes with the start
+ of the new thread; ``pthread_join`` has acquire semantics and synchronizes
+ with the exiting of the thread.
+
+- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has
+ acquire semantics.
+
+For example, in the following example there are no atomic accesses, but still
+thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit``
+(release) and ``pthread_join`` (acquire):
+
+ +----------------------+-------------------------------+
+ | thread 1 | thread 2 |
+ +======================+===============================+
+ | :: | :: |
+ | | |
+ | *a = 1; | |
+ | pthread_exit(a); | pthread_join(thread1, &a); |
+ | | x = *a; |
+ +----------------------+-------------------------------+
+
+Synchronization between threads basically descends from this pairing of
+a release operation and an acquire operation. Therefore, atomic operations
+other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be
+paired with another operation of the opposite kind: an acquire operation
+will pair with a release operation and vice versa. This rule of thumb is
+extremely useful; in the case of QEMU, however, note that the other
+operation may actually be in a driver that runs in the guest!
+
+``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``,
+``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count
+as acquire operations. ``smp_wmb()``, ``smp_mb_release()``,
+``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release
+operations. ``smp_mb()`` counts as both acquire and release, therefore
+it can pair with any other atomic operation. Here is an example:
+
+ +----------------------+------------------------------+
+ | thread 1 | thread 2 |
+ +======================+==============================+
+ | :: | :: |
+ | | |
+ | qatomic_set(&a, 1);| |
+ | smp_wmb(); | |
+ | qatomic_set(&b, 2);| x = qatomic_read(&b); |
+ | | smp_rmb(); |
+ | | y = qatomic_read(&a); |
+ +----------------------+------------------------------+
+
+Note that a load-store pair only counts if the two operations access the
+same variable: that is, a store-release on a variable ``x`` *synchronizes
+with* a load-acquire on a variable ``x``, while a release barrier
+synchronizes with any acquire operation. The following example shows
+correct synchronization:
+
+ +--------------------------------+--------------------------------+
+ | thread 1 | thread 2 |
+ +================================+================================+
+ | :: | :: |
+ | | |
+ | qatomic_set(&a, 1); | |
+ | qatomic_store_release(&b, 2);| x = qatomic_load_acquire(&b);|
+ | | y = qatomic_read(&a); |
+ +--------------------------------+--------------------------------+
+
+Acquire and release semantics of higher-level primitives can also be
+relied upon for the purpose of establishing the *synchronizes with*
+relation.
+
+Note that the "writing" thread is accessing the variables in the
+opposite order as the "reading" thread. This is expected: stores
+before a release operation will normally match the loads after
+the acquire operation, and vice versa. In fact, this happened already
+in the ``pthread_exit``/``pthread_join`` example above.
+
+Finally, this more complex example has more than two accesses and data
+dependency barriers. It also does not use atomic accesses whenever there
+cannot be a data race:
+
+ +----------------------+------------------------------+
+ | thread 1 | thread 2 |
+ +======================+==============================+
+ | :: | :: |
+ | | |
+ | b[2] = 1; | |
+ | smp_wmb(); | |
+ | x->i = 2; | |
+ | smp_wmb(); | |
+ | qatomic_set(&a, x);| x = qatomic_read(&a); |
+ | | smp_read_barrier_depends(); |
+ | | y = x->i; |
+ | | smp_read_barrier_depends(); |
+ | | z = b[y]; |
+ +----------------------+------------------------------+
+
+Comparison with Linux kernel primitives
+=======================================
+
+Here is a list of differences between Linux kernel atomic operations
+and memory barriers, and the equivalents in QEMU:
+
+- atomic operations in Linux are always on a 32-bit int type and
+ use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic
+ and use normal C types.
+
+- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee
+ at all. Linux 4.1 updated them to implement volatile
+ semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``).
+
+ QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed
+ semantics if the compiler supports it, and volatile semantics otherwise.
+ Both semantics prevent the compiler from doing certain transformations;
+ the difference is that atomic accesses are guaranteed to be atomic,
+ while volatile accesses aren't. Thus, in the volatile case we just cross
+ our fingers hoping that the compiler will generate atomic accesses,
+ since we assume the variables passed are machine-word sized and
+ properly aligned.
+
+ No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either
+ Linux or QEMU.
+
+- atomic read-modify-write operations in Linux are of three kinds:
+
+ ===================== =========================================
+ ``atomic_OP`` returns void
+ ``atomic_OP_return`` returns new value of the variable
+ ``atomic_fetch_OP`` returns the old value of the variable
+ ``atomic_cmpxchg`` returns the old value of the variable
+ ===================== =========================================
+
+ In QEMU, the second kind is named ``atomic_OP_fetch``.
+
+- different atomic read-modify-write operations in Linux imply
+ a different set of memory barriers; in QEMU, all of them enforce
+ sequential consistency.
+
+- in QEMU, ``qatomic_read()`` and ``qatomic_set()`` do not participate in
+ the total ordering enforced by sequentially-consistent operations.
+ This is because QEMU uses the C11 memory model. The following example
+ is correct in Linux but not in QEMU:
+
+ +----------------------------------+--------------------------------+
+ | Linux (correct) | QEMU (incorrect) |
+ +==================================+================================+
+ | :: | :: |
+ | | |
+ | a = atomic_fetch_add(&x, 2); | a = qatomic_fetch_add(&x, 2);|
+ | b = READ_ONCE(&y); | b = qatomic_read(&y); |
+ +----------------------------------+--------------------------------+
+
+ because the read of ``y`` can be moved (by either the processor or the
+ compiler) before the write of ``x``.
+
+ Fixing this requires an ``smp_mb()`` memory barrier between the write
+ of ``x`` and the read of ``y``. In the common case where only one thread
+ writes ``x``, it is also possible to write it like this:
+
+ +--------------------------------+
+ | QEMU (correct) |
+ +================================+
+ | :: |
+ | |
+ | a = qatomic_read(&x); |
+ | qatomic_set(&x, a + 2); |
+ | smp_mb(); |
+ | b = qatomic_read(&y); |
+ +--------------------------------+
+
+Sources
+=======
+
+- ``Documentation/memory-barriers.txt`` from the Linux kernel
diff --git a/docs/devel/bitops.rst b/docs/devel/bitops.rst
new file mode 100644
index 000000000..6addaecf8
--- /dev/null
+++ b/docs/devel/bitops.rst
@@ -0,0 +1,8 @@
+==================
+Bitwise operations
+==================
+
+The header ``qemu/bitops.h`` provides utility functions for
+performing bitwise operations.
+
+.. kernel-doc:: include/qemu/bitops.h
diff --git a/docs/devel/blkdebug.txt b/docs/devel/blkdebug.txt
new file mode 100644
index 000000000..0b0c128d3
--- /dev/null
+++ b/docs/devel/blkdebug.txt
@@ -0,0 +1,162 @@
+Block I/O error injection using blkdebug
+----------------------------------------
+Copyright (C) 2014-2015 Red Hat Inc
+
+This work is licensed under the terms of the GNU GPL, version 2 or later. See
+the COPYING file in the top-level directory.
+
+The blkdebug block driver is a rule-based error injection engine. It can be
+used to exercise error code paths in block drivers including ENOSPC (out of
+space) and EIO.
+
+This document gives an overview of the features available in blkdebug.
+
+Background
+----------
+Block drivers have many error code paths that handle I/O errors. Image formats
+are especially complex since metadata I/O errors during cluster allocation or
+while updating tables happen halfway through request processing and require
+discipline to keep image files consistent.
+
+Error injection allows test cases to trigger I/O errors at specific points.
+This way, all error paths can be tested to make sure they are correct.
+
+Rules
+-----
+The blkdebug block driver takes a list of "rules" that tell the error injection
+engine when to fail an I/O request.
+
+Each I/O request is evaluated against the rules. If a rule matches the request
+then its "action" is executed.
+
+Rules can be placed in a configuration file; the configuration file
+follows the same .ini-like format used by QEMU's -readconfig option, and
+each section of the file represents a rule.
+
+The following configuration file defines a single rule:
+
+ $ cat blkdebug.conf
+ [inject-error]
+ event = "read_aio"
+ errno = "28"
+
+This rule fails all aio read requests with ENOSPC (28). Note that the errno
+value depends on the host. On Linux, see
+/usr/include/asm-generic/errno-base.h for errno values.
+
+Invoke QEMU as follows:
+
+ $ qemu-system-x86_64
+ -drive if=none,cache=none,file=blkdebug:blkdebug.conf:test.img,id=drive0 \
+ -device virtio-blk-pci,drive=drive0,id=virtio-blk-pci0
+
+Rules support the following attributes:
+
+ event - which type of operation to match (e.g. read_aio, write_aio,
+ flush_to_os, flush_to_disk). See the "Events" section for
+ information on events.
+
+ state - (optional) the engine must be in this state number in order for this
+ rule to match. See the "State transitions" section for information
+ on states.
+
+ errno - the numeric errno value to return when a request matches this rule.
+ The errno values depend on the host since the numeric values are not
+ standardized in the POSIX specification.
+
+ sector - (optional) a sector number that the request must overlap in order to
+ match this rule
+
+ once - (optional, default "off") only execute this action on the first
+ matching request
+
+ immediately - (optional, default "off") return a NULL BlockAIOCB
+ pointer and fail without an errno instead. This
+ exercises the code path where BlockAIOCB fails and the
+ caller's BlockCompletionFunc is not invoked.
+
+Events
+------
+Block drivers provide information about the type of I/O request they are about
+to make so rules can match specific types of requests. For example, the qcow2
+block driver tells blkdebug when it accesses the L1 table so rules can match
+only L1 table accesses and not other metadata or guest data requests.
+
+The core events are:
+
+ read_aio - guest data read
+
+ write_aio - guest data write
+
+ flush_to_os - write out unwritten block driver state (e.g. cached metadata)
+
+ flush_to_disk - flush the host block device's disk cache
+
+See qapi/block-core.json:BlkdebugEvent for the full list of events.
+You may need to grep block driver source code to understand the
+meaning of specific events.
+
+State transitions
+-----------------
+There are cases where more power is needed to match a particular I/O request in
+a longer sequence of requests. For example:
+
+ write_aio
+ flush_to_disk
+ write_aio
+
+How do we match the 2nd write_aio but not the first? This is where state
+transitions come in.
+
+The error injection engine has an integer called the "state" that always starts
+initialized to 1. The state integer is internal to blkdebug and cannot be
+observed from outside but rules can interact with it for powerful matching
+behavior.
+
+Rules can be conditional on the current state and they can transition to a new
+state.
+
+When a rule's "state" attribute is non-zero then the current state must equal
+the attribute in order for the rule to match.
+
+For example, to match the 2nd write_aio:
+
+ [set-state]
+ event = "write_aio"
+ state = "1"
+ new_state = "2"
+
+ [inject-error]
+ event = "write_aio"
+ state = "2"
+ errno = "5"
+
+The first write_aio request matches the set-state rule and transitions from
+state 1 to state 2. Once state 2 has been entered, the set-state rule no
+longer matches since it requires state 1. But the inject-error rule now
+matches the next write_aio request and injects EIO (5).
+
+State transition rules support the following attributes:
+
+ event - which type of operation to match (e.g. read_aio, write_aio,
+ flush_to_os, flush_to_disk). See the "Events" section for
+ information on events.
+
+ state - (optional) the engine must be in this state number in order for this
+ rule to match
+
+ new_state - transition to this state number
+
+Suspend and resume
+------------------
+Exercising code paths in block drivers may require specific ordering amongst
+concurrent requests. The "breakpoint" feature allows requests to be halted on
+a blkdebug event and resumed later. This makes it possible to achieve
+deterministic ordering when multiple requests are in flight.
+
+Breakpoints on blkdebug events are associated with a user-defined "tag" string.
+This tag serves as an identifier by which the request can be resumed at a later
+point.
+
+See the qemu-io(1) break, resume, remove_break, and wait_break commands for
+details.
diff --git a/docs/devel/blkverify.txt b/docs/devel/blkverify.txt
new file mode 100644
index 000000000..aca826c51
--- /dev/null
+++ b/docs/devel/blkverify.txt
@@ -0,0 +1,69 @@
+= Block driver correctness testing with blkverify =
+
+== Introduction ==
+
+This document describes how to use the blkverify protocol to test that a block
+driver is operating correctly.
+
+It is difficult to test and debug block drivers against real guests. Often
+processes inside the guest will crash because corrupt sectors were read as part
+of the executable. Other times obscure errors are raised by a program inside
+the guest. These issues are extremely hard to trace back to bugs in the block
+driver.
+
+Blkverify solves this problem by catching data corruption inside QEMU the first
+time bad data is read and reporting the disk sector that is corrupted.
+
+== How it works ==
+
+The blkverify protocol has two child block devices, the "test" device and the
+"raw" device. Read/write operations are mirrored to both devices so their
+state should always be in sync.
+
+The "raw" device is a raw image, a flat file, that has identical starting
+contents to the "test" image. The idea is that the "raw" device will handle
+read/write operations correctly and not corrupt data. It can be used as a
+reference for comparison against the "test" device.
+
+After a mirrored read operation completes, blkverify will compare the data and
+raise an error if it is not identical. This makes it possible to catch the
+first instance where corrupt data is read.
+
+== Example ==
+
+Imagine raw.img has 0xcd repeated throughout its first sector:
+
+ $ ./qemu-io -c 'read -v 0 512' raw.img
+ 00000000: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
+ 00000010: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
+ [...]
+ 000001e0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
+ 000001f0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
+ read 512/512 bytes at offset 0
+ 512.000000 bytes, 1 ops; 0.0000 sec (97.656 MiB/sec and 200000.0000 ops/sec)
+
+And test.img is corrupt, its first sector is zeroed when it shouldn't be:
+
+ $ ./qemu-io -c 'read -v 0 512' test.img
+ 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+ 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+ [...]
+ 000001e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+ 000001f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+ read 512/512 bytes at offset 0
+ 512.000000 bytes, 1 ops; 0.0000 sec (81.380 MiB/sec and 166666.6667 ops/sec)
+
+This error is caught by blkverify:
+
+ $ ./qemu-io -c 'read 0 512' blkverify:a.img:b.img
+ blkverify: read sector_num=0 nb_sectors=4 contents mismatch in sector 0
+
+A more realistic scenario is verifying the installation of a guest OS:
+
+ $ ./qemu-img create raw.img 16G
+ $ ./qemu-img create -f qcow2 test.qcow2 16G
+ $ ./qemu-system-x86_64 -cdrom debian.iso \
+ -drive file=blkverify:raw.img:test.qcow2
+
+If the installation is aborted when blkverify detects corruption, use qemu-io
+to explore the contents of the disk image at the sector in question.
diff --git a/docs/devel/block-coroutine-wrapper.rst b/docs/devel/block-coroutine-wrapper.rst
new file mode 100644
index 000000000..412851986
--- /dev/null
+++ b/docs/devel/block-coroutine-wrapper.rst
@@ -0,0 +1,54 @@
+=======================
+block-coroutine-wrapper
+=======================
+
+A lot of functions in QEMU block layer (see ``block/*``) can only be
+called in coroutine context. Such functions are normally marked by the
+coroutine_fn specifier. Still, sometimes we need to call them from
+non-coroutine context; for this we need to start a coroutine, run the
+needed function from it and wait for the coroutine to finish in a
+BDRV_POLL_WHILE() loop. To run a coroutine we need a function with one
+void* argument. So for each coroutine_fn function which needs a
+non-coroutine interface, we should define a structure to pack the
+parameters, define a separate function to unpack the parameters and
+call the original function and finally define a new interface function
+with same list of arguments as original one, which will pack the
+parameters into a struct, create a coroutine, run it and wait in
+BDRV_POLL_WHILE() loop. It's boring to create such wrappers by hand,
+so we have a script to generate them.
+
+Usage
+=====
+
+Assume we have defined the ``coroutine_fn`` function
+``bdrv_co_foo(<some args>)`` and need a non-coroutine interface for it,
+called ``bdrv_foo(<same args>)``. In this case the script can help. To
+trigger the generation:
+
+1. You need ``bdrv_foo`` declaration somewhere (for example, in
+ ``block/coroutines.h``) with the ``generated_co_wrapper`` mark,
+ like this:
+
+.. code-block:: c
+
+ int generated_co_wrapper bdrv_foo(<some args>);
+
+2. You need to feed this declaration to block-coroutine-wrapper script.
+ For this, add the .h (or .c) file with the declaration to the
+ ``input: files(...)`` list of ``block_gen_c`` target declaration in
+ ``block/meson.build``
+
+You are done. During the build, coroutine wrappers will be generated in
+``<BUILD_DIR>/block/block-gen.c``.
+
+Links
+=====
+
+1. The script location is ``scripts/block-coroutine-wrapper.py``.
+
+2. Generic place for private ``generated_co_wrapper`` declarations is
+ ``block/coroutines.h``, for public declarations:
+ ``include/block/block.h``
+
+3. The core API of generated coroutine wrappers is placed in
+ (not generated) ``block/block-gen.h``
diff --git a/docs/devel/build-system.rst b/docs/devel/build-system.rst
new file mode 100644
index 000000000..431caba7a
--- /dev/null
+++ b/docs/devel/build-system.rst
@@ -0,0 +1,486 @@
+==================================
+The QEMU build system architecture
+==================================
+
+This document aims to help developers understand the architecture of the
+QEMU build system. As with projects using GNU autotools, the QEMU build
+system has two stages, first the developer runs the "configure" script
+to determine the local build environment characteristics, then they run
+"make" to build the project. There is about where the similarities with
+GNU autotools end, so try to forget what you know about them.
+
+
+Stage 1: configure
+==================
+
+The QEMU configure script is written directly in shell, and should be
+compatible with any POSIX shell, hence it uses #!/bin/sh. An important
+implication of this is that it is important to avoid using bash-isms on
+development platforms where bash is the primary host.
+
+In contrast to autoconf scripts, QEMU's configure is expected to be
+silent while it is checking for features. It will only display output
+when an error occurs, or to show the final feature enablement summary
+on completion.
+
+Because QEMU uses the Meson build system under the hood, only VPATH
+builds are supported. There are two general ways to invoke configure &
+perform a build:
+
+ - VPATH, build artifacts outside of QEMU source tree entirely::
+
+ cd ../
+ mkdir build
+ cd build
+ ../qemu/configure
+ make
+
+ - VPATH, build artifacts in a subdir of QEMU source tree::
+
+ mkdir build
+ cd build
+ ../configure
+ make
+
+The configure script automatically recognizes
+command line options for which a same-named Meson option exists;
+dashes in the command line are replaced with underscores.
+
+Many checks on the compilation environment are still found in configure
+rather than ``meson.build``, but new checks should be added directly to
+``meson.build``.
+
+Patches are also welcome to move existing checks from the configure
+phase to ``meson.build``. When doing so, ensure that ``meson.build`` does
+not use anymore the keys that you have removed from ``config-host.mak``.
+Typically these will be replaced in ``meson.build`` by boolean variables,
+``get_option('optname')`` invocations, or ``dep.found()`` expressions.
+In general, the remaining checks have little or no interdependencies,
+so they can be moved one by one.
+
+Helper functions
+----------------
+
+The configure script provides a variety of helper functions to assist
+developers in checking for system features:
+
+``do_cc $ARGS...``
+ Attempt to run the system C compiler passing it $ARGS...
+
+``do_cxx $ARGS...``
+ Attempt to run the system C++ compiler passing it $ARGS...
+
+``compile_object $CFLAGS``
+ Attempt to compile a test program with the system C compiler using
+ $CFLAGS. The test program must have been previously written to a file
+ called $TMPC. The replacement in Meson is the compiler object ``cc``,
+ which has methods such as ``cc.compiles()``,
+ ``cc.check_header()``, ``cc.has_function()``.
+
+``compile_prog $CFLAGS $LDFLAGS``
+ Attempt to compile a test program with the system C compiler using
+ $CFLAGS and link it with the system linker using $LDFLAGS. The test
+ program must have been previously written to a file called $TMPC.
+ The replacement in Meson is ``cc.find_library()`` and ``cc.links()``.
+
+``has $COMMAND``
+ Determine if $COMMAND exists in the current environment, either as a
+ shell builtin, or executable binary, returning 0 on success. The
+ replacement in Meson is ``find_program()``.
+
+``check_define $NAME``
+ Determine if the macro $NAME is defined by the system C compiler
+
+``check_include $NAME``
+ Determine if the include $NAME file is available to the system C
+ compiler. The replacement in Meson is ``cc.has_header()``.
+
+``write_c_skeleton``
+ Write a minimal C program main() function to the temporary file
+ indicated by $TMPC
+
+``feature_not_found $NAME $REMEDY``
+ Print a message to stderr that the feature $NAME was not available
+ on the system, suggesting the user try $REMEDY to address the
+ problem.
+
+``error_exit $MESSAGE $MORE...``
+ Print $MESSAGE to stderr, followed by $MORE... and then exit from the
+ configure script with non-zero status
+
+``query_pkg_config $ARGS...``
+ Run pkg-config passing it $ARGS. If QEMU is doing a static build,
+ then --static will be automatically added to $ARGS
+
+
+Stage 2: Meson
+==============
+
+The Meson build system is currently used to describe the build
+process for:
+
+1) executables, which include:
+
+ - Tools - ``qemu-img``, ``qemu-nbd``, ``qga`` (guest agent), etc
+
+ - System emulators - ``qemu-system-$ARCH``
+
+ - Userspace emulators - ``qemu-$ARCH``
+
+ - Unit tests
+
+2) documentation
+
+3) ROMs, which can be either installed as binary blobs or compiled
+
+4) other data files, such as icons or desktop files
+
+All executables are built by default, except for some ``contrib/``
+binaries that are known to fail to build on some platforms (for example
+32-bit or big-endian platforms). Tests are also built by default,
+though that might change in the future.
+
+The source code is highly modularized, split across many files to
+facilitate building of all of these components with as little duplicated
+compilation as possible. Using the Meson "sourceset" functionality,
+``meson.build`` files group the source files in rules that are
+enabled according to the available system libraries and to various
+configuration symbols. Sourcesets belong to one of four groups:
+
+Subsystem sourcesets:
+ Various subsystems that are common to both tools and emulators have
+ their own sourceset, for example ``block_ss`` for the block device subsystem,
+ ``chardev_ss`` for the character device subsystem, etc. These sourcesets
+ are then turned into static libraries as follows::
+
+ libchardev = static_library('chardev', chardev_ss.sources(),
+ name_suffix: 'fa',
+ build_by_default: false)
+
+ chardev = declare_dependency(link_whole: libchardev)
+
+ As of Meson 0.55.1, the special ``.fa`` suffix should be used for everything
+ that is used with ``link_whole``, to ensure that the link flags are placed
+ correctly in the command line.
+
+Target-independent emulator sourcesets:
+ Various general purpose helper code is compiled only once and
+ the .o files are linked into all output binaries that need it.
+ This includes error handling infrastructure, standard data structures,
+ platform portability wrapper functions, etc.
+
+ Target-independent code lives in the ``common_ss``, ``softmmu_ss`` and
+ ``user_ss`` sourcesets. ``common_ss`` is linked into all emulators,
+ ``softmmu_ss`` only in system emulators, ``user_ss`` only in user-mode
+ emulators.
+
+ Target-independent sourcesets must exercise particular care when using
+ ``if_false`` rules. The ``if_false`` rule will be used correctly when linking
+ emulator binaries; however, when *compiling* target-independent files
+ into .o files, Meson may need to pick *both* the ``if_true`` and
+ ``if_false`` sides to cater for targets that want either side. To
+ achieve that, you can add a special rule using the ``CONFIG_ALL``
+ symbol::
+
+ # Some targets have CONFIG_ACPI, some don't, so this is not enough
+ softmmu_ss.add(when: 'CONFIG_ACPI', if_true: files('acpi.c'),
+ if_false: files('acpi-stub.c'))
+
+ # This is required as well:
+ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('acpi-stub.c'))
+
+Target-dependent emulator sourcesets:
+ In the target-dependent set lives CPU emulation, some device emulation and
+ much glue code. This sometimes also has to be compiled multiple times,
+ once for each target being built. Target-dependent files are included
+ in the ``specific_ss`` sourceset.
+
+ Each emulator also includes sources for files in the ``hw/`` and ``target/``
+ subdirectories. The subdirectory used for each emulator comes
+ from the target's definition of ``TARGET_BASE_ARCH`` or (if missing)
+ ``TARGET_ARCH``, as found in ``default-configs/targets/*.mak``.
+
+ Each subdirectory in ``hw/`` adds one sourceset to the ``hw_arch`` dictionary,
+ for example::
+
+ arm_ss = ss.source_set()
+ arm_ss.add(files('boot.c'), fdt)
+ ...
+ hw_arch += {'arm': arm_ss}
+
+ The sourceset is only used for system emulators.
+
+ Each subdirectory in ``target/`` instead should add one sourceset to each
+ of the ``target_arch`` and ``target_softmmu_arch``, which are used respectively
+ for all emulators and for system emulators only. For example::
+
+ arm_ss = ss.source_set()
+ arm_softmmu_ss = ss.source_set()
+ ...
+ target_arch += {'arm': arm_ss}
+ target_softmmu_arch += {'arm': arm_softmmu_ss}
+
+Module sourcesets:
+ There are two dictionaries for modules: ``modules`` is used for
+ target-independent modules and ``target_modules`` is used for
+ target-dependent modules. When modules are disabled the ``module``
+ source sets are added to ``softmmu_ss`` and the ``target_modules``
+ source sets are added to ``specific_ss``.
+
+ Both dictionaries are nested. One dictionary is created per
+ subdirectory, and these per-subdirectory dictionaries are added to
+ the toplevel dictionaries. For example::
+
+ hw_display_modules = {}
+ qxl_ss = ss.source_set()
+ ...
+ hw_display_modules += { 'qxl': qxl_ss }
+ modules += { 'hw-display': hw_display_modules }
+
+Utility sourcesets:
+ All binaries link with a static library ``libqemuutil.a``. This library
+ is built from several sourcesets; most of them however host generated
+ code, and the only two of general interest are ``util_ss`` and ``stub_ss``.
+
+ The separation between these two is purely for documentation purposes.
+ ``util_ss`` contains generic utility files. Even though this code is only
+ linked in some binaries, sometimes it requires hooks only in some of
+ these and depend on other functions that are not fully implemented by
+ all QEMU binaries. ``stub_ss`` links dummy stubs that will only be linked
+ into the binary if the real implementation is not present. In a way,
+ the stubs can be thought of as a portable implementation of the weak
+ symbols concept.
+
+
+The following files concur in the definition of which files are linked
+into each emulator:
+
+``default-configs/devices/*.mak``
+ The files under ``default-configs/devices/`` control the boards and devices
+ that are built into each QEMU system emulation targets. They merely contain
+ a list of config variable definitions such as::
+
+ include arm-softmmu.mak
+ CONFIG_XLNX_ZYNQMP_ARM=y
+ CONFIG_XLNX_VERSAL=y
+
+``*/Kconfig``
+ These files are processed together with ``default-configs/devices/*.mak`` and
+ describe the dependencies between various features, subsystems and
+ device models. They are described in :ref:`kconfig`
+
+``default-configs/targets/*.mak``
+ These files mostly define symbols that appear in the ``*-config-target.h``
+ file for each emulator [#cfgtarget]_. However, the ``TARGET_ARCH``
+ and ``TARGET_BASE_ARCH`` will also be used to select the ``hw/`` and
+ ``target/`` subdirectories that are compiled into each target.
+
+.. [#cfgtarget] This header is included by ``qemu/osdep.h`` when
+ compiling files from the target-specific sourcesets.
+
+These files rarely need changing unless you are adding a completely
+new target, or enabling new devices or hardware for a particular
+system/userspace emulation target
+
+
+Adding checks
+-------------
+
+New checks should be added to Meson. Compiler checks can be as simple as
+the following::
+
+ config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
+
+A more complex task such as adding a new dependency usually
+comprises the following tasks:
+
+ - Add a Meson build option to meson_options.txt.
+
+ - Add code to perform the actual feature check.
+
+ - Add code to include the feature status in ``config-host.h``
+
+ - Add code to print out the feature status in the configure summary
+ upon completion.
+
+Taking the probe for SDL2_Image as an example, we have the following
+in ``meson_options.txt``::
+
+ option('sdl_image', type : 'feature', value : 'auto',
+ description: 'SDL Image support for icons')
+
+Unless the option was given a non-``auto`` value (on the configure
+command line), the detection code must be performed only if the
+dependency will be used::
+
+ sdl_image = not_found
+ if not get_option('sdl_image').auto() or have_system
+ sdl_image = dependency('SDL2_image', required: get_option('sdl_image'),
+ method: 'pkg-config',
+ static: enable_static)
+ endif
+
+This avoids warnings on static builds of user-mode emulators, for example.
+Most of the libraries used by system-mode emulators are not available for
+static linking.
+
+The other supporting code is generally simple::
+
+ # Create config-host.h (if applicable)
+ config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
+
+ # Summary
+ summary_info += {'SDL image support': sdl_image.found()}
+
+For the configure script to parse the new option, the
+``scripts/meson-buildoptions.sh`` file must be up-to-date; ``make
+update-buildoptions`` (or just ``make``) will take care of updating it.
+
+
+Support scripts
+---------------
+
+Meson has a special convention for invoking Python scripts: if their
+first line is ``#! /usr/bin/env python3`` and the file is *not* executable,
+find_program() arranges to invoke the script under the same Python
+interpreter that was used to invoke Meson. This is the most common
+and preferred way to invoke support scripts from Meson build files,
+because it automatically uses the value of configure's --python= option.
+
+In case the script is not written in Python, use a ``#! /usr/bin/env ...``
+line and make the script executable.
+
+Scripts written in Python, where it is desirable to make the script
+executable (for example for test scripts that developers may want to
+invoke from the command line, such as tests/qapi-schema/test-qapi.py),
+should be invoked through the ``python`` variable in meson.build. For
+example::
+
+ test('QAPI schema regression tests', python,
+ args: files('test-qapi.py'),
+ env: test_env, suite: ['qapi-schema', 'qapi-frontend'])
+
+This is needed to obey the --python= option passed to the configure
+script, which may point to something other than the first python3
+binary on the path.
+
+
+Stage 3: makefiles
+==================
+
+The use of GNU make is required with the QEMU build system.
+
+The output of Meson is a build.ninja file, which is used with the Ninja
+build system. QEMU uses a different approach, where Makefile rules are
+synthesized from the build.ninja file. The main Makefile includes these
+rules and wraps them so that e.g. submodules are built before QEMU.
+The resulting build system is largely non-recursive in nature, in
+contrast to common practices seen with automake.
+
+Tests are also ran by the Makefile with the traditional ``make check``
+phony target, while benchmarks are run with ``make bench``. Meson test
+suites such as ``unit`` can be ran with ``make check-unit`` too. It is also
+possible to run tests defined in meson.build with ``meson test``.
+
+Useful make targets
+-------------------
+
+``help``
+ Print a help message for the most common build targets.
+
+``print-VAR``
+ Print the value of the variable VAR. Useful for debugging the build
+ system.
+
+Important files for the build system
+====================================
+
+Statically defined files
+------------------------
+
+The following key files are statically defined in the source tree, with
+the rules needed to build QEMU. Their behaviour is influenced by a
+number of dynamically created files listed later.
+
+``Makefile``
+ The main entry point used when invoking make to build all the components
+ of QEMU. The default 'all' target will naturally result in the build of
+ every component. Makefile takes care of recursively building submodules
+ directly via a non-recursive set of rules.
+
+``*/meson.build``
+ The meson.build file in the root directory is the main entry point for the
+ Meson build system, and it coordinates the configuration and build of all
+ executables. Build rules for various subdirectories are included in
+ other meson.build files spread throughout the QEMU source tree.
+
+``tests/Makefile.include``
+ Rules for external test harnesses. These include the TCG tests,
+ ``qemu-iotests`` and the Avocado-based integration tests.
+
+``tests/docker/Makefile.include``
+ Rules for Docker tests. Like tests/Makefile, this file is included
+ directly by the top level Makefile, anything defined in this file will
+ influence the entire build system.
+
+``tests/vm/Makefile.include``
+ Rules for VM-based tests. Like tests/Makefile, this file is included
+ directly by the top level Makefile, anything defined in this file will
+ influence the entire build system.
+
+Dynamically created files
+-------------------------
+
+The following files are generated dynamically by configure in order to
+control the behaviour of the statically defined makefiles. This avoids
+the need for QEMU makefiles to go through any pre-processing as seen
+with autotools, where Makefile.am generates Makefile.in which generates
+Makefile.
+
+Built by configure:
+
+``config-host.mak``
+ When configure has determined the characteristics of the build host it
+ will write a long list of variables to config-host.mak file. This
+ provides the various install directories, compiler / linker flags and a
+ variety of ``CONFIG_*`` variables related to optionally enabled features.
+ This is imported by the top level Makefile and meson.build in order to
+ tailor the build output.
+
+ config-host.mak is also used as a dependency checking mechanism. If make
+ sees that the modification timestamp on configure is newer than that on
+ config-host.mak, then configure will be re-run.
+
+ The variables defined here are those which are applicable to all QEMU
+ build outputs. Variables which are potentially different for each
+ emulator target are defined by the next file...
+
+
+Built by Meson:
+
+``${TARGET-NAME}-config-devices.mak``
+ TARGET-NAME is again the name of a system or userspace emulator. The
+ config-devices.mak file is automatically generated by make using the
+ scripts/make_device_config.sh program, feeding it the
+ default-configs/$TARGET-NAME file as input.
+
+``config-host.h``, ``$TARGET_NAME-config-target.h``, ``$TARGET_NAME-config-devices.h``
+ These files are used by source code to determine what features are
+ enabled. They are generated from the contents of the corresponding
+ ``*.mak`` files using Meson's ``configure_file()`` function.
+
+``build.ninja``
+ The build rules.
+
+
+Built by Makefile:
+
+``Makefile.ninja``
+ A Makefile include that bridges to ninja for the actual build. The
+ Makefile is mostly a list of targets that Meson included in build.ninja.
+
+``Makefile.mtest``
+ The Makefile definitions that let "make check" run tests defined in
+ meson.build. The rules are produced from Meson's JSON description of
+ tests (obtained with "meson introspect --tests") through the script
+ scripts/mtest2make.py.
diff --git a/docs/devel/ci-definitions.rst.inc b/docs/devel/ci-definitions.rst.inc
new file mode 100644
index 000000000..6d5c6fd9f
--- /dev/null
+++ b/docs/devel/ci-definitions.rst.inc
@@ -0,0 +1,121 @@
+Definition of terms
+===================
+
+This section defines the terms used in this document and correlates them with
+what is currently used on QEMU.
+
+Automated tests
+---------------
+
+An automated test is written on a test framework using its generic test
+functions/classes. The test framework can run the tests and report their
+success or failure [1]_.
+
+An automated test has essentially three parts:
+
+1. The test initialization of the parameters, where the expected parameters,
+ like inputs and expected results, are set up;
+2. The call to the code that should be tested;
+3. An assertion, comparing the result from the previous call with the expected
+ result set during the initialization of the parameters. If the result
+ matches the expected result, the test has been successful; otherwise, it has
+ failed.
+
+Unit testing
+------------
+
+A unit test is responsible for exercising individual software components as a
+unit, like interfaces, data structures, and functionality, uncovering errors
+within the boundaries of a component. The verification effort is in the
+smallest software unit and focuses on the internal processing logic and data
+structures. A test case of unit tests should be designed to uncover errors due
+to erroneous computations, incorrect comparisons, or improper control flow [2]_.
+
+On QEMU, unit testing is represented by the 'check-unit' target from 'make'.
+
+Functional testing
+------------------
+
+A functional test focuses on the functional requirement of the software.
+Deriving sets of input conditions, the functional tests should fully exercise
+all the functional requirements for a program. Functional testing is
+complementary to other testing techniques, attempting to find errors like
+incorrect or missing functions, interface errors, behavior errors, and
+initialization and termination errors [3]_.
+
+On QEMU, functional testing is represented by the 'check-qtest' target from
+'make'.
+
+System testing
+--------------
+
+System tests ensure all application elements mesh properly while the overall
+functionality and performance are achieved [4]_. Some or all system components
+are integrated to create a complete system to be tested as a whole. System
+testing ensures that components are compatible, interact correctly, and
+transfer the right data at the right time across their interfaces. As system
+testing focuses on interactions, use case-based testing is a practical approach
+to system testing [5]_. Note that, in some cases, system testing may require
+interaction with third-party software, like operating system images, databases,
+networks, and so on.
+
+On QEMU, system testing is represented by the 'check-avocado' target from
+'make'.
+
+Flaky tests
+-----------
+
+A flaky test is defined as a test that exhibits both a passing and a failing
+result with the same code on different runs. Some usual reasons for an
+intermittent/flaky test are async wait, concurrency, and test order dependency
+[6]_.
+
+Gating
+------
+
+A gate restricts the move of code from one stage to another on a
+test/deployment pipeline. The step move is granted with approval. The approval
+can be a manual intervention or a set of tests succeeding [7]_.
+
+On QEMU, the gating process happens during the pull request. The approval is
+done by the project leader running its own set of tests. The pull request gets
+merged when the tests succeed.
+
+Continuous Integration (CI)
+---------------------------
+
+Continuous integration (CI) requires the builds of the entire application and
+the execution of a comprehensive set of automated tests every time there is a
+need to commit any set of changes [8]_. The automated tests can be composed of
+the unit, functional, system, and other tests.
+
+Keynotes about continuous integration (CI) [9]_:
+
+1. System tests may depend on external software (operating system images,
+ firmware, database, network).
+2. It may take a long time to build and test. It may be impractical to build
+ the system being developed several times per day.
+3. If the development platform is different from the target platform, it may
+ not be possible to run system tests in the developer’s private workspace.
+ There may be differences in hardware, operating system, or installed
+ software. Therefore, more time is required for testing the system.
+
+References
+----------
+
+.. [1] Sommerville, Ian (2016). Software Engineering. p. 233.
+.. [2] Pressman, Roger S. & Maxim, Bruce R. (2020). Software Engineering,
+ A Practitioner’s Approach. p. 48, 376, 378, 381.
+.. [3] Pressman, Roger S. & Maxim, Bruce R. (2020). Software Engineering,
+ A Practitioner’s Approach. p. 388.
+.. [4] Pressman, Roger S. & Maxim, Bruce R. (2020). Software Engineering,
+ A Practitioner’s Approach. Software Engineering, p. 377.
+.. [5] Sommerville, Ian (2016). Software Engineering. p. 59, 232, 240.
+.. [6] Luo, Qingzhou, et al. An empirical analysis of flaky tests.
+ Proceedings of the 22nd ACM SIGSOFT International Symposium on
+ Foundations of Software Engineering. 2014.
+.. [7] Humble, Jez & Farley, David (2010). Continuous Delivery:
+ Reliable Software Releases Through Build, Test, and Deployment, p. 122.
+.. [8] Humble, Jez & Farley, David (2010). Continuous Delivery:
+ Reliable Software Releases Through Build, Test, and Deployment, p. 55.
+.. [9] Sommerville, Ian (2016). Software Engineering. p. 743.
diff --git a/docs/devel/ci-jobs.rst.inc b/docs/devel/ci-jobs.rst.inc
new file mode 100644
index 000000000..db3f571d5
--- /dev/null
+++ b/docs/devel/ci-jobs.rst.inc
@@ -0,0 +1,58 @@
+Custom CI/CD variables
+======================
+
+QEMU CI pipelines can be tuned by setting some CI environment variables.
+
+Set variable globally in the user's CI namespace
+------------------------------------------------
+
+Variables can be set globally in the user's CI namespace setting.
+
+For further information about how to set these variables, please refer to::
+
+ https://docs.gitlab.com/ee/ci/variables/#add-a-cicd-variable-to-a-project
+
+Set variable manually when pushing a branch or tag to the user's repository
+---------------------------------------------------------------------------
+
+Variables can be set manually when pushing a branch or tag, using
+git-push command line arguments.
+
+Example setting the QEMU_CI_EXAMPLE_VAR variable:
+
+.. code::
+
+ git push -o ci.variable="QEMU_CI_EXAMPLE_VAR=value" myrepo mybranch
+
+For further information about how to set these variables, please refer to::
+
+ https://docs.gitlab.com/ee/user/project/push_options.html#push-options-for-gitlab-cicd
+
+Here is a list of the most used variables:
+
+QEMU_CI_AVOCADO_TESTING
+~~~~~~~~~~~~~~~~~~~~~~~
+By default, tests using the Avocado framework are not run automatically in
+the pipelines (because multiple artifacts have to be downloaded, and if
+these artifacts are not already cached, downloading them make the jobs
+reach the timeout limit). Set this variable to have the tests using the
+Avocado framework run automatically.
+
+AARCH64_RUNNER_AVAILABLE
+~~~~~~~~~~~~~~~~~~~~~~~~
+If you've got access to an aarch64 host that can be used as a gitlab-CI
+runner, you can set this variable to enable the tests that require this
+kind of host. The runner should be tagged with "aarch64".
+
+S390X_RUNNER_AVAILABLE
+~~~~~~~~~~~~~~~~~~~~~~
+If you've got access to an IBM Z host that can be used as a gitlab-CI
+runner, you can set this variable to enable the tests that require this
+kind of host. The runner should be tagged with "s390x".
+
+CENTOS_STREAM_8_x86_64_RUNNER_AVAILABLE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If you've got access to a CentOS Stream 8 x86_64 host that can be
+used as a gitlab-CI runner, you can set this variable to enable the
+tests that require this kind of host. The runner should be tagged with
+both "centos_stream_8" and "x86_64".
diff --git a/docs/devel/ci-runners.rst.inc b/docs/devel/ci-runners.rst.inc
new file mode 100644
index 000000000..7817001fb
--- /dev/null
+++ b/docs/devel/ci-runners.rst.inc
@@ -0,0 +1,117 @@
+Jobs on Custom Runners
+======================
+
+Besides the jobs run under the various CI systems listed before, there
+are a number additional jobs that will run before an actual merge.
+These use the same GitLab CI's service/framework already used for all
+other GitLab based CI jobs, but rely on additional systems, not the
+ones provided by GitLab as "shared runners".
+
+The architecture of GitLab's CI service allows different machines to
+be set up with GitLab's "agent", called gitlab-runner, which will take
+care of running jobs created by events such as a push to a branch.
+Here, the combination of a machine, properly configured with GitLab's
+gitlab-runner, is called a "custom runner".
+
+The GitLab CI jobs definition for the custom runners are located under::
+
+ .gitlab-ci.d/custom-runners.yml
+
+Custom runners entail custom machines. To see a list of the machines
+currently deployed in the QEMU GitLab CI and their maintainers, please
+refer to the QEMU `wiki <https://wiki.qemu.org/AdminContacts>`__.
+
+Machine Setup Howto
+-------------------
+
+For all Linux based systems, the setup can be mostly automated by the
+execution of two Ansible playbooks. Create an ``inventory`` file
+under ``scripts/ci/setup``, such as this::
+
+ fully.qualified.domain
+ other.machine.hostname
+
+You may need to set some variables in the inventory file itself. One
+very common need is to tell Ansible to use a Python 3 interpreter on
+those hosts. This would look like::
+
+ fully.qualified.domain ansible_python_interpreter=/usr/bin/python3
+ other.machine.hostname ansible_python_interpreter=/usr/bin/python3
+
+Build environment
+~~~~~~~~~~~~~~~~~
+
+The ``scripts/ci/setup/build-environment.yml`` Ansible playbook will
+set up machines with the environment needed to perform builds and run
+QEMU tests. This playbook consists on the installation of various
+required packages (and a general package update while at it). It
+currently covers a number of different Linux distributions, but it can
+be expanded to cover other systems.
+
+The minimum required version of Ansible successfully tested in this
+playbook is 2.8.0 (a version check is embedded within the playbook
+itself). To run the playbook, execute::
+
+ cd scripts/ci/setup
+ ansible-playbook -i inventory build-environment.yml
+
+Please note that most of the tasks in the playbook require superuser
+privileges, such as those from the ``root`` account or those obtained
+by ``sudo``. If necessary, please refer to ``ansible-playbook``
+options such as ``--become``, ``--become-method``, ``--become-user``
+and ``--ask-become-pass``.
+
+gitlab-runner setup and registration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The gitlab-runner agent needs to be installed on each machine that
+will run jobs. The association between a machine and a GitLab project
+happens with a registration token. To find the registration token for
+your repository/project, navigate on GitLab's web UI to:
+
+ * Settings (the gears-like icon at the bottom of the left hand side
+ vertical toolbar), then
+ * CI/CD, then
+ * Runners, and click on the "Expand" button, then
+ * Under "Set up a specific Runner manually", look for the value under
+ "And this registration token:"
+
+Copy the ``scripts/ci/setup/vars.yml.template`` file to
+``scripts/ci/setup/vars.yml``. Then, set the
+``gitlab_runner_registration_token`` variable to the value obtained
+earlier.
+
+To run the playbook, execute::
+
+ cd scripts/ci/setup
+ ansible-playbook -i inventory gitlab-runner.yml
+
+Following the registration, it's necessary to configure the runner tags,
+and optionally other configurations on the GitLab UI. Navigate to:
+
+ * Settings (the gears like icon), then
+ * CI/CD, then
+ * Runners, and click on the "Expand" button, then
+ * "Runners activated for this project", then
+ * Click on the "Edit" icon (next to the "Lock" Icon)
+
+Tags are very important as they are used to route specific jobs to
+specific types of runners, so it's a good idea to double check that
+the automatically created tags are consistent with the OS and
+architecture. For instance, an Ubuntu 20.04 aarch64 system should
+have tags set as::
+
+ ubuntu_20.04,aarch64
+
+Because the job definition at ``.gitlab-ci.d/custom-runners.yml``
+would contain::
+
+ ubuntu-20.04-aarch64-all:
+ tags:
+ - ubuntu_20.04
+ - aarch64
+
+It's also recommended to:
+
+ * increase the "Maximum job timeout" to something like ``2h``
+ * give it a better Description
diff --git a/docs/devel/ci.rst b/docs/devel/ci.rst
new file mode 100644
index 000000000..d10661009
--- /dev/null
+++ b/docs/devel/ci.rst
@@ -0,0 +1,13 @@
+==
+CI
+==
+
+QEMU has configurations enabled for a number of different CI services.
+The most up to date information about them and their status can be
+found at::
+
+ https://wiki.qemu.org/Testing/CI
+
+.. include:: ci-definitions.rst.inc
+.. include:: ci-jobs.rst.inc
+.. include:: ci-runners.rst.inc
diff --git a/docs/devel/clocks.rst b/docs/devel/clocks.rst
new file mode 100644
index 000000000..675fbeb6a
--- /dev/null
+++ b/docs/devel/clocks.rst
@@ -0,0 +1,528 @@
+Modelling a clock tree in QEMU
+==============================
+
+What are clocks?
+----------------
+
+Clocks are QOM objects developed for the purpose of modelling the
+distribution of clocks in QEMU.
+
+They allow us to model the clock distribution of a platform and detect
+configuration errors in the clock tree such as badly configured PLL, clock
+source selection or disabled clock.
+
+The object is *Clock* and its QOM name is ``clock`` (in C code, the macro
+``TYPE_CLOCK``).
+
+Clocks are typically used with devices where they are used to model inputs
+and outputs. They are created in a similar way to GPIOs. Inputs and outputs
+of different devices can be connected together.
+
+In these cases a Clock object is a child of a Device object, but this
+is not a requirement. Clocks can be independent of devices. For
+example it is possible to create a clock outside of any device to
+model the main clock source of a machine.
+
+Here is an example of clocks::
+
+ +---------+ +----------------------+ +--------------+
+ | Clock 1 | | Device B | | Device C |
+ | | | +-------+ +-------+ | | +-------+ |
+ | |>>-+-->>|Clock 2| |Clock 3|>>--->>|Clock 6| |
+ +---------+ | | | (in) | | (out) | | | | (in) | |
+ | | +-------+ +-------+ | | +-------+ |
+ | | +-------+ | +--------------+
+ | | |Clock 4|>>
+ | | | (out) | | +--------------+
+ | | +-------+ | | Device D |
+ | | +-------+ | | +-------+ |
+ | | |Clock 5|>>--->>|Clock 7| |
+ | | | (out) | | | | (in) | |
+ | | +-------+ | | +-------+ |
+ | +----------------------+ | |
+ | | +-------+ |
+ +----------------------------->>|Clock 8| |
+ | | (in) | |
+ | +-------+ |
+ +--------------+
+
+Clocks are defined in the ``include/hw/clock.h`` header and device
+related functions are defined in the ``include/hw/qdev-clock.h``
+header.
+
+The clock state
+---------------
+
+The state of a clock is its period; it is stored as an integer
+representing it in units of 2 :sup:`-32` ns. The special value of 0 is used to
+represent the clock being inactive or gated. The clocks do not model
+the signal itself (pin toggling) or other properties such as the duty
+cycle.
+
+All clocks contain this state: outputs as well as inputs. This allows
+the current period of a clock to be fetched at any time. When a clock
+is updated, the value is immediately propagated to all connected
+clocks in the tree.
+
+To ease interaction with clocks, helpers with a unit suffix are defined for
+every clock state setter or getter. The suffixes are:
+
+- ``_ns`` for handling periods in nanoseconds
+- ``_hz`` for handling frequencies in hertz
+
+The 0 period value is converted to 0 in hertz and vice versa. 0 always means
+that the clock is disabled.
+
+Adding a new clock
+------------------
+
+Adding clocks to a device must be done during the init method of the Device
+instance.
+
+To add an input clock to a device, the function ``qdev_init_clock_in()``
+must be used. It takes the name, a callback, an opaque parameter
+for the callback and a mask of events when the callback should be
+called (this will be explained in a following section).
+Output is simpler; only the name is required. Typically::
+
+ qdev_init_clock_in(DEVICE(dev), "clk_in", clk_in_callback, dev, ClockUpdate);
+ qdev_init_clock_out(DEVICE(dev), "clk_out");
+
+Both functions return the created Clock pointer, which should be saved in the
+device's state structure for further use.
+
+These objects will be automatically deleted by the QOM reference mechanism.
+
+Note that it is possible to create a static array describing clock inputs and
+outputs. The function ``qdev_init_clocks()`` must be called with the array as
+parameter to initialize the clocks: it has the same behaviour as calling the
+``qdev_init_clock_in/out()`` for each clock in the array. To ease the array
+construction, some macros are defined in ``include/hw/qdev-clock.h``.
+As an example, the following creates 2 clocks to a device: one input and one
+output.
+
+.. code-block:: c
+
+ /* device structure containing pointers to the clock objects */
+ typedef struct MyDeviceState {
+ DeviceState parent_obj;
+ Clock *clk_in;
+ Clock *clk_out;
+ } MyDeviceState;
+
+ /*
+ * callback for the input clock (see "Callback on input clock
+ * change" section below for more information).
+ */
+ static void clk_in_callback(void *opaque, ClockEvent event);
+
+ /*
+ * static array describing clocks:
+ * + a clock input named "clk_in", whose pointer is stored in
+ * the clk_in field of a MyDeviceState structure with callback
+ * clk_in_callback.
+ * + a clock output named "clk_out" whose pointer is stored in
+ * the clk_out field of a MyDeviceState structure.
+ */
+ static const ClockPortInitArray mydev_clocks = {
+ QDEV_CLOCK_IN(MyDeviceState, clk_in, clk_in_callback, ClockUpdate),
+ QDEV_CLOCK_OUT(MyDeviceState, clk_out),
+ QDEV_CLOCK_END
+ };
+
+ /* device initialization function */
+ static void mydev_init(Object *obj)
+ {
+ /* cast to MyDeviceState */
+ MyDeviceState *mydev = MYDEVICE(obj);
+ /* create and fill the pointer fields in the MyDeviceState */
+ qdev_init_clocks(mydev, mydev_clocks);
+ [...]
+ }
+
+An alternative way to create a clock is to simply call
+``object_new(TYPE_CLOCK)``. In that case the clock will neither be an
+input nor an output of a device. After the whole QOM hierarchy of the
+clock has been set ``clock_setup_canonical_path()`` should be called.
+
+At creation, the period of the clock is 0: the clock is disabled. You can
+change it using ``clock_set_ns()`` or ``clock_set_hz()``.
+
+Note that if you are creating a clock with a fixed period which will never
+change (for example the main clock source of a board), then you'll have
+nothing else to do. This value will be propagated to other clocks when
+connecting the clocks together and devices will fetch the right value during
+the first reset.
+
+Clock callbacks
+---------------
+
+You can give a clock a callback function in several ways:
+
+ * by passing it as an argument to ``qdev_init_clock_in()``
+ * as an argument to the ``QDEV_CLOCK_IN()`` macro initializing an
+ array to be passed to ``qdev_init_clocks()``
+ * by directly calling the ``clock_set_callback()`` function
+
+The callback function must be of this type:
+
+.. code-block:: c
+
+ typedef void ClockCallback(void *opaque, ClockEvent event);
+
+The ``opaque`` argument is the pointer passed to ``qdev_init_clock_in()``
+or ``clock_set_callback()``; for ``qdev_init_clocks()`` it is the
+``dev`` device pointer.
+
+The ``event`` argument specifies why the callback has been called.
+When you register the callback you specify a mask of ClockEvent values
+that you are interested in. The callback will only be called for those
+events.
+
+The events currently supported are:
+
+ * ``ClockPreUpdate`` : called when the input clock's period is about to
+ update. This is useful if the device needs to do some action for
+ which it needs to know the old value of the clock period. During
+ this callback, Clock API functions like ``clock_get()`` or
+ ``clock_ticks_to_ns()`` will use the old period.
+ * ``ClockUpdate`` : called after the input clock's period has changed.
+ During this callback, Clock API functions like ``clock_ticks_to_ns()``
+ will use the new period.
+
+Note that a clock only has one callback: it is not possible to register
+different functions for different events. You must register a single
+callback which listens for all of the events you are interested in,
+and use the ``event`` argument to identify which event has happened.
+
+Retrieving clocks from a device
+-------------------------------
+
+``qdev_get_clock_in()`` and ``dev_get_clock_out()`` are available to
+get the clock inputs or outputs of a device. For example:
+
+.. code-block:: c
+
+ Clock *clk = qdev_get_clock_in(DEVICE(mydev), "clk_in");
+
+or:
+
+.. code-block:: c
+
+ Clock *clk = qdev_get_clock_out(DEVICE(mydev), "clk_out");
+
+Connecting two clocks together
+------------------------------
+
+To connect two clocks together, use the ``clock_set_source()`` function.
+Given two clocks ``clk1``, and ``clk2``, ``clock_set_source(clk2, clk1);``
+configures ``clk2`` to follow the ``clk1`` period changes. Every time ``clk1``
+is updated, ``clk2`` will be updated too.
+
+When connecting clock between devices, prefer using the
+``qdev_connect_clock_in()`` function to set the source of an input
+device clock. For example, to connect the input clock ``clk2`` of
+``devB`` to the output clock ``clk1`` of ``devA``, do:
+
+.. code-block:: c
+
+ qdev_connect_clock_in(devB, "clk2", qdev_get_clock_out(devA, "clk1"))
+
+We used ``qdev_get_clock_out()`` above, but any clock can drive an
+input clock, even another input clock. The following diagram shows
+some examples of connections. Note also that a clock can drive several
+other clocks.
+
+::
+
+ +------------+ +--------------------------------------------------+
+ | Device A | | Device B |
+ | | | +---------------------+ |
+ | | | | Device C | |
+ | +-------+ | | +-------+ | +-------+ +-------+ | +-------+ |
+ | |Clock 1|>>-->>|Clock 2|>>+-->>|Clock 3| |Clock 5|>>>>|Clock 6|>>
+ | | (out) | | | | (in) | | | | (in) | | (out) | | | (out) | |
+ | +-------+ | | +-------+ | | +-------+ +-------+ | +-------+ |
+ +------------+ | | +---------------------+ |
+ | | |
+ | | +--------------+ |
+ | | | Device D | |
+ | | | +-------+ | |
+ | +-->>|Clock 4| | |
+ | | | (in) | | |
+ | | +-------+ | |
+ | +--------------+ |
+ +--------------------------------------------------+
+
+In the above example, when *Clock 1* is updated by *Device A*, three
+clocks get the new clock period value: *Clock 2*, *Clock 3* and *Clock 4*.
+
+It is not possible to disconnect a clock or to change the clock connection
+after it is connected.
+
+Clock multiplier and divider settings
+-------------------------------------
+
+By default, when clocks are connected together, the child
+clocks run with the same period as their source (parent) clock.
+The Clock API supports a built-in period multiplier/divider
+mechanism so you can configure a clock to make its children
+run at a different period from its own. If you call the
+``clock_set_mul_div()`` function you can specify the clock's
+multiplier and divider values. The children of that clock
+will all run with a period of ``parent_period * multiplier / divider``.
+For instance, if the clock has a frequency of 8MHz and you set its
+multiplier to 2 and its divider to 3, the child clocks will run
+at 12MHz.
+
+You can change the multiplier and divider of a clock at runtime,
+so you can use this to model clock controller devices which
+have guest-programmable frequency multipliers or dividers.
+
+Note that ``clock_set_mul_div()`` does not automatically call
+``clock_propagate()``. If you make a runtime change to the
+multiplier or divider you must call clock_propagate() yourself.
+
+Unconnected input clocks
+------------------------
+
+A newly created input clock is disabled (period of 0). This means the
+clock will be considered as disabled until the period is updated. If
+the clock remains unconnected it will always keep its initial value
+of 0. If this is not the desired behaviour, ``clock_set()``,
+``clock_set_ns()`` or ``clock_set_hz()`` should be called on the Clock
+object during device instance init. For example:
+
+.. code-block:: c
+
+ clk = qdev_init_clock_in(DEVICE(dev), "clk-in", clk_in_callback,
+ dev, ClockUpdate);
+ /* set initial value to 10ns / 100MHz */
+ clock_set_ns(clk, 10);
+
+To enforce that the clock is wired up by the board code, you can
+call ``clock_has_source()`` in your device's realize method:
+
+.. code-block:: c
+
+ if (!clock_has_source(s->clk)) {
+ error_setg(errp, "MyDevice: clk input must be connected");
+ return;
+ }
+
+Note that this only checks that the clock has been wired up; it is
+still possible that the output clock connected to it is disabled
+or has not yet been configured, in which case the period will be
+zero. You should use the clock callback to find out when the clock
+period changes.
+
+Fetching clock frequency/period
+-------------------------------
+
+To get the current state of a clock, use the functions ``clock_get()``
+or ``clock_get_hz()``.
+
+``clock_get()`` returns the period of the clock in its fully precise
+internal representation, as an unsigned 64-bit integer in units of
+2^-32 nanoseconds. (For many purposes ``clock_ticks_to_ns()`` will
+be more convenient; see the section below on expiry deadlines.)
+
+``clock_get_hz()`` returns the frequency of the clock, rounded to the
+next lowest integer. This implies some inaccuracy due to the rounding,
+so be cautious about using it in calculations.
+
+It is also possible to register a callback on clock frequency changes.
+Here is an example, which assumes that ``clock_callback`` has been
+specified as the callback for the ``ClockUpdate`` event:
+
+.. code-block:: c
+
+ void clock_callback(void *opaque, ClockEvent event) {
+ MyDeviceState *s = (MyDeviceState *) opaque;
+ /*
+ * 'opaque' is the argument passed to qdev_init_clock_in();
+ * usually this will be the device state pointer.
+ */
+
+ /* do something with the new period */
+ fprintf(stdout, "device new period is %" PRIu64 "* 2^-32 ns\n",
+ clock_get(dev->my_clk_input));
+ }
+
+If you are only interested in the frequency for displaying it to
+humans (for instance in debugging), use ``clock_display_freq()``,
+which returns a prettified string-representation, e.g. "33.3 MHz".
+The caller must free the string with g_free() after use.
+
+Calculating expiry deadlines
+----------------------------
+
+A commonly required operation for a clock is to calculate how long
+it will take for the clock to tick N times; this can then be used
+to set a timer expiry deadline. Use the function ``clock_ticks_to_ns()``,
+which takes an unsigned 64-bit count of ticks and returns the length
+of time in nanoseconds required for the clock to tick that many times.
+
+It is important not to try to calculate expiry deadlines using a
+shortcut like multiplying a "period of clock in nanoseconds" value
+by the tick count, because clocks can have periods which are not a
+whole number of nanoseconds, and the accumulated error in the
+multiplication can be significant.
+
+For a clock with a very long period and a large number of ticks,
+the result of this function could in theory be too large to fit in
+a 64-bit value. To avoid overflow in this case, ``clock_ticks_to_ns()``
+saturates the result to INT64_MAX (because this is the largest valid
+input to the QEMUTimer APIs). Since INT64_MAX nanoseconds is almost
+300 years, anything with an expiry later than that is in the "will
+never happen" category. Callers of ``clock_ticks_to_ns()`` should
+therefore generally not special-case the possibility of a saturated
+result but just allow the timer to be set to that far-future value.
+(If you are performing further calculations on the returned value
+rather than simply passing it to a QEMUTimer function like
+``timer_mod_ns()`` then you should be careful to avoid overflow
+in those calculations, of course.)
+
+Obtaining tick counts
+---------------------
+
+For calculations where you need to know the number of ticks in
+a given duration, use ``clock_ns_to_ticks()``. This function handles
+possible non-whole-number-of-nanoseconds periods and avoids
+potential rounding errors. It will return '0' if the clock is stopped
+(i.e. it has period zero). If the inputs imply a tick count that
+overflows a 64-bit value (a very long duration for a clock with a
+very short period) the output value is truncated, so effectively
+the 64-bit output wraps around.
+
+Changing a clock period
+-----------------------
+
+A device can change its outputs using the ``clock_update()``,
+``clock_update_ns()`` or ``clock_update_hz()`` function. It will trigger
+updates on every connected input.
+
+For example, let's say that we have an output clock *clkout* and we
+have a pointer to it in the device state because we did the following
+in init phase:
+
+.. code-block:: c
+
+ dev->clkout = qdev_init_clock_out(DEVICE(dev), "clkout");
+
+Then at any time (apart from the cases listed below), it is possible to
+change the clock value by doing:
+
+.. code-block:: c
+
+ clock_update_hz(dev->clkout, 1000 * 1000 * 1000); /* 1GHz */
+
+Because updating a clock may trigger any side effects through
+connected clocks and their callbacks, this operation must be done
+while holding the qemu io lock.
+
+For the same reason, one can update clocks only when it is allowed to have
+side effects on other objects. In consequence, it is forbidden:
+
+* during migration,
+* and in the enter phase of reset.
+
+Note that calling ``clock_update[_ns|_hz]()`` is equivalent to calling
+``clock_set[_ns|_hz]()`` (with the same arguments) then
+``clock_propagate()`` on the clock. Thus, setting the clock value can
+be separated from triggering the side-effects. This is often required
+to factorize code to handle reset and migration in devices.
+
+Aliasing clocks
+---------------
+
+Sometimes, one needs to forward, or inherit, a clock from another
+device. Typically, when doing device composition, a device might
+expose a sub-device's clock without interfering with it. The function
+``qdev_alias_clock()`` can be used to achieve this behaviour. Note
+that it is possible to expose the clock under a different name.
+``qdev_alias_clock()`` works for both input and output clocks.
+
+For example, if device B is a child of device A,
+``device_a_instance_init()`` may do something like this:
+
+.. code-block:: c
+
+ void device_a_instance_init(Object *obj)
+ {
+ AState *A = DEVICE_A(obj);
+ BState *B;
+ /* create object B as child of A */
+ [...]
+ qdev_alias_clock(B, "clk", A, "b_clk");
+ /*
+ * Now A has a clock "b_clk" which is an alias to
+ * the clock "clk" of its child B.
+ */
+ }
+
+This function does not return any clock object. The new clock has the
+same direction (input or output) as the original one. This function
+only adds a link to the existing clock. In the above example, object B
+remains the only object allowed to use the clock and device A must not
+try to change the clock period or set a callback to the clock. This
+diagram describes the example with an input clock::
+
+ +--------------------------+
+ | Device A |
+ | +--------------+ |
+ | | Device B | |
+ | | +-------+ | |
+ >>"b_clk">>>| "clk" | | |
+ | (in) | | (in) | | |
+ | | +-------+ | |
+ | +--------------+ |
+ +--------------------------+
+
+Migration
+---------
+
+Clock state is not migrated automatically. Every device must handle its
+clock migration. Alias clocks must not be migrated.
+
+To ensure clock states are restored correctly during migration, there
+are two solutions.
+
+Clock states can be migrated by adding an entry into the device
+vmstate description. You should use the ``VMSTATE_CLOCK`` macro for this.
+This is typically used to migrate an input clock state. For example:
+
+.. code-block:: c
+
+ MyDeviceState {
+ DeviceState parent_obj;
+ [...] /* some fields */
+ Clock *clk;
+ };
+
+ VMStateDescription my_device_vmstate = {
+ .name = "my_device",
+ .fields = (VMStateField[]) {
+ [...], /* other migrated fields */
+ VMSTATE_CLOCK(clk, MyDeviceState),
+ VMSTATE_END_OF_LIST()
+ }
+ };
+
+The second solution is to restore the clock state using information already
+at our disposal. This can be used to restore output clock states using the
+device state. The functions ``clock_set[_ns|_hz]()`` can be used during the
+``post_load()`` migration callback.
+
+When adding clock support to an existing device, if you care about
+migration compatibility you will need to be careful, as simply adding
+a ``VMSTATE_CLOCK()`` line will break compatibility. Instead, you can
+put the ``VMSTATE_CLOCK()`` line into a vmstate subsection with a
+suitable ``needed`` function, and use ``clock_set()`` in a
+``pre_load()`` function to set the default value that will be used if
+the source virtual machine in the migration does not send the clock
+state.
+
+Care should be taken not to use ``clock_update[_ns|_hz]()`` or
+``clock_propagate()`` during the whole migration procedure because it
+will trigger side effects to other devices in an unknown state.
diff --git a/docs/devel/code-of-conduct.rst b/docs/devel/code-of-conduct.rst
new file mode 100644
index 000000000..195444d1b
--- /dev/null
+++ b/docs/devel/code-of-conduct.rst
@@ -0,0 +1,60 @@
+Code of Conduct
+===============
+
+The QEMU community is made up of a mixture of professionals and
+volunteers from all over the world. Diversity is one of our strengths,
+but it can also lead to communication issues and unhappiness.
+To that end, we have a few ground rules that we ask people to adhere to.
+
+* Be welcoming. We are committed to making participation in this project
+ a harassment-free experience for everyone, regardless of level of
+ experience, gender, gender identity and expression, sexual orientation,
+ disability, personal appearance, body size, race, ethnicity, age, religion,
+ or nationality.
+
+* Be respectful. Not all of us will agree all the time. Disagreements, both
+ social and technical, happen all the time and the QEMU community is no
+ exception. When we disagree, we try to understand why. It is important that
+ we resolve disagreements and differing views constructively. Members of the
+ QEMU community should be respectful when dealing with other contributors as
+ well as with people outside the QEMU community and with users of QEMU.
+
+Harassment and other exclusionary behavior are not acceptable. A community
+where people feel uncomfortable or threatened is neither welcoming nor
+respectful. Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery
+
+* Personal attacks
+
+* Trolling or insulting/derogatory comments
+
+* Public or private harassment
+
+* Publishing other's private information, such as physical or electronic
+ addresses, without explicit permission
+
+This isn't an exhaustive list of things that you can't do. Rather, take
+it in the spirit in which it's intended: a guide to make it easier to
+be excellent to each other.
+
+This code of conduct applies to all spaces managed by the QEMU project.
+This includes IRC, the mailing lists, the issue tracker, community
+events, and any other forums created by the project team which the
+community uses for communication. This code of conduct also applies
+outside these spaces, when an individual acts as a representative or a
+member of the project or its community.
+
+By adopting this code of conduct, project maintainers commit themselves
+to fairly and consistently applying these principles to every aspect of
+managing this project. If you believe someone is violating the code of
+conduct, please read the :ref:`conflict-resolution` document for
+information about how to proceed.
+
+Sources
+-------
+
+This document is based on the `Fedora Code of Conduct
+<http://web.archive.org/web/20210429132536/https://docs.fedoraproject.org/en-US/project/code-of-conduct/>`__
+(as of April 2021) and the `Contributor Covenant version 1.3.0
+<https://www.contributor-covenant.org/version/1/3/0/code-of-conduct/>`__.
diff --git a/docs/devel/conflict-resolution.rst b/docs/devel/conflict-resolution.rst
new file mode 100644
index 000000000..bb25f6186
--- /dev/null
+++ b/docs/devel/conflict-resolution.rst
@@ -0,0 +1,80 @@
+.. _conflict-resolution:
+
+Conflict Resolution Policy
+==========================
+
+Conflicts in the community can take many forms, from someone having a
+bad day and using harsh and hurtful language on the mailing list to more
+serious code of conduct violations (including sexist/racist statements
+or threats of violence), and everything in between.
+
+For the vast majority of issues, we aim to empower individuals to first
+resolve conflicts themselves, asking for help when needed, and only
+after that fails to escalate further. This approach gives people more
+control over the outcome of their dispute.
+
+How we resolve conflicts
+------------------------
+
+If you are experiencing conflict, please consider first addressing the
+perceived conflict directly with other involved parties, preferably through
+a real-time medium such as IRC. You could also try to get a third-party (e.g.
+a mutual friend, and/or someone with background on the issue, but not
+involved in the conflict) to intercede or mediate.
+
+If this fails or if you do not feel comfortable proceeding this way, or
+if the problem requires immediate escalation, report the issue to the QEMU
+leadership committee by sending an email to qemu@sfconservancy.org, providing
+references to the misconduct.
+For very urgent topics, you can also inform one or more members through IRC.
+The up-to-date list of members is `available on the QEMU wiki
+<https://wiki.qemu.org/Conservancy>`__.
+
+Your report will be treated confidentially by the leadership committee and
+not be published without your agreement. The QEMU leadership committee will
+then do its best to review the incident in a timely manner, and will either
+seek further information, or will make a determination on next steps.
+
+Remedies
+--------
+
+Escalating an issue to the QEMU leadership committee may result in actions
+impacting one or more involved parties. In the event the leadership
+committee has to intervene, here are some of the ways they might respond:
+
+1. Take no action. For example, if the leadership committee determines
+ the complaint has not been substantiated or is being made in bad faith,
+ or if it is deemed to be outside its purview.
+
+2. A private reprimand, explaining the consequences of continued behavior,
+ to one or more involved individuals.
+
+3. A private reprimand and request for a private or public apology
+
+4. A public reprimand and request for a public apology
+
+5. A public reprimand plus a mandatory cooling off period. The cooling
+ off period may require, for example, one or more of the following:
+ abstaining from maintainer duties; not interacting with people involved,
+ including unsolicited interaction with those enforcing the guidelines
+ and interaction on social media; being denied participation to in-person
+ events. The cooling off period is voluntary but may escalate to a
+ temporary ban in order to enforce it.
+
+6. A temporary or permanent ban from some or all current and future QEMU
+ spaces (mailing lists, IRC, wiki, etc.), possibly including in-person
+ events.
+
+In the event of severe harassment, the leadership committee may advise that
+the matter be escalated to the relevant local law enforcement agency. It
+is however not the role of the leadership committee to initiate contact
+with law enforcement on behalf of any of the community members involved
+in an incident.
+
+Sources
+-------
+
+This document was developed based on the `Drupal Conflict Resolution
+Policy and Process <https://www.drupal.org/conflict-resolution>`__
+and the `Mozilla Consequence Ladder
+<https://github.com/mozilla/diversity/blob/master/code-of-conduct-enforcement/consequence-ladder.md>`__
diff --git a/docs/devel/control-flow-integrity.rst b/docs/devel/control-flow-integrity.rst
new file mode 100644
index 000000000..e6b73a4fe
--- /dev/null
+++ b/docs/devel/control-flow-integrity.rst
@@ -0,0 +1,137 @@
+============================
+Control-Flow Integrity (CFI)
+============================
+
+This document describes the current control-flow integrity (CFI) mechanism in
+QEMU. How it can be enabled, its benefits and deficiencies, and how it affects
+new and existing code in QEMU
+
+Basics
+------
+
+CFI is a hardening technique that focusing on guaranteeing that indirect
+function calls have not been altered by an attacker.
+The type used in QEMU is a forward-edge control-flow integrity that ensures
+function calls performed through function pointers, always call a "compatible"
+function. A compatible function is a function with the same signature of the
+function pointer declared in the source code.
+
+This type of CFI is entirely compiler-based and relies on the compiler knowing
+the signature of every function and every function pointer used in the code.
+As of now, the only compiler that provides support for CFI is Clang.
+
+CFI is best used on production binaries, to protect against unknown attack
+vectors.
+
+In case of a CFI violation (i.e. call to a non-compatible function) QEMU will
+terminate abruptly, to stop the possible attack.
+
+Building with CFI
+-----------------
+
+NOTE: CFI requires the use of link-time optimization. Therefore, when CFI is
+selected, LTO will be automatically enabled.
+
+To build with CFI, the minimum requirement is Clang 6+. If you
+are planning to also enable fuzzing, then Clang 11+ is needed (more on this
+later).
+
+Given the use of LTO, a version of AR that supports LLVM IR is required.
+The easies way of doing this is by selecting the AR provided by LLVM::
+
+ AR=llvm-ar-9 CC=clang-9 CXX=clang++-9 /path/to/configure --enable-cfi
+
+CFI is enabled on every binary produced.
+
+If desired, an additional flag to increase the verbosity of the output in case
+of a CFI violation is offered (``--enable-debug-cfi``).
+
+Using QEMU built with CFI
+-------------------------
+
+A binary with CFI will work exactly like a standard binary. In case of a CFI
+violation, the binary will terminate with an illegal instruction signal.
+
+Incompatible code with CFI
+--------------------------
+
+As mentioned above, CFI is entirely compiler-based and therefore relies on
+compile-time knowledge of the code. This means that, while generally supported
+for most code, some specific use pattern can break CFI compatibility, and
+create false-positives. The two main patterns that can cause issues are:
+
+* Just-in-time compiled code: since such code is created at runtime, the jump
+ to the buffer containing JIT code will fail.
+
+* Libraries loaded dynamically, e.g. with dlopen/dlsym, since the library was
+ not known at compile time.
+
+Current areas of QEMU that are not entirely compatible with CFI are:
+
+1. TCG, since the idea of TCG is to pre-compile groups of instructions at
+ runtime to speed-up interpretation, quite similarly to a JIT compiler
+
+2. TCI, where the interpreter has to interpret the generic *call* operation
+
+3. Plugins, since a plugin is implemented as an external library
+
+4. Modules, since they are implemented as an external library
+
+5. Directly calling signal handlers from the QEMU source code, since the
+ signal handler may have been provided by an external library or even plugged
+ at runtime.
+
+Disabling CFI for a specific function
+-------------------------------------
+
+If you are working on function that is performing a call using an
+incompatible way, as described before, you can selectively disable CFI checks
+for such function by using the decorator ``QEMU_DISABLE_CFI`` at function
+definition, and add an explanation on why the function is not compatible
+with CFI. An example of the use of ``QEMU_DISABLE_CFI`` is provided here::
+
+ /*
+ * Disable CFI checks.
+ * TCG creates binary blobs at runtime, with the transformed code.
+ * A TB is a blob of binary code, created at runtime and called with an
+ * indirect function call. Since such function did not exist at compile time,
+ * the CFI runtime has no way to verify its signature and would fail.
+ * TCG is not considered a security-sensitive part of QEMU so this does not
+ * affect the impact of CFI in environment with high security requirements
+ */
+ QEMU_DISABLE_CFI
+ static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb)
+
+NOTE: CFI needs to be disabled at the **caller** function, (i.e. a compatible
+cfi function that calls a non-compatible one), since the check is performed
+when the function call is performed.
+
+CFI and fuzzing
+---------------
+
+There is generally no advantage of using CFI and fuzzing together, because
+they target different environments (production for CFI, debug for fuzzing).
+
+CFI could be used in conjunction with fuzzing to identify a broader set of
+bugs that may not end immediately in a segmentation fault or triggering
+an assertion. However, other sanitizers such as address and ub sanitizers
+can identify such bugs in a more precise way than CFI.
+
+There is, however, an interesting use case in using CFI in conjunction with
+fuzzing, that is to make sure that CFI is not triggering any false positive
+in remote-but-possible parts of the code.
+
+CFI can be enabled with fuzzing, but with some caveats:
+1. Fuzzing relies on the linker performing function wrapping at link-time.
+The standard BFD linker does not support function wrapping when LTO is
+also enabled. The workaround is to use LLVM's lld linker.
+2. Fuzzing also relies on a custom linker script, which is only supported by
+lld with version 11+.
+
+In other words, to compile with fuzzing and CFI, clang 11+ is required, and
+lld needs to be used as a linker::
+
+ AR=llvm-ar-11 CC=clang-11 CXX=clang++-11 /path/to/configure --enable-cfi \
+ -enable-fuzzing --extra-ldflags="-fuse-ld=lld"
+
+and then, compile the fuzzers as usual.
diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst
new file mode 100644
index 000000000..49ea50c2a
--- /dev/null
+++ b/docs/devel/decodetree.rst
@@ -0,0 +1,237 @@
+========================
+Decodetree Specification
+========================
+
+A *decodetree* is built from instruction *patterns*. A pattern may
+represent a single architectural instruction or a group of same, depending
+on what is convenient for further processing.
+
+Each pattern has both *fixedbits* and *fixedmask*, the combination of which
+describes the condition under which the pattern is matched::
+
+ (insn & fixedmask) == fixedbits
+
+Each pattern may have *fields*, which are extracted from the insn and
+passed along to the translator. Examples of such are registers,
+immediates, and sub-opcodes.
+
+In support of patterns, one may declare *fields*, *argument sets*, and
+*formats*, each of which may be re-used to simplify further definitions.
+
+Fields
+======
+
+Syntax::
+
+ field_def := '%' identifier ( unnamed_field )* ( !function=identifier )?
+ unnamed_field := number ':' ( 's' ) number
+
+For *unnamed_field*, the first number is the least-significant bit position
+of the field and the second number is the length of the field. If the 's' is
+present, the field is considered signed. If multiple ``unnamed_fields`` are
+present, they are concatenated. In this way one can define disjoint fields.
+
+If ``!function`` is specified, the concatenated result is passed through the
+named function, taking and returning an integral value.
+
+One may use ``!function`` with zero ``unnamed_fields``. This case is called
+a *parameter*, and the named function is only passed the ``DisasContext``
+and returns an integral value extracted from there.
+
+A field with no ``unnamed_fields`` and no ``!function`` is in error.
+
+Field examples:
+
++---------------------------+---------------------------------------------+
+| Input | Generated code |
++===========================+=============================================+
+| %disp 0:s16 | sextract(i, 0, 16) |
++---------------------------+---------------------------------------------+
+| %imm9 16:6 10:3 | extract(i, 16, 6) << 3 | extract(i, 10, 3) |
++---------------------------+---------------------------------------------+
+| %disp12 0:s1 1:1 2:10 | sextract(i, 0, 1) << 11 | |
+| | extract(i, 1, 1) << 10 | |
+| | extract(i, 2, 10) |
++---------------------------+---------------------------------------------+
+| %shimm8 5:s8 13:1 | expand_shimm8(sextract(i, 5, 8) << 1 | |
+| !function=expand_shimm8 | extract(i, 13, 1)) |
++---------------------------+---------------------------------------------+
+
+Argument Sets
+=============
+
+Syntax::
+
+ args_def := '&' identifier ( args_elt )+ ( !extern )?
+ args_elt := identifier (':' identifier)?
+
+Each *args_elt* defines an argument within the argument set.
+If the form of the *args_elt* contains a colon, the first
+identifier is the argument name and the second identifier is
+the argument type. If the colon is missing, the argument
+type will be ``int``.
+
+Each argument set will be rendered as a C structure "arg_$name"
+with each of the fields being one of the member arguments.
+
+If ``!extern`` is specified, the backing structure is assumed
+to have been already declared, typically via a second decoder.
+
+Argument sets are useful when one wants to define helper functions
+for the translator functions that can perform operations on a common
+set of arguments. This can ensure, for instance, that the ``AND``
+pattern and the ``OR`` pattern put their operands into the same named
+structure, so that a common ``gen_logic_insn`` may be able to handle
+the operations common between the two.
+
+Argument set examples::
+
+ &reg3 ra rb rc
+ &loadstore reg base offset
+ &longldst reg base offset:int64_t
+
+
+Formats
+=======
+
+Syntax::
+
+ fmt_def := '@' identifier ( fmt_elt )+
+ fmt_elt := fixedbit_elt | field_elt | field_ref | args_ref
+ fixedbit_elt := [01.-]+
+ field_elt := identifier ':' 's'? number
+ field_ref := '%' identifier | identifier '=' '%' identifier
+ args_ref := '&' identifier
+
+Defining a format is a handy way to avoid replicating groups of fields
+across many instruction patterns.
+
+A *fixedbit_elt* describes a contiguous sequence of bits that must
+be 1, 0, or don't care. The difference between '.' and '-'
+is that '.' means that the bit will be covered with a field or a
+final 0 or 1 from the pattern, and '-' means that the bit is really
+ignored by the cpu and will not be specified.
+
+A *field_elt* describes a simple field only given a width; the position of
+the field is implied by its position with respect to other *fixedbit_elt*
+and *field_elt*.
+
+If any *fixedbit_elt* or *field_elt* appear, then all bits must be defined.
+Padding with a *fixedbit_elt* of all '.' is an easy way to accomplish that.
+
+A *field_ref* incorporates a field by reference. This is the only way to
+add a complex field to a format. A field may be renamed in the process
+via assignment to another identifier. This is intended to allow the
+same argument set be used with disjoint named fields.
+
+A single *args_ref* may specify an argument set to use for the format.
+The set of fields in the format must be a subset of the arguments in
+the argument set. If an argument set is not specified, one will be
+inferred from the set of fields.
+
+It is recommended, but not required, that all *field_ref* and *args_ref*
+appear at the end of the line, not interleaving with *fixedbit_elf* or
+*field_elt*.
+
+Format examples::
+
+ @opr ...... ra:5 rb:5 ... 0 ....... rc:5
+ @opi ...... ra:5 lit:8 1 ....... rc:5
+
+Patterns
+========
+
+Syntax::
+
+ pat_def := identifier ( pat_elt )+
+ pat_elt := fixedbit_elt | field_elt | field_ref | args_ref | fmt_ref | const_elt
+ fmt_ref := '@' identifier
+ const_elt := identifier '=' number
+
+The *fixedbit_elt* and *field_elt* specifiers are unchanged from formats.
+A pattern that does not specify a named format will have one inferred
+from a referenced argument set (if present) and the set of fields.
+
+A *const_elt* allows a argument to be set to a constant value. This may
+come in handy when fields overlap between patterns and one has to
+include the values in the *fixedbit_elt* instead.
+
+The decoder will call a translator function for each pattern matched.
+
+Pattern examples::
+
+ addl_r 010000 ..... ..... .... 0000000 ..... @opr
+ addl_i 010000 ..... ..... .... 0000000 ..... @opi
+
+which will, in part, invoke::
+
+ trans_addl_r(ctx, &arg_opr, insn)
+
+and::
+
+ trans_addl_i(ctx, &arg_opi, insn)
+
+Pattern Groups
+==============
+
+Syntax::
+
+ group := overlap_group | no_overlap_group
+ overlap_group := '{' ( pat_def | group )+ '}'
+ no_overlap_group := '[' ( pat_def | group )+ ']'
+
+A *group* begins with a lone open-brace or open-bracket, with all
+subsequent lines indented two spaces, and ending with a lone
+close-brace or close-bracket. Groups may be nested, increasing the
+required indentation of the lines within the nested group to two
+spaces per nesting level.
+
+Patterns within overlap groups are allowed to overlap. Conflicts are
+resolved by selecting the patterns in order. If all of the fixedbits
+for a pattern match, its translate function will be called. If the
+translate function returns false, then subsequent patterns within the
+group will be matched.
+
+Patterns within no-overlap groups are not allowed to overlap, just
+the same as ungrouped patterns. Thus no-overlap groups are intended
+to be nested inside overlap groups.
+
+The following example from PA-RISC shows specialization of the *or*
+instruction::
+
+ {
+ {
+ nop 000010 ----- ----- 0000 001001 0 00000
+ copy 000010 00000 r1:5 0000 001001 0 rt:5
+ }
+ or 000010 rt2:5 r1:5 cf:4 001001 0 rt:5
+ }
+
+When the *cf* field is zero, the instruction has no side effects,
+and may be specialized. When the *rt* field is zero, the output
+is discarded and so the instruction has no effect. When the *rt2*
+field is zero, the operation is ``reg[r1] | 0`` and so encodes
+the canonical register copy operation.
+
+The output from the generator might look like::
+
+ switch (insn & 0xfc000fe0) {
+ case 0x08000240:
+ /* 000010.. ........ ....0010 010..... */
+ if ((insn & 0x0000f000) == 0x00000000) {
+ /* 000010.. ........ 00000010 010..... */
+ if ((insn & 0x0000001f) == 0x00000000) {
+ /* 000010.. ........ 00000010 01000000 */
+ extract_decode_Fmt_0(&u.f_decode0, insn);
+ if (trans_nop(ctx, &u.f_decode0)) return true;
+ }
+ if ((insn & 0x03e00000) == 0x00000000) {
+ /* 00001000 000..... 00000010 010..... */
+ extract_decode_Fmt_1(&u.f_decode1, insn);
+ if (trans_copy(ctx, &u.f_decode1)) return true;
+ }
+ }
+ extract_decode_Fmt_2(&u.f_decode2, insn);
+ if (trans_or(ctx, &u.f_decode2)) return true;
+ return false;
+ }
diff --git a/docs/devel/ebpf_rss.rst b/docs/devel/ebpf_rss.rst
new file mode 100644
index 000000000..4a68682b3
--- /dev/null
+++ b/docs/devel/ebpf_rss.rst
@@ -0,0 +1,125 @@
+===========================
+eBPF RSS virtio-net support
+===========================
+
+RSS(Receive Side Scaling) is used to distribute network packets to guest virtqueues
+by calculating packet hash. Usually every queue is processed then by a specific guest CPU core.
+
+For now there are 2 RSS implementations in qemu:
+- 'in-qemu' RSS (functions if qemu receives network packets, i.e. vhost=off)
+- eBPF RSS (can function with also with vhost=on)
+
+eBPF support (CONFIG_EBPF) is enabled by 'configure' script.
+To enable eBPF RSS support use './configure --enable-bpf'.
+
+If steering BPF is not set for kernel's TUN module, the TUN uses automatic selection
+of rx virtqueue based on lookup table built according to calculated symmetric hash
+of transmitted packets.
+If steering BPF is set for TUN the BPF code calculates the hash of packet header and
+returns the virtqueue number to place the packet to.
+
+Simplified decision formula:
+
+.. code:: C
+
+ queue_index = indirection_table[hash(<packet data>)%<indirection_table size>]
+
+
+Not for all packets, the hash can/should be calculated.
+
+Note: currently, eBPF RSS does not support hash reporting.
+
+eBPF RSS turned on by different combinations of vhost-net, vitrio-net and tap configurations:
+
+- eBPF is used:
+
+ tap,vhost=off & virtio-net-pci,rss=on,hash=off
+
+- eBPF is used:
+
+ tap,vhost=on & virtio-net-pci,rss=on,hash=off
+
+- 'in-qemu' RSS is used:
+
+ tap,vhost=off & virtio-net-pci,rss=on,hash=on
+
+- eBPF is used, hash population feature is not reported to the guest:
+
+ tap,vhost=on & virtio-net-pci,rss=on,hash=on
+
+If CONFIG_EBPF is not set then only 'in-qemu' RSS is supported.
+Also 'in-qemu' RSS, as a fallback, is used if the eBPF program failed to load or set to TUN.
+
+RSS eBPF program
+----------------
+
+RSS program located in ebpf/rss.bpf.skeleton.h generated by bpftool.
+So the program is part of the qemu binary.
+Initially, the eBPF program was compiled by clang and source code located at tools/ebpf/rss.bpf.c.
+Prerequisites to recompile the eBPF program (regenerate ebpf/rss.bpf.skeleton.h):
+
+ llvm, clang, kernel source tree, bpftool
+ Adjust Makefile.ebpf to reflect the location of the kernel source tree
+
+ $ cd tools/ebpf
+ $ make -f Makefile.ebpf
+
+Current eBPF RSS implementation uses 'bounded loops' with 'backward jump instructions' which present in the last kernels.
+Overall eBPF RSS works on kernels 5.8+.
+
+eBPF RSS implementation
+-----------------------
+
+eBPF RSS loading functionality located in ebpf/ebpf_rss.c and ebpf/ebpf_rss.h.
+
+The ``struct EBPFRSSContext`` structure that holds 4 file descriptors:
+
+- ctx - pointer of the libbpf context.
+- program_fd - file descriptor of the eBPF RSS program.
+- map_configuration - file descriptor of the 'configuration' map. This map contains one element of 'struct EBPFRSSConfig'. This configuration determines eBPF program behavior.
+- map_toeplitz_key - file descriptor of the 'Toeplitz key' map. One element of the 40byte key prepared for the hashing algorithm.
+- map_indirections_table - 128 elements of queue indexes.
+
+``struct EBPFRSSConfig`` fields:
+
+- redirect - "boolean" value, should the hash be calculated, on false - ``default_queue`` would be used as the final decision.
+- populate_hash - for now, not used. eBPF RSS doesn't support hash reporting.
+- hash_types - binary mask of different hash types. See ``VIRTIO_NET_RSS_HASH_TYPE_*`` defines. If for packet hash should not be calculated - ``default_queue`` would be used.
+- indirections_len - length of the indirections table, maximum 128.
+- default_queue - the queue index that used for packet that shouldn't be hashed. For some packets, the hash can't be calculated(g.e ARP).
+
+Functions:
+
+- ``ebpf_rss_init()`` - sets ctx to NULL, which indicates that EBPFRSSContext is not loaded.
+- ``ebpf_rss_load()`` - creates 3 maps and loads eBPF program from the rss.bpf.skeleton.h. Returns 'true' on success. After that, program_fd can be used to set steering for TAP.
+- ``ebpf_rss_set_all()`` - sets values for eBPF maps. ``indirections_table`` length is in EBPFRSSConfig. ``toeplitz_key`` is VIRTIO_NET_RSS_MAX_KEY_SIZE aka 40 bytes array.
+- ``ebpf_rss_unload()`` - close all file descriptors and set ctx to NULL.
+
+Simplified eBPF RSS workflow:
+
+.. code:: C
+
+ struct EBPFRSSConfig config;
+ config.redirect = 1;
+ config.hash_types = VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | VIRTIO_NET_RSS_HASH_TYPE_TCPv4;
+ config.indirections_len = VIRTIO_NET_RSS_MAX_TABLE_LEN;
+ config.default_queue = 0;
+
+ uint16_t table[VIRTIO_NET_RSS_MAX_TABLE_LEN] = {...};
+ uint8_t key[VIRTIO_NET_RSS_MAX_KEY_SIZE] = {...};
+
+ struct EBPFRSSContext ctx;
+ ebpf_rss_init(&ctx);
+ ebpf_rss_load(&ctx);
+ ebpf_rss_set_all(&ctx, &config, table, key);
+ if (net_client->info->set_steering_ebpf != NULL) {
+ net_client->info->set_steering_ebpf(net_client, ctx->program_fd);
+ }
+ ...
+ ebpf_unload(&ctx);
+
+
+NetClientState SetSteeringEBPF()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For now, ``set_steering_ebpf()`` method supported by Linux TAP NetClientState. The method requires an eBPF program file descriptor as an argument.
diff --git a/docs/devel/fuzzing.rst b/docs/devel/fuzzing.rst
new file mode 100644
index 000000000..784ecb99e
--- /dev/null
+++ b/docs/devel/fuzzing.rst
@@ -0,0 +1,322 @@
+========
+Fuzzing
+========
+
+This document describes the virtual-device fuzzing infrastructure in QEMU and
+how to use it to implement additional fuzzers.
+
+Basics
+------
+
+Fuzzing operates by passing inputs to an entry point/target function. The
+fuzzer tracks the code coverage triggered by the input. Based on these
+findings, the fuzzer mutates the input and repeats the fuzzing.
+
+To fuzz QEMU, we rely on libfuzzer. Unlike other fuzzers such as AFL, libfuzzer
+is an *in-process* fuzzer. For the developer, this means that it is their
+responsibility to ensure that state is reset between fuzzing-runs.
+
+Building the fuzzers
+--------------------
+
+*NOTE*: If possible, build a 32-bit binary. When forking, the 32-bit fuzzer is
+much faster, since the page-map has a smaller size. This is due to the fact that
+AddressSanitizer maps ~20TB of memory, as part of its detection. This results
+in a large page-map, and a much slower ``fork()``.
+
+To build the fuzzers, install a recent version of clang:
+Configure with (substitute the clang binaries with the version you installed).
+Here, enable-sanitizers, is optional but it allows us to reliably detect bugs
+such as out-of-bounds accesses, use-after-frees, double-frees etc.::
+
+ CC=clang-8 CXX=clang++-8 /path/to/configure --enable-fuzzing \
+ --enable-sanitizers
+
+Fuzz targets are built similarly to system targets::
+
+ make qemu-fuzz-i386
+
+This builds ``./qemu-fuzz-i386``
+
+The first option to this command is: ``--fuzz-target=FUZZ_NAME``
+To list all of the available fuzzers run ``qemu-fuzz-i386`` with no arguments.
+
+For example::
+
+ ./qemu-fuzz-i386 --fuzz-target=virtio-scsi-fuzz
+
+Internally, libfuzzer parses all arguments that do not begin with ``"--"``.
+Information about these is available by passing ``-help=1``
+
+Now the only thing left to do is wait for the fuzzer to trigger potential
+crashes.
+
+Useful libFuzzer flags
+----------------------
+
+As mentioned above, libFuzzer accepts some arguments. Passing ``-help=1`` will
+list the available arguments. In particular, these arguments might be helpful:
+
+* ``CORPUS_DIR/`` : Specify a directory as the last argument to libFuzzer.
+ libFuzzer stores each "interesting" input in this corpus directory. The next
+ time you run libFuzzer, it will read all of the inputs from the corpus, and
+ continue fuzzing from there. You can also specify multiple directories.
+ libFuzzer loads existing inputs from all specified directories, but will only
+ write new ones to the first one specified.
+
+* ``-max_len=4096`` : specify the maximum byte-length of the inputs libFuzzer
+ will generate.
+
+* ``-close_fd_mask={1,2,3}`` : close, stderr, or both. Useful for targets that
+ trigger many debug/error messages, or create output on the serial console.
+
+* ``-jobs=4 -workers=4`` : These arguments configure libFuzzer to run 4 fuzzers in
+ parallel (4 fuzzing jobs in 4 worker processes). Alternatively, with only
+ ``-jobs=N``, libFuzzer automatically spawns a number of workers less than or equal
+ to half the available CPU cores. Replace 4 with a number appropriate for your
+ machine. Make sure to specify a ``CORPUS_DIR``, which will allow the parallel
+ fuzzers to share information about the interesting inputs they find.
+
+* ``-use_value_profile=1`` : For each comparison operation, libFuzzer computes
+ ``(caller_pc&4095) | (popcnt(Arg1 ^ Arg2) << 12)`` and places this in the
+ coverage table. Useful for targets with "magic" constants. If Arg1 came from
+ the fuzzer's input and Arg2 is a magic constant, then each time the Hamming
+ distance between Arg1 and Arg2 decreases, libFuzzer adds the input to the
+ corpus.
+
+* ``-shrink=1`` : Tries to make elements of the corpus "smaller". Might lead to
+ better coverage performance, depending on the target.
+
+Note that libFuzzer's exact behavior will depend on the version of
+clang and libFuzzer used to build the device fuzzers.
+
+Generating Coverage Reports
+---------------------------
+
+Code coverage is a crucial metric for evaluating a fuzzer's performance.
+libFuzzer's output provides a "cov: " column that provides a total number of
+unique blocks/edges covered. To examine coverage on a line-by-line basis we
+can use Clang coverage:
+
+ 1. Configure libFuzzer to store a corpus of all interesting inputs (see
+ CORPUS_DIR above)
+ 2. ``./configure`` the QEMU build with ::
+
+ --enable-fuzzing \
+ --extra-cflags="-fprofile-instr-generate -fcoverage-mapping"
+
+ 3. Re-run the fuzzer. Specify $CORPUS_DIR/* as an argument, telling libfuzzer
+ to execute all of the inputs in $CORPUS_DIR and exit. Once the process
+ exits, you should find a file, "default.profraw" in the working directory.
+ 4. Execute these commands to generate a detailed HTML coverage-report::
+
+ llvm-profdata merge -output=default.profdata default.profraw
+ llvm-cov show ./path/to/qemu-fuzz-i386 -instr-profile=default.profdata \
+ --format html -output-dir=/path/to/output/report
+
+Adding a new fuzzer
+-------------------
+
+Coverage over virtual devices can be improved by adding additional fuzzers.
+Fuzzers are kept in ``tests/qtest/fuzz/`` and should be added to
+``tests/qtest/fuzz/meson.build``
+
+Fuzzers can rely on both qtest and libqos to communicate with virtual devices.
+
+1. Create a new source file. For example ``tests/qtest/fuzz/foo-device-fuzz.c``.
+
+2. Write the fuzzing code using the libqtest/libqos API. See existing fuzzers
+ for reference.
+
+3. Add the fuzzer to ``tests/qtest/fuzz/meson.build``.
+
+Fuzzers can be more-or-less thought of as special qtest programs which can
+modify the qtest commands and/or qtest command arguments based on inputs
+provided by libfuzzer. Libfuzzer passes a byte array and length. Commonly the
+fuzzer loops over the byte-array interpreting it as a list of qtest commands,
+addresses, or values.
+
+The Generic Fuzzer
+------------------
+
+Writing a fuzz target can be a lot of effort (especially if a device driver has
+not be built-out within libqos). Many devices can be fuzzed to some degree,
+without any device-specific code, using the generic-fuzz target.
+
+The generic-fuzz target is capable of fuzzing devices over their PIO, MMIO,
+and DMA input-spaces. To apply the generic-fuzz to a device, we need to define
+two env-variables, at minimum:
+
+* ``QEMU_FUZZ_ARGS=`` is the set of QEMU arguments used to configure a machine, with
+ the device attached. For example, if we want to fuzz the virtio-net device
+ attached to a pc-i440fx machine, we can specify::
+
+ QEMU_FUZZ_ARGS="-M pc -nodefaults -netdev user,id=user0 \
+ -device virtio-net,netdev=user0"
+
+* ``QEMU_FUZZ_OBJECTS=`` is a set of space-delimited strings used to identify
+ the MemoryRegions that will be fuzzed. These strings are compared against
+ MemoryRegion names and MemoryRegion owner names, to decide whether each
+ MemoryRegion should be fuzzed. These strings support globbing. For the
+ virtio-net example, we could use one of ::
+
+ QEMU_FUZZ_OBJECTS='virtio-net'
+ QEMU_FUZZ_OBJECTS='virtio*'
+ QEMU_FUZZ_OBJECTS='virtio* pcspk' # Fuzz the virtio devices and the speaker
+ QEMU_FUZZ_OBJECTS='*' # Fuzz the whole machine``
+
+The ``"info mtree"`` and ``"info qom-tree"`` monitor commands can be especially
+useful for identifying the ``MemoryRegion`` and ``Object`` names used for
+matching.
+
+As a generic rule-of-thumb, the more ``MemoryRegions``/Devices we match, the
+greater the input-space, and the smaller the probability of finding crashing
+inputs for individual devices. As such, it is usually a good idea to limit the
+fuzzer to only a few ``MemoryRegions``.
+
+To ensure that these env variables have been configured correctly, we can use::
+
+ ./qemu-fuzz-i386 --fuzz-target=generic-fuzz -runs=0
+
+The output should contain a complete list of matched MemoryRegions.
+
+OSS-Fuzz
+--------
+QEMU is continuously fuzzed on `OSS-Fuzz
+<https://github.com/google/oss-fuzz>`_. By default, the OSS-Fuzz build
+will try to fuzz every fuzz-target. Since the generic-fuzz target
+requires additional information provided in environment variables, we
+pre-define some generic-fuzz configs in
+``tests/qtest/fuzz/generic_fuzz_configs.h``. Each config must specify:
+
+- ``.name``: To identify the fuzzer config
+
+- ``.args`` OR ``.argfunc``: A string or pointer to a function returning a
+ string. These strings are used to specify the ``QEMU_FUZZ_ARGS``
+ environment variable. ``argfunc`` is useful when the config relies on e.g.
+ a dynamically created temp directory, or a free tcp/udp port.
+
+- ``.objects``: A string that specifies the ``QEMU_FUZZ_OBJECTS`` environment
+ variable.
+
+To fuzz additional devices/device configuration on OSS-Fuzz, send patches for
+either a new device-specific fuzzer or a new generic-fuzz config.
+
+Build details:
+
+- The Dockerfile that sets up the environment for building QEMU's
+ fuzzers on OSS-Fuzz can be fund in the OSS-Fuzz repository
+ __(https://github.com/google/oss-fuzz/blob/master/projects/qemu/Dockerfile)
+
+- The script responsible for building the fuzzers can be found in the
+ QEMU source tree at ``scripts/oss-fuzz/build.sh``
+
+Building Crash Reproducers
+-----------------------------------------
+When we find a crash, we should try to create an independent reproducer, that
+can be used on a non-fuzzer build of QEMU. This filters out any potential
+false-positives, and improves the debugging experience for developers.
+Here are the steps for building a reproducer for a crash found by the
+generic-fuzz target.
+
+- Ensure the crash reproduces::
+
+ qemu-fuzz-i386 --fuzz-target... ./crash-...
+
+- Gather the QTest output for the crash::
+
+ QEMU_FUZZ_TIMEOUT=0 QTEST_LOG=1 FUZZ_SERIALIZE_QTEST=1 \
+ qemu-fuzz-i386 --fuzz-target... ./crash-... &> /tmp/trace
+
+- Reorder and clean-up the resulting trace::
+
+ scripts/oss-fuzz/reorder_fuzzer_qtest_trace.py /tmp/trace > /tmp/reproducer
+
+- Get the arguments needed to start qemu, and provide a path to qemu::
+
+ less /tmp/trace # The args should be logged at the top of this file
+ export QEMU_ARGS="-machine ..."
+ export QEMU_PATH="path/to/qemu-system"
+
+- Ensure the crash reproduces in qemu-system::
+
+ $QEMU_PATH $QEMU_ARGS -qtest stdio < /tmp/reproducer
+
+- From the crash output, obtain some string that identifies the crash. This
+ can be a line in the stack-trace, for example::
+
+ export CRASH_TOKEN="hw/usb/hcd-xhci.c:1865"
+
+- Minimize the reproducer::
+
+ scripts/oss-fuzz/minimize_qtest_trace.py -M1 -M2 \
+ /tmp/reproducer /tmp/reproducer-minimized
+
+- Confirm that the minimized reproducer still crashes::
+
+ $QEMU_PATH $QEMU_ARGS -qtest stdio < /tmp/reproducer-minimized
+
+- Create a one-liner reproducer that can be sent over email::
+
+ ./scripts/oss-fuzz/output_reproducer.py -bash /tmp/reproducer-minimized
+
+- Output the C source code for a test case that will reproduce the bug::
+
+ ./scripts/oss-fuzz/output_reproducer.py -owner "John Smith <john@smith.com>"\
+ -name "test_function_name" /tmp/reproducer-minimized
+
+- Report the bug and send a patch with the C reproducer upstream
+
+Implementation Details / Fuzzer Lifecycle
+-----------------------------------------
+
+The fuzzer has two entrypoints that libfuzzer calls. libfuzzer provides it's
+own ``main()``, which performs some setup, and calls the entrypoints:
+
+``LLVMFuzzerInitialize``: called prior to fuzzing. Used to initialize all of the
+necessary state
+
+``LLVMFuzzerTestOneInput``: called for each fuzzing run. Processes the input and
+resets the state at the end of each run.
+
+In more detail:
+
+``LLVMFuzzerInitialize`` parses the arguments to the fuzzer (must start with two
+dashes, so they are ignored by libfuzzer ``main()``). Currently, the arguments
+select the fuzz target. Then, the qtest client is initialized. If the target
+requires qos, qgraph is set up and the QOM/LIBQOS modules are initialized.
+Then the QGraph is walked and the QEMU cmd_line is determined and saved.
+
+After this, the ``vl.c:qemu_main`` is called to set up the guest. There are
+target-specific hooks that can be called before and after qemu_main, for
+additional setup(e.g. PCI setup, or VM snapshotting).
+
+``LLVMFuzzerTestOneInput``: Uses qtest/qos functions to act based on the fuzz
+input. It is also responsible for manually calling ``main_loop_wait`` to ensure
+that bottom halves are executed and any cleanup required before the next input.
+
+Since the same process is reused for many fuzzing runs, QEMU state needs to
+be reset at the end of each run. There are currently two implemented
+options for resetting state:
+
+- Reboot the guest between runs.
+ - *Pros*: Straightforward and fast for simple fuzz targets.
+
+ - *Cons*: Depending on the device, does not reset all device state. If the
+ device requires some initialization prior to being ready for fuzzing (common
+ for QOS-based targets), this initialization needs to be done after each
+ reboot.
+
+ - *Example target*: ``i440fx-qtest-reboot-fuzz``
+
+- Run each test case in a separate forked process and copy the coverage
+ information back to the parent. This is fairly similar to AFL's "deferred"
+ fork-server mode [3]
+
+ - *Pros*: Relatively fast. Devices only need to be initialized once. No need to
+ do slow reboots or vmloads.
+
+ - *Cons*: Not officially supported by libfuzzer. Does not work well for
+ devices that rely on dedicated threads.
+
+ - *Example target*: ``virtio-net-fork-fuzz``
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
new file mode 100644
index 000000000..afd937535
--- /dev/null
+++ b/docs/devel/index.rst
@@ -0,0 +1,50 @@
+---------------------
+Developer Information
+---------------------
+
+This section of the manual documents various parts of the internals of QEMU.
+You only need to read it if you are interested in reading or
+modifying QEMU's source code.
+
+.. toctree::
+ :maxdepth: 2
+ :includehidden:
+
+ code-of-conduct
+ conflict-resolution
+ build-system
+ style
+ kconfig
+ testing
+ fuzzing
+ control-flow-integrity
+ loads-stores
+ memory
+ migration
+ atomics
+ stable-process
+ ci
+ qtest
+ decodetree
+ secure-coding-practices
+ tcg
+ tcg-icount
+ tracing
+ multi-thread-tcg
+ tcg-plugins
+ bitops
+ ui
+ reset
+ s390-dasd-ipl
+ clocks
+ qom
+ modules
+ block-coroutine-wrapper
+ multi-process
+ ebpf_rss
+ vfio-migration
+ qapi-code-gen
+ writing-monitor-commands
+ trivial-patches
+ submitting-a-patch
+ submitting-a-pull-request
diff --git a/docs/devel/kconfig.rst b/docs/devel/kconfig.rst
new file mode 100644
index 000000000..a1cdbec75
--- /dev/null
+++ b/docs/devel/kconfig.rst
@@ -0,0 +1,307 @@
+.. _kconfig:
+
+================
+QEMU and Kconfig
+================
+
+QEMU is a very versatile emulator; it can be built for a variety of
+targets, where each target can emulate various boards and at the same
+time different targets can share large amounts of code. For example,
+a POWER and an x86 board can run the same code to emulate a PCI network
+card, even though the boards use different PCI host bridges, and they
+can run the same code to emulate a SCSI disk while using different
+SCSI adapters. Arm, s390 and x86 boards can all present a virtio-blk
+disk to their guests, but with three different virtio guest interfaces.
+
+Each QEMU target enables a subset of the boards, devices and buses that
+are included in QEMU's source code. As a result, each QEMU executable
+only links a small subset of the files that form QEMU's source code;
+anything that is not needed to support a particular target is culled.
+
+QEMU uses a simple domain-specific language to describe the dependencies
+between components. This is useful for two reasons:
+
+* new targets and boards can be added without knowing in detail the
+ architecture of the hardware emulation subsystems. Boards only have
+ to list the components they need, and the compiled executable will
+ include all the required dependencies and all the devices that the
+ user can add to that board;
+
+* users can easily build reduced versions of QEMU that support only a subset
+ of boards or devices. For example, by default most targets will include
+ all emulated PCI devices that QEMU supports, but the build process is
+ configurable and it is easy to drop unnecessary (or otherwise unwanted)
+ code to make a leaner binary.
+
+This domain-specific language is based on the Kconfig language that
+originated in the Linux kernel, though it was heavily simplified and
+the handling of dependencies is stricter in QEMU.
+
+Unlike Linux, there is no user interface to edit the configuration, which
+is instead specified in per-target files under the ``default-configs/``
+directory of the QEMU source tree. This is because, unlike Linux,
+configuration and dependencies can be treated as a black box when building
+QEMU; the default configuration that QEMU ships with should be okay in
+almost all cases.
+
+The Kconfig language
+--------------------
+
+Kconfig defines configurable components in files named ``hw/*/Kconfig``.
+Note that configurable components are _not_ visible in C code as preprocessor
+symbols; they are only visible in the Makefile. Each configurable component
+defines a Makefile variable whose name starts with ``CONFIG_``.
+
+All elements have boolean (true/false) type; truth is written as ``y``, while
+falsehood is written ``n``. They are defined in a Kconfig
+stanza like the following::
+
+ config ARM_VIRT
+ bool
+ imply PCI_DEVICES
+ imply VFIO_AMD_XGBE
+ imply VFIO_XGMAC
+ select A15MPCORE
+ select ACPI
+ select ARM_SMMUV3
+
+The ``config`` keyword introduces a new configuration element. In the example
+above, Makefiles will have access to a variable named ``CONFIG_ARM_VIRT``,
+with value ``y`` or ``n`` (respectively for boolean true and false).
+
+Boolean expressions can be used within the language, whenever ``<expr>``
+is written in the remainder of this section. The ``&&``, ``||`` and
+``!`` operators respectively denote conjunction (AND), disjunction (OR)
+and negation (NOT).
+
+The ``bool`` data type declaration is optional, but it is suggested to
+include it for clarity and future-proofing. After ``bool`` the following
+directives can be included:
+
+**dependencies**: ``depends on <expr>``
+
+ This defines a dependency for this configurable element. Dependencies
+ evaluate an expression and force the value of the variable to false
+ if the expression is false.
+
+**reverse dependencies**: ``select <symbol> [if <expr>]``
+
+ While ``depends on`` can force a symbol to false, reverse dependencies can
+ be used to force another symbol to true. In the following example,
+ ``CONFIG_BAZ`` will be true whenever ``CONFIG_FOO`` is true::
+
+ config FOO
+ select BAZ
+
+ The optional expression will prevent ``select`` from having any effect
+ unless it is true.
+
+ Note that unlike Linux's Kconfig implementation, QEMU will detect
+ contradictions between ``depends on`` and ``select`` statements and prevent
+ you from building such a configuration.
+
+**default value**: ``default <value> [if <expr>]``
+
+ Default values are assigned to the config symbol if no other value was
+ set by the user via ``default-configs/*.mak`` files, and only if
+ ``select`` or ``depends on`` directives do not force the value to true
+ or false respectively. ``<value>`` can be ``y`` or ``n``; it cannot
+ be an arbitrary Boolean expression. However, a condition for applying
+ the default value can be added with ``if``.
+
+ A configuration element can have any number of default values (usually,
+ if more than one default is present, they will have different
+ conditions). If multiple default values satisfy their condition,
+ only the first defined one is active.
+
+**reverse default** (weak reverse dependency): ``imply <symbol> [if <expr>]``
+
+ This is similar to ``select`` as it applies a lower limit of ``y``
+ to another symbol. However, the lower limit is only a default
+ and the "implied" symbol's value may still be set to ``n`` from a
+ ``default-configs/*.mak`` files. The following two examples are
+ equivalent::
+
+ config FOO
+ bool
+ imply BAZ
+
+ config BAZ
+ bool
+ default y if FOO
+
+ The next section explains where to use ``imply`` or ``default y``.
+
+Guidelines for writing Kconfig files
+------------------------------------
+
+Configurable elements in QEMU fall under five broad groups. Each group
+declares its dependencies in different ways:
+
+**subsystems**, of which **buses** are a special case
+
+ Example::
+
+ config SCSI
+ bool
+
+ Subsystems always default to false (they have no ``default`` directive)
+ and are never visible in ``default-configs/*.mak`` files. It's
+ up to other symbols to ``select`` whatever subsystems they require.
+
+ They sometimes have ``select`` directives to bring in other required
+ subsystems or buses. For example, ``AUX`` (the DisplayPort auxiliary
+ channel "bus") selects ``I2C`` because it can act as an I2C master too.
+
+**devices**
+
+ Example::
+
+ config MEGASAS_SCSI_PCI
+ bool
+ default y if PCI_DEVICES
+ depends on PCI
+ select SCSI
+
+ Devices are the most complex of the five. They can have a variety
+ of directives that cooperate so that a default configuration includes
+ all the devices that can be accessed from QEMU.
+
+ Devices *depend on* the bus that they lie on, for example a PCI
+ device would specify ``depends on PCI``. An MMIO device will likely
+ have no ``depends on`` directive. Devices also *select* the buses
+ that the device provides, for example a SCSI adapter would specify
+ ``select SCSI``. Finally, devices are usually ``default y`` if and
+ only if they have at least one ``depends on``; the default could be
+ conditional on a device group.
+
+ Devices also select any optional subsystem that they use; for example
+ a video card might specify ``select EDID`` if it needs to build EDID
+ information and publish it to the guest.
+
+**device groups**
+
+ Example::
+
+ config PCI_DEVICES
+ bool
+
+ Device groups provide a convenient mechanism to enable/disable many
+ devices in one go. This is useful when a set of devices is likely to
+ be enabled/disabled by several targets. Device groups usually need
+ no directive and are not used in the Makefile either; they only appear
+ as conditions for ``default y`` directives.
+
+ QEMU currently has two device groups, ``PCI_DEVICES`` and
+ ``TEST_DEVICES``. PCI devices usually have a ``default y if
+ PCI_DEVICES`` directive rather than just ``default y``. This lets
+ some boards (notably s390) easily support a subset of PCI devices,
+ for example only VFIO (passthrough) and virtio-pci devices.
+ ``TEST_DEVICES`` instead is used for devices that are rarely used on
+ production virtual machines, but provide useful hooks to test QEMU
+ or KVM.
+
+**boards**
+
+ Example::
+
+ config SUN4M
+ bool
+ imply TCX
+ imply CG3
+ select CS4231
+ select ECCMEMCTL
+ select EMPTY_SLOT
+ select ESCC
+ select ESP
+ select FDC
+ select SLAVIO
+ select LANCE
+ select M48T59
+ select STP2000
+
+ Boards specify their constituent devices using ``imply`` and ``select``
+ directives. A device should be listed under ``select`` if the board
+ cannot be started at all without it. It should be listed under
+ ``imply`` if (depending on the QEMU command line) the board may or
+ may not be started without it. Boards also default to false; they are
+ enabled by the ``default-configs/*.mak`` for the target they apply to.
+
+**internal elements**
+
+ Example::
+
+ config ECCMEMCTL
+ bool
+ select ECC
+
+ Internal elements group code that is useful in several boards or
+ devices. They are usually enabled with ``select`` and in turn select
+ other elements; they are never visible in ``default-configs/*.mak``
+ files, and often not even in the Makefile.
+
+Writing and modifying default configurations
+--------------------------------------------
+
+In addition to the Kconfig files under hw/, each target also includes
+a file called ``default-configs/TARGETNAME-softmmu.mak``. These files
+initialize some Kconfig variables to non-default values and provide the
+starting point to turn on devices and subsystems.
+
+A file in ``default-configs/`` looks like the following example::
+
+ # Default configuration for alpha-softmmu
+
+ # Uncomment the following lines to disable these optional devices:
+ #
+ #CONFIG_PCI_DEVICES=n
+ #CONFIG_TEST_DEVICES=n
+
+ # Boards:
+ #
+ CONFIG_DP264=y
+
+The first part, consisting of commented-out ``=n`` assignments, tells
+the user which devices or device groups are implied by the boards.
+The second part, consisting of ``=y`` assignments, tells the user which
+boards are supported by the target. The user will typically modify
+the default configuration by uncommenting lines in the first group,
+or commenting out lines in the second group.
+
+It is also possible to run QEMU's configure script with the
+``--without-default-devices`` option. When this is done, everything defaults
+to ``n`` unless it is ``select``ed or explicitly switched on in the
+``.mak`` files. In other words, ``default`` and ``imply`` directives
+are disabled. When QEMU is built with this option, the user will probably
+want to change some lines in the first group, for example like this::
+
+ CONFIG_PCI_DEVICES=y
+ #CONFIG_TEST_DEVICES=n
+
+and/or pick a subset of the devices in those device groups. Right now
+there is no single place that lists all the optional devices for
+``CONFIG_PCI_DEVICES`` and ``CONFIG_TEST_DEVICES``. In the future,
+we expect that ``.mak`` files will be automatically generated, so that
+they will include all these symbols and some help text on what they do.
+
+``Kconfig.host``
+----------------
+
+In some special cases, a configurable element depends on host features
+that are detected by QEMU's configure or ``meson.build`` scripts; for
+example some devices depend on the availability of KVM or on the presence
+of a library on the host.
+
+These symbols should be listed in ``Kconfig.host`` like this::
+
+ config TPM
+ bool
+
+and also listed as follows in the top-level meson.build's host_kconfig
+variable::
+
+ host_kconfig = \
+ ('CONFIG_TPM' in config_host ? ['CONFIG_TPM=y'] : []) + \
+ ('CONFIG_SPICE' in config_host ? ['CONFIG_SPICE=y'] : []) + \
+ (have_ivshmem ? ['CONFIG_IVSHMEM=y'] : []) + \
+ ...
diff --git a/docs/devel/loads-stores.rst b/docs/devel/loads-stores.rst
new file mode 100644
index 000000000..8f0035c82
--- /dev/null
+++ b/docs/devel/loads-stores.rst
@@ -0,0 +1,558 @@
+..
+ Copyright (c) 2017 Linaro Limited
+ Written by Peter Maydell
+
+===================
+Load and Store APIs
+===================
+
+QEMU internally has multiple families of functions for performing
+loads and stores. This document attempts to enumerate them all
+and indicate when to use them. It does not provide detailed
+documentation of each API -- for that you should look at the
+documentation comments in the relevant header files.
+
+
+``ld*_p and st*_p``
+~~~~~~~~~~~~~~~~~~~
+
+These functions operate on a host pointer, and should be used
+when you already have a pointer into host memory (corresponding
+to guest ram or a local buffer). They deal with doing accesses
+with the desired endianness and with correctly handling
+potentially unaligned pointer values.
+
+Function names follow the pattern:
+
+load: ``ld{sign}{size}_{endian}_p(ptr)``
+
+store: ``st{size}_{endian}_p(ptr, val)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``he`` : host endian
+ - ``be`` : big endian
+ - ``le`` : little endian
+
+The ``_{endian}`` infix is omitted for target-endian accesses.
+
+The target endian accessors are only available to source
+files which are built per-target.
+
+There are also functions which take the size as an argument:
+
+load: ``ldn{endian}_p(ptr, sz)``
+
+which performs an unsigned load of ``sz`` bytes from ``ptr``
+as an ``{endian}`` order value and returns it in a uint64_t.
+
+store: ``stn{endian}_p(ptr, sz, val)``
+
+which stores ``val`` to ``ptr`` as an ``{endian}`` order value
+of size ``sz`` bytes.
+
+
+Regexes for git grep
+ - ``\<ld[us]\?[bwlq]\(_[hbl]e\)\?_p\>``
+ - ``\<st[bwlq]\(_[hbl]e\)\?_p\>``
+ - ``\<ldn_\([hbl]e\)?_p\>``
+ - ``\<stn_\([hbl]e\)?_p\>``
+
+``cpu_{ld,st}*_mmu``
+~~~~~~~~~~~~~~~~~~~~
+
+These functions operate on a guest virtual address, plus a context
+known as a "mmu index" which controls how that virtual address is
+translated, plus a ``MemOp`` which contains alignment requirements
+among other things. The ``MemOp`` and mmu index are combined into
+a single argument of type ``MemOpIdx``.
+
+The meaning of the indexes are target specific, but specifying a
+particular index might be necessary if, for instance, the helper
+requires a "always as non-privileged" access rather than the
+default access for the current state of the guest CPU.
+
+These functions may cause a guest CPU exception to be taken
+(e.g. for an alignment fault or MMU fault) which will result in
+guest CPU state being updated and control longjmp'ing out of the
+function call. They should therefore only be used in code that is
+implementing emulation of the guest CPU.
+
+The ``retaddr`` parameter is used to control unwinding of the
+guest CPU state in case of a guest CPU exception. This is passed
+to ``cpu_restore_state()``. Therefore the value should either be 0,
+to indicate that the guest CPU state is already synchronized, or
+the result of ``GETPC()`` from the top level ``HELPER(foo)``
+function, which is a return address into the generated code [#gpc]_.
+
+.. [#gpc] Note that ``GETPC()`` should be used with great care: calling
+ it in other functions that are *not* the top level
+ ``HELPER(foo)`` will cause unexpected behavior. Instead, the
+ value of ``GETPC()`` should be read from the helper and passed
+ if needed to the functions that the helper calls.
+
+Function names follow the pattern:
+
+load: ``cpu_ld{size}{end}_mmu(env, ptr, oi, retaddr)``
+
+store: ``cpu_st{size}{end}_mmu(env, ptr, val, oi, retaddr)``
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``end``
+ - (empty) : for target endian, or 8 bit sizes
+ - ``_be`` : big endian
+ - ``_le`` : little endian
+
+Regexes for git grep:
+ - ``\<cpu_ld[bwlq](_[bl]e)\?_mmu\>``
+ - ``\<cpu_st[bwlq](_[bl]e)\?_mmu\>``
+
+
+``cpu_{ld,st}*_mmuidx_ra``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These functions work like the ``cpu_{ld,st}_mmu`` functions except
+that the ``mmuidx`` parameter is not combined with a ``MemOp``,
+and therefore there is no required alignment supplied or enforced.
+
+Function names follow the pattern:
+
+load: ``cpu_ld{sign}{size}{end}_mmuidx_ra(env, ptr, mmuidx, retaddr)``
+
+store: ``cpu_st{size}{end}_mmuidx_ra(env, ptr, val, mmuidx, retaddr)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``end``
+ - (empty) : for target endian, or 8 bit sizes
+ - ``_be`` : big endian
+ - ``_le`` : little endian
+
+Regexes for git grep:
+ - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_mmuidx_ra\>``
+ - ``\<cpu_st[bwlq](_[bl]e)\?_mmuidx_ra\>``
+
+``cpu_{ld,st}*_data_ra``
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+These functions work like the ``cpu_{ld,st}_mmuidx_ra`` functions
+except that the ``mmuidx`` parameter is taken from the current mode
+of the guest CPU, as determined by ``cpu_mmu_index(env, false)``.
+
+These are generally the preferred way to do accesses by guest
+virtual address from helper functions, unless the access should
+be performed with a context other than the default, or alignment
+should be enforced for the access.
+
+Function names follow the pattern:
+
+load: ``cpu_ld{sign}{size}{end}_data_ra(env, ptr, ra)``
+
+store: ``cpu_st{size}{end}_data_ra(env, ptr, val, ra)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``end``
+ - (empty) : for target endian, or 8 bit sizes
+ - ``_be`` : big endian
+ - ``_le`` : little endian
+
+Regexes for git grep:
+ - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_data_ra\>``
+ - ``\<cpu_st[bwlq](_[bl]e)\?_data_ra\>``
+
+``cpu_{ld,st}*_data``
+~~~~~~~~~~~~~~~~~~~~~
+
+These functions work like the ``cpu_{ld,st}_data_ra`` functions
+except that the ``retaddr`` parameter is 0, and thus does not
+unwind guest CPU state.
+
+This means they must only be used from helper functions where the
+translator has saved all necessary CPU state. These functions are
+the right choice for calls made from hooks like the CPU ``do_interrupt``
+hook or when you know for certain that the translator had to save all
+the CPU state anyway.
+
+Function names follow the pattern:
+
+load: ``cpu_ld{sign}{size}{end}_data(env, ptr)``
+
+store: ``cpu_st{size}{end}_data(env, ptr, val)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``end``
+ - (empty) : for target endian, or 8 bit sizes
+ - ``_be`` : big endian
+ - ``_le`` : little endian
+
+Regexes for git grep
+ - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_data\>``
+ - ``\<cpu_st[bwlq](_[bl]e)\?_data\+\>``
+
+``cpu_ld*_code``
+~~~~~~~~~~~~~~~~
+
+These functions perform a read for instruction execution. The ``mmuidx``
+parameter is taken from the current mode of the guest CPU, as determined
+by ``cpu_mmu_index(env, true)``. The ``retaddr`` parameter is 0, and
+thus does not unwind guest CPU state, because CPU state is always
+synchronized while translating instructions. Any guest CPU exception
+that is raised will indicate an instruction execution fault rather than
+a data read fault.
+
+In general these functions should not be used directly during translation.
+There are wrapper functions that are to be used which also take care of
+plugins for tracing.
+
+Function names follow the pattern:
+
+load: ``cpu_ld{sign}{size}_code(env, ptr)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+Regexes for git grep:
+ - ``\<cpu_ld[us]\?[bwlq]_code\>``
+
+``translator_ld*``
+~~~~~~~~~~~~~~~~~~
+
+These functions are a wrapper for ``cpu_ld*_code`` which also perform
+any actions required by any tracing plugins. They are only to be
+called during the translator callback ``translate_insn``.
+
+There is a set of functions ending in ``_swap`` which, if the parameter
+is true, returns the value in the endianness that is the reverse of
+the guest native endianness, as determined by ``TARGET_WORDS_BIGENDIAN``.
+
+Function names follow the pattern:
+
+load: ``translator_ld{sign}{size}(env, ptr)``
+
+swap: ``translator_ld{sign}{size}_swap(env, ptr, swap)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+Regexes for git grep
+ - ``\<translator_ld[us]\?[bwlq]\(_swap\)\?\>``
+
+``helper_*_{ld,st}*_mmu``
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These functions are intended primarily to be called by the code
+generated by the TCG backend. They may also be called by target
+CPU helper function code. Like the ``cpu_{ld,st}_mmuidx_ra`` functions
+they perform accesses by guest virtual address, with a given ``mmuidx``.
+
+These functions specify an ``opindex`` parameter which encodes
+(among other things) the mmu index to use for the access. This parameter
+should be created by calling ``make_memop_idx()``.
+
+The ``retaddr`` parameter should be the result of GETPC() called directly
+from the top level HELPER(foo) function (or 0 if no guest CPU state
+unwinding is required).
+
+**TODO** The names of these functions are a bit odd for historical
+reasons because they were originally expected to be called only from
+within generated code. We should rename them to bring them more in
+line with the other memory access functions. The explicit endianness
+is the only feature they have beyond ``*_mmuidx_ra``.
+
+load: ``helper_{endian}_ld{sign}{size}_mmu(env, addr, opindex, retaddr)``
+
+store: ``helper_{endian}_st{size}_mmu(env, addr, val, opindex, retaddr)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``le`` : little endian
+ - ``be`` : big endian
+ - ``ret`` : target endianness
+
+Regexes for git grep
+ - ``\<helper_\(le\|be\|ret\)_ld[us]\?[bwlq]_mmu\>``
+ - ``\<helper_\(le\|be\|ret\)_st[bwlq]_mmu\>``
+
+``address_space_*``
+~~~~~~~~~~~~~~~~~~~
+
+These functions are the primary ones to use when emulating CPU
+or device memory accesses. They take an AddressSpace, which is the
+way QEMU defines the view of memory that a device or CPU has.
+(They generally correspond to being the "master" end of a hardware bus
+or bus fabric.)
+
+Each CPU has an AddressSpace. Some kinds of CPU have more than
+one AddressSpace (for instance Arm guest CPUs have an AddressSpace
+for the Secure world and one for NonSecure if they implement TrustZone).
+Devices which can do DMA-type operations should generally have an
+AddressSpace. There is also a "system address space" which typically
+has all the devices and memory that all CPUs can see. (Some older
+device models use the "system address space" rather than properly
+modelling that they have an AddressSpace of their own.)
+
+Functions are provided for doing byte-buffer reads and writes,
+and also for doing one-data-item loads and stores.
+
+In all cases the caller provides a MemTxAttrs to specify bus
+transaction attributes, and can check whether the memory transaction
+succeeded using a MemTxResult return code.
+
+``address_space_read(address_space, addr, attrs, buf, len)``
+
+``address_space_write(address_space, addr, attrs, buf, len)``
+
+``address_space_rw(address_space, addr, attrs, buf, len, is_write)``
+
+``address_space_ld{sign}{size}_{endian}(address_space, addr, attrs, txresult)``
+
+``address_space_st{size}_{endian}(address_space, addr, val, attrs, txresult)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+
+(No signed load operations are provided.)
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``le`` : little endian
+ - ``be`` : big endian
+
+The ``_{endian}`` suffix is omitted for byte accesses.
+
+Regexes for git grep
+ - ``\<address_space_\(read\|write\|rw\)\>``
+ - ``\<address_space_ldu\?[bwql]\(_[lb]e\)\?\>``
+ - ``\<address_space_st[bwql]\(_[lb]e\)\?\>``
+
+``address_space_write_rom``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This function performs a write by physical address like
+``address_space_write``, except that if the write is to a ROM then
+the ROM contents will be modified, even though a write by the guest
+CPU to the ROM would be ignored. This is used for non-guest writes
+like writes from the gdb debug stub or initial loading of ROM contents.
+
+Note that portions of the write which attempt to write data to a
+device will be silently ignored -- only real RAM and ROM will
+be written to.
+
+Regexes for git grep
+ - ``address_space_write_rom``
+
+``{ld,st}*_phys``
+~~~~~~~~~~~~~~~~~
+
+These are functions which are identical to
+``address_space_{ld,st}*``, except that they always pass
+``MEMTXATTRS_UNSPECIFIED`` for the transaction attributes, and ignore
+whether the transaction succeeded or failed.
+
+The fact that they ignore whether the transaction succeeded means
+they should not be used in new code, unless you know for certain
+that your code will only be used in a context where the CPU or
+device doing the access has no way to report such an error.
+
+``load: ld{sign}{size}_{endian}_phys``
+
+``store: st{size}_{endian}_phys``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+
+(No signed load operations are provided.)
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``le`` : little endian
+ - ``be`` : big endian
+
+The ``_{endian}_`` infix is omitted for byte accesses.
+
+Regexes for git grep
+ - ``\<ldu\?[bwlq]\(_[bl]e\)\?_phys\>``
+ - ``\<st[bwlq]\(_[bl]e\)\?_phys\>``
+
+``cpu_physical_memory_*``
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These are convenience functions which are identical to
+``address_space_*`` but operate specifically on the system address space,
+always pass a ``MEMTXATTRS_UNSPECIFIED`` set of memory attributes and
+ignore whether the memory transaction succeeded or failed.
+For new code they are better avoided:
+
+* there is likely to be behaviour you need to model correctly for a
+ failed read or write operation
+* a device should usually perform operations on its own AddressSpace
+ rather than using the system address space
+
+``cpu_physical_memory_read``
+
+``cpu_physical_memory_write``
+
+``cpu_physical_memory_rw``
+
+Regexes for git grep
+ - ``\<cpu_physical_memory_\(read\|write\|rw\)\>``
+
+``cpu_memory_rw_debug``
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Access CPU memory by virtual address for debug purposes.
+
+This function is intended for use by the GDB stub and similar code.
+It takes a virtual address, converts it to a physical address via
+an MMU lookup using the current settings of the specified CPU,
+and then performs the access (using ``address_space_rw`` for
+reads or ``cpu_physical_memory_write_rom`` for writes).
+This means that if the access is a write to a ROM then this
+function will modify the contents (whereas a normal guest CPU access
+would ignore the write attempt).
+
+``cpu_memory_rw_debug``
+
+``dma_memory_*``
+~~~~~~~~~~~~~~~~
+
+These behave like ``address_space_*``, except that they perform a DMA
+barrier operation first.
+
+**TODO**: We should provide guidance on when you need the DMA
+barrier operation and when it's OK to use ``address_space_*``, and
+make sure our existing code is doing things correctly.
+
+``dma_memory_read``
+
+``dma_memory_write``
+
+``dma_memory_rw``
+
+Regexes for git grep
+ - ``\<dma_memory_\(read\|write\|rw\)\>``
+ - ``\<ldu\?[bwlq]\(_[bl]e\)\?_dma\>``
+ - ``\<st[bwlq]\(_[bl]e\)\?_dma\>``
+
+``pci_dma_*`` and ``{ld,st}*_pci_dma``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These functions are specifically for PCI device models which need to
+perform accesses where the PCI device is a bus master. You pass them a
+``PCIDevice *`` and they will do ``dma_memory_*`` operations on the
+correct address space for that device.
+
+``pci_dma_read``
+
+``pci_dma_write``
+
+``pci_dma_rw``
+
+``load: ld{sign}{size}_{endian}_pci_dma``
+
+``store: st{size}_{endian}_pci_dma``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+
+(No signed load operations are provided.)
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``le`` : little endian
+ - ``be`` : big endian
+
+The ``_{endian}_`` infix is omitted for byte accesses.
+
+Regexes for git grep
+ - ``\<pci_dma_\(read\|write\|rw\)\>``
+ - ``\<ldu\?[bwlq]\(_[bl]e\)\?_pci_dma\>``
+ - ``\<st[bwlq]\(_[bl]e\)\?_pci_dma\>``
diff --git a/docs/devel/lockcnt.txt b/docs/devel/lockcnt.txt
new file mode 100644
index 000000000..a3fb3bc5d
--- /dev/null
+++ b/docs/devel/lockcnt.txt
@@ -0,0 +1,277 @@
+DOCUMENTATION FOR LOCKED COUNTERS (aka QemuLockCnt)
+===================================================
+
+QEMU often uses reference counts to track data structures that are being
+accessed and should not be freed. For example, a loop that invoke
+callbacks like this is not safe:
+
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+
+QLIST_FOREACH_SAFE protects against deletion of the current node (ioh)
+by stashing away its "next" pointer. However, ioh->fd_write could
+actually delete the next node from the list. The simplest way to
+avoid this is to mark the node as deleted, and remove it from the
+list in the above loop:
+
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ } else {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+ }
+
+If however this loop must also be reentrant, i.e. it is possible that
+ioh->fd_write invokes the loop again, some kind of counting is needed:
+
+ walking_handlers++;
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ if (walking_handlers == 1) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ }
+ } else {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+ }
+ walking_handlers--;
+
+One may think of using the RCU primitives, rcu_read_lock() and
+rcu_read_unlock(); effectively, the RCU nesting count would take
+the place of the walking_handlers global variable. Indeed,
+reference counting and RCU have similar purposes, but their usage in
+general is complementary:
+
+- reference counting is fine-grained and limited to a single data
+ structure; RCU delays reclamation of *all* RCU-protected data
+ structures;
+
+- reference counting works even in the presence of code that keeps
+ a reference for a long time; RCU critical sections in principle
+ should be kept short;
+
+- reference counting is often applied to code that is not thread-safe
+ but is reentrant; in fact, usage of reference counting in QEMU predates
+ the introduction of threads by many years. RCU is generally used to
+ protect readers from other threads freeing memory after concurrent
+ modifications to a data structure.
+
+- reclaiming data can be done by a separate thread in the case of RCU;
+ this can improve performance, but also delay reclamation undesirably.
+ With reference counting, reclamation is deterministic.
+
+This file documents QemuLockCnt, an abstraction for using reference
+counting in code that has to be both thread-safe and reentrant.
+
+
+QemuLockCnt concepts
+--------------------
+
+A QemuLockCnt comprises both a counter and a mutex; it has primitives
+to increment and decrement the counter, and to take and release the
+mutex. The counter notes how many visits to the data structures are
+taking place (the visits could be from different threads, or there could
+be multiple reentrant visits from the same thread). The basic rules
+governing the counter/mutex pair then are the following:
+
+- Data protected by the QemuLockCnt must not be freed unless the
+ counter is zero and the mutex is taken.
+
+- A new visit cannot be started while the counter is zero and the
+ mutex is taken.
+
+Most of the time, the mutex protects all writes to the data structure,
+not just frees, though there could be cases where this is not necessary.
+
+Reads, instead, can be done without taking the mutex, as long as the
+readers and writers use the same macros that are used for RCU, for
+example qatomic_rcu_read, qatomic_rcu_set, QLIST_FOREACH_RCU, etc. This is
+because the reads are done outside a lock and a set or QLIST_INSERT_HEAD
+can happen concurrently with the read. The RCU API ensures that the
+processor and the compiler see all required memory barriers.
+
+This could be implemented simply by protecting the counter with the
+mutex, for example:
+
+ // (1)
+ qemu_mutex_lock(&walking_handlers_mutex);
+ walking_handlers++;
+ qemu_mutex_unlock(&walking_handlers_mutex);
+
+ ...
+
+ // (2)
+ qemu_mutex_lock(&walking_handlers_mutex);
+ if (--walking_handlers == 0) {
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ }
+ }
+ }
+ qemu_mutex_unlock(&walking_handlers_mutex);
+
+Here, no frees can happen in the code represented by the ellipsis.
+If another thread is executing critical section (2), that part of
+the code cannot be entered, because the thread will not be able
+to increment the walking_handlers variable. And of course
+during the visit any other thread will see a nonzero value for
+walking_handlers, as in the single-threaded code.
+
+Note that it is possible for multiple concurrent accesses to delay
+the cleanup arbitrarily; in other words, for the walking_handlers
+counter to never become zero. For this reason, this technique is
+more easily applicable if concurrent access to the structure is rare.
+
+However, critical sections are easy to forget since you have to do
+them for each modification of the counter. QemuLockCnt ensures that
+all modifications of the counter take the lock appropriately, and it
+can also be more efficient in two ways:
+
+- it avoids taking the lock for many operations (for example
+ incrementing the counter while it is non-zero);
+
+- on some platforms, one can implement QemuLockCnt to hold the lock
+ and the mutex in a single word, making the fast path no more expensive
+ than simply managing a counter using atomic operations (see
+ docs/devel/atomics.rst). This can be very helpful if concurrent access to
+ the data structure is expected to be rare.
+
+
+Using the same mutex for frees and writes can still incur some small
+inefficiencies; for example, a visit can never start if the counter is
+zero and the mutex is taken---even if the mutex is taken by a write,
+which in principle need not block a visit of the data structure.
+However, these are usually not a problem if any of the following
+assumptions are valid:
+
+- concurrent access is possible but rare
+
+- writes are rare
+
+- writes are frequent, but this kind of write (e.g. appending to a
+ list) has a very small critical section.
+
+For example, QEMU uses QemuLockCnt to manage an AioContext's list of
+bottom halves and file descriptor handlers. Modifications to the list
+of file descriptor handlers are rare. Creation of a new bottom half is
+frequent and can happen on a fast path; however: 1) it is almost never
+concurrent with a visit to the list of bottom halves; 2) it only has
+three instructions in the critical path, two assignments and a smp_wmb().
+
+
+QemuLockCnt API
+---------------
+
+The QemuLockCnt API is described in include/qemu/thread.h.
+
+
+QemuLockCnt usage
+-----------------
+
+This section explains the typical usage patterns for QemuLockCnt functions.
+
+Setting a variable to a non-NULL value can be done between
+qemu_lockcnt_lock and qemu_lockcnt_unlock:
+
+ qemu_lockcnt_lock(&xyz_lockcnt);
+ if (!xyz) {
+ new_xyz = g_new(XYZ, 1);
+ ...
+ qatomic_rcu_set(&xyz, new_xyz);
+ }
+ qemu_lockcnt_unlock(&xyz_lockcnt);
+
+Accessing the value can be done between qemu_lockcnt_inc and
+qemu_lockcnt_dec:
+
+ qemu_lockcnt_inc(&xyz_lockcnt);
+ if (xyz) {
+ XYZ *p = qatomic_rcu_read(&xyz);
+ ...
+ /* Accesses can now be done through "p". */
+ }
+ qemu_lockcnt_dec(&xyz_lockcnt);
+
+Freeing the object can similarly use qemu_lockcnt_lock and
+qemu_lockcnt_unlock, but you also need to ensure that the count
+is zero (i.e. there is no concurrent visit). Because qemu_lockcnt_inc
+takes the QemuLockCnt's lock, the count cannot become non-zero while
+the object is being freed. Freeing an object looks like this:
+
+ qemu_lockcnt_lock(&xyz_lockcnt);
+ if (!qemu_lockcnt_count(&xyz_lockcnt)) {
+ g_free(xyz);
+ xyz = NULL;
+ }
+ qemu_lockcnt_unlock(&xyz_lockcnt);
+
+If an object has to be freed right after a visit, you can combine
+the decrement, the locking and the check on count as follows:
+
+ qemu_lockcnt_inc(&xyz_lockcnt);
+ if (xyz) {
+ XYZ *p = qatomic_rcu_read(&xyz);
+ ...
+ /* Accesses can now be done through "p". */
+ }
+ if (qemu_lockcnt_dec_and_lock(&xyz_lockcnt)) {
+ g_free(xyz);
+ xyz = NULL;
+ qemu_lockcnt_unlock(&xyz_lockcnt);
+ }
+
+QemuLockCnt can also be used to access a list as follows:
+
+ qemu_lockcnt_inc(&io_handlers_lockcnt);
+ QLIST_FOREACH_RCU(ioh, &io_handlers, pioh) {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+
+ if (qemu_lockcnt_dec_and_lock(&io_handlers_lockcnt)) {
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ }
+ }
+ qemu_lockcnt_unlock(&io_handlers_lockcnt);
+ }
+
+Again, the RCU primitives are used because new items can be added to the
+list during the walk. QLIST_FOREACH_RCU ensures that the processor and
+the compiler see the appropriate memory barriers.
+
+An alternative pattern uses qemu_lockcnt_dec_if_lock:
+
+ qemu_lockcnt_inc(&io_handlers_lockcnt);
+ QLIST_FOREACH_SAFE_RCU(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ if (qemu_lockcnt_dec_if_lock(&io_handlers_lockcnt)) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ qemu_lockcnt_inc_and_unlock(&io_handlers_lockcnt);
+ }
+ } else {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+ }
+ qemu_lockcnt_dec(&io_handlers_lockcnt);
+
+Here you can use qemu_lockcnt_dec instead of qemu_lockcnt_dec_and_lock,
+because there is no special task to do if the count goes from 1 to 0.
diff --git a/docs/devel/memory.rst b/docs/devel/memory.rst
new file mode 100644
index 000000000..5dc8a1268
--- /dev/null
+++ b/docs/devel/memory.rst
@@ -0,0 +1,368 @@
+==============
+The memory API
+==============
+
+The memory API models the memory and I/O buses and controllers of a QEMU
+machine. It attempts to allow modelling of:
+
+- ordinary RAM
+- memory-mapped I/O (MMIO)
+- memory controllers that can dynamically reroute physical memory regions
+ to different destinations
+
+The memory model provides support for
+
+- tracking RAM changes by the guest
+- setting up coalesced memory for kvm
+- setting up ioeventfd regions for kvm
+
+Memory is modelled as an acyclic graph of MemoryRegion objects. Sinks
+(leaves) are RAM and MMIO regions, while other nodes represent
+buses, memory controllers, and memory regions that have been rerouted.
+
+In addition to MemoryRegion objects, the memory API provides AddressSpace
+objects for every root and possibly for intermediate MemoryRegions too.
+These represent memory as seen from the CPU or a device's viewpoint.
+
+Types of regions
+----------------
+
+There are multiple types of memory regions (all represented by a single C type
+MemoryRegion):
+
+- RAM: a RAM region is simply a range of host memory that can be made available
+ to the guest.
+ You typically initialize these with memory_region_init_ram(). Some special
+ purposes require the variants memory_region_init_resizeable_ram(),
+ memory_region_init_ram_from_file(), or memory_region_init_ram_ptr().
+
+- MMIO: a range of guest memory that is implemented by host callbacks;
+ each read or write causes a callback to be called on the host.
+ You initialize these with memory_region_init_io(), passing it a
+ MemoryRegionOps structure describing the callbacks.
+
+- ROM: a ROM memory region works like RAM for reads (directly accessing
+ a region of host memory), and forbids writes. You initialize these with
+ memory_region_init_rom().
+
+- ROM device: a ROM device memory region works like RAM for reads
+ (directly accessing a region of host memory), but like MMIO for
+ writes (invoking a callback). You initialize these with
+ memory_region_init_rom_device().
+
+- IOMMU region: an IOMMU region translates addresses of accesses made to it
+ and forwards them to some other target memory region. As the name suggests,
+ these are only needed for modelling an IOMMU, not for simple devices.
+ You initialize these with memory_region_init_iommu().
+
+- container: a container simply includes other memory regions, each at
+ a different offset. Containers are useful for grouping several regions
+ into one unit. For example, a PCI BAR may be composed of a RAM region
+ and an MMIO region.
+
+ A container's subregions are usually non-overlapping. In some cases it is
+ useful to have overlapping regions; for example a memory controller that
+ can overlay a subregion of RAM with MMIO or ROM, or a PCI controller
+ that does not prevent card from claiming overlapping BARs.
+
+ You initialize a pure container with memory_region_init().
+
+- alias: a subsection of another region. Aliases allow a region to be
+ split apart into discontiguous regions. Examples of uses are memory banks
+ used when the guest address space is smaller than the amount of RAM
+ addressed, or a memory controller that splits main memory to expose a "PCI
+ hole". Aliases may point to any type of region, including other aliases,
+ but an alias may not point back to itself, directly or indirectly.
+ You initialize these with memory_region_init_alias().
+
+- reservation region: a reservation region is primarily for debugging.
+ It claims I/O space that is not supposed to be handled by QEMU itself.
+ The typical use is to track parts of the address space which will be
+ handled by the host kernel when KVM is enabled. You initialize these
+ by passing a NULL callback parameter to memory_region_init_io().
+
+It is valid to add subregions to a region which is not a pure container
+(that is, to an MMIO, RAM or ROM region). This means that the region
+will act like a container, except that any addresses within the container's
+region which are not claimed by any subregion are handled by the
+container itself (ie by its MMIO callbacks or RAM backing). However
+it is generally possible to achieve the same effect with a pure container
+one of whose subregions is a low priority "background" region covering
+the whole address range; this is often clearer and is preferred.
+Subregions cannot be added to an alias region.
+
+Migration
+---------
+
+Where the memory region is backed by host memory (RAM, ROM and
+ROM device memory region types), this host memory needs to be
+copied to the destination on migration. These APIs which allocate
+the host memory for you will also register the memory so it is
+migrated:
+
+- memory_region_init_ram()
+- memory_region_init_rom()
+- memory_region_init_rom_device()
+
+For most devices and boards this is the correct thing. If you
+have a special case where you need to manage the migration of
+the backing memory yourself, you can call the functions:
+
+- memory_region_init_ram_nomigrate()
+- memory_region_init_rom_nomigrate()
+- memory_region_init_rom_device_nomigrate()
+
+which only initialize the MemoryRegion and leave handling
+migration to the caller.
+
+The functions:
+
+- memory_region_init_resizeable_ram()
+- memory_region_init_ram_from_file()
+- memory_region_init_ram_from_fd()
+- memory_region_init_ram_ptr()
+- memory_region_init_ram_device_ptr()
+
+are for special cases only, and so they do not automatically
+register the backing memory for migration; the caller must
+manage migration if necessary.
+
+Region names
+------------
+
+Regions are assigned names by the constructor. For most regions these are
+only used for debugging purposes, but RAM regions also use the name to identify
+live migration sections. This means that RAM region names need to have ABI
+stability.
+
+Region lifecycle
+----------------
+
+A region is created by one of the memory_region_init*() functions and
+attached to an object, which acts as its owner or parent. QEMU ensures
+that the owner object remains alive as long as the region is visible to
+the guest, or as long as the region is in use by a virtual CPU or another
+device. For example, the owner object will not die between an
+address_space_map operation and the corresponding address_space_unmap.
+
+After creation, a region can be added to an address space or a
+container with memory_region_add_subregion(), and removed using
+memory_region_del_subregion().
+
+Various region attributes (read-only, dirty logging, coalesced mmio,
+ioeventfd) can be changed during the region lifecycle. They take effect
+as soon as the region is made visible. This can be immediately, later,
+or never.
+
+Destruction of a memory region happens automatically when the owner
+object dies.
+
+If however the memory region is part of a dynamically allocated data
+structure, you should call object_unparent() to destroy the memory region
+before the data structure is freed. For an example see VFIOMSIXInfo
+and VFIOQuirk in hw/vfio/pci.c.
+
+You must not destroy a memory region as long as it may be in use by a
+device or CPU. In order to do this, as a general rule do not create or
+destroy memory regions dynamically during a device's lifetime, and only
+call object_unparent() in the memory region owner's instance_finalize
+callback. The dynamically allocated data structure that contains the
+memory region then should obviously be freed in the instance_finalize
+callback as well.
+
+If you break this rule, the following situation can happen:
+
+- the memory region's owner had a reference taken via memory_region_ref
+ (for example by address_space_map)
+
+- the region is unparented, and has no owner anymore
+
+- when address_space_unmap is called, the reference to the memory region's
+ owner is leaked.
+
+
+There is an exception to the above rule: it is okay to call
+object_unparent at any time for an alias or a container region. It is
+therefore also okay to create or destroy alias and container regions
+dynamically during a device's lifetime.
+
+This exceptional usage is valid because aliases and containers only help
+QEMU building the guest's memory map; they are never accessed directly.
+memory_region_ref and memory_region_unref are never called on aliases
+or containers, and the above situation then cannot happen. Exploiting
+this exception is rarely necessary, and therefore it is discouraged,
+but nevertheless it is used in a few places.
+
+For regions that "have no owner" (NULL is passed at creation time), the
+machine object is actually used as the owner. Since instance_finalize is
+never called for the machine object, you must never call object_unparent
+on regions that have no owner, unless they are aliases or containers.
+
+
+Overlapping regions and priority
+--------------------------------
+Usually, regions may not overlap each other; a memory address decodes into
+exactly one target. In some cases it is useful to allow regions to overlap,
+and sometimes to control which of an overlapping regions is visible to the
+guest. This is done with memory_region_add_subregion_overlap(), which
+allows the region to overlap any other region in the same container, and
+specifies a priority that allows the core to decide which of two regions at
+the same address are visible (highest wins).
+Priority values are signed, and the default value is zero. This means that
+you can use memory_region_add_subregion_overlap() both to specify a region
+that must sit 'above' any others (with a positive priority) and also a
+background region that sits 'below' others (with a negative priority).
+
+If the higher priority region in an overlap is a container or alias, then
+the lower priority region will appear in any "holes" that the higher priority
+region has left by not mapping subregions to that area of its address range.
+(This applies recursively -- if the subregions are themselves containers or
+aliases that leave holes then the lower priority region will appear in these
+holes too.)
+
+For example, suppose we have a container A of size 0x8000 with two subregions
+B and C. B is a container mapped at 0x2000, size 0x4000, priority 2; C is
+an MMIO region mapped at 0x0, size 0x6000, priority 1. B currently has two
+of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at
+offset 0x2000. As a diagram::
+
+ 0 1000 2000 3000 4000 5000 6000 7000 8000
+ |------|------|------|------|------|------|------|------|
+ A: [ ]
+ C: [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC]
+ B: [ ]
+ D: [DDDDD]
+ E: [EEEEE]
+
+The regions that will be seen within this address range then are::
+
+ [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC]
+
+Since B has higher priority than C, its subregions appear in the flat map
+even where they overlap with C. In ranges where B has not mapped anything
+C's region appears.
+
+If B had provided its own MMIO operations (ie it was not a pure container)
+then these would be used for any addresses in its range not handled by
+D or E, and the result would be::
+
+ [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB]
+
+Priority values are local to a container, because the priorities of two
+regions are only compared when they are both children of the same container.
+This means that the device in charge of the container (typically modelling
+a bus or a memory controller) can use them to manage the interaction of
+its child regions without any side effects on other parts of the system.
+In the example above, the priorities of D and E are unimportant because
+they do not overlap each other. It is the relative priority of B and C
+that causes D and E to appear on top of C: D and E's priorities are never
+compared against the priority of C.
+
+Visibility
+----------
+The memory core uses the following rules to select a memory region when the
+guest accesses an address:
+
+- all direct subregions of the root region are matched against the address, in
+ descending priority order
+
+ - if the address lies outside the region offset/size, the subregion is
+ discarded
+ - if the subregion is a leaf (RAM or MMIO), the search terminates, returning
+ this leaf region
+ - if the subregion is a container, the same algorithm is used within the
+ subregion (after the address is adjusted by the subregion offset)
+ - if the subregion is an alias, the search is continued at the alias target
+ (after the address is adjusted by the subregion offset and alias offset)
+ - if a recursive search within a container or alias subregion does not
+ find a match (because of a "hole" in the container's coverage of its
+ address range), then if this is a container with its own MMIO or RAM
+ backing the search terminates, returning the container itself. Otherwise
+ we continue with the next subregion in priority order
+
+- if none of the subregions match the address then the search terminates
+ with no match found
+
+Example memory map
+------------------
+
+::
+
+ system_memory: container@0-2^48-1
+ |
+ +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff)
+ |
+ +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff)
+ |
+ +---- vga-window: alias@0xa0000-0xbffff ---> #pci (0xa0000-0xbffff)
+ | (prio 1)
+ |
+ +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff)
+
+ pci (0-2^32-1)
+ |
+ +--- vga-area: container@0xa0000-0xbffff
+ | |
+ | +--- alias@0x00000-0x7fff ---> #vram (0x010000-0x017fff)
+ | |
+ | +--- alias@0x08000-0xffff ---> #vram (0x020000-0x027fff)
+ |
+ +---- vram: ram@0xe1000000-0xe1ffffff
+ |
+ +---- vga-mmio: mmio@0xe2000000-0xe200ffff
+
+ ram: ram@0x00000000-0xffffffff
+
+This is a (simplified) PC memory map. The 4GB RAM block is mapped into the
+system address space via two aliases: "lomem" is a 1:1 mapping of the first
+3.5GB; "himem" maps the last 0.5GB at address 4GB. This leaves 0.5GB for the
+so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with
+4GB of memory.
+
+The memory controller diverts addresses in the range 640K-768K to the PCI
+address space. This is modelled using the "vga-window" alias, mapped at a
+higher priority so it obscures the RAM at the same addresses. The vga window
+can be removed by programming the memory controller; this is modelled by
+removing the alias and exposing the RAM underneath.
+
+The pci address space is not a direct child of the system address space, since
+we only want parts of it to be visible (we accomplish this using aliases).
+It has two subregions: vga-area models the legacy vga window and is occupied
+by two 32K memory banks pointing at two sections of the framebuffer.
+In addition the vram is mapped as a BAR at address e1000000, and an additional
+BAR containing MMIO registers is mapped after it.
+
+Note that if the guest maps a BAR outside the PCI hole, it would not be
+visible as the pci-hole alias clips it to a 0.5GB range.
+
+MMIO Operations
+---------------
+
+MMIO regions are provided with ->read() and ->write() callbacks,
+which are sufficient for most devices. Some devices change behaviour
+based on the attributes used for the memory transaction, or need
+to be able to respond that the access should provoke a bus error
+rather than completing successfully; those devices can use the
+->read_with_attrs() and ->write_with_attrs() callbacks instead.
+
+In addition various constraints can be supplied to control how these
+callbacks are called:
+
+- .valid.min_access_size, .valid.max_access_size define the access sizes
+ (in bytes) which the device accepts; accesses outside this range will
+ have device and bus specific behaviour (ignored, or machine check)
+- .valid.unaligned specifies that the *device being modelled* supports
+ unaligned accesses; if false, unaligned accesses will invoke the
+ appropriate bus or CPU specific behaviour.
+- .impl.min_access_size, .impl.max_access_size define the access sizes
+ (in bytes) supported by the *implementation*; other access sizes will be
+ emulated using the ones available. For example a 4-byte write will be
+ emulated using four 1-byte writes, if .impl.max_access_size = 1.
+- .impl.unaligned specifies that the *implementation* supports unaligned
+ accesses; if false, unaligned accesses will be emulated by two aligned
+ accesses.
+
+API Reference
+-------------
+
+.. kernel-doc:: include/exec/memory.h
diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
new file mode 100644
index 000000000..240125348
--- /dev/null
+++ b/docs/devel/migration.rst
@@ -0,0 +1,883 @@
+=========
+Migration
+=========
+
+QEMU has code to load/save the state of the guest that it is running.
+These are two complementary operations. Saving the state just does
+that, saves the state for each device that the guest is running.
+Restoring a guest is just the opposite operation: we need to load the
+state of each device.
+
+For this to work, QEMU has to be launched with the same arguments the
+two times. I.e. it can only restore the state in one guest that has
+the same devices that the one it was saved (this last requirement can
+be relaxed a bit, but for now we can consider that configuration has
+to be exactly the same).
+
+Once that we are able to save/restore a guest, a new functionality is
+requested: migration. This means that QEMU is able to start in one
+machine and being "migrated" to another machine. I.e. being moved to
+another machine.
+
+Next was the "live migration" functionality. This is important
+because some guests run with a lot of state (specially RAM), and it
+can take a while to move all state from one machine to another. Live
+migration allows the guest to continue running while the state is
+transferred. Only while the last part of the state is transferred has
+the guest to be stopped. Typically the time that the guest is
+unresponsive during live migration is the low hundred of milliseconds
+(notice that this depends on a lot of things).
+
+Transports
+==========
+
+The migration stream is normally just a byte stream that can be passed
+over any transport.
+
+- tcp migration: do the migration using tcp sockets
+- unix migration: do the migration using unix sockets
+- exec migration: do the migration using the stdin/stdout through a process.
+- fd migration: do the migration using a file descriptor that is
+ passed to QEMU. QEMU doesn't care how this file descriptor is opened.
+
+In addition, support is included for migration using RDMA, which
+transports the page data using ``RDMA``, where the hardware takes care of
+transporting the pages, and the load on the CPU is much lower. While the
+internals of RDMA migration are a bit different, this isn't really visible
+outside the RAM migration code.
+
+All these migration protocols use the same infrastructure to
+save/restore state devices. This infrastructure is shared with the
+savevm/loadvm functionality.
+
+Debugging
+=========
+
+The migration stream can be analyzed thanks to ``scripts/analyze-migration.py``.
+
+Example usage:
+
+.. code-block:: shell
+
+ $ qemu-system-x86_64 -display none -monitor stdio
+ (qemu) migrate "exec:cat > mig"
+ (qemu) q
+ $ ./scripts/analyze-migration.py -f mig
+ {
+ "ram (3)": {
+ "section sizes": {
+ "pc.ram": "0x0000000008000000",
+ ...
+
+See also ``analyze-migration.py -h`` help for more options.
+
+Common infrastructure
+=====================
+
+The files, sockets or fd's that carry the migration stream are abstracted by
+the ``QEMUFile`` type (see ``migration/qemu-file.h``). In most cases this
+is connected to a subtype of ``QIOChannel`` (see ``io/``).
+
+
+Saving the state of one device
+==============================
+
+For most devices, the state is saved in a single call to the migration
+infrastructure; these are *non-iterative* devices. The data for these
+devices is sent at the end of precopy migration, when the CPUs are paused.
+There are also *iterative* devices, which contain a very large amount of
+data (e.g. RAM or large tables). See the iterative device section below.
+
+General advice for device developers
+------------------------------------
+
+- The migration state saved should reflect the device being modelled rather
+ than the way your implementation works. That way if you change the implementation
+ later the migration stream will stay compatible. That model may include
+ internal state that's not directly visible in a register.
+
+- When saving a migration stream the device code may walk and check
+ the state of the device. These checks might fail in various ways (e.g.
+ discovering internal state is corrupt or that the guest has done something bad).
+ Consider carefully before asserting/aborting at this point, since the
+ normal response from users is that *migration broke their VM* since it had
+ apparently been running fine until then. In these error cases, the device
+ should log a message indicating the cause of error, and should consider
+ putting the device into an error state, allowing the rest of the VM to
+ continue execution.
+
+- The migration might happen at an inconvenient point,
+ e.g. right in the middle of the guest reprogramming the device, during
+ guest reboot or shutdown or while the device is waiting for external IO.
+ It's strongly preferred that migrations do not fail in this situation,
+ since in the cloud environment migrations might happen automatically to
+ VMs that the administrator doesn't directly control.
+
+- If you do need to fail a migration, ensure that sufficient information
+ is logged to identify what went wrong.
+
+- The destination should treat an incoming migration stream as hostile
+ (which we do to varying degrees in the existing code). Check that offsets
+ into buffers and the like can't cause overruns. Fail the incoming migration
+ in the case of a corrupted stream like this.
+
+- Take care with internal device state or behaviour that might become
+ migration version dependent. For example, the order of PCI capabilities
+ is required to stay constant across migration. Another example would
+ be that a special case handled by subsections (see below) might become
+ much more common if a default behaviour is changed.
+
+- The state of the source should not be changed or destroyed by the
+ outgoing migration. Migrations timing out or being failed by
+ higher levels of management, or failures of the destination host are
+ not unusual, and in that case the VM is restarted on the source.
+ Note that the management layer can validly revert the migration
+ even though the QEMU level of migration has succeeded as long as it
+ does it before starting execution on the destination.
+
+- Buses and devices should be able to explicitly specify addresses when
+ instantiated, and management tools should use those. For example,
+ when hot adding USB devices it's important to specify the ports
+ and addresses, since implicit ordering based on the command line order
+ may be different on the destination. This can result in the
+ device state being loaded into the wrong device.
+
+VMState
+-------
+
+Most device data can be described using the ``VMSTATE`` macros (mostly defined
+in ``include/migration/vmstate.h``).
+
+An example (from hw/input/pckbd.c)
+
+.. code:: c
+
+ static const VMStateDescription vmstate_kbd = {
+ .name = "pckbd",
+ .version_id = 3,
+ .minimum_version_id = 3,
+ .fields = (VMStateField[]) {
+ VMSTATE_UINT8(write_cmd, KBDState),
+ VMSTATE_UINT8(status, KBDState),
+ VMSTATE_UINT8(mode, KBDState),
+ VMSTATE_UINT8(pending, KBDState),
+ VMSTATE_END_OF_LIST()
+ }
+ };
+
+We are declaring the state with name "pckbd".
+The ``version_id`` is 3, and the fields are 4 uint8_t in a KBDState structure.
+We registered this with:
+
+.. code:: c
+
+ vmstate_register(NULL, 0, &vmstate_kbd, s);
+
+For devices that are ``qdev`` based, we can register the device in the class
+init function:
+
+.. code:: c
+
+ dc->vmsd = &vmstate_kbd_isa;
+
+The VMState macros take care of ensuring that the device data section
+is formatted portably (normally big endian) and make some compile time checks
+against the types of the fields in the structures.
+
+VMState macros can include other VMStateDescriptions to store substructures
+(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length
+arrays (``VMSTATE_VARRAY_``). Various other macros exist for special
+cases.
+
+Note that the format on the wire is still very raw; i.e. a VMSTATE_UINT32
+ends up with a 4 byte bigendian representation on the wire; in the future
+it might be possible to use a more structured format.
+
+Legacy way
+----------
+
+This way is going to disappear as soon as all current users are ported to VMSTATE;
+although converting existing code can be tricky, and thus 'soon' is relative.
+
+Each device has to register two functions, one to save the state and
+another to load the state back.
+
+.. code:: c
+
+ int register_savevm_live(const char *idstr,
+ int instance_id,
+ int version_id,
+ SaveVMHandlers *ops,
+ void *opaque);
+
+Two functions in the ``ops`` structure are the ``save_state``
+and ``load_state`` functions. Notice that ``load_state`` receives a version_id
+parameter to know what state format is receiving. ``save_state`` doesn't
+have a version_id parameter because it always uses the latest version.
+
+Note that because the VMState macros still save the data in a raw
+format, in many cases it's possible to replace legacy code
+with a carefully constructed VMState description that matches the
+byte layout of the existing code.
+
+Changing migration data structures
+----------------------------------
+
+When we migrate a device, we save/load the state as a series
+of fields. Sometimes, due to bugs or new functionality, we need to
+change the state to store more/different information. Changing the migration
+state saved for a device can break migration compatibility unless
+care is taken to use the appropriate techniques. In general QEMU tries
+to maintain forward migration compatibility (i.e. migrating from
+QEMU n->n+1) and there are users who benefit from backward compatibility
+as well.
+
+Subsections
+-----------
+
+The most common structure change is adding new data, e.g. when adding
+a newer form of device, or adding that state that you previously
+forgot to migrate. This is best solved using a subsection.
+
+A subsection is "like" a device vmstate, but with a particularity, it
+has a Boolean function that tells if that values are needed to be sent
+or not. If this functions returns false, the subsection is not sent.
+Subsections have a unique name, that is looked for on the receiving
+side.
+
+On the receiving side, if we found a subsection for a device that we
+don't understand, we just fail the migration. If we understand all
+the subsections, then we load the state with success. There's no check
+that a subsection is loaded, so a newer QEMU that knows about a subsection
+can (with care) load a stream from an older QEMU that didn't send
+the subsection.
+
+If the new data is only needed in a rare case, then the subsection
+can be made conditional on that case and the migration will still
+succeed to older QEMUs in most cases. This is OK for data that's
+critical, but in some use cases it's preferred that the migration
+should succeed even with the data missing. To support this the
+subsection can be connected to a device property and from there
+to a versioned machine type.
+
+The 'pre_load' and 'post_load' functions on subsections are only
+called if the subsection is loaded.
+
+One important note is that the outer post_load() function is called "after"
+loading all subsections, because a newer subsection could change the same
+value that it uses. A flag, and the combination of outer pre_load and
+post_load can be used to detect whether a subsection was loaded, and to
+fall back on default behaviour when the subsection isn't present.
+
+Example:
+
+.. code:: c
+
+ static bool ide_drive_pio_state_needed(void *opaque)
+ {
+ IDEState *s = opaque;
+
+ return ((s->status & DRQ_STAT) != 0)
+ || (s->bus->error_status & BM_STATUS_PIO_RETRY);
+ }
+
+ const VMStateDescription vmstate_ide_drive_pio_state = {
+ .name = "ide_drive/pio_state",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .pre_save = ide_drive_pio_pre_save,
+ .post_load = ide_drive_pio_post_load,
+ .needed = ide_drive_pio_state_needed,
+ .fields = (VMStateField[]) {
+ VMSTATE_INT32(req_nb_sectors, IDEState),
+ VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
+ vmstate_info_uint8, uint8_t),
+ VMSTATE_INT32(cur_io_buffer_offset, IDEState),
+ VMSTATE_INT32(cur_io_buffer_len, IDEState),
+ VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
+ VMSTATE_INT32(elementary_transfer_size, IDEState),
+ VMSTATE_INT32(packet_transfer_size, IDEState),
+ VMSTATE_END_OF_LIST()
+ }
+ };
+
+ const VMStateDescription vmstate_ide_drive = {
+ .name = "ide_drive",
+ .version_id = 3,
+ .minimum_version_id = 0,
+ .post_load = ide_drive_post_load,
+ .fields = (VMStateField[]) {
+ .... several fields ....
+ VMSTATE_END_OF_LIST()
+ },
+ .subsections = (const VMStateDescription*[]) {
+ &vmstate_ide_drive_pio_state,
+ NULL
+ }
+ };
+
+Here we have a subsection for the pio state. We only need to
+save/send this state when we are in the middle of a pio operation
+(that is what ``ide_drive_pio_state_needed()`` checks). If DRQ_STAT is
+not enabled, the values on that fields are garbage and don't need to
+be sent.
+
+Connecting subsections to properties
+------------------------------------
+
+Using a condition function that checks a 'property' to determine whether
+to send a subsection allows backward migration compatibility when
+new subsections are added, especially when combined with versioned
+machine types.
+
+For example:
+
+ a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and
+ default it to true.
+ b) Add an entry to the ``hw_compat_`` for the previous version that sets
+ the property to false.
+ c) Add a static bool support_foo function that tests the property.
+ d) Add a subsection with a .needed set to the support_foo function
+ e) (potentially) Add an outer pre_load that sets up a default value
+ for 'foo' to be used if the subsection isn't loaded.
+
+Now that subsection will not be generated when using an older
+machine type and the migration stream will be accepted by older
+QEMU versions.
+
+Not sending existing elements
+-----------------------------
+
+Sometimes members of the VMState are no longer needed:
+
+ - removing them will break migration compatibility
+
+ - making them version dependent and bumping the version will break backward migration
+ compatibility.
+
+Adding a dummy field into the migration stream is normally the best way to preserve
+compatibility.
+
+If the field really does need to be removed then:
+
+ a) Add a new property/compatibility/function in the same way for subsections above.
+ b) replace the VMSTATE macro with the _TEST version of the macro, e.g.:
+
+ ``VMSTATE_UINT32(foo, barstruct)``
+
+ becomes
+
+ ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)``
+
+ Sometime in the future when we no longer care about the ancient versions these can be killed off.
+ Note that for backward compatibility it's important to fill in the structure with
+ data that the destination will understand.
+
+Any difference in the predicates on the source and destination will end up
+with different fields being enabled and data being loaded into the wrong
+fields; for this reason conditional fields like this are very fragile.
+
+Versions
+--------
+
+Version numbers are intended for major incompatible changes to the
+migration of a device, and using them breaks backward-migration
+compatibility; in general most changes can be made by adding Subsections
+(see above) or _TEST macros (see above) which won't break compatibility.
+
+Each version is associated with a series of fields saved. The ``save_state`` always saves
+the state as the newer version. But ``load_state`` sometimes is able to
+load state from an older version.
+
+You can see that there are several version fields:
+
+- ``version_id``: the maximum version_id supported by VMState for that device.
+- ``minimum_version_id``: the minimum version_id that VMState is able to understand
+ for that device.
+- ``minimum_version_id_old``: For devices that were not able to port to vmstate, we can
+ assign a function that knows how to read this old state. This field is
+ ignored if there is no ``load_state_old`` handler.
+
+VMState is able to read versions from minimum_version_id to
+version_id. And the function ``load_state_old()`` (if present) is able to
+load state from minimum_version_id_old to minimum_version_id. This
+function is deprecated and will be removed when no more users are left.
+
+There are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields,
+e.g.
+
+.. code:: c
+
+ VMSTATE_UINT16_V(ip_id, Slirp, 2),
+
+only loads that field for versions 2 and newer.
+
+Saving state will always create a section with the 'version_id' value
+and thus can't be loaded by any older QEMU.
+
+Massaging functions
+-------------------
+
+Sometimes, it is not enough to be able to save the state directly
+from one structure, we need to fill the correct values there. One
+example is when we are using kvm. Before saving the cpu state, we
+need to ask kvm to copy to QEMU the state that it is using. And the
+opposite when we are loading the state, we need a way to tell kvm to
+load the state for the cpu that we have just loaded from the QEMUFile.
+
+The functions to do that are inside a vmstate definition, and are called:
+
+- ``int (*pre_load)(void *opaque);``
+
+ This function is called before we load the state of one device.
+
+- ``int (*post_load)(void *opaque, int version_id);``
+
+ This function is called after we load the state of one device.
+
+- ``int (*pre_save)(void *opaque);``
+
+ This function is called before we save the state of one device.
+
+- ``int (*post_save)(void *opaque);``
+
+ This function is called after we save the state of one device
+ (even upon failure, unless the call to pre_save returned an error).
+
+Example: You can look at hpet.c, that uses the first three functions
+to massage the state that is transferred.
+
+The ``VMSTATE_WITH_TMP`` macro may be useful when the migration
+data doesn't match the stored device data well; it allows an
+intermediate temporary structure to be populated with migration
+data and then transferred to the main structure.
+
+If you use memory API functions that update memory layout outside
+initialization (i.e., in response to a guest action), this is a strong
+indication that you need to call these functions in a ``post_load`` callback.
+Examples of such memory API functions are:
+
+ - memory_region_add_subregion()
+ - memory_region_del_subregion()
+ - memory_region_set_readonly()
+ - memory_region_set_nonvolatile()
+ - memory_region_set_enabled()
+ - memory_region_set_address()
+ - memory_region_set_alias_offset()
+
+Iterative device migration
+--------------------------
+
+Some devices, such as RAM, Block storage or certain platform devices,
+have large amounts of data that would mean that the CPUs would be
+paused for too long if they were sent in one section. For these
+devices an *iterative* approach is taken.
+
+The iterative devices generally don't use VMState macros
+(although it may be possible in some cases) and instead use
+qemu_put_*/qemu_get_* macros to read/write data to the stream. Specialist
+versions exist for high bandwidth IO.
+
+
+An iterative device must provide:
+
+ - A ``save_setup`` function that initialises the data structures and
+ transmits a first section containing information on the device. In the
+ case of RAM this transmits a list of RAMBlocks and sizes.
+
+ - A ``load_setup`` function that initialises the data structures on the
+ destination.
+
+ - A ``save_live_pending`` function that is called repeatedly and must
+ indicate how much more data the iterative data must save. The core
+ migration code will use this to determine when to pause the CPUs
+ and complete the migration.
+
+ - A ``save_live_iterate`` function (called after ``save_live_pending``
+ when there is significant data still to be sent). It should send
+ a chunk of data until the point that stream bandwidth limits tell it
+ to stop. Each call generates one section.
+
+ - A ``save_live_complete_precopy`` function that must transmit the
+ last section for the device containing any remaining data.
+
+ - A ``load_state`` function used to load sections generated by
+ any of the save functions that generate sections.
+
+ - ``cleanup`` functions for both save and load that are called
+ at the end of migration.
+
+Note that the contents of the sections for iterative migration tend
+to be open-coded by the devices; care should be taken in parsing
+the results and structuring the stream to make them easy to validate.
+
+Device ordering
+---------------
+
+There are cases in which the ordering of device loading matters; for
+example in some systems where a device may assert an interrupt during loading,
+if the interrupt controller is loaded later then it might lose the state.
+
+Some ordering is implicitly provided by the order in which the machine
+definition creates devices, however this is somewhat fragile.
+
+The ``MigrationPriority`` enum provides a means of explicitly enforcing
+ordering. Numerically higher priorities are loaded earlier.
+The priority is set by setting the ``priority`` field of the top level
+``VMStateDescription`` for the device.
+
+Stream structure
+================
+
+The stream tries to be word and endian agnostic, allowing migration between hosts
+of different characteristics running the same VM.
+
+ - Header
+
+ - Magic
+ - Version
+ - VM configuration section
+
+ - Machine type
+ - Target page bits
+ - List of sections
+ Each section contains a device, or one iteration of a device save.
+
+ - section type
+ - section id
+ - ID string (First section of each device)
+ - instance id (First section of each device)
+ - version id (First section of each device)
+ - <device data>
+ - Footer mark
+ - EOF mark
+ - VM Description structure
+ Consisting of a JSON description of the contents for analysis only
+
+The ``device data`` in each section consists of the data produced
+by the code described above. For non-iterative devices they have a single
+section; iterative devices have an initial and last section and a set
+of parts in between.
+Note that there is very little checking by the common code of the integrity
+of the ``device data`` contents, that's up to the devices themselves.
+The ``footer mark`` provides a little bit of protection for the case where
+the receiving side reads more or less data than expected.
+
+The ``ID string`` is normally unique, having been formed from a bus name
+and device address, PCI devices and storage devices hung off PCI controllers
+fit this pattern well. Some devices are fixed single instances (e.g. "pc-ram").
+Others (especially either older devices or system devices which for
+some reason don't have a bus concept) make use of the ``instance id``
+for otherwise identically named devices.
+
+Return path
+-----------
+
+Only a unidirectional stream is required for normal migration, however a
+``return path`` can be created when bidirectional communication is desired.
+This is primarily used by postcopy, but is also used to return a success
+flag to the source at the end of migration.
+
+``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return
+path.
+
+ Source side
+
+ Forward path - written by migration thread
+ Return path - opened by main thread, read by return-path thread
+
+ Destination side
+
+ Forward path - read by main thread
+ Return path - opened by main thread, written by main thread AND postcopy
+ thread (protected by rp_mutex)
+
+Postcopy
+========
+
+'Postcopy' migration is a way to deal with migrations that refuse to converge
+(or take too long to converge) its plus side is that there is an upper bound on
+the amount of migration traffic and time it takes, the down side is that during
+the postcopy phase, a failure of *either* side or the network connection causes
+the guest to be lost.
+
+In postcopy the destination CPUs are started before all the memory has been
+transferred, and accesses to pages that are yet to be transferred cause
+a fault that's translated by QEMU into a request to the source QEMU.
+
+Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
+doesn't finish in a given time the switch is made to postcopy.
+
+Enabling postcopy
+-----------------
+
+To enable postcopy, issue this command on the monitor (both source and
+destination) prior to the start of migration:
+
+``migrate_set_capability postcopy-ram on``
+
+The normal commands are then used to start a migration, which is still
+started in precopy mode. Issuing:
+
+``migrate_start_postcopy``
+
+will now cause the transition from precopy to postcopy.
+It can be issued immediately after migration is started or any
+time later on. Issuing it after the end of a migration is harmless.
+
+Blocktime is a postcopy live migration metric, intended to show how
+long the vCPU was in state of interruptible sleep due to pagefault.
+That metric is calculated both for all vCPUs as overlapped value, and
+separately for each vCPU. These values are calculated on destination
+side. To enable postcopy blocktime calculation, enter following
+command on destination monitor:
+
+``migrate_set_capability postcopy-blocktime on``
+
+Postcopy blocktime can be retrieved by query-migrate qmp command.
+postcopy-blocktime value of qmp command will show overlapped blocking
+time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
+time per vCPU.
+
+.. note::
+ During the postcopy phase, the bandwidth limits set using
+ ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
+ the destination is waiting for).
+
+Postcopy device transfer
+------------------------
+
+Loading of device data may cause the device emulation to access guest RAM
+that may trigger faults that have to be resolved by the source, as such
+the migration stream has to be able to respond with page data *during* the
+device load, and hence the device data has to be read from the stream completely
+before the device load begins to free the stream up. This is achieved by
+'packaging' the device data into a blob that's read in one go.
+
+Source behaviour
+----------------
+
+Until postcopy is entered the migration stream is identical to normal
+precopy, except for the addition of a 'postcopy advise' command at
+the beginning, to tell the destination that postcopy might happen.
+When postcopy starts the source sends the page discard data and then
+forms the 'package' containing:
+
+ - Command: 'postcopy listen'
+ - The device state
+
+ A series of sections, identical to the precopy streams device state stream
+ containing everything except postcopiable devices (i.e. RAM)
+ - Command: 'postcopy run'
+
+The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
+contents are formatted in the same way as the main migration stream.
+
+During postcopy the source scans the list of dirty pages and sends them
+to the destination without being requested (in much the same way as precopy),
+however when a page request is received from the destination, the dirty page
+scanning restarts from the requested location. This causes requested pages
+to be sent quickly, and also causes pages directly after the requested page
+to be sent quickly in the hope that those pages are likely to be used
+by the destination soon.
+
+Destination behaviour
+---------------------
+
+Initially the destination looks the same as precopy, with a single thread
+reading the migration stream; the 'postcopy advise' and 'discard' commands
+are processed to change the way RAM is managed, but don't affect the stream
+processing.
+
+::
+
+ ------------------------------------------------------------------------------
+ 1 2 3 4 5 6 7
+ main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
+ thread | |
+ | (page request)
+ | \___
+ v \
+ listen thread: --- page -- page -- page -- page -- page --
+
+ a b c
+ ------------------------------------------------------------------------------
+
+- On receipt of ``CMD_PACKAGED`` (1)
+
+ All the data associated with the package - the ( ... ) section in the diagram -
+ is read into memory, and the main thread recurses into qemu_loadvm_state_main
+ to process the contents of the package (2) which contains commands (3,6) and
+ devices (4...)
+
+- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
+
+ a new thread (a) is started that takes over servicing the migration stream,
+ while the main thread carries on loading the package. It loads normal
+ background page data (b) but if during a device load a fault happens (5)
+ the returned page (c) is loaded by the listen thread allowing the main
+ threads device load to carry on.
+
+- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
+
+ letting the destination CPUs start running. At the end of the
+ ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
+ is no longer used by migration, while the listen thread carries on servicing
+ page data until the end of migration.
+
+Postcopy states
+---------------
+
+Postcopy moves through a series of states (see postcopy_state) from
+ADVISE->DISCARD->LISTEN->RUNNING->END
+
+ - Advise
+
+ Set at the start of migration if postcopy is enabled, even
+ if it hasn't had the start command; here the destination
+ checks that its OS has the support needed for postcopy, and performs
+ setup to ensure the RAM mappings are suitable for later postcopy.
+ The destination will fail early in migration at this point if the
+ required OS support is not present.
+ (Triggered by reception of POSTCOPY_ADVISE command)
+
+ - Discard
+
+ Entered on receipt of the first 'discard' command; prior to
+ the first Discard being performed, hugepages are switched off
+ (using madvise) to ensure that no new huge pages are created
+ during the postcopy phase, and to cause any huge pages that
+ have discards on them to be broken.
+
+ - Listen
+
+ The first command in the package, POSTCOPY_LISTEN, switches
+ the destination state to Listen, and starts a new thread
+ (the 'listen thread') which takes over the job of receiving
+ pages off the migration stream, while the main thread carries
+ on processing the blob. With this thread able to process page
+ reception, the destination now 'sensitises' the RAM to detect
+ any access to missing pages (on Linux using the 'userfault'
+ system).
+
+ - Running
+
+ POSTCOPY_RUN causes the destination to synchronise all
+ state and start the CPUs and IO devices running. The main
+ thread now finishes processing the migration package and
+ now carries on as it would for normal precopy migration
+ (although it can't do the cleanup it would do as it
+ finishes a normal migration).
+
+ - End
+
+ The listen thread can now quit, and perform the cleanup of migration
+ state, the migration is now complete.
+
+Source side page maps
+---------------------
+
+The source side keeps two bitmaps during postcopy; 'the migration bitmap'
+and 'unsent map'. The 'migration bitmap' is basically the same as in
+the precopy case, and holds a bit to indicate that page is 'dirty' -
+i.e. needs sending. During the precopy phase this is updated as the CPU
+dirties pages, however during postcopy the CPUs are stopped and nothing
+should dirty anything any more.
+
+The 'unsent map' is used for the transition to postcopy. It is a bitmap that
+has a bit cleared whenever a page is sent to the destination, however during
+the transition to postcopy mode it is combined with the migration bitmap
+to form a set of pages that:
+
+ a) Have been sent but then redirtied (which must be discarded)
+ b) Have not yet been sent - which also must be discarded to cause any
+ transparent huge pages built during precopy to be broken.
+
+Note that the contents of the unsentmap are sacrificed during the calculation
+of the discard set and thus aren't valid once in postcopy. The dirtymap
+is still valid and is used to ensure that no page is sent more than once. Any
+request for a page that has already been sent is ignored. Duplicate requests
+such as this can happen as a page is sent at about the same time the
+destination accesses it.
+
+Postcopy with hugepages
+-----------------------
+
+Postcopy now works with hugetlbfs backed memory:
+
+ a) The linux kernel on the destination must support userfault on hugepages.
+ b) The huge-page configuration on the source and destination VMs must be
+ identical; i.e. RAMBlocks on both sides must use the same page size.
+ c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
+ RAM if it doesn't have enough hugepages, triggering (b) to fail.
+ Using ``-mem-prealloc`` enforces the allocation using hugepages.
+ d) Care should be taken with the size of hugepage used; postcopy with 2MB
+ hugepages works well, however 1GB hugepages are likely to be problematic
+ since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
+ and until the full page is transferred the destination thread is blocked.
+
+Postcopy with shared memory
+---------------------------
+
+Postcopy migration with shared memory needs explicit support from the other
+processes that share memory and from QEMU. There are restrictions on the type of
+memory that userfault can support shared.
+
+The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
+(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
+for hugetlbfs which may be a problem in some configurations).
+
+The vhost-user code in QEMU supports clients that have Postcopy support,
+and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
+to support postcopy.
+
+The client needs to open a userfaultfd and register the areas
+of memory that it maps with userfault. The client must then pass the
+userfaultfd back to QEMU together with a mapping table that allows
+fault addresses in the clients address space to be converted back to
+RAMBlock/offsets. The client's userfaultfd is added to the postcopy
+fault-thread and page requests are made on behalf of the client by QEMU.
+QEMU performs 'wake' operations on the client's userfaultfd to allow it
+to continue after a page has arrived.
+
+.. note::
+ There are two future improvements that would be nice:
+ a) Some way to make QEMU ignorant of the addresses in the clients
+ address space
+ b) Avoiding the need for QEMU to perform ufd-wake calls after the
+ pages have arrived
+
+Retro-fitting postcopy to existing clients is possible:
+ a) A mechanism is needed for the registration with userfault as above,
+ and the registration needs to be coordinated with the phases of
+ postcopy. In vhost-user extra messages are added to the existing
+ control channel.
+ b) Any thread that can block due to guest memory accesses must be
+ identified and the implication understood; for example if the
+ guest memory access is made while holding a lock then all other
+ threads waiting for that lock will also be blocked.
+
+Firmware
+========
+
+Migration migrates the copies of RAM and ROM, and thus when running
+on the destination it includes the firmware from the source. Even after
+resetting a VM, the old firmware is used. Only once QEMU has been restarted
+is the new firmware in use.
+
+- Changes in firmware size can cause changes in the required RAMBlock size
+ to hold the firmware and thus migration can fail. In practice it's best
+ to pad firmware images to convenient powers of 2 with plenty of space
+ for growth.
+
+- Care should be taken with device emulation code so that newer
+ emulation code can work with older firmware to allow forward migration.
+
+- Care should be taken with newer firmware so that backward migration
+ to older systems with older device emulation code will work.
+
+In some cases it may be best to tie specific firmware versions to specific
+versioned machine types to cut down on the combinations that will need
+support. This is also useful when newer versions of firmware outgrow
+the padding.
+
diff --git a/docs/devel/modules.rst b/docs/devel/modules.rst
new file mode 100644
index 000000000..8e999c4fa
--- /dev/null
+++ b/docs/devel/modules.rst
@@ -0,0 +1,5 @@
+============
+QEMU modules
+============
+
+.. kernel-doc:: include/qemu/module.h
diff --git a/docs/devel/multi-process.rst b/docs/devel/multi-process.rst
new file mode 100644
index 000000000..e4801751f
--- /dev/null
+++ b/docs/devel/multi-process.rst
@@ -0,0 +1,968 @@
+Multi-process QEMU
+===================
+
+.. note::
+
+ This is the design document for multi-process QEMU. It does not
+ necessarily reflect the status of the current implementation, which
+ may lack features or be considerably different from what is described
+ in this document. This document is still useful as a description of
+ the goals and general direction of this feature.
+
+ Please refer to the following wiki for latest details:
+ https://wiki.qemu.org/Features/MultiProcessQEMU
+
+QEMU is often used as the hypervisor for virtual machines running in the
+Oracle cloud. Since one of the advantages of cloud computing is the
+ability to run many VMs from different tenants in the same cloud
+infrastructure, a guest that compromised its hypervisor could
+potentially use the hypervisor's access privileges to access data it is
+not authorized for.
+
+QEMU can be susceptible to security attacks because it is a large,
+monolithic program that provides many features to the VMs it services.
+Many of these features can be configured out of QEMU, but even a reduced
+configuration QEMU has a large amount of code a guest can potentially
+attack. Separating QEMU reduces the attack surface by aiding to
+limit each component in the system to only access the resources that
+it needs to perform its job.
+
+QEMU services
+-------------
+
+QEMU can be broadly described as providing three main services. One is a
+VM control point, where VMs can be created, migrated, re-configured, and
+destroyed. A second is to emulate the CPU instructions within the VM,
+often accelerated by HW virtualization features such as Intel's VT
+extensions. Finally, it provides IO services to the VM by emulating HW
+IO devices, such as disk and network devices.
+
+A multi-process QEMU
+~~~~~~~~~~~~~~~~~~~~
+
+A multi-process QEMU involves separating QEMU services into separate
+host processes. Each of these processes can be given only the privileges
+it needs to provide its service, e.g., a disk service could be given
+access only to the disk images it provides, and not be allowed to
+access other files, or any network devices. An attacker who compromised
+this service would not be able to use this exploit to access files or
+devices beyond what the disk service was given access to.
+
+A QEMU control process would remain, but in multi-process mode, will
+have no direct interfaces to the VM. During VM execution, it would still
+provide the user interface to hot-plug devices or live migrate the VM.
+
+A first step in creating a multi-process QEMU is to separate IO services
+from the main QEMU program, which would continue to provide CPU
+emulation. i.e., the control process would also be the CPU emulation
+process. In a later phase, CPU emulation could be separated from the
+control process.
+
+Separating IO services
+----------------------
+
+Separating IO services into individual host processes is a good place to
+begin for a couple of reasons. One is the sheer number of IO devices QEMU
+can emulate provides a large surface of interfaces which could potentially
+be exploited, and, indeed, have been a source of exploits in the past.
+Another is the modular nature of QEMU device emulation code provides
+interface points where the QEMU functions that perform device emulation
+can be separated from the QEMU functions that manage the emulation of
+guest CPU instructions. The devices emulated in the separate process are
+referred to as remote devices.
+
+QEMU device emulation
+~~~~~~~~~~~~~~~~~~~~~
+
+QEMU uses an object oriented SW architecture for device emulation code.
+Configured objects are all compiled into the QEMU binary, then objects
+are instantiated by name when used by the guest VM. For example, the
+code to emulate a device named "foo" is always present in QEMU, but its
+instantiation code is only run when the device is included in the target
+VM. (e.g., via the QEMU command line as *-device foo*)
+
+The object model is hierarchical, so device emulation code names its
+parent object (such as "pci-device" for a PCI device) and QEMU will
+instantiate a parent object before calling the device's instantiation
+code.
+
+Current separation models
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to separate the device emulation code from the CPU emulation
+code, the device object code must run in a different process. There are
+a couple of existing QEMU features that can run emulation code
+separately from the main QEMU process. These are examined below.
+
+vhost user model
+^^^^^^^^^^^^^^^^
+
+Virtio guest device drivers can be connected to vhost user applications
+in order to perform their IO operations. This model uses special virtio
+device drivers in the guest and vhost user device objects in QEMU, but
+once the QEMU vhost user code has configured the vhost user application,
+mission-mode IO is performed by the application. The vhost user
+application is a daemon process that can be contacted via a known UNIX
+domain socket.
+
+vhost socket
+''''''''''''
+
+As mentioned above, one of the tasks of the vhost device object within
+QEMU is to contact the vhost application and send it configuration
+information about this device instance. As part of the configuration
+process, the application can also be sent other file descriptors over
+the socket, which then can be used by the vhost user application in
+various ways, some of which are described below.
+
+vhost MMIO store acceleration
+'''''''''''''''''''''''''''''
+
+VMs are often run using HW virtualization features via the KVM kernel
+driver. This driver allows QEMU to accelerate the emulation of guest CPU
+instructions by running the guest in a virtual HW mode. When the guest
+executes instructions that cannot be executed by virtual HW mode,
+execution returns to the KVM driver so it can inform QEMU to emulate the
+instructions in SW.
+
+One of the events that can cause a return to QEMU is when a guest device
+driver accesses an IO location. QEMU then dispatches the memory
+operation to the corresponding QEMU device object. In the case of a
+vhost user device, the memory operation would need to be sent over a
+socket to the vhost application. This path is accelerated by the QEMU
+virtio code by setting up an eventfd file descriptor that the vhost
+application can directly receive MMIO store notifications from the KVM
+driver, instead of needing them to be sent to the QEMU process first.
+
+vhost interrupt acceleration
+''''''''''''''''''''''''''''
+
+Another optimization used by the vhost application is the ability to
+directly inject interrupts into the VM via the KVM driver, again,
+bypassing the need to send the interrupt back to the QEMU process first.
+The QEMU virtio setup code configures the KVM driver with an eventfd
+that triggers the device interrupt in the guest when the eventfd is
+written. This irqfd file descriptor is then passed to the vhost user
+application program.
+
+vhost access to guest memory
+''''''''''''''''''''''''''''
+
+The vhost application is also allowed to directly access guest memory,
+instead of needing to send the data as messages to QEMU. This is also
+done with file descriptors sent to the vhost user application by QEMU.
+These descriptors can be passed to ``mmap()`` by the vhost application
+to map the guest address space into the vhost application.
+
+IOMMUs introduce another level of complexity, since the address given to
+the guest virtio device to DMA to or from is not a guest physical
+address. This case is handled by having vhost code within QEMU register
+as a listener for IOMMU mapping changes. The vhost application maintains
+a cache of IOMMMU translations: sending translation requests back to
+QEMU on cache misses, and in turn receiving flush requests from QEMU
+when mappings are purged.
+
+applicability to device separation
+''''''''''''''''''''''''''''''''''
+
+Much of the vhost model can be re-used by separated device emulation. In
+particular, the ideas of using a socket between QEMU and the device
+emulation application, using a file descriptor to inject interrupts into
+the VM via KVM, and allowing the application to ``mmap()`` the guest
+should be re used.
+
+There are, however, some notable differences between how a vhost
+application works and the needs of separated device emulation. The most
+basic is that vhost uses custom virtio device drivers which always
+trigger IO with MMIO stores. A separated device emulation model must
+work with existing IO device models and guest device drivers. MMIO loads
+break vhost store acceleration since they are synchronous - guest
+progress cannot continue until the load has been emulated. By contrast,
+stores are asynchronous, the guest can continue after the store event
+has been sent to the vhost application.
+
+Another difference is that in the vhost user model, a single daemon can
+support multiple QEMU instances. This is contrary to the security regime
+desired, in which the emulation application should only be allowed to
+access the files or devices the VM it's running on behalf of can access.
+#### qemu-io model
+
+``qemu-io`` is a test harness used to test changes to the QEMU block backend
+object code (e.g., the code that implements disk images for disk driver
+emulation). ``qemu-io`` is not a device emulation application per se, but it
+does compile the QEMU block objects into a separate binary from the main
+QEMU one. This could be useful for disk device emulation, since its
+emulation applications will need to include the QEMU block objects.
+
+New separation model based on proxy objects
+-------------------------------------------
+
+A different model based on proxy objects in the QEMU program
+communicating with remote emulation programs could provide separation
+while minimizing the changes needed to the device emulation code. The
+rest of this section is a discussion of how a proxy object model would
+work.
+
+Remote emulation processes
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The remote emulation process will run the QEMU object hierarchy without
+modification. The device emulation objects will be also be based on the
+QEMU code, because for anything but the simplest device, it would not be
+a tractable to re-implement both the object model and the many device
+backends that QEMU has.
+
+The processes will communicate with the QEMU process over UNIX domain
+sockets. The processes can be executed either as standalone processes,
+or be executed by QEMU. In both cases, the host backends the emulation
+processes will provide are specified on its command line, as they would
+be for QEMU. For example:
+
+::
+
+ disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \
+ -blockdev driver=qcow2,node-name=drive0,file=file0
+
+would indicate process *disk-proc* uses a qcow2 emulated disk named
+*file0* as its backend.
+
+Emulation processes may emulate more than one guest controller. A common
+configuration might be to put all controllers of the same device class
+(e.g., disk, network, etc.) in a single process, so that all backends of
+the same type can be managed by a single QMP monitor.
+
+communication with QEMU
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The first argument to the remote emulation process will be a Unix domain
+socket that connects with the Proxy object. This is a required argument.
+
+::
+
+ disk-proc <socket number> <backend list>
+
+remote process QMP monitor
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Remote emulation processes can be monitored via QMP, similar to QEMU
+itself. The QMP monitor socket is specified the same as for a QEMU
+process:
+
+::
+
+ disk-proc -qmp unix:/tmp/disk-mon,server
+
+can be monitored over the UNIX socket path */tmp/disk-mon*.
+
+QEMU command line
+~~~~~~~~~~~~~~~~~
+
+Each remote device emulated in a remote process on the host is
+represented as a *-device* of type *pci-proxy-dev*. A socket
+sub-option to this option specifies the Unix socket that connects
+to the remote process. An *id* sub-option is required, and it should
+be the same id as used in the remote process.
+
+::
+
+ qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3
+
+can be used to add a device emulated in a remote process
+
+
+QEMU management of remote processes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+QEMU is not aware of the type of type of the remote PCI device. It is
+a pass through device as far as QEMU is concerned.
+
+communication with emulation process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+primary channel
+'''''''''''''''
+
+The primary channel (referred to as com in the code) is used to bootstrap
+the remote process. It is also used to pass on device-agnostic commands
+like reset.
+
+per-device channels
+'''''''''''''''''''
+
+Each remote device communicates with QEMU using a dedicated communication
+channel. The proxy object sets up this channel using the primary
+channel during its initialization.
+
+QEMU device proxy objects
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+QEMU has an object model based on sub-classes inherited from the
+"object" super-class. The sub-classes that are of interest here are the
+"device" and "bus" sub-classes whose child sub-classes make up the
+device tree of a QEMU emulated system.
+
+The proxy object model will use device proxy objects to replace the
+device emulation code within the QEMU process. These objects will live
+in the same place in the object and bus hierarchies as the objects they
+replace. i.e., the proxy object for an LSI SCSI controller will be a
+sub-class of the "pci-device" class, and will have the same PCI bus
+parent and the same SCSI bus child objects as the LSI controller object
+it replaces.
+
+It is worth noting that the same proxy object is used to mediate with
+all types of remote PCI devices.
+
+object initialization
+^^^^^^^^^^^^^^^^^^^^^
+
+The Proxy device objects are initialized in the exact same manner in
+which any other QEMU device would be initialized.
+
+In addition, the Proxy objects perform the following two tasks:
+- Parses the "socket" sub option and connects to the remote process
+using this channel
+- Uses the "id" sub-option to connect to the emulated device on the
+separate process
+
+class\_init
+'''''''''''
+
+The ``class_init()`` method of a proxy object will, in general behave
+similarly to the object it replaces, including setting any static
+properties and methods needed by the proxy.
+
+instance\_init / realize
+''''''''''''''''''''''''
+
+The ``instance_init()`` and ``realize()`` functions would only need to
+perform tasks related to being a proxy, such are registering its own
+MMIO handlers, or creating a child bus that other proxy devices can be
+attached to later.
+
+Other tasks will be device-specific. For example, PCI device objects
+will initialize the PCI config space in order to make a valid PCI device
+tree within the QEMU process.
+
+address space registration
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most devices are driven by guest device driver accesses to IO addresses
+or ports. The QEMU device emulation code uses QEMU's memory region
+function calls (such as ``memory_region_init_io()``) to add callback
+functions that QEMU will invoke when the guest accesses the device's
+areas of the IO address space. When a guest driver does access the
+device, the VM will exit HW virtualization mode and return to QEMU,
+which will then lookup and execute the corresponding callback function.
+
+A proxy object would need to mirror the memory region calls the actual
+device emulator would perform in its initialization code, but with its
+own callbacks. When invoked by QEMU as a result of a guest IO operation,
+they will forward the operation to the device emulation process.
+
+PCI config space
+^^^^^^^^^^^^^^^^
+
+PCI devices also have a configuration space that can be accessed by the
+guest driver. Guest accesses to this space is not handled by the device
+emulation object, but by its PCI parent object. Much of this space is
+read-only, but certain registers (especially BAR and MSI-related ones)
+need to be propagated to the emulation process.
+
+PCI parent proxy
+''''''''''''''''
+
+One way to propagate guest PCI config accesses is to create a
+"pci-device-proxy" class that can serve as the parent of a PCI device
+proxy object. This class's parent would be "pci-device" and it would
+override the PCI parent's ``config_read()`` and ``config_write()``
+methods with ones that forward these operations to the emulation
+program.
+
+interrupt receipt
+^^^^^^^^^^^^^^^^^
+
+A proxy for a device that generates interrupts will need to create a
+socket to receive interrupt indications from the emulation process. An
+incoming interrupt indication would then be sent up to its bus parent to
+be injected into the guest. For example, a PCI device object may use
+``pci_set_irq()``.
+
+live migration
+^^^^^^^^^^^^^^
+
+The proxy will register to save and restore any *vmstate* it needs over
+a live migration event. The device proxy does not need to manage the
+remote device's *vmstate*; that will be handled by the remote process
+proxy (see below).
+
+QEMU remote device operation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generic device operations, such as DMA, will be performed by the remote
+process proxy by sending messages to the remote process.
+
+DMA operations
+^^^^^^^^^^^^^^
+
+DMA operations would be handled much like vhost applications do. One of
+the initial messages sent to the emulation process is a guest memory
+table. Each entry in this table consists of a file descriptor and size
+that the emulation process can ``mmap()`` to directly access guest
+memory, similar to ``vhost_user_set_mem_table()``. Note guest memory
+must be backed by file descriptors, such as when QEMU is given the
+*-mem-path* command line option.
+
+IOMMU operations
+^^^^^^^^^^^^^^^^
+
+When the emulated system includes an IOMMU, the remote process proxy in
+QEMU will need to create a socket for IOMMU requests from the emulation
+process. It will handle those requests with an
+``address_space_get_iotlb_entry()`` call. In order to handle IOMMU
+unmaps, the remote process proxy will also register as a listener on the
+device's DMA address space. When an IOMMU memory region is created
+within the DMA address space, an IOMMU notifier for unmaps will be added
+to the memory region that will forward unmaps to the emulation process
+over the IOMMU socket.
+
+device hot-plug via QMP
+^^^^^^^^^^^^^^^^^^^^^^^
+
+An QMP "device\_add" command can add a device emulated by a remote
+process. It will also have "rid" option to the command, just as the
+*-device* command line option does. The remote process may either be one
+started at QEMU startup, or be one added by the "add-process" QMP
+command described above. In either case, the remote process proxy will
+forward the new device's JSON description to the corresponding emulation
+process.
+
+live migration
+^^^^^^^^^^^^^^
+
+The remote process proxy will also register for live migration
+notifications with ``vmstate_register()``. When called to save state,
+the proxy will send the remote process a secondary socket file
+descriptor to save the remote process's device *vmstate* over. The
+incoming byte stream length and data will be saved as the proxy's
+*vmstate*. When the proxy is resumed on its new host, this *vmstate*
+will be extracted, and a secondary socket file descriptor will be sent
+to the new remote process through which it receives the *vmstate* in
+order to restore the devices there.
+
+device emulation in remote process
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The parts of QEMU that the emulation program will need include the
+object model; the memory emulation objects; the device emulation objects
+of the targeted device, and any dependent devices; and, the device's
+backends. It will also need code to setup the machine environment,
+handle requests from the QEMU process, and route machine-level requests
+(such as interrupts or IOMMU mappings) back to the QEMU process.
+
+initialization
+^^^^^^^^^^^^^^
+
+The process initialization sequence will follow the same sequence
+followed by QEMU. It will first initialize the backend objects, then
+device emulation objects. The JSON descriptions sent by the QEMU process
+will drive which objects need to be created.
+
+- address spaces
+
+Before the device objects are created, the initial address spaces and
+memory regions must be configured with ``memory_map_init()``. This
+creates a RAM memory region object (*system\_memory*) and an IO memory
+region object (*system\_io*).
+
+- RAM
+
+RAM memory region creation will follow how ``pc_memory_init()`` creates
+them, but must use ``memory_region_init_ram_from_fd()`` instead of
+``memory_region_allocate_system_memory()``. The file descriptors needed
+will be supplied by the guest memory table from above. Those RAM regions
+would then be added to the *system\_memory* memory region with
+``memory_region_add_subregion()``.
+
+- PCI
+
+IO initialization will be driven by the JSON descriptions sent from the
+QEMU process. For a PCI device, a PCI bus will need to be created with
+``pci_root_bus_new()``, and a PCI memory region will need to be created
+and added to the *system\_memory* memory region with
+``memory_region_add_subregion_overlap()``. The overlap version is
+required for architectures where PCI memory overlaps with RAM memory.
+
+MMIO handling
+^^^^^^^^^^^^^
+
+The device emulation objects will use ``memory_region_init_io()`` to
+install their MMIO handlers, and ``pci_register_bar()`` to associate
+those handlers with a PCI BAR, as they do within QEMU currently.
+
+In order to use ``address_space_rw()`` in the emulation process to
+handle MMIO requests from QEMU, the PCI physical addresses must be the
+same in the QEMU process and the device emulation process. In order to
+accomplish that, guest BAR programming must also be forwarded from QEMU
+to the emulation process.
+
+interrupt injection
+^^^^^^^^^^^^^^^^^^^
+
+When device emulation wants to inject an interrupt into the VM, the
+request climbs the device's bus object hierarchy until the point where a
+bus object knows how to signal the interrupt to the guest. The details
+depend on the type of interrupt being raised.
+
+- PCI pin interrupts
+
+On x86 systems, there is an emulated IOAPIC object attached to the root
+PCI bus object, and the root PCI object forwards interrupt requests to
+it. The IOAPIC object, in turn, calls the KVM driver to inject the
+corresponding interrupt into the VM. The simplest way to handle this in
+an emulation process would be to setup the root PCI bus driver (via
+``pci_bus_irqs()``) to send a interrupt request back to the QEMU
+process, and have the device proxy object reflect it up the PCI tree
+there.
+
+- PCI MSI/X interrupts
+
+PCI MSI/X interrupts are implemented in HW as DMA writes to a
+CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives
+these DMA writes, then calls into the KVM driver to inject the interrupt
+into the VM. A simple emulation process implementation would be to send
+the MSI DMA address from QEMU as a message at initialization, then
+install an address space handler at that address which forwards the MSI
+message back to QEMU.
+
+DMA operations
+^^^^^^^^^^^^^^
+
+When a emulation object wants to DMA into or out of guest memory, it
+first must use dma\_memory\_map() to convert the DMA address to a local
+virtual address. The emulation process memory region objects setup above
+will be used to translate the DMA address to a local virtual address the
+device emulation code can access.
+
+IOMMU
+^^^^^
+
+When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory
+regions to translate the DMA address to a guest physical address before
+that physical address can be translated to a local virtual address. The
+emulation process will need similar functionality.
+
+- IOTLB cache
+
+The emulation process will maintain a cache of recent IOMMU translations
+(the IOTLB). When the translate() callback of an IOMMU memory region is
+invoked, the IOTLB cache will be searched for an entry that will map the
+DMA address to a guest PA. On a cache miss, a message will be sent back
+to QEMU requesting the corresponding translation entry, which be both be
+used to return a guest address and be added to the cache.
+
+- IOTLB purge
+
+The IOMMU emulation will also need to act on unmap requests from QEMU.
+These happen when the guest IOMMU driver purges an entry from the
+guest's translation table.
+
+live migration
+^^^^^^^^^^^^^^
+
+When a remote process receives a live migration indication from QEMU, it
+will set up a channel using the received file descriptor with
+``qio_channel_socket_new_fd()``. This channel will be used to create a
+*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send
+the process's device state back to QEMU. This method will be reversed on
+restore - the channel will be passed to ``qemu_loadvm_state()`` to
+restore the device state.
+
+Accelerating device emulation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The messages that are required to be sent between QEMU and the emulation
+process can add considerable latency to IO operations. The optimizations
+described below attempt to ameliorate this effect by allowing the
+emulation process to communicate directly with the kernel KVM driver.
+The KVM file descriptors created would be passed to the emulation process
+via initialization messages, much like the guest memory table is done.
+#### MMIO acceleration
+
+Vhost user applications can receive guest virtio driver stores directly
+from KVM. The issue with the eventfd mechanism used by vhost user is
+that it does not pass any data with the event indication, so it cannot
+handle guest loads or guest stores that carry store data. This concept
+could, however, be expanded to cover more cases.
+
+The expanded idea would require a new type of KVM device:
+*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master
+descriptor that QEMU can use for configuration, and a slave descriptor
+that the emulation process can use to receive MMIO notifications. QEMU
+would create both descriptors using the KVM driver, and pass the slave
+descriptor to the emulation process via an initialization message.
+
+data structures
+^^^^^^^^^^^^^^^
+
+- guest physical range
+
+The guest physical range structure describes the address range that a
+device will respond to. It includes the base and length of the range, as
+well as which bus the range resides on (e.g., on an x86machine, it can
+specify whether the range refers to memory or IO addresses).
+
+A device can have multiple physical address ranges it responds to (e.g.,
+a PCI device can have multiple BARs), so the structure will also include
+an enumerated identifier to specify which of the device's ranges is
+being referred to.
+
++--------+----------------------------+
+| Name | Description |
++========+============================+
+| addr | range base address |
++--------+----------------------------+
+| len | range length |
++--------+----------------------------+
+| bus | addr type (memory or IO) |
++--------+----------------------------+
+| id | range ID (e.g., PCI BAR) |
++--------+----------------------------+
+
+- MMIO request structure
+
+This structure describes an MMIO operation. It includes which guest
+physical range the MMIO was within, the offset within that range, the
+MMIO type (e.g., load or store), and its length and data. It also
+includes a sequence number that can be used to reply to the MMIO, and
+the CPU that issued the MMIO.
+
++----------+------------------------+
+| Name | Description |
++==========+========================+
+| rid | range MMIO is within |
++----------+------------------------+
+| offset | offset within *rid* |
++----------+------------------------+
+| type | e.g., load or store |
++----------+------------------------+
+| len | MMIO length |
++----------+------------------------+
+| data | store data |
++----------+------------------------+
+| seq | sequence ID |
++----------+------------------------+
+
+- MMIO request queues
+
+MMIO request queues are FIFO arrays of MMIO request structures. There
+are two queues: pending queue is for MMIOs that haven't been read by the
+emulation program, and the sent queue is for MMIOs that haven't been
+acknowledged. The main use of the second queue is to validate MMIO
+replies from the emulation program.
+
+- scoreboard
+
+Each CPU in the VM is emulated in QEMU by a separate thread, so multiple
+MMIOs may be waiting to be consumed by an emulation program and multiple
+threads may be waiting for MMIO replies. The scoreboard would contain a
+wait queue and sequence number for the per-CPU threads, allowing them to
+be individually woken when the MMIO reply is received from the emulation
+program. It also tracks the number of posted MMIO stores to the device
+that haven't been replied to, in order to satisfy the PCI constraint
+that a load to a device will not complete until all previous stores to
+that device have been completed.
+
+- device shadow memory
+
+Some MMIO loads do not have device side-effects. These MMIOs can be
+completed without sending a MMIO request to the emulation program if the
+emulation program shares a shadow image of the device's memory image
+with the KVM driver.
+
+The emulation program will ask the KVM driver to allocate memory for the
+shadow image, and will then use ``mmap()`` to directly access it. The
+emulation program can control KVM access to the shadow image by sending
+KVM an access map telling it which areas of the image have no
+side-effects (and can be completed immediately), and which require a
+MMIO request to the emulation program. The access map can also inform
+the KVM drive which size accesses are allowed to the image.
+
+master descriptor
+^^^^^^^^^^^^^^^^^
+
+The master descriptor is used by QEMU to configure the new KVM device.
+The descriptor would be returned by the KVM driver when QEMU issues a
+*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type.
+
+KVM\_DEV\_TYPE\_USER device ops
+
+
+The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a
+``kvm_register_device_ops()`` call when the KVM system in initialized by
+``kvm_init()``. These device ops are called by the KVM driver when QEMU
+executes certain ``ioctl()`` operations on its KVM file descriptor. They
+include:
+
+- create
+
+This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE*
+``ioctl()`` on its per-VM file descriptor. It will allocate and
+initialize a KVM user device specific data structure, and assign the
+*kvm\_device* private field to it.
+
+- ioctl
+
+This routine is invoked when QEMU issues an ``ioctl()`` on the master
+descriptor. The ``ioctl()`` commands supported are defined by the KVM
+device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands:
+
+*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will
+be passed to the device emulation program. Only one slave can be created
+by each master descriptor. The file operations performed by this
+descriptor are described below.
+
+The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical
+address range that the slave descriptor will receive MMIO notifications
+for. The range is specified by a guest physical range structure
+argument. For buses that assign addresses to devices dynamically, this
+command can be executed while the guest is running, such as the case
+when a guest changes a device's PCI BAR registers.
+
+*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to
+register *kvm\_io\_device\_ops* callbacks to be invoked when the guest
+performs a MMIO operation within the range. When a range is changed,
+``kvm_io_bus_unregister_dev()`` is used to remove the previous
+instantiation.
+
+*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies
+how long KVM will wait for the emulation process to respond to a MMIO
+indication.
+
+- destroy
+
+This routine is called when the VM instance is destroyed. It will need
+to destroy the slave descriptor; and free any memory allocated by the
+driver, as well as the *kvm\_device* structure itself.
+
+slave descriptor
+^^^^^^^^^^^^^^^^
+
+The slave descriptor will have its own file operations vector, which
+responds to system calls on the descriptor performed by the device
+emulation program.
+
+- read
+
+A read returns any pending MMIO requests from the KVM driver as MMIO
+request structures. Multiple structures can be returned if there are
+multiple MMIO operations pending. The MMIO requests are moved from the
+pending queue to the sent queue, and if there are threads waiting for
+space in the pending to add new MMIO operations, they will be woken
+here.
+
+- write
+
+A write also consists of a set of MMIO requests. They are compared to
+the MMIO requests in the sent queue. Matches are removed from the sent
+queue, and any threads waiting for the reply are woken. If a store is
+removed, then the number of posted stores in the per-CPU scoreboard is
+decremented. When the number is zero, and a non side-effect load was
+waiting for posted stores to complete, the load is continued.
+
+- ioctl
+
+There are several ioctl()s that can be performed on the slave
+descriptor.
+
+A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to
+allocate memory for the shadow image. This memory can later be
+``mmap()``\ ed by the emulation process to share the emulation's view of
+device memory with the KVM driver.
+
+A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the
+shadow image. It will send the KVM driver a shadow control map, which
+specifies which areas of the image can complete guest loads without
+sending the load request to the emulation program. It will also specify
+the size of load operations that are allowed.
+
+- poll
+
+An emulation program will use the ``poll()`` call with a *POLLIN* flag
+to determine if there are MMIO requests waiting to be read. It will
+return if the pending MMIO request queue is not empty.
+
+- mmap
+
+This call allows the emulation program to directly access the shadow
+image allocated by the KVM driver. As device emulation updates device
+memory, changes with no side-effects will be reflected in the shadow,
+and the KVM driver can satisfy guest loads from the shadow image without
+needing to wait for the emulation program.
+
+kvm\_io\_device ops
+^^^^^^^^^^^^^^^^^^^
+
+Each KVM per-CPU thread can handle MMIO operation on behalf of the guest
+VM. KVM will use the MMIO's guest physical address to search for a
+matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM
+driver instead of exiting back to QEMU. If a match is found, the
+corresponding callback will be invoked.
+
+- read
+
+This callback is invoked when the guest performs a load to the device.
+Loads with side-effects must be handled synchronously, with the KVM
+driver putting the QEMU thread to sleep waiting for the emulation
+process reply before re-starting the guest. Loads that do not have
+side-effects may be optimized by satisfying them from the shadow image,
+if there are no outstanding stores to the device by this CPU. PCI memory
+ordering demands that a load cannot complete before all older stores to
+the same device have been completed.
+
+- write
+
+Stores can be handled asynchronously unless the pending MMIO request
+queue is full. In this case, the QEMU thread must sleep waiting for
+space in the queue. Stores will increment the number of posted stores in
+the per-CPU scoreboard, in order to implement the PCI ordering
+constraint above.
+
+interrupt acceleration
+^^^^^^^^^^^^^^^^^^^^^^
+
+This performance optimization would work much like a vhost user
+application does, where the QEMU process sets up *eventfds* that cause
+the device's corresponding interrupt to be triggered by the KVM driver.
+These irq file descriptors are sent to the emulation process at
+initialization, and are used when the emulation code raises a device
+interrupt.
+
+intx acceleration
+'''''''''''''''''
+
+Traditional PCI pin interrupts are level based, so, in addition to an
+irq file descriptor, a re-sampling file descriptor needs to be sent to
+the emulation program. This second file descriptor allows multiple
+devices sharing an irq to be notified when the interrupt has been
+acknowledged by the guest, so they can re-trigger the interrupt if their
+device has not de-asserted its interrupt.
+
+intx irq descriptor
+
+
+The irq descriptors are created by the proxy object
+``using event_notifier_init()`` to create the irq and re-sampling
+*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt.
+The interrupt route can be found with
+``pci_device_route_intx_to_irq()``.
+
+intx routing changes
+
+
+Intx routing can be changed when the guest programs the APIC the device
+pin is connected to. The proxy object in QEMU will use
+``pci_device_set_intx_routing_notifier()`` to be informed of any guest
+changes to the route. This handler will broadly follow the VFIO
+interrupt logic to change the route: de-assigning the existing irq
+descriptor from its route, then assigning it the new route. (see
+``vfio_intx_update()``)
+
+MSI/X acceleration
+''''''''''''''''''
+
+MSI/X interrupts are sent as DMA transactions to the host. The interrupt
+data contains a vector that is programmed by the guest, A device may have
+multiple MSI interrupts associated with it, so multiple irq descriptors
+may need to be sent to the emulation program.
+
+MSI/X irq descriptor
+
+
+This case will also follow the VFIO example. For each MSI/X interrupt,
+an *eventfd* is created, a virtual interrupt is allocated by
+``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to
+the eventfd with ``kvm_irqchip_add_irqfd_notifier()``.
+
+MSI/X config space changes
+
+
+The guest may dynamically update several MSI-related tables in the
+device's PCI config space. These include per-MSI interrupt enables and
+vector data. Additionally, MSIX tables exist in device memory space, not
+config space. Much like the BAR case above, the proxy object must look
+at guest config space programming to keep the MSI interrupt state
+consistent between QEMU and the emulation program.
+
+--------------
+
+Disaggregated CPU emulation
+---------------------------
+
+After IO services have been disaggregated, a second phase would be to
+separate a process to handle CPU instruction emulation from the main
+QEMU control function. There are no object separation points for this
+code, so the first task would be to create one.
+
+Host access controls
+--------------------
+
+Separating QEMU relies on the host OS's access restriction mechanisms to
+enforce that the differing processes can only access the objects they
+are entitled to. There are a couple types of mechanisms usually provided
+by general purpose OSs.
+
+Discretionary access control
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Discretionary access control allows each user to control who can access
+their files. In Linux, this type of control is usually too coarse for
+QEMU separation, since it only provides three separate access controls:
+one for the same user ID, the second for users IDs with the same group
+ID, and the third for all other user IDs. Each device instance would
+need a separate user ID to provide access control, which is likely to be
+unwieldy for dynamically created VMs.
+
+Mandatory access control
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Mandatory access control allows the OS to add an additional set of
+controls on top of discretionary access for the OS to control. It also
+adds other attributes to processes and files such as types, roles, and
+categories, and can establish rules for how processes and files can
+interact.
+
+Type enforcement
+^^^^^^^^^^^^^^^^
+
+Type enforcement assigns a *type* attribute to processes and files, and
+allows rules to be written on what operations a process with a given
+type can perform on a file with a given type. QEMU separation could take
+advantage of type enforcement by running the emulation processes with
+different types, both from the main QEMU process, and from the emulation
+processes of different classes of devices.
+
+For example, guest disk images and disk emulation processes could have
+types separate from the main QEMU process and non-disk emulation
+processes, and the type rules could prevent processes other than disk
+emulation ones from accessing guest disk images. Similarly, network
+emulation processes can have a type separate from the main QEMU process
+and non-network emulation process, and only that type can access the
+host tun/tap device used to provide guest networking.
+
+Category enforcement
+^^^^^^^^^^^^^^^^^^^^
+
+Category enforcement assigns a set of numbers within a given range to
+the process or file. The process is granted access to the file if the
+process's set is a superset of the file's set. This enforcement can be
+used to separate multiple instances of devices in the same class.
+
+For example, if there are multiple disk devices provides to a guest,
+each device emulation process could be provisioned with a separate
+category. The different device emulation processes would not be able to
+access each other's backing disk images.
+
+Alternatively, categories could be used in lieu of the type enforcement
+scheme described above. In this scenario, different categories would be
+used to prevent device emulation processes in different classes from
+accessing resources assigned to other classes.
diff --git a/docs/devel/multi-thread-tcg.rst b/docs/devel/multi-thread-tcg.rst
new file mode 100644
index 000000000..c9541a7b2
--- /dev/null
+++ b/docs/devel/multi-thread-tcg.rst
@@ -0,0 +1,373 @@
+..
+ Copyright (c) 2015-2020 Linaro Ltd.
+
+ This work is licensed under the terms of the GNU GPL, version 2 or
+ later. See the COPYING file in the top-level directory.
+
+==================
+Multi-threaded TCG
+==================
+
+This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
+system-mode emulation. user-mode emulation has always mirrored the
+thread structure of the translated executable although some of the
+changes done for MTTCG system emulation have improved the stability of
+linux-user emulation.
+
+The original system-mode TCG implementation was single threaded and
+dealt with multiple CPUs with simple round-robin scheduling. This
+simplified a lot of things but became increasingly limited as systems
+being emulated gained additional cores and per-core performance gains
+for host systems started to level off.
+
+vCPU Scheduling
+===============
+
+We introduce a new running mode where each vCPU will run on its own
+user-space thread. This is enabled by default for all FE/BE
+combinations where the host memory model is able to accommodate the
+guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
+guest has had the required work done to support this safely
+(TARGET_SUPPORTS_MTTCG).
+
+System emulation will fall back to the original round robin approach
+if:
+
+* forced by --accel tcg,thread=single
+* enabling --icount mode
+* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
+
+In the general case of running translated code there should be no
+inter-vCPU dependencies and all vCPUs should be able to run at full
+speed. Synchronisation will only be required while accessing internal
+shared data structures or when the emulated architecture requires a
+coherent representation of the emulated machine state.
+
+Shared Data Structures
+======================
+
+Main Run Loop
+-------------
+
+Even when there is no code being generated there are a number of
+structures associated with the hot-path through the main run-loop.
+These are associated with looking up the next translation block to
+execute. These include:
+
+ tb_jmp_cache (per-vCPU, cache of recent jumps)
+ tb_ctx.htable (global hash table, phys address->tb lookup)
+
+As TB linking only occurs when blocks are in the same page this code
+is critical to performance as looking up the next TB to execute is the
+most common reason to exit the generated code.
+
+DESIGN REQUIREMENT: Make access to lookup structures safe with
+multiple reader/writer threads. Minimise any lock contention to do it.
+
+The hot-path avoids using locks where possible. The tb_jmp_cache is
+updated with atomic accesses to ensure consistent results. The fall
+back QHT based hash table is also designed for lockless lookups. Locks
+are only taken when code generation is required or TranslationBlocks
+have their block-to-block jumps patched.
+
+Global TCG State
+----------------
+
+User-mode emulation
+~~~~~~~~~~~~~~~~~~~
+
+We need to protect the entire code generation cycle including any post
+generation patching of the translated code. This also implies a shared
+translation buffer which contains code running on all cores. Any
+execution path that comes to the main run loop will need to hold a
+mutex for code generation. This also includes times when we need flush
+code or entries from any shared lookups/caches. Structures held on a
+per-vCPU basis won't need locking unless other vCPUs will need to
+modify them.
+
+DESIGN REQUIREMENT: Add locking around all code generation and TB
+patching.
+
+(Current solution)
+
+Code generation is serialised with mmap_lock().
+
+!User-mode emulation
+~~~~~~~~~~~~~~~~~~~~
+
+Each vCPU has its own TCG context and associated TCG region, thereby
+requiring no locking during translation.
+
+Translation Blocks
+------------------
+
+Currently the whole system shares a single code generation buffer
+which when full will force a flush of all translations and start from
+scratch again. Some operations also force a full flush of translations
+including:
+
+ - debugging operations (breakpoint insertion/removal)
+ - some CPU helper functions
+ - linux-user spawning its first thread
+
+This is done with the async_safe_run_on_cpu() mechanism to ensure all
+vCPUs are quiescent when changes are being made to shared global
+structures.
+
+More granular translation invalidation events are typically due
+to a change of the state of a physical page:
+
+ - code modification (self modify code, patching code)
+ - page changes (new page mapping in linux-user mode)
+
+While setting the invalid flag in a TranslationBlock will stop it
+being used when looked up in the hot-path there are a number of other
+book-keeping structures that need to be safely cleared.
+
+Any TranslationBlocks which have been patched to jump directly to the
+now invalid blocks need the jump patches reversing so they will return
+to the C code.
+
+There are a number of look-up caches that need to be properly updated
+including the:
+
+ - jump lookup cache
+ - the physical-to-tb lookup hash table
+ - the global page table
+
+The global page table (l1_map) which provides a multi-level look-up
+for PageDesc structures which contain pointers to the start of a
+linked list of all Translation Blocks in that page (see page_next).
+
+Both the jump patching and the page cache involve linked lists that
+the invalidated TranslationBlock needs to be removed from.
+
+DESIGN REQUIREMENT: Safely handle invalidation of TBs
+ - safely patch/revert direct jumps
+ - remove central PageDesc lookup entries
+ - ensure lookup caches/hashes are safely updated
+
+(Current solution)
+
+The direct jump themselves are updated atomically by the TCG
+tb_set_jmp_target() code. Modification to the linked lists that allow
+searching for linked pages are done under the protection of tb->jmp_lock,
+where tb is the destination block of a jump. Each origin block keeps a
+pointer to its destinations so that the appropriate lock can be acquired before
+iterating over a jump list.
+
+The global page table is a lockless radix tree; cmpxchg is used
+to atomically insert new elements.
+
+The lookup caches are updated atomically and the lookup hash uses QHT
+which is designed for concurrent safe lookup.
+
+Parallel code generation is supported. QHT is used at insertion time
+as the synchronization point across threads, thereby ensuring that we only
+keep track of a single TranslationBlock for each guest code block.
+
+Memory maps and TLBs
+--------------------
+
+The memory handling code is fairly critical to the speed of memory
+access in the emulated system. The SoftMMU code is designed so the
+hot-path can be handled entirely within translated code. This is
+handled with a per-vCPU TLB structure which once populated will allow
+a series of accesses to the page to occur without exiting the
+translated code. It is possible to set flags in the TLB address which
+will ensure the slow-path is taken for each access. This can be done
+to support:
+
+ - Memory regions (dividing up access to PIO, MMIO and RAM)
+ - Dirty page tracking (for code gen, SMC detection, migration and display)
+ - Virtual TLB (for translating guest address->real address)
+
+When the TLB tables are updated by a vCPU thread other than their own
+we need to ensure it is done in a safe way so no inconsistent state is
+seen by the vCPU thread.
+
+Some operations require updating a number of vCPUs TLBs at the same
+time in a synchronised manner.
+
+DESIGN REQUIREMENTS:
+
+ - TLB Flush All/Page
+ - can be across-vCPUs
+ - cross vCPU TLB flush may need other vCPU brought to halt
+ - change may need to be visible to the calling vCPU immediately
+ - TLB Flag Update
+ - usually cross-vCPU
+ - want change to be visible as soon as possible
+ - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
+ - This is a per-vCPU table - by definition can't race
+ - updated by its own thread when the slow-path is forced
+
+(Current solution)
+
+We have updated cputlb.c to defer operations when a cross-vCPU
+operation with async_run_on_cpu() which ensures each vCPU sees a
+coherent state when it next runs its work (in a few instructions
+time).
+
+A new set up operations (tlb_flush_*_all_cpus) take an additional flag
+which when set will force synchronisation by setting the source vCPUs
+work as "safe work" and exiting the cpu run loop. This ensure by the
+time execution restarts all flush operations have completed.
+
+TLB flag updates are all done atomically and are also protected by the
+corresponding page lock.
+
+(Known limitation)
+
+Not really a limitation but the wait mechanism is overly strict for
+some architectures which only need flushes completed by a barrier
+instruction. This could be a future optimisation.
+
+Emulated hardware state
+-----------------------
+
+Currently thanks to KVM work any access to IO memory is automatically
+protected by the global iothread mutex, also known as the BQL (Big
+QEMU Lock). Any IO region that doesn't use global mutex is expected to
+do its own locking.
+
+However IO memory isn't the only way emulated hardware state can be
+modified. Some architectures have model specific registers that
+trigger hardware emulation features. Generally any translation helper
+that needs to update more than a single vCPUs of state should take the
+BQL.
+
+As the BQL, or global iothread mutex is shared across the system we
+push the use of the lock as far down into the TCG code as possible to
+minimise contention.
+
+(Current solution)
+
+MMIO access automatically serialises hardware emulation by way of the
+BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
+and also defer the reset/startup of vCPUs to the vCPU context by way
+of async_run_on_cpu().
+
+Updates to interrupt state are also protected by the BQL as they can
+often be cross vCPU.
+
+Memory Consistency
+==================
+
+Between emulated guests and host systems there are a range of memory
+consistency models. Even emulating weakly ordered systems on strongly
+ordered hosts needs to ensure things like store-after-load re-ordering
+can be prevented when the guest wants to.
+
+Memory Barriers
+---------------
+
+Barriers (sometimes known as fences) provide a mechanism for software
+to enforce a particular ordering of memory operations from the point
+of view of external observers (e.g. another processor core). They can
+apply to any memory operations as well as just loads or stores.
+
+The Linux kernel has an excellent `write-up
+<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
+on the various forms of memory barrier and the guarantees they can
+provide.
+
+Barriers are often wrapped around synchronisation primitives to
+provide explicit memory ordering semantics. However they can be used
+by themselves to provide safe lockless access by ensuring for example
+a change to a signal flag will only be visible once the changes to
+payload are.
+
+DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
+
+This would enforce a strong load/store ordering so all loads/stores
+complete at the memory barrier. On single-core non-SMP strongly
+ordered backends this could become a NOP.
+
+Aside from explicit standalone memory barrier instructions there are
+also implicit memory ordering semantics which comes with each guest
+memory access instruction. For example all x86 load/stores come with
+fairly strong guarantees of sequential consistency whereas Arm has
+special variants of load/store instructions that imply acquire/release
+semantics.
+
+In the case of a strongly ordered guest architecture being emulated on
+a weakly ordered host the scope for a heavy performance impact is
+quite high.
+
+DESIGN REQUIREMENTS: Be efficient with use of memory barriers
+ - host systems with stronger implied guarantees can skip some barriers
+ - merge consecutive barriers to the strongest one
+
+(Current solution)
+
+The system currently has a tcg_gen_mb() which will add memory barrier
+operations if code generation is being done in a parallel context. The
+tcg_optimize() function attempts to merge barriers up to their
+strongest form before any load/store operations. The solution was
+originally developed and tested for linux-user based systems. All
+backends have been converted to emit fences when required. So far the
+following front-ends have been updated to emit fences when required:
+
+ - target-i386
+ - target-arm
+ - target-aarch64
+ - target-alpha
+ - target-mips
+
+Memory Control and Maintenance
+------------------------------
+
+This includes a class of instructions for controlling system cache
+behaviour. While QEMU doesn't model cache behaviour these instructions
+are often seen when code modification has taken place to ensure the
+changes take effect.
+
+Synchronisation Primitives
+--------------------------
+
+There are two broad types of synchronisation primitives found in
+modern ISAs: atomic instructions and exclusive regions.
+
+The first type offer a simple atomic instruction which will guarantee
+some sort of test and conditional store will be truly atomic w.r.t.
+other cores sharing access to the memory. The classic example is the
+x86 cmpxchg instruction.
+
+The second type offer a pair of load/store instructions which offer a
+guarantee that a region of memory has not been touched between the
+load and store instructions. An example of this is Arm's ldrex/strex
+pair where the strex instruction will return a flag indicating a
+successful store only if no other CPU has accessed the memory region
+since the ldrex.
+
+Traditionally TCG has generated a series of operations that work
+because they are within the context of a single translation block so
+will have completed before another CPU is scheduled. However with
+the ability to have multiple threads running to emulate multiple CPUs
+we will need to explicitly expose these semantics.
+
+DESIGN REQUIREMENTS:
+ - Support classic atomic instructions
+ - Support load/store exclusive (or load link/store conditional) pairs
+ - Generic enough infrastructure to support all guest architectures
+CURRENT OPEN QUESTIONS:
+ - How problematic is the ABA problem in general?
+
+(Current solution)
+
+The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
+can be used directly or combined to emulate other instructions like
+Arm's ldrex/strex instructions. While they are susceptible to the ABA
+problem so far common guests have not implemented patterns where
+this may be a problem - typically presenting a locking ABI which
+assumes cmpxchg like semantics.
+
+The code also includes a fall-back for cases where multi-threaded TCG
+ops can't work (e.g. guest atomic width > host atomic width). In this
+case an EXCP_ATOMIC exit occurs and the instruction is emulated with
+an exclusive lock which ensures all emulation is serialised.
+
+While the atomic helpers look good enough for now there may be a need
+to look at solutions that can more closely model the guest
+architectures semantics.
diff --git a/docs/devel/multiple-iothreads.txt b/docs/devel/multiple-iothreads.txt
new file mode 100644
index 000000000..aeb997bed
--- /dev/null
+++ b/docs/devel/multiple-iothreads.txt
@@ -0,0 +1,138 @@
+Copyright (c) 2014-2017 Red Hat Inc.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later. See
+the COPYING file in the top-level directory.
+
+
+This document explains the IOThread feature and how to write code that runs
+outside the QEMU global mutex.
+
+The main loop and IOThreads
+---------------------------
+QEMU is an event-driven program that can do several things at once using an
+event loop. The VNC server and the QMP monitor are both processed from the
+same event loop, which monitors their file descriptors until they become
+readable and then invokes a callback.
+
+The default event loop is called the main loop (see main-loop.c). It is
+possible to create additional event loop threads using -object
+iothread,id=my-iothread.
+
+Side note: The main loop and IOThread are both event loops but their code is
+not shared completely. Sometimes it is useful to remember that although they
+are conceptually similar they are currently not interchangeable.
+
+Why IOThreads are useful
+------------------------
+IOThreads allow the user to control the placement of work. The main loop is a
+scalability bottleneck on hosts with many CPUs. Work can be spread across
+several IOThreads instead of just one main loop. When set up correctly this
+can improve I/O latency and reduce jitter seen by the guest.
+
+The main loop is also deeply associated with the QEMU global mutex, which is a
+scalability bottleneck in itself. vCPU threads and the main loop use the QEMU
+global mutex to serialize execution of QEMU code. This mutex is necessary
+because a lot of QEMU's code historically was not thread-safe.
+
+The fact that all I/O processing is done in a single main loop and that the
+QEMU global mutex is contended by all vCPU threads and the main loop explain
+why it is desirable to place work into IOThreads.
+
+The experimental virtio-blk data-plane implementation has been benchmarked and
+shows these effects:
+ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf
+
+How to program for IOThreads
+----------------------------
+The main difference between legacy code and new code that can run in an
+IOThread is dealing explicitly with the event loop object, AioContext
+(see include/block/aio.h). Code that only works in the main loop
+implicitly uses the main loop's AioContext. Code that supports running
+in IOThreads must be aware of its AioContext.
+
+AioContext supports the following services:
+ * File descriptor monitoring (read/write/error on POSIX hosts)
+ * Event notifiers (inter-thread signalling)
+ * Timers
+ * Bottom Halves (BH) deferred callbacks
+
+There are several old APIs that use the main loop AioContext:
+ * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
+ * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
+ * LEGACY timer_new_ms() - create a timer
+ * LEGACY qemu_bh_new() - create a BH
+ * LEGACY qemu_aio_wait() - run an event loop iteration
+
+Since they implicitly work on the main loop they cannot be used in code that
+runs in an IOThread. They might cause a crash or deadlock if called from an
+IOThread since the QEMU global mutex is not held.
+
+Instead, use the AioContext functions directly (see include/block/aio.h):
+ * aio_set_fd_handler() - monitor a file descriptor
+ * aio_set_event_notifier() - monitor an event notifier
+ * aio_timer_new() - create a timer
+ * aio_bh_new() - create a BH
+ * aio_poll() - run an event loop iteration
+
+The AioContext can be obtained from the IOThread using
+iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
+Code that takes an AioContext argument works both in IOThreads or the main
+loop, depending on which AioContext instance the caller passes in.
+
+How to synchronize with an IOThread
+-----------------------------------
+AioContext is not thread-safe so some rules must be followed when using file
+descriptors, event notifiers, timers, or BHs across threads:
+
+1. AioContext functions can always be called safely. They handle their
+own locking internally.
+
+2. Other threads wishing to access the AioContext must use
+aio_context_acquire()/aio_context_release() for mutual exclusion. Once the
+context is acquired no other thread can access it or run event loop iterations
+in this AioContext.
+
+Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls.
+Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro
+used in the block layer and can lead to hangs.
+
+There is currently no lock ordering rule if a thread needs to acquire multiple
+AioContexts simultaneously. Therefore, it is only safe for code holding the
+QEMU global mutex to acquire other AioContexts.
+
+Side note: the best way to schedule a function call across threads is to call
+aio_bh_schedule_oneshot(). No acquire/release or locking is needed.
+
+AioContext and the block layer
+------------------------------
+The AioContext originates from the QEMU block layer, even though nowadays
+AioContext is a generic event loop that can be used by any QEMU subsystem.
+
+The block layer has support for AioContext integrated. Each BlockDriverState
+is associated with an AioContext using bdrv_try_set_aio_context() and
+bdrv_get_aio_context(). This allows block layer code to process I/O inside the
+right AioContext. Other subsystems may wish to follow a similar approach.
+
+Block layer code must therefore expect to run in an IOThread and avoid using
+old APIs that implicitly use the main loop. See the "How to program for
+IOThreads" above for information on how to do that.
+
+If main loop code such as a QMP function wishes to access a BlockDriverState
+it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure
+that callbacks in the IOThread do not run in parallel.
+
+Code running in the monitor typically needs to ensure that past
+requests from the guest are completed. When a block device is running
+in an IOThread, the IOThread can also process requests from the guest
+(via ioeventfd). To achieve both objects, wrap the code between
+bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained
+section". The functions must be called between aio_context_acquire()
+and aio_context_release(). You can freely release and re-acquire the
+AioContext within a drained section.
+
+Long-running jobs (usually in the form of coroutines) are best scheduled in
+the BlockDriverState's AioContext to avoid the need to acquire/release around
+each bdrv_*() call. The functions bdrv_add/remove_aio_context_notifier,
+or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends,
+can be used to get a notification whenever bdrv_try_set_aio_context() moves a
+BlockDriverState to a different AioContext.
diff --git a/docs/devel/qapi-code-gen.rst b/docs/devel/qapi-code-gen.rst
new file mode 100644
index 000000000..a3b547308
--- /dev/null
+++ b/docs/devel/qapi-code-gen.rst
@@ -0,0 +1,1932 @@
+==================================
+How to use the QAPI code generator
+==================================
+
+..
+ Copyright IBM Corp. 2011
+ Copyright (C) 2012-2016 Red Hat, Inc.
+
+ This work is licensed under the terms of the GNU GPL, version 2 or
+ later. See the COPYING file in the top-level directory.
+
+
+Introduction
+============
+
+QAPI is a native C API within QEMU which provides management-level
+functionality to internal and external users. For external
+users/processes, this interface is made available by a JSON-based wire
+format for the QEMU Monitor Protocol (QMP) for controlling qemu, as
+well as the QEMU Guest Agent (QGA) for communicating with the guest.
+The remainder of this document uses "Client JSON Protocol" when
+referring to the wire contents of a QMP or QGA connection.
+
+To map between Client JSON Protocol interfaces and the native C API,
+we generate C code from a QAPI schema. This document describes the
+QAPI schema language, and how it gets mapped to the Client JSON
+Protocol and to C. It additionally provides guidance on maintaining
+Client JSON Protocol compatibility.
+
+
+The QAPI schema language
+========================
+
+The QAPI schema defines the Client JSON Protocol's commands and
+events, as well as types used by them. Forward references are
+allowed.
+
+It is permissible for the schema to contain additional types not used
+by any commands or events, for the side effect of generated C code
+used internally.
+
+There are several kinds of types: simple types (a number of built-in
+types, such as ``int`` and ``str``; as well as enumerations), arrays,
+complex types (structs and two flavors of unions), and alternate types
+(a choice between other types).
+
+
+Schema syntax
+-------------
+
+Syntax is loosely based on `JSON <http://www.ietf.org/rfc/rfc8259.txt>`_.
+Differences:
+
+* Comments: start with a hash character (``#``) that is not part of a
+ string, and extend to the end of the line.
+
+* Strings are enclosed in ``'single quotes'``, not ``"double quotes"``.
+
+* Strings are restricted to printable ASCII, and escape sequences to
+ just ``\\``.
+
+* Numbers and ``null`` are not supported.
+
+A second layer of syntax defines the sequences of JSON texts that are
+a correctly structured QAPI schema. We provide a grammar for this
+syntax in an EBNF-like notation:
+
+* Production rules look like ``non-terminal = expression``
+* Concatenation: expression ``A B`` matches expression ``A``, then ``B``
+* Alternation: expression ``A | B`` matches expression ``A`` or ``B``
+* Repetition: expression ``A...`` matches zero or more occurrences of
+ expression ``A``
+* Repetition: expression ``A, ...`` matches zero or more occurrences of
+ expression ``A`` separated by ``,``
+* Grouping: expression ``( A )`` matches expression ``A``
+* JSON's structural characters are terminals: ``{ } [ ] : ,``
+* JSON's literal names are terminals: ``false true``
+* String literals enclosed in ``'single quotes'`` are terminal, and match
+ this JSON string, with a leading ``*`` stripped off
+* When JSON object member's name starts with ``*``, the member is
+ optional.
+* The symbol ``STRING`` is a terminal, and matches any JSON string
+* The symbol ``BOOL`` is a terminal, and matches JSON ``false`` or ``true``
+* ALL-CAPS words other than ``STRING`` are non-terminals
+
+The order of members within JSON objects does not matter unless
+explicitly noted.
+
+A QAPI schema consists of a series of top-level expressions::
+
+ SCHEMA = TOP-LEVEL-EXPR...
+
+The top-level expressions are all JSON objects. Code and
+documentation is generated in schema definition order. Code order
+should not matter.
+
+A top-level expressions is either a directive or a definition::
+
+ TOP-LEVEL-EXPR = DIRECTIVE | DEFINITION
+
+There are two kinds of directives and six kinds of definitions::
+
+ DIRECTIVE = INCLUDE | PRAGMA
+ DEFINITION = ENUM | STRUCT | UNION | ALTERNATE | COMMAND | EVENT
+
+These are discussed in detail below.
+
+
+Built-in Types
+--------------
+
+The following types are predefined, and map to C as follows:
+
+ ============= ============== ============================================
+ Schema C JSON
+ ============= ============== ============================================
+ ``str`` ``char *`` any JSON string, UTF-8
+ ``number`` ``double`` any JSON number
+ ``int`` ``int64_t`` a JSON number without fractional part
+ that fits into the C integer type
+ ``int8`` ``int8_t`` likewise
+ ``int16`` ``int16_t`` likewise
+ ``int32`` ``int32_t`` likewise
+ ``int64`` ``int64_t`` likewise
+ ``uint8`` ``uint8_t`` likewise
+ ``uint16`` ``uint16_t`` likewise
+ ``uint32`` ``uint32_t`` likewise
+ ``uint64`` ``uint64_t`` likewise
+ ``size`` ``uint64_t`` like ``uint64_t``, except
+ ``StringInputVisitor`` accepts size suffixes
+ ``bool`` ``bool`` JSON ``true`` or ``false``
+ ``null`` ``QNull *`` JSON ``null``
+ ``any`` ``QObject *`` any JSON value
+ ``QType`` ``QType`` JSON string matching enum ``QType`` values
+ ============= ============== ============================================
+
+
+Include directives
+------------------
+
+Syntax::
+
+ INCLUDE = { 'include': STRING }
+
+The QAPI schema definitions can be modularized using the 'include' directive::
+
+ { 'include': 'path/to/file.json' }
+
+The directive is evaluated recursively, and include paths are relative
+to the file using the directive. Multiple includes of the same file
+are idempotent.
+
+As a matter of style, it is a good idea to have all files be
+self-contained, but at the moment, nothing prevents an included file
+from making a forward reference to a type that is only introduced by
+an outer file. The parser may be made stricter in the future to
+prevent incomplete include files.
+
+.. _pragma:
+
+Pragma directives
+-----------------
+
+Syntax::
+
+ PRAGMA = { 'pragma': {
+ '*doc-required': BOOL,
+ '*command-name-exceptions': [ STRING, ... ],
+ '*command-returns-exceptions': [ STRING, ... ],
+ '*member-name-exceptions': [ STRING, ... ] } }
+
+The pragma directive lets you control optional generator behavior.
+
+Pragma's scope is currently the complete schema. Setting the same
+pragma to different values in parts of the schema doesn't work.
+
+Pragma 'doc-required' takes a boolean value. If true, documentation
+is required. Default is false.
+
+Pragma 'command-name-exceptions' takes a list of commands whose names
+may contain ``"_"`` instead of ``"-"``. Default is none.
+
+Pragma 'command-returns-exceptions' takes a list of commands that may
+violate the rules on permitted return types. Default is none.
+
+Pragma 'member-name-exceptions' takes a list of types whose member
+names may contain uppercase letters, and ``"_"`` instead of ``"-"``.
+Default is none.
+
+.. _ENUM-VALUE:
+
+Enumeration types
+-----------------
+
+Syntax::
+
+ ENUM = { 'enum': STRING,
+ 'data': [ ENUM-VALUE, ... ],
+ '*prefix': STRING,
+ '*if': COND,
+ '*features': FEATURES }
+ ENUM-VALUE = STRING
+ | { 'name': STRING,
+ '*if': COND,
+ '*features': FEATURES }
+
+Member 'enum' names the enum type.
+
+Each member of the 'data' array defines a value of the enumeration
+type. The form STRING is shorthand for :code:`{ 'name': STRING }`. The
+'name' values must be be distinct.
+
+Example::
+
+ { 'enum': 'MyEnum', 'data': [ 'value1', 'value2', 'value3' ] }
+
+Nothing prevents an empty enumeration, although it is probably not
+useful.
+
+On the wire, an enumeration type's value is represented by its
+(string) name. In C, it's represented by an enumeration constant.
+These are of the form PREFIX_NAME, where PREFIX is derived from the
+enumeration type's name, and NAME from the value's name. For the
+example above, the generator maps 'MyEnum' to MY_ENUM and 'value1' to
+VALUE1, resulting in the enumeration constant MY_ENUM_VALUE1. The
+optional 'prefix' member overrides PREFIX.
+
+The generated C enumeration constants have values 0, 1, ..., N-1 (in
+QAPI schema order), where N is the number of values. There is an
+additional enumeration constant PREFIX__MAX with value N.
+
+Do not use string or an integer type when an enumeration type can do
+the job satisfactorily.
+
+The optional 'if' member specifies a conditional. See `Configuring the
+schema`_ below for more on this.
+
+The optional 'features' member specifies features. See Features_
+below for more on this.
+
+
+.. _TYPE-REF:
+
+Type references and array types
+-------------------------------
+
+Syntax::
+
+ TYPE-REF = STRING | ARRAY-TYPE
+ ARRAY-TYPE = [ STRING ]
+
+A string denotes the type named by the string.
+
+A one-element array containing a string denotes an array of the type
+named by the string. Example: ``['int']`` denotes an array of ``int``.
+
+
+Struct types
+------------
+
+Syntax::
+
+ STRUCT = { 'struct': STRING,
+ 'data': MEMBERS,
+ '*base': STRING,
+ '*if': COND,
+ '*features': FEATURES }
+ MEMBERS = { MEMBER, ... }
+ MEMBER = STRING : TYPE-REF
+ | STRING : { 'type': TYPE-REF,
+ '*if': COND,
+ '*features': FEATURES }
+
+Member 'struct' names the struct type.
+
+Each MEMBER of the 'data' object defines a member of the struct type.
+
+.. _MEMBERS:
+
+The MEMBER's STRING name consists of an optional ``*`` prefix and the
+struct member name. If ``*`` is present, the member is optional.
+
+The MEMBER's value defines its properties, in particular its type.
+The form TYPE-REF_ is shorthand for :code:`{ 'type': TYPE-REF }`.
+
+Example::
+
+ { 'struct': 'MyType',
+ 'data': { 'member1': 'str', 'member2': ['int'], '*member3': 'str' } }
+
+A struct type corresponds to a struct in C, and an object in JSON.
+The C struct's members are generated in QAPI schema order.
+
+The optional 'base' member names a struct type whose members are to be
+included in this type. They go first in the C struct.
+
+Example::
+
+ { 'struct': 'BlockdevOptionsGenericFormat',
+ 'data': { 'file': 'str' } }
+ { 'struct': 'BlockdevOptionsGenericCOWFormat',
+ 'base': 'BlockdevOptionsGenericFormat',
+ 'data': { '*backing': 'str' } }
+
+An example BlockdevOptionsGenericCOWFormat object on the wire could use
+both members like this::
+
+ { "file": "/some/place/my-image",
+ "backing": "/some/place/my-backing-file" }
+
+The optional 'if' member specifies a conditional. See `Configuring
+the schema`_ below for more on this.
+
+The optional 'features' member specifies features. See Features_
+below for more on this.
+
+
+Union types
+-----------
+
+Syntax::
+
+ UNION = { 'union': STRING,
+ 'base': ( MEMBERS | STRING ),
+ 'discriminator': STRING,
+ 'data': BRANCHES,
+ '*if': COND,
+ '*features': FEATURES }
+ BRANCHES = { BRANCH, ... }
+ BRANCH = STRING : TYPE-REF
+ | STRING : { 'type': TYPE-REF, '*if': COND }
+
+Member 'union' names the union type.
+
+The 'base' member defines the common members. If it is a MEMBERS_
+object, it defines common members just like a struct type's 'data'
+member defines struct type members. If it is a STRING, it names a
+struct type whose members are the common members.
+
+Member 'discriminator' must name a non-optional enum-typed member of
+the base struct. That member's value selects a branch by its name.
+If no such branch exists, an empty branch is assumed.
+
+Each BRANCH of the 'data' object defines a branch of the union. A
+union must have at least one branch.
+
+The BRANCH's STRING name is the branch name. It must be a value of
+the discriminator enum type.
+
+The BRANCH's value defines the branch's properties, in particular its
+type. The type must a struct type. The form TYPE-REF_ is shorthand
+for :code:`{ 'type': TYPE-REF }`.
+
+In the Client JSON Protocol, a union is represented by an object with
+the common members (from the base type) and the selected branch's
+members. The two sets of member names must be disjoint.
+
+Example::
+
+ { 'enum': 'BlockdevDriver', 'data': [ 'file', 'qcow2' ] }
+ { 'union': 'BlockdevOptions',
+ 'base': { 'driver': 'BlockdevDriver', '*read-only': 'bool' },
+ 'discriminator': 'driver',
+ 'data': { 'file': 'BlockdevOptionsFile',
+ 'qcow2': 'BlockdevOptionsQcow2' } }
+
+Resulting in these JSON objects::
+
+ { "driver": "file", "read-only": true,
+ "filename": "/some/place/my-image" }
+ { "driver": "qcow2", "read-only": false,
+ "backing": "/some/place/my-image", "lazy-refcounts": true }
+
+The order of branches need not match the order of the enum values.
+The branches need not cover all possible enum values. In the
+resulting generated C data types, a union is represented as a struct
+with the base members in QAPI schema order, and then a union of
+structures for each branch of the struct.
+
+The optional 'if' member specifies a conditional. See `Configuring
+the schema`_ below for more on this.
+
+The optional 'features' member specifies features. See Features_
+below for more on this.
+
+
+Alternate types
+---------------
+
+Syntax::
+
+ ALTERNATE = { 'alternate': STRING,
+ 'data': ALTERNATIVES,
+ '*if': COND,
+ '*features': FEATURES }
+ ALTERNATIVES = { ALTERNATIVE, ... }
+ ALTERNATIVE = STRING : STRING
+ | STRING : { 'type': STRING, '*if': COND }
+
+Member 'alternate' names the alternate type.
+
+Each ALTERNATIVE of the 'data' object defines a branch of the
+alternate. An alternate must have at least one branch.
+
+The ALTERNATIVE's STRING name is the branch name.
+
+The ALTERNATIVE's value defines the branch's properties, in particular
+its type. The form STRING is shorthand for :code:`{ 'type': STRING }`.
+
+Example::
+
+ { 'alternate': 'BlockdevRef',
+ 'data': { 'definition': 'BlockdevOptions',
+ 'reference': 'str' } }
+
+An alternate type is like a union type, except there is no
+discriminator on the wire. Instead, the branch to use is inferred
+from the value. An alternate can only express a choice between types
+represented differently on the wire.
+
+If a branch is typed as the 'bool' built-in, the alternate accepts
+true and false; if it is typed as any of the various numeric
+built-ins, it accepts a JSON number; if it is typed as a 'str'
+built-in or named enum type, it accepts a JSON string; if it is typed
+as the 'null' built-in, it accepts JSON null; and if it is typed as a
+complex type (struct or union), it accepts a JSON object.
+
+The example alternate declaration above allows using both of the
+following example objects::
+
+ { "file": "my_existing_block_device_id" }
+ { "file": { "driver": "file",
+ "read-only": false,
+ "filename": "/tmp/mydisk.qcow2" } }
+
+The optional 'if' member specifies a conditional. See `Configuring
+the schema`_ below for more on this.
+
+The optional 'features' member specifies features. See Features_
+below for more on this.
+
+
+Commands
+--------
+
+Syntax::
+
+ COMMAND = { 'command': STRING,
+ (
+ '*data': ( MEMBERS | STRING ),
+ |
+ 'data': STRING,
+ 'boxed': true,
+ )
+ '*returns': TYPE-REF,
+ '*success-response': false,
+ '*gen': false,
+ '*allow-oob': true,
+ '*allow-preconfig': true,
+ '*coroutine': true,
+ '*if': COND,
+ '*features': FEATURES }
+
+Member 'command' names the command.
+
+Member 'data' defines the arguments. It defaults to an empty MEMBERS_
+object.
+
+If 'data' is a MEMBERS_ object, then MEMBERS defines arguments just
+like a struct type's 'data' defines struct type members.
+
+If 'data' is a STRING, then STRING names a complex type whose members
+are the arguments. A union type requires ``'boxed': true``.
+
+Member 'returns' defines the command's return type. It defaults to an
+empty struct type. It must normally be a complex type or an array of
+a complex type. To return anything else, the command must be listed
+in pragma 'commands-returns-exceptions'. If you do this, extending
+the command to return additional information will be harder. Use of
+the pragma for new commands is strongly discouraged.
+
+A command's error responses are not specified in the QAPI schema.
+Error conditions should be documented in comments.
+
+In the Client JSON Protocol, the value of the "execute" or "exec-oob"
+member is the command name. The value of the "arguments" member then
+has to conform to the arguments, and the value of the success
+response's "return" member will conform to the return type.
+
+Some example commands::
+
+ { 'command': 'my-first-command',
+ 'data': { 'arg1': 'str', '*arg2': 'str' } }
+ { 'struct': 'MyType', 'data': { '*value': 'str' } }
+ { 'command': 'my-second-command',
+ 'returns': [ 'MyType' ] }
+
+which would validate this Client JSON Protocol transaction::
+
+ => { "execute": "my-first-command",
+ "arguments": { "arg1": "hello" } }
+ <= { "return": { } }
+ => { "execute": "my-second-command" }
+ <= { "return": [ { "value": "one" }, { } ] }
+
+The generator emits a prototype for the C function implementing the
+command. The function itself needs to be written by hand. See
+section `Code generated for commands`_ for examples.
+
+The function returns the return type. When member 'boxed' is absent,
+it takes the command arguments as arguments one by one, in QAPI schema
+order. Else it takes them wrapped in the C struct generated for the
+complex argument type. It takes an additional ``Error **`` argument in
+either case.
+
+The generator also emits a marshalling function that extracts
+arguments for the user's function out of an input QDict, calls the
+user's function, and if it succeeded, builds an output QObject from
+its return value. This is for use by the QMP monitor core.
+
+In rare cases, QAPI cannot express a type-safe representation of a
+corresponding Client JSON Protocol command. You then have to suppress
+generation of a marshalling function by including a member 'gen' with
+boolean value false, and instead write your own function. For
+example::
+
+ { 'command': 'netdev_add',
+ 'data': {'type': 'str', 'id': 'str'},
+ 'gen': false }
+
+Please try to avoid adding new commands that rely on this, and instead
+use type-safe unions.
+
+Normally, the QAPI schema is used to describe synchronous exchanges,
+where a response is expected. But in some cases, the action of a
+command is expected to change state in a way that a successful
+response is not possible (although the command will still return an
+error object on failure). When a successful reply is not possible,
+the command definition includes the optional member 'success-response'
+with boolean value false. So far, only QGA makes use of this member.
+
+Member 'allow-oob' declares whether the command supports out-of-band
+(OOB) execution. It defaults to false. For example::
+
+ { 'command': 'migrate_recover',
+ 'data': { 'uri': 'str' }, 'allow-oob': true }
+
+See qmp-spec.txt for out-of-band execution syntax and semantics.
+
+Commands supporting out-of-band execution can still be executed
+in-band.
+
+When a command is executed in-band, its handler runs in the main
+thread with the BQL held.
+
+When a command is executed out-of-band, its handler runs in a
+dedicated monitor I/O thread with the BQL *not* held.
+
+An OOB-capable command handler must satisfy the following conditions:
+
+- It terminates quickly.
+- It does not invoke system calls that may block.
+- It does not access guest RAM that may block when userfaultfd is
+ enabled for postcopy live migration.
+- It takes only "fast" locks, i.e. all critical sections protected by
+ any lock it takes also satisfy the conditions for OOB command
+ handler code.
+
+The restrictions on locking limit access to shared state. Such access
+requires synchronization, but OOB commands can't take the BQL or any
+other "slow" lock.
+
+When in doubt, do not implement OOB execution support.
+
+Member 'allow-preconfig' declares whether the command is available
+before the machine is built. It defaults to false. For example::
+
+ { 'enum': 'QMPCapability',
+ 'data': [ 'oob' ] }
+ { 'command': 'qmp_capabilities',
+ 'data': { '*enable': [ 'QMPCapability' ] },
+ 'allow-preconfig': true }
+
+QMP is available before the machine is built only when QEMU was
+started with --preconfig.
+
+Member 'coroutine' tells the QMP dispatcher whether the command handler
+is safe to be run in a coroutine. It defaults to false. If it is true,
+the command handler is called from coroutine context and may yield while
+waiting for an external event (such as I/O completion) in order to avoid
+blocking the guest and other background operations.
+
+Coroutine safety can be hard to prove, similar to thread safety. Common
+pitfalls are:
+
+- The global mutex isn't held across ``qemu_coroutine_yield()``, so
+ operations that used to assume that they execute atomically may have
+ to be more careful to protect against changes in the global state.
+
+- Nested event loops (``AIO_WAIT_WHILE()`` etc.) are problematic in
+ coroutine context and can easily lead to deadlocks. They should be
+ replaced by yielding and reentering the coroutine when the condition
+ becomes false.
+
+Since the command handler may assume coroutine context, any callers
+other than the QMP dispatcher must also call it in coroutine context.
+In particular, HMP commands calling such a QMP command handler must be
+marked ``.coroutine = true`` in hmp-commands.hx.
+
+It is an error to specify both ``'coroutine': true`` and ``'allow-oob': true``
+for a command. We don't currently have a use case for both together and
+without a use case, it's not entirely clear what the semantics should
+be.
+
+The optional 'if' member specifies a conditional. See `Configuring
+the schema`_ below for more on this.
+
+The optional 'features' member specifies features. See Features_
+below for more on this.
+
+
+Events
+------
+
+Syntax::
+
+ EVENT = { 'event': STRING,
+ (
+ '*data': ( MEMBERS | STRING ),
+ |
+ 'data': STRING,
+ 'boxed': true,
+ )
+ '*if': COND,
+ '*features': FEATURES }
+
+Member 'event' names the event. This is the event name used in the
+Client JSON Protocol.
+
+Member 'data' defines the event-specific data. It defaults to an
+empty MEMBERS object.
+
+If 'data' is a MEMBERS object, then MEMBERS defines event-specific
+data just like a struct type's 'data' defines struct type members.
+
+If 'data' is a STRING, then STRING names a complex type whose members
+are the event-specific data. A union type requires ``'boxed': true``.
+
+An example event is::
+
+ { 'event': 'EVENT_C',
+ 'data': { '*a': 'int', 'b': 'str' } }
+
+Resulting in this JSON object::
+
+ { "event": "EVENT_C",
+ "data": { "b": "test string" },
+ "timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
+
+The generator emits a function to send the event. When member 'boxed'
+is absent, it takes event-specific data one by one, in QAPI schema
+order. Else it takes them wrapped in the C struct generated for the
+complex type. See section `Code generated for events`_ for examples.
+
+The optional 'if' member specifies a conditional. See `Configuring
+the schema`_ below for more on this.
+
+The optional 'features' member specifies features. See Features_
+below for more on this.
+
+
+.. _FEATURE:
+
+Features
+--------
+
+Syntax::
+
+ FEATURES = [ FEATURE, ... ]
+ FEATURE = STRING
+ | { 'name': STRING, '*if': COND }
+
+Sometimes, the behaviour of QEMU changes compatibly, but without a
+change in the QMP syntax (usually by allowing values or operations
+that previously resulted in an error). QMP clients may still need to
+know whether the extension is available.
+
+For this purpose, a list of features can be specified for a command or
+struct type. Each list member can either be ``{ 'name': STRING, '*if':
+COND }``, or STRING, which is shorthand for ``{ 'name': STRING }``.
+
+The optional 'if' member specifies a conditional. See `Configuring
+the schema`_ below for more on this.
+
+Example::
+
+ { 'struct': 'TestType',
+ 'data': { 'number': 'int' },
+ 'features': [ 'allow-negative-numbers' ] }
+
+The feature strings are exposed to clients in introspection, as
+explained in section `Client JSON Protocol introspection`_.
+
+Intended use is to have each feature string signal that this build of
+QEMU shows a certain behaviour.
+
+
+Special features
+~~~~~~~~~~~~~~~~
+
+Feature "deprecated" marks a command, event, enum value, or struct
+member as deprecated. It is not supported elsewhere so far.
+Interfaces so marked may be withdrawn in future releases in accordance
+with QEMU's deprecation policy.
+
+Feature "unstable" marks a command, event, enum value, or struct
+member as unstable. It is not supported elsewhere so far. Interfaces
+so marked may be withdrawn or changed incompatibly in future releases.
+
+
+Naming rules and reserved names
+-------------------------------
+
+All names must begin with a letter, and contain only ASCII letters,
+digits, hyphen, and underscore. There are two exceptions: enum values
+may start with a digit, and names that are downstream extensions (see
+section `Downstream extensions`_) start with underscore.
+
+Names beginning with ``q_`` are reserved for the generator, which uses
+them for munging QMP names that resemble C keywords or other
+problematic strings. For example, a member named ``default`` in qapi
+becomes ``q_default`` in the generated C code.
+
+Types, commands, and events share a common namespace. Therefore,
+generally speaking, type definitions should always use CamelCase for
+user-defined type names, while built-in types are lowercase.
+
+Type names ending with ``Kind`` or ``List`` are reserved for the
+generator, which uses them for implicit union enums and array types,
+respectively.
+
+Command names, and member names within a type, should be all lower
+case with words separated by a hyphen. However, some existing older
+commands and complex types use underscore; when extending them,
+consistency is preferred over blindly avoiding underscore.
+
+Event names should be ALL_CAPS with words separated by underscore.
+
+Member name ``u`` and names starting with ``has-`` or ``has_`` are reserved
+for the generator, which uses them for unions and for tracking
+optional members.
+
+Names beginning with ``x-`` used to signify "experimental". This
+convention has been replaced by special feature "unstable".
+
+Pragmas ``command-name-exceptions`` and ``member-name-exceptions`` let
+you violate naming rules. Use for new code is strongly discouraged. See
+`Pragma directives`_ for details.
+
+
+Downstream extensions
+---------------------
+
+QAPI schema names that are externally visible, say in the Client JSON
+Protocol, need to be managed with care. Names starting with a
+downstream prefix of the form __RFQDN_ are reserved for the downstream
+who controls the valid, reverse fully qualified domain name RFQDN.
+RFQDN may only contain ASCII letters, digits, hyphen and period.
+
+Example: Red Hat, Inc. controls redhat.com, and may therefore add a
+downstream command ``__com.redhat_drive-mirror``.
+
+
+Configuring the schema
+----------------------
+
+Syntax::
+
+ COND = STRING
+ | { 'all: [ COND, ... ] }
+ | { 'any: [ COND, ... ] }
+ | { 'not': COND }
+
+All definitions take an optional 'if' member. Its value must be a
+string, or an object with a single member 'all', 'any' or 'not'.
+
+The C code generated for the definition will then be guarded by an #if
+preprocessing directive with an operand generated from that condition:
+
+ * STRING will generate defined(STRING)
+ * { 'all': [COND, ...] } will generate (COND && ...)
+ * { 'any': [COND, ...] } will generate (COND || ...)
+ * { 'not': COND } will generate !COND
+
+Example: a conditional struct ::
+
+ { 'struct': 'IfStruct', 'data': { 'foo': 'int' },
+ 'if': { 'all': [ 'CONFIG_FOO', 'HAVE_BAR' ] } }
+
+gets its generated code guarded like this::
+
+ #if defined(CONFIG_FOO) && defined(HAVE_BAR)
+ ... generated code ...
+ #endif /* defined(HAVE_BAR) && defined(CONFIG_FOO) */
+
+Individual members of complex types, commands arguments, and
+event-specific data can also be made conditional. This requires the
+longhand form of MEMBER.
+
+Example: a struct type with unconditional member 'foo' and conditional
+member 'bar' ::
+
+ { 'struct': 'IfStruct',
+ 'data': { 'foo': 'int',
+ 'bar': { 'type': 'int', 'if': 'IFCOND'} } }
+
+A union's discriminator may not be conditional.
+
+Likewise, individual enumeration values be conditional. This requires
+the longhand form of ENUM-VALUE_.
+
+Example: an enum type with unconditional value 'foo' and conditional
+value 'bar' ::
+
+ { 'enum': 'IfEnum',
+ 'data': [ 'foo',
+ { 'name' : 'bar', 'if': 'IFCOND' } ] }
+
+Likewise, features can be conditional. This requires the longhand
+form of FEATURE_.
+
+Example: a struct with conditional feature 'allow-negative-numbers' ::
+
+ { 'struct': 'TestType',
+ 'data': { 'number': 'int' },
+ 'features': [ { 'name': 'allow-negative-numbers',
+ 'if': 'IFCOND' } ] }
+
+Please note that you are responsible to ensure that the C code will
+compile with an arbitrary combination of conditions, since the
+generator is unable to check it at this point.
+
+The conditions apply to introspection as well, i.e. introspection
+shows a conditional entity only when the condition is satisfied in
+this particular build.
+
+
+Documentation comments
+----------------------
+
+A multi-line comment that starts and ends with a ``##`` line is a
+documentation comment.
+
+If the documentation comment starts like ::
+
+ ##
+ # @SYMBOL:
+
+it documents the definition of SYMBOL, else it's free-form
+documentation.
+
+See below for more on `Definition documentation`_.
+
+Free-form documentation may be used to provide additional text and
+structuring content.
+
+
+Headings and subheadings
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A free-form documentation comment containing a line which starts with
+some ``=`` symbols and then a space defines a section heading::
+
+ ##
+ # = This is a top level heading
+ #
+ # This is a free-form comment which will go under the
+ # top level heading.
+ ##
+
+ ##
+ # == This is a second level heading
+ ##
+
+A heading line must be the first line of the documentation
+comment block.
+
+Section headings must always be correctly nested, so you can only
+define a third-level heading inside a second-level heading, and so on.
+
+
+Documentation markup
+~~~~~~~~~~~~~~~~~~~~
+
+Documentation comments can use most rST markup. In particular,
+a ``::`` literal block can be used for examples::
+
+ # ::
+ #
+ # Text of the example, may span
+ # multiple lines
+
+``*`` starts an itemized list::
+
+ # * First item, may span
+ # multiple lines
+ # * Second item
+
+You can also use ``-`` instead of ``*``.
+
+A decimal number followed by ``.`` starts a numbered list::
+
+ # 1. First item, may span
+ # multiple lines
+ # 2. Second item
+
+The actual number doesn't matter.
+
+Lists of either kind must be preceded and followed by a blank line.
+If a list item's text spans multiple lines, then the second and
+subsequent lines must be correctly indented to line up with the
+first character of the first line.
+
+The usual ****strong****, *\*emphasized\** and ````literal```` markup
+should be used. If you need a single literal ``*``, you will need to
+backslash-escape it. As an extension beyond the usual rST syntax, you
+can also use ``@foo`` to reference a name in the schema; this is rendered
+the same way as ````foo````.
+
+Example::
+
+ ##
+ # Some text foo with **bold** and *emphasis*
+ # 1. with a list
+ # 2. like that
+ #
+ # And some code:
+ #
+ # ::
+ #
+ # $ echo foo
+ # -> do this
+ # <- get that
+ ##
+
+
+Definition documentation
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Definition documentation, if present, must immediately precede the
+definition it documents.
+
+When documentation is required (see pragma_ 'doc-required'), every
+definition must have documentation.
+
+Definition documentation starts with a line naming the definition,
+followed by an optional overview, a description of each argument (for
+commands and events), member (for structs and unions), branch (for
+alternates), or value (for enums), a description of each feature (if
+any), and finally optional tagged sections.
+
+The description of an argument or feature 'name' starts with
+'\@name:'. The description text can start on the line following the
+'\@name:', in which case it must not be indented at all. It can also
+start on the same line as the '\@name:'. In this case if it spans
+multiple lines then second and subsequent lines must be indented to
+line up with the first character of the first line of the
+description::
+
+ # @argone:
+ # This is a two line description
+ # in the first style.
+ #
+ # @argtwo: This is a two line description
+ # in the second style.
+
+The number of spaces between the ':' and the text is not significant.
+
+.. admonition:: FIXME
+
+ The parser accepts these things in almost any order.
+
+.. admonition:: FIXME
+
+ union branches should be described, too.
+
+Extensions added after the definition was first released carry a
+'(since x.y.z)' comment.
+
+The feature descriptions must be preceded by a line "Features:", like
+this::
+
+ # Features:
+ # @feature: Description text
+
+A tagged section starts with one of the following words:
+"Note:"/"Notes:", "Since:", "Example"/"Examples", "Returns:", "TODO:".
+The section ends with the start of a new section.
+
+The text of a section can start on a new line, in
+which case it must not be indented at all. It can also start
+on the same line as the 'Note:', 'Returns:', etc tag. In this
+case if it spans multiple lines then second and subsequent
+lines must be indented to match the first, in the same way as
+multiline argument descriptions.
+
+A 'Since: x.y.z' tagged section lists the release that introduced the
+definition.
+
+An 'Example' or 'Examples' section is automatically rendered
+entirely as literal fixed-width text. In other sections,
+the text is formatted, and rST markup can be used.
+
+For example::
+
+ ##
+ # @BlockStats:
+ #
+ # Statistics of a virtual block device or a block backing device.
+ #
+ # @device: If the stats are for a virtual block device, the name
+ # corresponding to the virtual block device.
+ #
+ # @node-name: The node name of the device. (since 2.3)
+ #
+ # ... more members ...
+ #
+ # Since: 0.14.0
+ ##
+ { 'struct': 'BlockStats',
+ 'data': {'*device': 'str', '*node-name': 'str',
+ ... more members ... } }
+
+ ##
+ # @query-blockstats:
+ #
+ # Query the @BlockStats for all virtual block devices.
+ #
+ # @query-nodes: If true, the command will query all the
+ # block nodes ... explain, explain ... (since 2.3)
+ #
+ # Returns: A list of @BlockStats for each virtual block devices.
+ #
+ # Since: 0.14.0
+ #
+ # Example:
+ #
+ # -> { "execute": "query-blockstats" }
+ # <- {
+ # ... lots of output ...
+ # }
+ #
+ ##
+ { 'command': 'query-blockstats',
+ 'data': { '*query-nodes': 'bool' },
+ 'returns': ['BlockStats'] }
+
+
+Client JSON Protocol introspection
+==================================
+
+Clients of a Client JSON Protocol commonly need to figure out what
+exactly the server (QEMU) supports.
+
+For this purpose, QMP provides introspection via command
+query-qmp-schema. QGA currently doesn't support introspection.
+
+While Client JSON Protocol wire compatibility should be maintained
+between qemu versions, we cannot make the same guarantees for
+introspection stability. For example, one version of qemu may provide
+a non-variant optional member of a struct, and a later version rework
+the member to instead be non-optional and associated with a variant.
+Likewise, one version of qemu may list a member with open-ended type
+'str', and a later version could convert it to a finite set of strings
+via an enum type; or a member may be converted from a specific type to
+an alternate that represents a choice between the original type and
+something else.
+
+query-qmp-schema returns a JSON array of SchemaInfo objects. These
+objects together describe the wire ABI, as defined in the QAPI schema.
+There is no specified order to the SchemaInfo objects returned; a
+client must search for a particular name throughout the entire array
+to learn more about that name, but is at least guaranteed that there
+will be no collisions between type, command, and event names.
+
+However, the SchemaInfo can't reflect all the rules and restrictions
+that apply to QMP. It's interface introspection (figuring out what's
+there), not interface specification. The specification is in the QAPI
+schema. To understand how QMP is to be used, you need to study the
+QAPI schema.
+
+Like any other command, query-qmp-schema is itself defined in the QAPI
+schema, along with the SchemaInfo type. This text attempts to give an
+overview how things work. For details you need to consult the QAPI
+schema.
+
+SchemaInfo objects have common members "name", "meta-type",
+"features", and additional variant members depending on the value of
+meta-type.
+
+Each SchemaInfo object describes a wire ABI entity of a certain
+meta-type: a command, event or one of several kinds of type.
+
+SchemaInfo for commands and events have the same name as in the QAPI
+schema.
+
+Command and event names are part of the wire ABI, but type names are
+not. Therefore, the SchemaInfo for types have auto-generated
+meaningless names. For readability, the examples in this section use
+meaningful type names instead.
+
+Optional member "features" exposes the entity's feature strings as a
+JSON array of strings.
+
+To examine a type, start with a command or event using it, then follow
+references by name.
+
+QAPI schema definitions not reachable that way are omitted.
+
+The SchemaInfo for a command has meta-type "command", and variant
+members "arg-type", "ret-type" and "allow-oob". On the wire, the
+"arguments" member of a client's "execute" command must conform to the
+object type named by "arg-type". The "return" member that the server
+passes in a success response conforms to the type named by "ret-type".
+When "allow-oob" is true, it means the command supports out-of-band
+execution. It defaults to false.
+
+If the command takes no arguments, "arg-type" names an object type
+without members. Likewise, if the command returns nothing, "ret-type"
+names an object type without members.
+
+Example: the SchemaInfo for command query-qmp-schema ::
+
+ { "name": "query-qmp-schema", "meta-type": "command",
+ "arg-type": "q_empty", "ret-type": "SchemaInfoList" }
+
+ Type "q_empty" is an automatic object type without members, and type
+ "SchemaInfoList" is the array of SchemaInfo type.
+
+The SchemaInfo for an event has meta-type "event", and variant member
+"arg-type". On the wire, a "data" member that the server passes in an
+event conforms to the object type named by "arg-type".
+
+If the event carries no additional information, "arg-type" names an
+object type without members. The event may not have a data member on
+the wire then.
+
+Each command or event defined with 'data' as MEMBERS object in the
+QAPI schema implicitly defines an object type.
+
+Example: the SchemaInfo for EVENT_C from section Events_ ::
+
+ { "name": "EVENT_C", "meta-type": "event",
+ "arg-type": "q_obj-EVENT_C-arg" }
+
+ Type "q_obj-EVENT_C-arg" is an implicitly defined object type with
+ the two members from the event's definition.
+
+The SchemaInfo for struct and union types has meta-type "object".
+
+The SchemaInfo for a struct type has variant member "members".
+
+The SchemaInfo for a union type additionally has variant members "tag"
+and "variants".
+
+"members" is a JSON array describing the object's common members, if
+any. Each element is a JSON object with members "name" (the member's
+name), "type" (the name of its type), "features" (a JSON array of
+feature strings), and "default". The latter two are optional. The
+member is optional if "default" is present. Currently, "default" can
+only have value null. Other values are reserved for future
+extensions. The "members" array is in no particular order; clients
+must search the entire object when learning whether a particular
+member is supported.
+
+Example: the SchemaInfo for MyType from section `Struct types`_ ::
+
+ { "name": "MyType", "meta-type": "object",
+ "members": [
+ { "name": "member1", "type": "str" },
+ { "name": "member2", "type": "int" },
+ { "name": "member3", "type": "str", "default": null } ] }
+
+"features" exposes the command's feature strings as a JSON array of
+strings.
+
+Example: the SchemaInfo for TestType from section Features_::
+
+ { "name": "TestType", "meta-type": "object",
+ "members": [
+ { "name": "number", "type": "int" } ],
+ "features": ["allow-negative-numbers"] }
+
+"tag" is the name of the common member serving as type tag.
+"variants" is a JSON array describing the object's variant members.
+Each element is a JSON object with members "case" (the value of type
+tag this element applies to) and "type" (the name of an object type
+that provides the variant members for this type tag value). The
+"variants" array is in no particular order, and is not guaranteed to
+list cases in the same order as the corresponding "tag" enum type.
+
+Example: the SchemaInfo for union BlockdevOptions from section
+`Union types`_ ::
+
+ { "name": "BlockdevOptions", "meta-type": "object",
+ "members": [
+ { "name": "driver", "type": "BlockdevDriver" },
+ { "name": "read-only", "type": "bool", "default": null } ],
+ "tag": "driver",
+ "variants": [
+ { "case": "file", "type": "BlockdevOptionsFile" },
+ { "case": "qcow2", "type": "BlockdevOptionsQcow2" } ] }
+
+Note that base types are "flattened": its members are included in the
+"members" array.
+
+The SchemaInfo for an alternate type has meta-type "alternate", and
+variant member "members". "members" is a JSON array. Each element is
+a JSON object with member "type", which names a type. Values of the
+alternate type conform to exactly one of its member types. There is
+no guarantee on the order in which "members" will be listed.
+
+Example: the SchemaInfo for BlockdevRef from section `Alternate types`_ ::
+
+ { "name": "BlockdevRef", "meta-type": "alternate",
+ "members": [
+ { "type": "BlockdevOptions" },
+ { "type": "str" } ] }
+
+The SchemaInfo for an array type has meta-type "array", and variant
+member "element-type", which names the array's element type. Array
+types are implicitly defined. For convenience, the array's name may
+resemble the element type; however, clients should examine member
+"element-type" instead of making assumptions based on parsing member
+"name".
+
+Example: the SchemaInfo for ['str'] ::
+
+ { "name": "[str]", "meta-type": "array",
+ "element-type": "str" }
+
+The SchemaInfo for an enumeration type has meta-type "enum" and
+variant member "members".
+
+"members" is a JSON array describing the enumeration values. Each
+element is a JSON object with member "name" (the member's name), and
+optionally "features" (a JSON array of feature strings). The
+"members" array is in no particular order; clients must search the
+entire array when learning whether a particular value is supported.
+
+Example: the SchemaInfo for MyEnum from section `Enumeration types`_ ::
+
+ { "name": "MyEnum", "meta-type": "enum",
+ "members": [
+ { "name": "value1" },
+ { "name": "value2" },
+ { "name": "value3" }
+ ] }
+
+The SchemaInfo for a built-in type has the same name as the type in
+the QAPI schema (see section `Built-in Types`_), with one exception
+detailed below. It has variant member "json-type" that shows how
+values of this type are encoded on the wire.
+
+Example: the SchemaInfo for str ::
+
+ { "name": "str", "meta-type": "builtin", "json-type": "string" }
+
+The QAPI schema supports a number of integer types that only differ in
+how they map to C. They are identical as far as SchemaInfo is
+concerned. Therefore, they get all mapped to a single type "int" in
+SchemaInfo.
+
+As explained above, type names are not part of the wire ABI. Not even
+the names of built-in types. Clients should examine member
+"json-type" instead of hard-coding names of built-in types.
+
+
+Compatibility considerations
+============================
+
+Maintaining backward compatibility at the Client JSON Protocol level
+while evolving the schema requires some care. This section is about
+syntactic compatibility, which is necessary, but not sufficient, for
+actual compatibility.
+
+Clients send commands with argument data, and receive command
+responses with return data and events with event data.
+
+Adding opt-in functionality to the send direction is backwards
+compatible: adding commands, optional arguments, enumeration values,
+union and alternate branches; turning an argument type into an
+alternate of that type; making mandatory arguments optional. Clients
+oblivious of the new functionality continue to work.
+
+Incompatible changes include removing commands, command arguments,
+enumeration values, union and alternate branches, adding mandatory
+command arguments, and making optional arguments mandatory.
+
+The specified behavior of an absent optional argument should remain
+the same. With proper documentation, this policy still allows some
+flexibility; for example, when an optional 'buffer-size' argument is
+specified to default to a sensible buffer size, the actual default
+value can still be changed. The specified default behavior is not the
+exact size of the buffer, only that the default size is sensible.
+
+Adding functionality to the receive direction is generally backwards
+compatible: adding events, adding return and event data members.
+Clients are expected to ignore the ones they don't know.
+
+Removing "unreachable" stuff like events that can't be triggered
+anymore, optional return or event data members that can't be sent
+anymore, and return or event data member (enumeration) values that
+can't be sent anymore makes no difference to clients, except for
+introspection. The latter can conceivably confuse clients, so tread
+carefully.
+
+Incompatible changes include removing return and event data members.
+
+Any change to a command definition's 'data' or one of the types used
+there (recursively) needs to consider send direction compatibility.
+
+Any change to a command definition's 'return', an event definition's
+'data', or one of the types used there (recursively) needs to consider
+receive direction compatibility.
+
+Any change to types used in both contexts need to consider both.
+
+Enumeration type values and complex and alternate type members may be
+reordered freely. For enumerations and alternate types, this doesn't
+affect the wire encoding. For complex types, this might make the
+implementation emit JSON object members in a different order, which
+the Client JSON Protocol permits.
+
+Since type names are not visible in the Client JSON Protocol, types
+may be freely renamed. Even certain refactorings are invisible, such
+as splitting members from one type into a common base type.
+
+
+Code generation
+===============
+
+The QAPI code generator qapi-gen.py generates code and documentation
+from the schema. Together with the core QAPI libraries, this code
+provides everything required to take JSON commands read in by a Client
+JSON Protocol server, unmarshal the arguments into the underlying C
+types, call into the corresponding C function, map the response back
+to a Client JSON Protocol response to be returned to the user, and
+introspect the commands.
+
+As an example, we'll use the following schema, which describes a
+single complex user-defined type, along with command which takes a
+list of that type as a parameter, and returns a single element of that
+type. The user is responsible for writing the implementation of
+qmp_my_command(); everything else is produced by the generator. ::
+
+ $ cat example-schema.json
+ { 'struct': 'UserDefOne',
+ 'data': { 'integer': 'int', '*string': 'str' } }
+
+ { 'command': 'my-command',
+ 'data': { 'arg1': ['UserDefOne'] },
+ 'returns': 'UserDefOne' }
+
+ { 'event': 'MY_EVENT' }
+
+We run qapi-gen.py like this::
+
+ $ python scripts/qapi-gen.py --output-dir="qapi-generated" \
+ --prefix="example-" example-schema.json
+
+For a more thorough look at generated code, the testsuite includes
+tests/qapi-schema/qapi-schema-tests.json that covers more examples of
+what the generator will accept, and compiles the resulting C code as
+part of 'make check-unit'.
+
+
+Code generated for QAPI types
+-----------------------------
+
+The following files are created:
+
+ ``$(prefix)qapi-types.h``
+ C types corresponding to types defined in the schema
+
+ ``$(prefix)qapi-types.c``
+ Cleanup functions for the above C types
+
+The $(prefix) is an optional parameter used as a namespace to keep the
+generated code from one schema/code-generation separated from others so code
+can be generated/used from multiple schemas without clobbering previously
+created code.
+
+Example::
+
+ $ cat qapi-generated/example-qapi-types.h
+ [Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QAPI_TYPES_H
+ #define EXAMPLE_QAPI_TYPES_H
+
+ #include "qapi/qapi-builtin-types.h"
+
+ typedef struct UserDefOne UserDefOne;
+
+ typedef struct UserDefOneList UserDefOneList;
+
+ typedef struct q_obj_my_command_arg q_obj_my_command_arg;
+
+ struct UserDefOne {
+ int64_t integer;
+ bool has_string;
+ char *string;
+ };
+
+ void qapi_free_UserDefOne(UserDefOne *obj);
+ G_DEFINE_AUTOPTR_CLEANUP_FUNC(UserDefOne, qapi_free_UserDefOne)
+
+ struct UserDefOneList {
+ UserDefOneList *next;
+ UserDefOne *value;
+ };
+
+ void qapi_free_UserDefOneList(UserDefOneList *obj);
+ G_DEFINE_AUTOPTR_CLEANUP_FUNC(UserDefOneList, qapi_free_UserDefOneList)
+
+ struct q_obj_my_command_arg {
+ UserDefOneList *arg1;
+ };
+
+ #endif /* EXAMPLE_QAPI_TYPES_H */
+ $ cat qapi-generated/example-qapi-types.c
+ [Uninteresting stuff omitted...]
+
+ void qapi_free_UserDefOne(UserDefOne *obj)
+ {
+ Visitor *v;
+
+ if (!obj) {
+ return;
+ }
+
+ v = qapi_dealloc_visitor_new();
+ visit_type_UserDefOne(v, NULL, &obj, NULL);
+ visit_free(v);
+ }
+
+ void qapi_free_UserDefOneList(UserDefOneList *obj)
+ {
+ Visitor *v;
+
+ if (!obj) {
+ return;
+ }
+
+ v = qapi_dealloc_visitor_new();
+ visit_type_UserDefOneList(v, NULL, &obj, NULL);
+ visit_free(v);
+ }
+
+ [Uninteresting stuff omitted...]
+
+For a modular QAPI schema (see section `Include directives`_), code for
+each sub-module SUBDIR/SUBMODULE.json is actually generated into ::
+
+ SUBDIR/$(prefix)qapi-types-SUBMODULE.h
+ SUBDIR/$(prefix)qapi-types-SUBMODULE.c
+
+If qapi-gen.py is run with option --builtins, additional files are
+created:
+
+ ``qapi-builtin-types.h``
+ C types corresponding to built-in types
+
+ ``qapi-builtin-types.c``
+ Cleanup functions for the above C types
+
+
+Code generated for visiting QAPI types
+--------------------------------------
+
+These are the visitor functions used to walk through and convert
+between a native QAPI C data structure and some other format (such as
+QObject); the generated functions are named visit_type_FOO() and
+visit_type_FOO_members().
+
+The following files are generated:
+
+ ``$(prefix)qapi-visit.c``
+ Visitor function for a particular C type, used to automagically
+ convert QObjects into the corresponding C type and vice-versa, as
+ well as for deallocating memory for an existing C type
+
+ ``$(prefix)qapi-visit.h``
+ Declarations for previously mentioned visitor functions
+
+Example::
+
+ $ cat qapi-generated/example-qapi-visit.h
+ [Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QAPI_VISIT_H
+ #define EXAMPLE_QAPI_VISIT_H
+
+ #include "qapi/qapi-builtin-visit.h"
+ #include "example-qapi-types.h"
+
+
+ bool visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp);
+
+ bool visit_type_UserDefOne(Visitor *v, const char *name,
+ UserDefOne **obj, Error **errp);
+
+ bool visit_type_UserDefOneList(Visitor *v, const char *name,
+ UserDefOneList **obj, Error **errp);
+
+ bool visit_type_q_obj_my_command_arg_members(Visitor *v, q_obj_my_command_arg *obj, Error **errp);
+
+ #endif /* EXAMPLE_QAPI_VISIT_H */
+ $ cat qapi-generated/example-qapi-visit.c
+ [Uninteresting stuff omitted...]
+
+ bool visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp)
+ {
+ if (!visit_type_int(v, "integer", &obj->integer, errp)) {
+ return false;
+ }
+ if (visit_optional(v, "string", &obj->has_string)) {
+ if (!visit_type_str(v, "string", &obj->string, errp)) {
+ return false;
+ }
+ }
+ return true;
+ }
+
+ bool visit_type_UserDefOne(Visitor *v, const char *name,
+ UserDefOne **obj, Error **errp)
+ {
+ bool ok = false;
+
+ if (!visit_start_struct(v, name, (void **)obj, sizeof(UserDefOne), errp)) {
+ return false;
+ }
+ if (!*obj) {
+ /* incomplete */
+ assert(visit_is_dealloc(v));
+ ok = true;
+ goto out_obj;
+ }
+ if (!visit_type_UserDefOne_members(v, *obj, errp)) {
+ goto out_obj;
+ }
+ ok = visit_check_struct(v, errp);
+ out_obj:
+ visit_end_struct(v, (void **)obj);
+ if (!ok && visit_is_input(v)) {
+ qapi_free_UserDefOne(*obj);
+ *obj = NULL;
+ }
+ return ok;
+ }
+
+ bool visit_type_UserDefOneList(Visitor *v, const char *name,
+ UserDefOneList **obj, Error **errp)
+ {
+ bool ok = false;
+ UserDefOneList *tail;
+ size_t size = sizeof(**obj);
+
+ if (!visit_start_list(v, name, (GenericList **)obj, size, errp)) {
+ return false;
+ }
+
+ for (tail = *obj; tail;
+ tail = (UserDefOneList *)visit_next_list(v, (GenericList *)tail, size)) {
+ if (!visit_type_UserDefOne(v, NULL, &tail->value, errp)) {
+ goto out_obj;
+ }
+ }
+
+ ok = visit_check_list(v, errp);
+ out_obj:
+ visit_end_list(v, (void **)obj);
+ if (!ok && visit_is_input(v)) {
+ qapi_free_UserDefOneList(*obj);
+ *obj = NULL;
+ }
+ return ok;
+ }
+
+ bool visit_type_q_obj_my_command_arg_members(Visitor *v, q_obj_my_command_arg *obj, Error **errp)
+ {
+ if (!visit_type_UserDefOneList(v, "arg1", &obj->arg1, errp)) {
+ return false;
+ }
+ return true;
+ }
+
+ [Uninteresting stuff omitted...]
+
+For a modular QAPI schema (see section `Include directives`_), code for
+each sub-module SUBDIR/SUBMODULE.json is actually generated into ::
+
+ SUBDIR/$(prefix)qapi-visit-SUBMODULE.h
+ SUBDIR/$(prefix)qapi-visit-SUBMODULE.c
+
+If qapi-gen.py is run with option --builtins, additional files are
+created:
+
+ ``qapi-builtin-visit.h``
+ Visitor functions for built-in types
+
+ ``qapi-builtin-visit.c``
+ Declarations for these visitor functions
+
+
+Code generated for commands
+---------------------------
+
+These are the marshaling/dispatch functions for the commands defined
+in the schema. The generated code provides qmp_marshal_COMMAND(), and
+declares qmp_COMMAND() that the user must implement.
+
+The following files are generated:
+
+ ``$(prefix)qapi-commands.c``
+ Command marshal/dispatch functions for each QMP command defined in
+ the schema
+
+ ``$(prefix)qapi-commands.h``
+ Function prototypes for the QMP commands specified in the schema
+
+ ``$(prefix)qapi-init-commands.h``
+ Command initialization prototype
+
+ ``$(prefix)qapi-init-commands.c``
+ Command initialization code
+
+Example::
+
+ $ cat qapi-generated/example-qapi-commands.h
+ [Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QAPI_COMMANDS_H
+ #define EXAMPLE_QAPI_COMMANDS_H
+
+ #include "example-qapi-types.h"
+
+ UserDefOne *qmp_my_command(UserDefOneList *arg1, Error **errp);
+ void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp);
+
+ #endif /* EXAMPLE_QAPI_COMMANDS_H */
+ $ cat qapi-generated/example-qapi-commands.c
+ [Uninteresting stuff omitted...]
+
+
+ static void qmp_marshal_output_UserDefOne(UserDefOne *ret_in,
+ QObject **ret_out, Error **errp)
+ {
+ Visitor *v;
+
+ v = qobject_output_visitor_new_qmp(ret_out);
+ if (visit_type_UserDefOne(v, "unused", &ret_in, errp)) {
+ visit_complete(v, ret_out);
+ }
+ visit_free(v);
+ v = qapi_dealloc_visitor_new();
+ visit_type_UserDefOne(v, "unused", &ret_in, NULL);
+ visit_free(v);
+ }
+
+ void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp)
+ {
+ Error *err = NULL;
+ bool ok = false;
+ Visitor *v;
+ UserDefOne *retval;
+ q_obj_my_command_arg arg = {0};
+
+ v = qobject_input_visitor_new_qmp(QOBJECT(args));
+ if (!visit_start_struct(v, NULL, NULL, 0, errp)) {
+ goto out;
+ }
+ if (visit_type_q_obj_my_command_arg_members(v, &arg, errp)) {
+ ok = visit_check_struct(v, errp);
+ }
+ visit_end_struct(v, NULL);
+ if (!ok) {
+ goto out;
+ }
+
+ retval = qmp_my_command(arg.arg1, &err);
+ error_propagate(errp, err);
+ if (err) {
+ goto out;
+ }
+
+ qmp_marshal_output_UserDefOne(retval, ret, errp);
+
+ out:
+ visit_free(v);
+ v = qapi_dealloc_visitor_new();
+ visit_start_struct(v, NULL, NULL, 0, NULL);
+ visit_type_q_obj_my_command_arg_members(v, &arg, NULL);
+ visit_end_struct(v, NULL);
+ visit_free(v);
+ }
+
+ [Uninteresting stuff omitted...]
+ $ cat qapi-generated/example-qapi-init-commands.h
+ [Uninteresting stuff omitted...]
+ #ifndef EXAMPLE_QAPI_INIT_COMMANDS_H
+ #define EXAMPLE_QAPI_INIT_COMMANDS_H
+
+ #include "qapi/qmp/dispatch.h"
+
+ void example_qmp_init_marshal(QmpCommandList *cmds);
+
+ #endif /* EXAMPLE_QAPI_INIT_COMMANDS_H */
+ $ cat qapi-generated/example-qapi-init-commands.c
+ [Uninteresting stuff omitted...]
+ void example_qmp_init_marshal(QmpCommandList *cmds)
+ {
+ QTAILQ_INIT(cmds);
+
+ qmp_register_command(cmds, "my-command",
+ qmp_marshal_my_command, QCO_NO_OPTIONS);
+ }
+ [Uninteresting stuff omitted...]
+
+For a modular QAPI schema (see section `Include directives`_), code for
+each sub-module SUBDIR/SUBMODULE.json is actually generated into::
+
+ SUBDIR/$(prefix)qapi-commands-SUBMODULE.h
+ SUBDIR/$(prefix)qapi-commands-SUBMODULE.c
+
+
+Code generated for events
+-------------------------
+
+This is the code related to events defined in the schema, providing
+qapi_event_send_EVENT().
+
+The following files are created:
+
+ ``$(prefix)qapi-events.h``
+ Function prototypes for each event type
+
+ ``$(prefix)qapi-events.c``
+ Implementation of functions to send an event
+
+ ``$(prefix)qapi-emit-events.h``
+ Enumeration of all event names, and common event code declarations
+
+ ``$(prefix)qapi-emit-events.c``
+ Common event code definitions
+
+Example::
+
+ $ cat qapi-generated/example-qapi-events.h
+ [Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QAPI_EVENTS_H
+ #define EXAMPLE_QAPI_EVENTS_H
+
+ #include "qapi/util.h"
+ #include "example-qapi-types.h"
+
+ void qapi_event_send_my_event(void);
+
+ #endif /* EXAMPLE_QAPI_EVENTS_H */
+ $ cat qapi-generated/example-qapi-events.c
+ [Uninteresting stuff omitted...]
+
+ void qapi_event_send_my_event(void)
+ {
+ QDict *qmp;
+
+ qmp = qmp_event_build_dict("MY_EVENT");
+
+ example_qapi_event_emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp);
+
+ qobject_unref(qmp);
+ }
+
+ [Uninteresting stuff omitted...]
+ $ cat qapi-generated/example-qapi-emit-events.h
+ [Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QAPI_EMIT_EVENTS_H
+ #define EXAMPLE_QAPI_EMIT_EVENTS_H
+
+ #include "qapi/util.h"
+
+ typedef enum example_QAPIEvent {
+ EXAMPLE_QAPI_EVENT_MY_EVENT,
+ EXAMPLE_QAPI_EVENT__MAX,
+ } example_QAPIEvent;
+
+ #define example_QAPIEvent_str(val) \
+ qapi_enum_lookup(&example_QAPIEvent_lookup, (val))
+
+ extern const QEnumLookup example_QAPIEvent_lookup;
+
+ void example_qapi_event_emit(example_QAPIEvent event, QDict *qdict);
+
+ #endif /* EXAMPLE_QAPI_EMIT_EVENTS_H */
+ $ cat qapi-generated/example-qapi-emit-events.c
+ [Uninteresting stuff omitted...]
+
+ const QEnumLookup example_QAPIEvent_lookup = {
+ .array = (const char *const[]) {
+ [EXAMPLE_QAPI_EVENT_MY_EVENT] = "MY_EVENT",
+ },
+ .size = EXAMPLE_QAPI_EVENT__MAX
+ };
+
+ [Uninteresting stuff omitted...]
+
+For a modular QAPI schema (see section `Include directives`_), code for
+each sub-module SUBDIR/SUBMODULE.json is actually generated into ::
+
+ SUBDIR/$(prefix)qapi-events-SUBMODULE.h
+ SUBDIR/$(prefix)qapi-events-SUBMODULE.c
+
+
+Code generated for introspection
+--------------------------------
+
+The following files are created:
+
+ ``$(prefix)qapi-introspect.c``
+ Defines a string holding a JSON description of the schema
+
+ ``$(prefix)qapi-introspect.h``
+ Declares the above string
+
+Example::
+
+ $ cat qapi-generated/example-qapi-introspect.h
+ [Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QAPI_INTROSPECT_H
+ #define EXAMPLE_QAPI_INTROSPECT_H
+
+ #include "qapi/qmp/qlit.h"
+
+ extern const QLitObject example_qmp_schema_qlit;
+
+ #endif /* EXAMPLE_QAPI_INTROSPECT_H */
+ $ cat qapi-generated/example-qapi-introspect.c
+ [Uninteresting stuff omitted...]
+
+ const QLitObject example_qmp_schema_qlit = QLIT_QLIST(((QLitObject[]) {
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "arg-type", QLIT_QSTR("0"), },
+ { "meta-type", QLIT_QSTR("command"), },
+ { "name", QLIT_QSTR("my-command"), },
+ { "ret-type", QLIT_QSTR("1"), },
+ {}
+ })),
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "arg-type", QLIT_QSTR("2"), },
+ { "meta-type", QLIT_QSTR("event"), },
+ { "name", QLIT_QSTR("MY_EVENT"), },
+ {}
+ })),
+ /* "0" = q_obj_my-command-arg */
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "members", QLIT_QLIST(((QLitObject[]) {
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "name", QLIT_QSTR("arg1"), },
+ { "type", QLIT_QSTR("[1]"), },
+ {}
+ })),
+ {}
+ })), },
+ { "meta-type", QLIT_QSTR("object"), },
+ { "name", QLIT_QSTR("0"), },
+ {}
+ })),
+ /* "1" = UserDefOne */
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "members", QLIT_QLIST(((QLitObject[]) {
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "name", QLIT_QSTR("integer"), },
+ { "type", QLIT_QSTR("int"), },
+ {}
+ })),
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "default", QLIT_QNULL, },
+ { "name", QLIT_QSTR("string"), },
+ { "type", QLIT_QSTR("str"), },
+ {}
+ })),
+ {}
+ })), },
+ { "meta-type", QLIT_QSTR("object"), },
+ { "name", QLIT_QSTR("1"), },
+ {}
+ })),
+ /* "2" = q_empty */
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "members", QLIT_QLIST(((QLitObject[]) {
+ {}
+ })), },
+ { "meta-type", QLIT_QSTR("object"), },
+ { "name", QLIT_QSTR("2"), },
+ {}
+ })),
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "element-type", QLIT_QSTR("1"), },
+ { "meta-type", QLIT_QSTR("array"), },
+ { "name", QLIT_QSTR("[1]"), },
+ {}
+ })),
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "json-type", QLIT_QSTR("int"), },
+ { "meta-type", QLIT_QSTR("builtin"), },
+ { "name", QLIT_QSTR("int"), },
+ {}
+ })),
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "json-type", QLIT_QSTR("string"), },
+ { "meta-type", QLIT_QSTR("builtin"), },
+ { "name", QLIT_QSTR("str"), },
+ {}
+ })),
+ {}
+ }));
+
+ [Uninteresting stuff omitted...]
diff --git a/docs/devel/qgraph.rst b/docs/devel/qgraph.rst
new file mode 100644
index 000000000..43342d9d6
--- /dev/null
+++ b/docs/devel/qgraph.rst
@@ -0,0 +1,628 @@
+.. _qgraph:
+
+Qtest Driver Framework
+======================
+
+In order to test a specific driver, plain libqos tests need to
+take care of booting QEMU with the right machine and devices.
+This makes each test "hardcoded" for a specific configuration, reducing
+the possible coverage that it can reach.
+
+For example, the sdhci device is supported on both x86_64 and ARM boards,
+therefore a generic sdhci test should test all machines and drivers that
+support that device.
+Using only libqos APIs, the test has to manually take care of
+covering all the setups, and build the correct command line.
+
+This also introduces backward compatibility issues: if a device/driver command
+line name is changed, all tests that use that will not work
+properly anymore and need to be adjusted.
+
+The aim of qgraph is to create a graph of drivers, machines and tests such that
+a test aimed to a certain driver does not have to care of
+booting the right QEMU machine, pick the right device, build the command line
+and so on. Instead, it only defines what type of device it is testing
+(interface in qgraph terms) and the framework takes care of
+covering all supported types of devices and machine architectures.
+
+Following the above example, an interface would be ``sdhci``,
+so the sdhci-test should only care of linking its qgraph node with
+that interface. In this way, if the command line of a sdhci driver
+is changed, only the respective qgraph driver node has to be adjusted.
+
+QGraph concepts
+---------------
+
+The graph is composed by nodes that represent machines, drivers, tests
+and edges that define the relationships between them (``CONSUMES``, ``PRODUCES``, and
+``CONTAINS``).
+
+Nodes
+~~~~~
+
+A node can be of four types:
+
+- **QNODE_MACHINE**: for example ``arm/raspi2b``
+- **QNODE_DRIVER**: for example ``generic-sdhci``
+- **QNODE_INTERFACE**: for example ``sdhci`` (interface for all ``-sdhci``
+ drivers).
+ An interface is not explicitly created, it will be automatically
+ instantiated when a node consumes or produces it.
+ An interface is simply a struct that abstracts the various drivers
+ for the same type of device, and offers an API to the nodes that
+ use it ("consume" relation in qgraph terms) that is implemented/backed up by the drivers that implement it ("produce" relation in qgraph terms).
+- **QNODE_TEST**: for example ``sdhci-test``. A test consumes an interface
+ and tests the functions provided by it.
+
+Notes for the nodes:
+
+- QNODE_MACHINE: each machine struct must have a ``QGuestAllocator`` and
+ implement ``get_driver()`` to return the allocator mapped to the interface
+ "memory". The function can also return ``NULL`` if the allocator
+ is not set.
+- QNODE_DRIVER: driver names must be unique, and machines and nodes
+ planned to be "consumed" by other nodes must match QEMU
+ drivers name, otherwise they won't be discovered
+
+Edges
+~~~~~
+
+An edge relation between two nodes (drivers or machines) ``X`` and ``Y`` can be:
+
+- ``X CONSUMES Y``: ``Y`` can be plugged into ``X``
+- ``X PRODUCES Y``: ``X`` provides the interface ``Y``
+- ``X CONTAINS Y``: ``Y`` is part of ``X`` component
+
+Execution steps
+~~~~~~~~~~~~~~~
+
+The basic framework steps are the following:
+
+- All nodes and edges are created in their respective
+ machine/driver/test files
+- The framework starts QEMU and asks for a list of available devices
+ and machines (note that only machines and "consumed" nodes are mapped
+ 1:1 with QEMU devices)
+- The framework walks the graph starting from the available machines and
+ performs a Depth First Search for tests
+- Once a test is found, the path is walked again and all drivers are
+ allocated accordingly and the final interface is passed to the test
+- The test is executed
+- Unused objects are cleaned and the path discovery is continued
+
+Depending on the QEMU binary used, only some drivers/machines will be
+available and only test that are reached by them will be executed.
+
+Command line
+~~~~~~~~~~~~
+
+Command line is built by using node names and optional arguments
+passed by the user when building the edges.
+
+There are three types of command line arguments:
+
+- ``in node`` : created from the node name. For example, machines will
+ have ``-M <machine>`` to its command line, while devices
+ ``-device <device>``. It is automatically done by the framework.
+- ``after node`` : added as additional argument to the node name.
+ This argument is added optionally when creating edges,
+ by setting the parameter ``after_cmd_line`` and
+ ``extra_edge_opts`` in ``QOSGraphEdgeOptions``.
+ The framework automatically adds
+ a comma before ``extra_edge_opts``,
+ because it is going to add attributes
+ after the destination node pointed by
+ the edge containing these options, and automatically
+ adds a space before ``after_cmd_line``, because it
+ adds an additional device, not an attribute.
+- ``before node`` : added as additional argument to the node name.
+ This argument is added optionally when creating edges,
+ by setting the parameter ``before_cmd_line`` in
+ ``QOSGraphEdgeOptions``. This attribute
+ is going to add attributes before the destination node
+ pointed by the edge containing these options. It is
+ helpful to commands that are not node-representable,
+ such as ``-fdsev`` or ``-netdev``.
+
+While adding command line in edges is always used, not all nodes names are
+used in every path walk: this is because the contained or produced ones
+are already added by QEMU, so only nodes that "consumes" will be used to
+build the command line. Also, nodes that will have ``{ "abstract" : true }``
+as QMP attribute will loose their command line, since they are not proper
+devices to be added in QEMU.
+
+Example::
+
+ QOSGraphEdgeOptions opts = {
+ .before_cmd_line = "-drive id=drv0,if=none,file=null-co://,"
+ "file.read-zeroes=on,format=raw",
+ .after_cmd_line = "-device scsi-hd,bus=vs0.0,drive=drv0",
+
+ opts.extra_device_opts = "id=vs0";
+ };
+
+ qos_node_create_driver("virtio-scsi-device",
+ virtio_scsi_device_create);
+ qos_node_consumes("virtio-scsi-device", "virtio-bus", &opts);
+
+Will produce the following command line:
+``-drive id=drv0,if=none,file=null-co://, -device virtio-scsi-device,id=vs0 -device scsi-hd,bus=vs0.0,drive=drv0``
+
+Troubleshooting unavailable tests
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If there is no path from an available machine to a test then that test will be
+unavailable and won't execute. This can happen if a test or driver did not set
+up its qgraph node correctly. It can also happen if the necessary machine type
+or device is missing from the QEMU binary because it was compiled out or
+otherwise.
+
+It is possible to troubleshoot unavailable tests by running::
+
+ $ QTEST_QEMU_BINARY=build/qemu-system-x86_64 build/tests/qtest/qos-test --verbose
+ # ALL QGRAPH EDGES: {
+ # src='virtio-net'
+ # |-> dest='virtio-net-tests/vhost-user/multiqueue' type=2 (node=0x559142109e30)
+ # |-> dest='virtio-net-tests/vhost-user/migrate' type=2 (node=0x559142109d00)
+ # src='virtio-net-pci'
+ # |-> dest='virtio-net' type=1 (node=0x55914210d740)
+ # src='pci-bus'
+ # |-> dest='virtio-net-pci' type=2 (node=0x55914210d880)
+ # src='pci-bus-pc'
+ # |-> dest='pci-bus' type=1 (node=0x559142103f40)
+ # src='i440FX-pcihost'
+ # |-> dest='pci-bus-pc' type=0 (node=0x55914210ac70)
+ # src='x86_64/pc'
+ # |-> dest='i440FX-pcihost' type=0 (node=0x5591421117f0)
+ # src=''
+ # |-> dest='x86_64/pc' type=0 (node=0x559142111600)
+ # |-> dest='arm/raspi2b' type=0 (node=0x559142110740)
+ ...
+ # }
+ # ALL QGRAPH NODES: {
+ # name='virtio-net-tests/announce-self' type=3 cmd_line='(null)' [available]
+ # name='arm/raspi2b' type=0 cmd_line='-M raspi2b ' [UNAVAILABLE]
+ ...
+ # }
+
+The ``virtio-net-tests/announce-self`` test is listed as "available" in the
+"ALL QGRAPH NODES" output. This means the test will execute. We can follow the
+qgraph path in the "ALL QGRAPH EDGES" output as follows: '' -> 'x86_64/pc' ->
+'i440FX-pcihost' -> 'pci-bus-pc' -> 'pci-bus' -> 'virtio-net-pci' ->
+'virtio-net'. The root of the qgraph is '' and the depth first search begins
+there.
+
+The ``arm/raspi2b`` machine node is listed as "UNAVAILABLE". Although it is
+reachable from the root via '' -> 'arm/raspi2b' the node is unavailable because
+the QEMU binary did not list it when queried by the framework. This is expected
+because we used the ``qemu-system-x86_64`` binary which does not support ARM
+machine types.
+
+If a test is unexpectedly listed as "UNAVAILABLE", first check that the "ALL
+QGRAPH EDGES" output reports edge connectivity from the root ('') to the test.
+If there is no connectivity then the qgraph nodes were not set up correctly and
+the driver or test code is incorrect. If there is connectivity, check the
+availability of each node in the path in the "ALL QGRAPH NODES" output. The
+first unavailable node in the path is the reason why the test is unavailable.
+Typically this is because the QEMU binary lacks support for the necessary
+machine type or device.
+
+Creating a new driver and its interface
+---------------------------------------
+
+Here we continue the ``sdhci`` use case, with the following scenario:
+
+- ``sdhci-test`` aims to test the ``read[q,w], writeq`` functions
+ offered by the ``sdhci`` drivers.
+- The current ``sdhci`` device is supported by both ``x86_64/pc`` and ``ARM``
+ (in this example we focus on the ``arm-raspi2b``) machines.
+- QEMU offers 2 types of drivers: ``QSDHCI_MemoryMapped`` for ``ARM`` and
+ ``QSDHCI_PCI`` for ``x86_64/pc``. Both implement the
+ ``read[q,w], writeq`` functions.
+
+In order to implement such scenario in qgraph, the test developer needs to:
+
+- Create the ``x86_64/pc`` machine node. This machine uses the
+ ``pci-bus`` architecture so it ``contains`` a PCI driver,
+ ``pci-bus-pc``. The actual path is
+
+ ``x86_64/pc --contains--> 1440FX-pcihost --contains-->
+ pci-bus-pc --produces--> pci-bus``.
+
+ For the sake of this example,
+ we do not focus on the PCI interface implementation.
+- Create the ``sdhci-pci`` driver node, representing ``QSDHCI_PCI``.
+ The driver uses the PCI bus (and its API),
+ so it must ``consume`` the ``pci-bus`` generic interface (which abstracts
+ all the pci drivers available)
+
+ ``sdhci-pci --consumes--> pci-bus``
+- Create an ``arm/raspi2b`` machine node. This machine ``contains``
+ a ``generic-sdhci`` memory mapped ``sdhci`` driver node, representing
+ ``QSDHCI_MemoryMapped``.
+
+ ``arm/raspi2b --contains--> generic-sdhci``
+- Create the ``sdhci`` interface node. This interface offers the
+ functions that are shared by all ``sdhci`` devices.
+ The interface is produced by ``sdhci-pci`` and ``generic-sdhci``,
+ the available architecture-specific drivers.
+
+ ``sdhci-pci --produces--> sdhci``
+
+ ``generic-sdhci --produces--> sdhci``
+- Create the ``sdhci-test`` test node. The test ``consumes`` the
+ ``sdhci`` interface, using its API. It doesn't need to look at
+ the supported machines or drivers.
+
+ ``sdhci-test --consumes--> sdhci``
+
+``arm-raspi2b`` machine, simplified from
+``tests/qtest/libqos/arm-raspi2-machine.c``::
+
+ #include "qgraph.h"
+
+ struct QRaspi2Machine {
+ QOSGraphObject obj;
+ QGuestAllocator alloc;
+ QSDHCI_MemoryMapped sdhci;
+ };
+
+ static void *raspi2_get_driver(void *object, const char *interface)
+ {
+ QRaspi2Machine *machine = object;
+ if (!g_strcmp0(interface, "memory")) {
+ return &machine->alloc;
+ }
+
+ fprintf(stderr, "%s not present in arm/raspi2b\n", interface);
+ g_assert_not_reached();
+ }
+
+ static QOSGraphObject *raspi2_get_device(void *obj,
+ const char *device)
+ {
+ QRaspi2Machine *machine = obj;
+ if (!g_strcmp0(device, "generic-sdhci")) {
+ return &machine->sdhci.obj;
+ }
+
+ fprintf(stderr, "%s not present in arm/raspi2b\n", device);
+ g_assert_not_reached();
+ }
+
+ static void *qos_create_machine_arm_raspi2(QTestState *qts)
+ {
+ QRaspi2Machine *machine = g_new0(QRaspi2Machine, 1);
+
+ alloc_init(&machine->alloc, ...);
+
+ /* Get node(s) contained inside (CONTAINS) */
+ machine->obj.get_device = raspi2_get_device;
+
+ /* Get node(s) produced (PRODUCES) */
+ machine->obj.get_driver = raspi2_get_driver;
+
+ /* free the object */
+ machine->obj.destructor = raspi2_destructor;
+ qos_init_sdhci_mm(&machine->sdhci, ...);
+ return &machine->obj;
+ }
+
+ static void raspi2_register_nodes(void)
+ {
+ /* arm/raspi2b --contains--> generic-sdhci */
+ qos_node_create_machine("arm/raspi2b",
+ qos_create_machine_arm_raspi2);
+ qos_node_contains("arm/raspi2b", "generic-sdhci", NULL);
+ }
+
+ libqos_init(raspi2_register_nodes);
+
+``x86_64/pc`` machine, simplified from
+``tests/qtest/libqos/x86_64_pc-machine.c``::
+
+ #include "qgraph.h"
+
+ struct i440FX_pcihost {
+ QOSGraphObject obj;
+ QPCIBusPC pci;
+ };
+
+ struct QX86PCMachine {
+ QOSGraphObject obj;
+ QGuestAllocator alloc;
+ i440FX_pcihost bridge;
+ };
+
+ /* i440FX_pcihost */
+
+ static QOSGraphObject *i440FX_host_get_device(void *obj,
+ const char *device)
+ {
+ i440FX_pcihost *host = obj;
+ if (!g_strcmp0(device, "pci-bus-pc")) {
+ return &host->pci.obj;
+ }
+ fprintf(stderr, "%s not present in i440FX-pcihost\n", device);
+ g_assert_not_reached();
+ }
+
+ /* x86_64/pc machine */
+
+ static void *pc_get_driver(void *object, const char *interface)
+ {
+ QX86PCMachine *machine = object;
+ if (!g_strcmp0(interface, "memory")) {
+ return &machine->alloc;
+ }
+
+ fprintf(stderr, "%s not present in x86_64/pc\n", interface);
+ g_assert_not_reached();
+ }
+
+ static QOSGraphObject *pc_get_device(void *obj, const char *device)
+ {
+ QX86PCMachine *machine = obj;
+ if (!g_strcmp0(device, "i440FX-pcihost")) {
+ return &machine->bridge.obj;
+ }
+
+ fprintf(stderr, "%s not present in x86_64/pc\n", device);
+ g_assert_not_reached();
+ }
+
+ static void *qos_create_machine_pc(QTestState *qts)
+ {
+ QX86PCMachine *machine = g_new0(QX86PCMachine, 1);
+
+ /* Get node(s) contained inside (CONTAINS) */
+ machine->obj.get_device = pc_get_device;
+
+ /* Get node(s) produced (PRODUCES) */
+ machine->obj.get_driver = pc_get_driver;
+
+ /* free the object */
+ machine->obj.destructor = pc_destructor;
+ pc_alloc_init(&machine->alloc, qts, ALLOC_NO_FLAGS);
+
+ /* Get node(s) contained inside (CONTAINS) */
+ machine->bridge.obj.get_device = i440FX_host_get_device;
+
+ return &machine->obj;
+ }
+
+ static void pc_machine_register_nodes(void)
+ {
+ /* x86_64/pc --contains--> 1440FX-pcihost --contains-->
+ * pci-bus-pc [--produces--> pci-bus (in pci.h)] */
+ qos_node_create_machine("x86_64/pc", qos_create_machine_pc);
+ qos_node_contains("x86_64/pc", "i440FX-pcihost", NULL);
+
+ /* contained drivers don't need a constructor,
+ * they will be init by the parent */
+ qos_node_create_driver("i440FX-pcihost", NULL);
+ qos_node_contains("i440FX-pcihost", "pci-bus-pc", NULL);
+ }
+
+ libqos_init(pc_machine_register_nodes);
+
+``sdhci`` taken from ``tests/qtest/libqos/sdhci.c``::
+
+ /* Interface node, offers the sdhci API */
+ struct QSDHCI {
+ uint16_t (*readw)(QSDHCI *s, uint32_t reg);
+ uint64_t (*readq)(QSDHCI *s, uint32_t reg);
+ void (*writeq)(QSDHCI *s, uint32_t reg, uint64_t val);
+ /* other fields */
+ };
+
+ /* Memory Mapped implementation of QSDHCI */
+ struct QSDHCI_MemoryMapped {
+ QOSGraphObject obj;
+ QSDHCI sdhci;
+ /* other driver-specific fields */
+ };
+
+ /* PCI implementation of QSDHCI */
+ struct QSDHCI_PCI {
+ QOSGraphObject obj;
+ QSDHCI sdhci;
+ /* other driver-specific fields */
+ };
+
+ /* Memory mapped implementation of QSDHCI */
+
+ static void *sdhci_mm_get_driver(void *obj, const char *interface)
+ {
+ QSDHCI_MemoryMapped *smm = obj;
+ if (!g_strcmp0(interface, "sdhci")) {
+ return &smm->sdhci;
+ }
+ fprintf(stderr, "%s not present in generic-sdhci\n", interface);
+ g_assert_not_reached();
+ }
+
+ void qos_init_sdhci_mm(QSDHCI_MemoryMapped *sdhci, QTestState *qts,
+ uint32_t addr, QSDHCIProperties *common)
+ {
+ /* Get node contained inside (CONTAINS) */
+ sdhci->obj.get_driver = sdhci_mm_get_driver;
+
+ /* SDHCI interface API */
+ sdhci->sdhci.readw = sdhci_mm_readw;
+ sdhci->sdhci.readq = sdhci_mm_readq;
+ sdhci->sdhci.writeq = sdhci_mm_writeq;
+ sdhci->qts = qts;
+ }
+
+ /* PCI implementation of QSDHCI */
+
+ static void *sdhci_pci_get_driver(void *object,
+ const char *interface)
+ {
+ QSDHCI_PCI *spci = object;
+ if (!g_strcmp0(interface, "sdhci")) {
+ return &spci->sdhci;
+ }
+
+ fprintf(stderr, "%s not present in sdhci-pci\n", interface);
+ g_assert_not_reached();
+ }
+
+ static void *sdhci_pci_create(void *pci_bus,
+ QGuestAllocator *alloc,
+ void *addr)
+ {
+ QSDHCI_PCI *spci = g_new0(QSDHCI_PCI, 1);
+ QPCIBus *bus = pci_bus;
+ uint64_t barsize;
+
+ qpci_device_init(&spci->dev, bus, addr);
+
+ /* SDHCI interface API */
+ spci->sdhci.readw = sdhci_pci_readw;
+ spci->sdhci.readq = sdhci_pci_readq;
+ spci->sdhci.writeq = sdhci_pci_writeq;
+
+ /* Get node(s) produced (PRODUCES) */
+ spci->obj.get_driver = sdhci_pci_get_driver;
+
+ spci->obj.start_hw = sdhci_pci_start_hw;
+ spci->obj.destructor = sdhci_destructor;
+ return &spci->obj;
+ }
+
+ static void qsdhci_register_nodes(void)
+ {
+ QOSGraphEdgeOptions opts = {
+ .extra_device_opts = "addr=04.0",
+ };
+
+ /* generic-sdhci */
+ /* generic-sdhci --produces--> sdhci */
+ qos_node_create_driver("generic-sdhci", NULL);
+ qos_node_produces("generic-sdhci", "sdhci");
+
+ /* sdhci-pci */
+ /* sdhci-pci --produces--> sdhci
+ * sdhci-pci --consumes--> pci-bus */
+ qos_node_create_driver("sdhci-pci", sdhci_pci_create);
+ qos_node_produces("sdhci-pci", "sdhci");
+ qos_node_consumes("sdhci-pci", "pci-bus", &opts);
+ }
+
+ libqos_init(qsdhci_register_nodes);
+
+In the above example, all possible types of relations are created::
+
+ x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc
+ |
+ sdhci-pci --consumes--> pci-bus <--produces--+
+ |
+ +--produces--+
+ |
+ v
+ sdhci
+ ^
+ |
+ +--produces-- +
+ |
+ arm/raspi2b --contains--> generic-sdhci
+
+or inverting the consumes edge in consumed_by::
+
+ x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc
+ |
+ sdhci-pci <--consumed by-- pci-bus <--produces--+
+ |
+ +--produces--+
+ |
+ v
+ sdhci
+ ^
+ |
+ +--produces-- +
+ |
+ arm/raspi2b --contains--> generic-sdhci
+
+Adding a new test
+-----------------
+
+Given the above setup, adding a new test is very simple.
+``sdhci-test``, taken from ``tests/qtest/sdhci-test.c``::
+
+ static void check_capab_sdma(QSDHCI *s, bool supported)
+ {
+ uint64_t capab, capab_sdma;
+
+ capab = s->readq(s, SDHC_CAPAB);
+ capab_sdma = FIELD_EX64(capab, SDHC_CAPAB, SDMA);
+ g_assert_cmpuint(capab_sdma, ==, supported);
+ }
+
+ static void test_registers(void *obj, void *data,
+ QGuestAllocator *alloc)
+ {
+ QSDHCI *s = obj;
+
+ /* example test */
+ check_capab_sdma(s, s->props.capab.sdma);
+ }
+
+ static void register_sdhci_test(void)
+ {
+ /* sdhci-test --consumes--> sdhci */
+ qos_add_test("registers", "sdhci", test_registers, NULL);
+ }
+
+ libqos_init(register_sdhci_test);
+
+Here a new test is created, consuming ``sdhci`` interface node
+and creating a valid path from both machines to a test.
+Final graph will be like this::
+
+ x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc
+ |
+ sdhci-pci --consumes--> pci-bus <--produces--+
+ |
+ +--produces--+
+ |
+ v
+ sdhci <--consumes-- sdhci-test
+ ^
+ |
+ +--produces-- +
+ |
+ arm/raspi2b --contains--> generic-sdhci
+
+or inverting the consumes edge in consumed_by::
+
+ x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc
+ |
+ sdhci-pci <--consumed by-- pci-bus <--produces--+
+ |
+ +--produces--+
+ |
+ v
+ sdhci --consumed by--> sdhci-test
+ ^
+ |
+ +--produces-- +
+ |
+ arm/raspi2b --contains--> generic-sdhci
+
+Assuming there the binary is
+``QTEST_QEMU_BINARY=./qemu-system-x86_64``
+a valid test path will be:
+``/x86_64/pc/1440FX-pcihost/pci-bus-pc/pci-bus/sdhci-pc/sdhci/sdhci-test``
+
+and for the binary ``QTEST_QEMU_BINARY=./qemu-system-arm``:
+
+``/arm/raspi2b/generic-sdhci/sdhci/sdhci-test``
+
+Additional examples are also in ``test-qgraph.c``
+
+Qgraph API reference
+--------------------
+
+.. kernel-doc:: tests/qtest/libqos/qgraph.h
diff --git a/docs/devel/qom.rst b/docs/devel/qom.rst
new file mode 100644
index 000000000..e5fe3597c
--- /dev/null
+++ b/docs/devel/qom.rst
@@ -0,0 +1,389 @@
+===========================
+The QEMU Object Model (QOM)
+===========================
+
+.. highlight:: c
+
+The QEMU Object Model provides a framework for registering user creatable
+types and instantiating objects from those types. QOM provides the following
+features:
+
+- System for dynamically registering types
+- Support for single-inheritance of types
+- Multiple inheritance of stateless interfaces
+
+.. code-block:: c
+ :caption: Creating a minimal type
+
+ #include "qdev.h"
+
+ #define TYPE_MY_DEVICE "my-device"
+
+ // No new virtual functions: we can reuse the typedef for the
+ // superclass.
+ typedef DeviceClass MyDeviceClass;
+ typedef struct MyDevice
+ {
+ DeviceState parent;
+
+ int reg0, reg1, reg2;
+ } MyDevice;
+
+ static const TypeInfo my_device_info = {
+ .name = TYPE_MY_DEVICE,
+ .parent = TYPE_DEVICE,
+ .instance_size = sizeof(MyDevice),
+ };
+
+ static void my_device_register_types(void)
+ {
+ type_register_static(&my_device_info);
+ }
+
+ type_init(my_device_register_types)
+
+In the above example, we create a simple type that is described by #TypeInfo.
+#TypeInfo describes information about the type including what it inherits
+from, the instance and class size, and constructor/destructor hooks.
+
+Alternatively several static types could be registered using helper macro
+DEFINE_TYPES()
+
+.. code-block:: c
+
+ static const TypeInfo device_types_info[] = {
+ {
+ .name = TYPE_MY_DEVICE_A,
+ .parent = TYPE_DEVICE,
+ .instance_size = sizeof(MyDeviceA),
+ },
+ {
+ .name = TYPE_MY_DEVICE_B,
+ .parent = TYPE_DEVICE,
+ .instance_size = sizeof(MyDeviceB),
+ },
+ };
+
+ DEFINE_TYPES(device_types_info)
+
+Every type has an #ObjectClass associated with it. #ObjectClass derivatives
+are instantiated dynamically but there is only ever one instance for any
+given type. The #ObjectClass typically holds a table of function pointers
+for the virtual methods implemented by this type.
+
+Using object_new(), a new #Object derivative will be instantiated. You can
+cast an #Object to a subclass (or base-class) type using
+object_dynamic_cast(). You typically want to define macro wrappers around
+OBJECT_CHECK() and OBJECT_CLASS_CHECK() to make it easier to convert to a
+specific type:
+
+.. code-block:: c
+ :caption: Typecasting macros
+
+ #define MY_DEVICE_GET_CLASS(obj) \
+ OBJECT_GET_CLASS(MyDeviceClass, obj, TYPE_MY_DEVICE)
+ #define MY_DEVICE_CLASS(klass) \
+ OBJECT_CLASS_CHECK(MyDeviceClass, klass, TYPE_MY_DEVICE)
+ #define MY_DEVICE(obj) \
+ OBJECT_CHECK(MyDevice, obj, TYPE_MY_DEVICE)
+
+In case the ObjectClass implementation can be built as module a
+module_obj() line must be added to make sure qemu loads the module
+when the object is needed.
+
+.. code-block:: c
+
+ module_obj(TYPE_MY_DEVICE);
+
+Class Initialization
+====================
+
+Before an object is initialized, the class for the object must be
+initialized. There is only one class object for all instance objects
+that is created lazily.
+
+Classes are initialized by first initializing any parent classes (if
+necessary). After the parent class object has initialized, it will be
+copied into the current class object and any additional storage in the
+class object is zero filled.
+
+The effect of this is that classes automatically inherit any virtual
+function pointers that the parent class has already initialized. All
+other fields will be zero filled.
+
+Once all of the parent classes have been initialized, #TypeInfo::class_init
+is called to let the class being instantiated provide default initialize for
+its virtual functions. Here is how the above example might be modified
+to introduce an overridden virtual function:
+
+.. code-block:: c
+ :caption: Overriding a virtual function
+
+ #include "qdev.h"
+
+ void my_device_class_init(ObjectClass *klass, void *class_data)
+ {
+ DeviceClass *dc = DEVICE_CLASS(klass);
+ dc->reset = my_device_reset;
+ }
+
+ static const TypeInfo my_device_info = {
+ .name = TYPE_MY_DEVICE,
+ .parent = TYPE_DEVICE,
+ .instance_size = sizeof(MyDevice),
+ .class_init = my_device_class_init,
+ };
+
+Introducing new virtual methods requires a class to define its own
+struct and to add a .class_size member to the #TypeInfo. Each method
+will also have a wrapper function to call it easily:
+
+.. code-block:: c
+ :caption: Defining an abstract class
+
+ #include "qdev.h"
+
+ typedef struct MyDeviceClass
+ {
+ DeviceClass parent;
+
+ void (*frobnicate) (MyDevice *obj);
+ } MyDeviceClass;
+
+ static const TypeInfo my_device_info = {
+ .name = TYPE_MY_DEVICE,
+ .parent = TYPE_DEVICE,
+ .instance_size = sizeof(MyDevice),
+ .abstract = true, // or set a default in my_device_class_init
+ .class_size = sizeof(MyDeviceClass),
+ };
+
+ void my_device_frobnicate(MyDevice *obj)
+ {
+ MyDeviceClass *klass = MY_DEVICE_GET_CLASS(obj);
+
+ klass->frobnicate(obj);
+ }
+
+Interfaces
+==========
+
+Interfaces allow a limited form of multiple inheritance. Instances are
+similar to normal types except for the fact that are only defined by
+their classes and never carry any state. As a consequence, a pointer to
+an interface instance should always be of incomplete type in order to be
+sure it cannot be dereferenced. That is, you should define the
+'typedef struct SomethingIf SomethingIf' so that you can pass around
+``SomethingIf *si`` arguments, but not define a ``struct SomethingIf { ... }``.
+The only things you can validly do with a ``SomethingIf *`` are to pass it as
+an argument to a method on its corresponding SomethingIfClass, or to
+dynamically cast it to an object that implements the interface.
+
+Methods
+=======
+
+A *method* is a function within the namespace scope of
+a class. It usually operates on the object instance by passing it as a
+strongly-typed first argument.
+If it does not operate on an object instance, it is dubbed
+*class method*.
+
+Methods cannot be overloaded. That is, the #ObjectClass and method name
+uniquely identity the function to be called; the signature does not vary
+except for trailing varargs.
+
+Methods are always *virtual*. Overriding a method in
+#TypeInfo.class_init of a subclass leads to any user of the class obtained
+via OBJECT_GET_CLASS() accessing the overridden function.
+The original function is not automatically invoked. It is the responsibility
+of the overriding class to determine whether and when to invoke the method
+being overridden.
+
+To invoke the method being overridden, the preferred solution is to store
+the original value in the overriding class before overriding the method.
+This corresponds to ``{super,base}.method(...)`` in Java and C#
+respectively; this frees the overriding class from hardcoding its parent
+class, which someone might choose to change at some point.
+
+.. code-block:: c
+ :caption: Overriding a virtual method
+
+ typedef struct MyState MyState;
+
+ typedef void (*MyDoSomething)(MyState *obj);
+
+ typedef struct MyClass {
+ ObjectClass parent_class;
+
+ MyDoSomething do_something;
+ } MyClass;
+
+ static void my_do_something(MyState *obj)
+ {
+ // do something
+ }
+
+ static void my_class_init(ObjectClass *oc, void *data)
+ {
+ MyClass *mc = MY_CLASS(oc);
+
+ mc->do_something = my_do_something;
+ }
+
+ static const TypeInfo my_type_info = {
+ .name = TYPE_MY,
+ .parent = TYPE_OBJECT,
+ .instance_size = sizeof(MyState),
+ .class_size = sizeof(MyClass),
+ .class_init = my_class_init,
+ };
+
+ typedef struct DerivedClass {
+ MyClass parent_class;
+
+ MyDoSomething parent_do_something;
+ } DerivedClass;
+
+ static void derived_do_something(MyState *obj)
+ {
+ DerivedClass *dc = DERIVED_GET_CLASS(obj);
+
+ // do something here
+ dc->parent_do_something(obj);
+ // do something else here
+ }
+
+ static void derived_class_init(ObjectClass *oc, void *data)
+ {
+ MyClass *mc = MY_CLASS(oc);
+ DerivedClass *dc = DERIVED_CLASS(oc);
+
+ dc->parent_do_something = mc->do_something;
+ mc->do_something = derived_do_something;
+ }
+
+ static const TypeInfo derived_type_info = {
+ .name = TYPE_DERIVED,
+ .parent = TYPE_MY,
+ .class_size = sizeof(DerivedClass),
+ .class_init = derived_class_init,
+ };
+
+Alternatively, object_class_by_name() can be used to obtain the class and
+its non-overridden methods for a specific type. This would correspond to
+``MyClass::method(...)`` in C++.
+
+The first example of such a QOM method was #CPUClass.reset,
+another example is #DeviceClass.realize.
+
+Standard type declaration and definition macros
+===============================================
+
+A lot of the code outlined above follows a standard pattern and naming
+convention. To reduce the amount of boilerplate code that needs to be
+written for a new type there are two sets of macros to generate the
+common parts in a standard format.
+
+A type is declared using the OBJECT_DECLARE macro family. In types
+which do not require any virtual functions in the class, the
+OBJECT_DECLARE_SIMPLE_TYPE macro is suitable, and is commonly placed
+in the header file:
+
+.. code-block:: c
+ :caption: Declaring a simple type
+
+ OBJECT_DECLARE_SIMPLE_TYPE(MyDevice, my_device,
+ MY_DEVICE, DEVICE)
+
+This is equivalent to the following:
+
+.. code-block:: c
+ :caption: Expansion from declaring a simple type
+
+ typedef struct MyDevice MyDevice;
+ typedef struct MyDeviceClass MyDeviceClass;
+
+ G_DEFINE_AUTOPTR_CLEANUP_FUNC(MyDeviceClass, object_unref)
+
+ #define MY_DEVICE_GET_CLASS(void *obj) \
+ OBJECT_GET_CLASS(MyDeviceClass, obj, TYPE_MY_DEVICE)
+ #define MY_DEVICE_CLASS(void *klass) \
+ OBJECT_CLASS_CHECK(MyDeviceClass, klass, TYPE_MY_DEVICE)
+ #define MY_DEVICE(void *obj)
+ OBJECT_CHECK(MyDevice, obj, TYPE_MY_DEVICE)
+
+ struct MyDeviceClass {
+ DeviceClass parent_class;
+ };
+
+The 'struct MyDevice' needs to be declared separately.
+If the type requires virtual functions to be declared in the class
+struct, then the alternative OBJECT_DECLARE_TYPE() macro can be
+used. This does the same as OBJECT_DECLARE_SIMPLE_TYPE(), but without
+the 'struct MyDeviceClass' definition.
+
+To implement the type, the OBJECT_DEFINE macro family is available.
+In the simple case the OBJECT_DEFINE_TYPE macro is suitable:
+
+.. code-block:: c
+ :caption: Defining a simple type
+
+ OBJECT_DEFINE_TYPE(MyDevice, my_device, MY_DEVICE, DEVICE)
+
+This is equivalent to the following:
+
+.. code-block:: c
+ :caption: Expansion from defining a simple type
+
+ static void my_device_finalize(Object *obj);
+ static void my_device_class_init(ObjectClass *oc, void *data);
+ static void my_device_init(Object *obj);
+
+ static const TypeInfo my_device_info = {
+ .parent = TYPE_DEVICE,
+ .name = TYPE_MY_DEVICE,
+ .instance_size = sizeof(MyDevice),
+ .instance_init = my_device_init,
+ .instance_finalize = my_device_finalize,
+ .class_size = sizeof(MyDeviceClass),
+ .class_init = my_device_class_init,
+ };
+
+ static void
+ my_device_register_types(void)
+ {
+ type_register_static(&my_device_info);
+ }
+ type_init(my_device_register_types);
+
+This is sufficient to get the type registered with the type
+system, and the three standard methods now need to be implemented
+along with any other logic required for the type.
+
+If the type needs to implement one or more interfaces, then the
+OBJECT_DEFINE_TYPE_WITH_INTERFACES() macro can be used instead.
+This accepts an array of interface type names.
+
+.. code-block:: c
+ :caption: Defining a simple type implementing interfaces
+
+ OBJECT_DEFINE_TYPE_WITH_INTERFACES(MyDevice, my_device,
+ MY_DEVICE, DEVICE,
+ { TYPE_USER_CREATABLE },
+ { NULL })
+
+If the type is not intended to be instantiated, then then
+the OBJECT_DEFINE_ABSTRACT_TYPE() macro can be used instead:
+
+.. code-block:: c
+ :caption: Defining a simple abstract type
+
+ OBJECT_DEFINE_ABSTRACT_TYPE(MyDevice, my_device,
+ MY_DEVICE, DEVICE)
+
+
+
+API Reference
+-------------
+
+.. kernel-doc:: include/qom/object.h
diff --git a/docs/devel/qtest.rst b/docs/devel/qtest.rst
new file mode 100644
index 000000000..c3dceb6c8
--- /dev/null
+++ b/docs/devel/qtest.rst
@@ -0,0 +1,92 @@
+========================================
+QTest Device Emulation Testing Framework
+========================================
+
+.. toctree::
+ :hidden:
+
+ qgraph
+
+QTest is a device emulation testing framework. It can be very useful to test
+device models; it could also control certain aspects of QEMU (such as virtual
+clock stepping), with a special purpose "qtest" protocol. Refer to
+:ref:`qtest-protocol` for more details of the protocol.
+
+QTest cases can be executed with
+
+.. code::
+
+ make check-qtest
+
+The QTest library is implemented by ``tests/qtest/libqtest.c`` and the API is
+defined in ``tests/qtest/libqtest.h``.
+
+Consider adding a new QTest case when you are introducing a new virtual
+hardware, or extending one if you are adding functionalities to an existing
+virtual device.
+
+On top of libqtest, a higher level library, ``libqos``, was created to
+encapsulate common tasks of device drivers, such as memory management and
+communicating with system buses or devices. Many virtual device tests use
+libqos instead of directly calling into libqtest.
+Libqos also offers the Qgraph API to increase each test coverage and
+automate QEMU command line arguments and devices setup.
+Refer to :ref:`qgraph` for Qgraph explanation and API.
+
+Steps to add a new QTest case are:
+
+1. Create a new source file for the test. (More than one file can be added as
+ necessary.) For example, ``tests/qtest/foo-test.c``.
+
+2. Write the test code with the glib and libqtest/libqos API. See also existing
+ tests and the library headers for reference.
+
+3. Register the new test in ``tests/qtest/meson.build``. Add the test
+ executable name to an appropriate ``qtests_*`` variable. There is
+ one variable per architecture, plus ``qtests_generic`` for tests
+ that can be run for all architectures. For example::
+
+ qtests_generic = [
+ ...
+ 'foo-test',
+ ...
+ ]
+
+4. If the test has more than one source file or needs to be linked with any
+ dependency other than ``qemuutil`` and ``qos``, list them in the ``qtests``
+ dictionary. For example a test that needs to use the ``QIO`` library
+ will have an entry like::
+
+ {
+ ...
+ 'foo-test': [io],
+ ...
+ }
+
+Debugging a QTest failure is slightly harder than the unit test because the
+tests look up QEMU program names in the environment variables, such as
+``QTEST_QEMU_BINARY`` and ``QTEST_QEMU_IMG``, and also because it is not easy
+to attach gdb to the QEMU process spawned from the test. But manual invoking
+and using gdb on the test is still simple to do: find out the actual command
+from the output of
+
+.. code::
+
+ make check-qtest V=1
+
+which you can run manually.
+
+
+.. _qtest-protocol:
+
+QTest Protocol
+--------------
+
+.. kernel-doc:: softmmu/qtest.c
+ :doc: QTest Protocol
+
+
+libqtest API reference
+----------------------
+
+.. kernel-doc:: tests/qtest/libqos/libqtest.h
diff --git a/docs/devel/rcu.txt b/docs/devel/rcu.txt
new file mode 100644
index 000000000..2e6cc607a
--- /dev/null
+++ b/docs/devel/rcu.txt
@@ -0,0 +1,406 @@
+Using RCU (Read-Copy-Update) for synchronization
+================================================
+
+Read-copy update (RCU) is a synchronization mechanism that is used to
+protect read-mostly data structures. RCU is very efficient and scalable
+on the read side (it is wait-free), and thus can make the read paths
+extremely fast.
+
+RCU supports concurrency between a single writer and multiple readers,
+thus it is not used alone. Typically, the write-side will use a lock to
+serialize multiple updates, but other approaches are possible (e.g.,
+restricting updates to a single task). In QEMU, when a lock is used,
+this will often be the "iothread mutex", also known as the "big QEMU
+lock" (BQL). Also, restricting updates to a single task is done in
+QEMU using the "bottom half" API.
+
+RCU is fundamentally a "wait-to-finish" mechanism. The read side marks
+sections of code with "critical sections", and the update side will wait
+for the execution of all *currently running* critical sections before
+proceeding, or before asynchronously executing a callback.
+
+The key point here is that only the currently running critical sections
+are waited for; critical sections that are started _after_ the beginning
+of the wait do not extend the wait, despite running concurrently with
+the updater. This is the reason why RCU is more scalable than,
+for example, reader-writer locks. It is so much more scalable that
+the system will have a single instance of the RCU mechanism; a single
+mechanism can be used for an arbitrary number of "things", without
+having to worry about things such as contention or deadlocks.
+
+How is this possible? The basic idea is to split updates in two phases,
+"removal" and "reclamation". During removal, we ensure that subsequent
+readers will not be able to get a reference to the old data. After
+removal has completed, a critical section will not be able to access
+the old data. Therefore, critical sections that begin after removal
+do not matter; as soon as all previous critical sections have finished,
+there cannot be any readers who hold references to the data structure,
+and these can now be safely reclaimed (e.g., freed or unref'ed).
+
+Here is a picture:
+
+ thread 1 thread 2 thread 3
+ ------------------- ------------------------ -------------------
+ enter RCU crit.sec.
+ | finish removal phase
+ | begin wait
+ | | enter RCU crit.sec.
+ exit RCU crit.sec | |
+ complete wait |
+ begin reclamation phase |
+ exit RCU crit.sec.
+
+
+Note how thread 3 is still executing its critical section when thread 2
+starts reclaiming data. This is possible, because the old version of the
+data structure was not accessible at the time thread 3 began executing
+that critical section.
+
+
+RCU API
+=======
+
+The core RCU API is small:
+
+ void rcu_read_lock(void);
+
+ Used by a reader to inform the reclaimer that the reader is
+ entering an RCU read-side critical section.
+
+ void rcu_read_unlock(void);
+
+ Used by a reader to inform the reclaimer that the reader is
+ exiting an RCU read-side critical section. Note that RCU
+ read-side critical sections may be nested and/or overlapping.
+
+ void synchronize_rcu(void);
+
+ Blocks until all pre-existing RCU read-side critical sections
+ on all threads have completed. This marks the end of the removal
+ phase and the beginning of reclamation phase.
+
+ Note that it would be valid for another update to come while
+ synchronize_rcu is running. Because of this, it is better that
+ the updater releases any locks it may hold before calling
+ synchronize_rcu. If this is not possible (for example, because
+ the updater is protected by the BQL), you can use call_rcu.
+
+ void call_rcu1(struct rcu_head * head,
+ void (*func)(struct rcu_head *head));
+
+ This function invokes func(head) after all pre-existing RCU
+ read-side critical sections on all threads have completed. This
+ marks the end of the removal phase, with func taking care
+ asynchronously of the reclamation phase.
+
+ The foo struct needs to have an rcu_head structure added,
+ perhaps as follows:
+
+ struct foo {
+ struct rcu_head rcu;
+ int a;
+ char b;
+ long c;
+ };
+
+ so that the reclaimer function can fetch the struct foo address
+ and free it:
+
+ call_rcu1(&foo.rcu, foo_reclaim);
+
+ void foo_reclaim(struct rcu_head *rp)
+ {
+ struct foo *fp = container_of(rp, struct foo, rcu);
+ g_free(fp);
+ }
+
+ For the common case where the rcu_head member is the first of the
+ struct, you can use the following macro.
+
+ void call_rcu(T *p,
+ void (*func)(T *p),
+ field-name);
+ void g_free_rcu(T *p,
+ field-name);
+
+ call_rcu1 is typically used through these macro, in the common case
+ where the "struct rcu_head" is the first field in the struct. If
+ the callback function is g_free, in particular, g_free_rcu can be
+ used. In the above case, one could have written simply:
+
+ g_free_rcu(&foo, rcu);
+
+ typeof(*p) qatomic_rcu_read(p);
+
+ qatomic_rcu_read() is similar to qatomic_load_acquire(), but it makes
+ some assumptions on the code that calls it. This allows a more
+ optimized implementation.
+
+ qatomic_rcu_read assumes that whenever a single RCU critical
+ section reads multiple shared data, these reads are either
+ data-dependent or need no ordering. This is almost always the
+ case when using RCU, because read-side critical sections typically
+ navigate one or more pointers (the pointers that are changed on
+ every update) until reaching a data structure of interest,
+ and then read from there.
+
+ RCU read-side critical sections must use qatomic_rcu_read() to
+ read data, unless concurrent writes are prevented by another
+ synchronization mechanism.
+
+ Furthermore, RCU read-side critical sections should traverse the
+ data structure in a single direction, opposite to the direction
+ in which the updater initializes it.
+
+ void qatomic_rcu_set(p, typeof(*p) v);
+
+ qatomic_rcu_set() is similar to qatomic_store_release(), though it also
+ makes assumptions on the code that calls it in order to allow a more
+ optimized implementation.
+
+ In particular, qatomic_rcu_set() suffices for synchronization
+ with readers, if the updater never mutates a field within a
+ data item that is already accessible to readers. This is the
+ case when initializing a new copy of the RCU-protected data
+ structure; just ensure that initialization of *p is carried out
+ before qatomic_rcu_set() makes the data item visible to readers.
+ If this rule is observed, writes will happen in the opposite
+ order as reads in the RCU read-side critical sections (or if
+ there is just one update), and there will be no need for other
+ synchronization mechanism to coordinate the accesses.
+
+The following APIs must be used before RCU is used in a thread:
+
+ void rcu_register_thread(void);
+
+ Mark a thread as taking part in the RCU mechanism. Such a thread
+ will have to report quiescent points regularly, either manually
+ or through the QemuCond/QemuSemaphore/QemuEvent APIs.
+
+ void rcu_unregister_thread(void);
+
+ Mark a thread as not taking part anymore in the RCU mechanism.
+ It is not a problem if such a thread reports quiescent points,
+ either manually or by using the QemuCond/QemuSemaphore/QemuEvent
+ APIs.
+
+Note that these APIs are relatively heavyweight, and should _not_ be
+nested.
+
+Convenience macros
+==================
+
+Two macros are provided that automatically release the read lock at the
+end of the scope.
+
+ RCU_READ_LOCK_GUARD()
+
+ Takes the lock and will release it at the end of the block it's
+ used in.
+
+ WITH_RCU_READ_LOCK_GUARD() { code }
+
+ Is used at the head of a block to protect the code within the block.
+
+Note that 'goto'ing out of the guarded block will also drop the lock.
+
+DIFFERENCES WITH LINUX
+======================
+
+- Waiting on a mutex is possible, though discouraged, within an RCU critical
+ section. This is because spinlocks are rarely (if ever) used in userspace
+ programming; not allowing this would prevent upgrading an RCU read-side
+ critical section to become an updater.
+
+- qatomic_rcu_read and qatomic_rcu_set replace rcu_dereference and
+ rcu_assign_pointer. They take a _pointer_ to the variable being accessed.
+
+- call_rcu is a macro that has an extra argument (the name of the first
+ field in the struct, which must be a struct rcu_head), and expects the
+ type of the callback's argument to be the type of the first argument.
+ call_rcu1 is the same as Linux's call_rcu.
+
+
+RCU PATTERNS
+============
+
+Many patterns using read-writer locks translate directly to RCU, with
+the advantages of higher scalability and deadlock immunity.
+
+In general, RCU can be used whenever it is possible to create a new
+"version" of a data structure every time the updater runs. This may
+sound like a very strict restriction, however:
+
+- the updater does not mean "everything that writes to a data structure",
+ but rather "everything that involves a reclamation step". See the
+ array example below
+
+- in some cases, creating a new version of a data structure may actually
+ be very cheap. For example, modifying the "next" pointer of a singly
+ linked list is effectively creating a new version of the list.
+
+Here are some frequently-used RCU idioms that are worth noting.
+
+
+RCU list processing
+-------------------
+
+TBD (not yet used in QEMU)
+
+
+RCU reference counting
+----------------------
+
+Because grace periods are not allowed to complete while there is an RCU
+read-side critical section in progress, the RCU read-side primitives
+may be used as a restricted reference-counting mechanism. For example,
+consider the following code fragment:
+
+ rcu_read_lock();
+ p = qatomic_rcu_read(&foo);
+ /* do something with p. */
+ rcu_read_unlock();
+
+The RCU read-side critical section ensures that the value of "p" remains
+valid until after the rcu_read_unlock(). In some sense, it is acquiring
+a reference to p that is later released when the critical section ends.
+The write side looks simply like this (with appropriate locking):
+
+ qemu_mutex_lock(&foo_mutex);
+ old = foo;
+ qatomic_rcu_set(&foo, new);
+ qemu_mutex_unlock(&foo_mutex);
+ synchronize_rcu();
+ free(old);
+
+If the processing cannot be done purely within the critical section, it
+is possible to combine this idiom with a "real" reference count:
+
+ rcu_read_lock();
+ p = qatomic_rcu_read(&foo);
+ foo_ref(p);
+ rcu_read_unlock();
+ /* do something with p. */
+ foo_unref(p);
+
+The write side can be like this:
+
+ qemu_mutex_lock(&foo_mutex);
+ old = foo;
+ qatomic_rcu_set(&foo, new);
+ qemu_mutex_unlock(&foo_mutex);
+ synchronize_rcu();
+ foo_unref(old);
+
+or with call_rcu:
+
+ qemu_mutex_lock(&foo_mutex);
+ old = foo;
+ qatomic_rcu_set(&foo, new);
+ qemu_mutex_unlock(&foo_mutex);
+ call_rcu(foo_unref, old, rcu);
+
+In both cases, the write side only performs removal. Reclamation
+happens when the last reference to a "foo" object is dropped.
+Using synchronize_rcu() is undesirably expensive, because the
+last reference may be dropped on the read side. Hence you can
+use call_rcu() instead:
+
+ foo_unref(struct foo *p) {
+ if (qatomic_fetch_dec(&p->refcount) == 1) {
+ call_rcu(foo_destroy, p, rcu);
+ }
+ }
+
+
+Note that the same idioms would be possible with reader/writer
+locks:
+
+ read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock);
+ p = foo; p = foo;
+ /* do something with p. */ foo = new;
+ read_unlock(&foo_rwlock); free(p);
+ write_mutex_unlock(&foo_rwlock);
+ free(p);
+
+ ------------------------------------------------------------------
+
+ read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock);
+ p = foo; old = foo;
+ foo_ref(p); foo = new;
+ read_unlock(&foo_rwlock); foo_unref(old);
+ /* do something with p. */ write_mutex_unlock(&foo_rwlock);
+ read_lock(&foo_rwlock);
+ foo_unref(p);
+ read_unlock(&foo_rwlock);
+
+foo_unref could use a mechanism such as bottom halves to move deallocation
+out of the write-side critical section.
+
+
+RCU resizable arrays
+--------------------
+
+Resizable arrays can be used with RCU. The expensive RCU synchronization
+(or call_rcu) only needs to take place when the array is resized.
+The two items to take care of are:
+
+- ensuring that the old version of the array is available between removal
+ and reclamation;
+
+- avoiding mismatches in the read side between the array data and the
+ array size.
+
+The first problem is avoided simply by not using realloc. Instead,
+each resize will allocate a new array and copy the old data into it.
+The second problem would arise if the size and the data pointers were
+two members of a larger struct:
+
+ struct mystuff {
+ ...
+ int data_size;
+ int data_alloc;
+ T *data;
+ ...
+ };
+
+Instead, we store the size of the array with the array itself:
+
+ struct arr {
+ int size;
+ int alloc;
+ T data[];
+ };
+ struct arr *global_array;
+
+ read side:
+ rcu_read_lock();
+ struct arr *array = qatomic_rcu_read(&global_array);
+ x = i < array->size ? array->data[i] : -1;
+ rcu_read_unlock();
+ return x;
+
+ write side (running under a lock):
+ if (global_array->size == global_array->alloc) {
+ /* Creating a new version. */
+ new_array = g_malloc(sizeof(struct arr) +
+ global_array->alloc * 2 * sizeof(T));
+ new_array->size = global_array->size;
+ new_array->alloc = global_array->alloc * 2;
+ memcpy(new_array->data, global_array->data,
+ global_array->alloc * sizeof(T));
+
+ /* Removal phase. */
+ old_array = global_array;
+ qatomic_rcu_set(&global_array, new_array);
+ synchronize_rcu();
+
+ /* Reclamation phase. */
+ free(old_array);
+ }
+
+
+SOURCES
+=======
+
+* Documentation/RCU/ from the Linux kernel
diff --git a/docs/devel/replay.txt b/docs/devel/replay.txt
new file mode 100644
index 000000000..e641c35ad
--- /dev/null
+++ b/docs/devel/replay.txt
@@ -0,0 +1,46 @@
+Record/replay mechanism, that could be enabled through icount mode, expects
+the virtual devices to satisfy the following requirements.
+
+The main idea behind this document is that everything that affects
+the guest state during execution in icount mode should be deterministic.
+
+Timers
+======
+
+All virtual devices should use virtual clock for timers that change the guest
+state. Virtual clock is deterministic, therefore such timers are deterministic
+too.
+
+Virtual devices can also use realtime clock for the events that do not change
+the guest state directly. When the clock ticking should depend on VM execution
+speed, use virtual clock with EXTERNAL attribute. It is not deterministic,
+but its speed depends on the guest execution. This clock is used by
+the virtual devices (e.g., slirp routing device) that lie outside the
+replayed guest.
+
+Bottom halves
+=============
+
+Bottom half callbacks, that affect the guest state, should be invoked through
+replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions.
+Their invocations are saved in record mode and synchronized with the existing
+log in replay mode.
+
+Saving/restoring the VM state
+=============================
+
+All fields in the device state structure (including virtual timers)
+should be restored by loadvm to the same values they had before savevm.
+
+Avoid accessing other devices' state, because the order of saving/restoring
+is not defined. It means that you should not call functions like
+'update_irq' in post_load callback. Save everything explicitly to avoid
+the dependencies that may make restoring the VM state non-deterministic.
+
+Stopping the VM
+===============
+
+Stopping the guest should not interfere with its state (with the exception
+of the network connections, that could be broken by the remote timeouts).
+VM can be stopped at any moment of replay by the user. Restarting the VM
+after that stop should not break the replay by the unneeded guest state change.
diff --git a/docs/devel/reset.rst b/docs/devel/reset.rst
new file mode 100644
index 000000000..abea1102d
--- /dev/null
+++ b/docs/devel/reset.rst
@@ -0,0 +1,289 @@
+
+=======================================
+Reset in QEMU: the Resettable interface
+=======================================
+
+The reset of qemu objects is handled using the resettable interface declared
+in ``include/hw/resettable.h``.
+
+This interface allows objects to be grouped (on a tree basis); so that the
+whole group can be reset consistently. Each individual member object does not
+have to care about others; in particular, problems of order (which object is
+reset first) are addressed.
+
+As of now DeviceClass and BusClass implement this interface.
+
+
+Triggering reset
+----------------
+
+This section documents the APIs which "users" of a resettable object should use
+to control it. All resettable control functions must be called while holding
+the iothread lock.
+
+You can apply a reset to an object using ``resettable_assert_reset()``. You need
+to call ``resettable_release_reset()`` to release the object from reset. To
+instantly reset an object, without keeping it in reset state, just call
+``resettable_reset()``. These functions take two parameters: a pointer to the
+object to reset and a reset type.
+
+Several types of reset will be supported. For now only cold reset is defined;
+others may be added later. The Resettable interface handles reset types with an
+enum:
+
+``RESET_TYPE_COLD``
+ Cold reset is supported by every resettable object. In QEMU, it means we reset
+ to the initial state corresponding to the start of QEMU; this might differ
+ from what is a real hardware cold reset. It differs from other resets (like
+ warm or bus resets) which may keep certain parts untouched.
+
+Calling ``resettable_reset()`` is equivalent to calling
+``resettable_assert_reset()`` then ``resettable_release_reset()``. It is
+possible to interleave multiple calls to these three functions. There may
+be several reset sources/controllers of a given object. The interface handles
+everything and the different reset controllers do not need to know anything
+about each others. The object will leave reset state only when each other
+controllers end their reset operation. This point is handled internally by
+maintaining a count of in-progress resets; it is crucial to call
+``resettable_release_reset()`` one time and only one time per
+``resettable_assert_reset()`` call.
+
+For now migration of a device or bus in reset is not supported. Care must be
+taken not to delay ``resettable_release_reset()`` after its
+``resettable_assert_reset()`` counterpart.
+
+Note that, since resettable is an interface, the API takes a simple Object as
+parameter. Still, it is a programming error to call a resettable function on a
+non-resettable object and it will trigger a run time assert error. Since most
+calls to resettable interface are done through base class functions, such an
+error is not likely to happen.
+
+For Devices and Buses, the following helper functions exist:
+
+- ``device_cold_reset()``
+- ``bus_cold_reset()``
+
+These are simple wrappers around resettable_reset() function; they only cast the
+Device or Bus into an Object and pass the cold reset type. When possible
+prefer to use these functions instead of ``resettable_reset()``.
+
+Device and bus functions co-exist because there can be semantic differences
+between resetting a bus and resetting the controller bridge which owns it.
+For example, consider a SCSI controller. Resetting the controller puts all
+its registers back to what reset state was as well as reset everything on the
+SCSI bus, whereas resetting just the SCSI bus only resets everything that's on
+it but not the controller.
+
+
+Multi-phase mechanism
+---------------------
+
+This section documents the internals of the resettable interface.
+
+The resettable interface uses a multi-phase system to relieve objects and
+machines from reset ordering problems. To address this, the reset operation
+of an object is split into three well defined phases.
+
+When resetting several objects (for example the whole machine at simulation
+startup), all first phases of all objects are executed, then all second phases
+and then all third phases.
+
+The three phases are:
+
+1. The **enter** phase is executed when the object enters reset. It resets only
+ local state of the object; it must not do anything that has a side-effect
+ on other objects, such as raising or lowering a qemu_irq line or reading or
+ writing guest memory.
+
+2. The **hold** phase is executed for entry into reset, once every object in the
+ group which is being reset has had its *enter* phase executed. At this point
+ devices can do actions that affect other objects.
+
+3. The **exit** phase is executed when the object leaves the reset state.
+ Actions affecting other objects are permitted.
+
+As said in previous section, the interface maintains a count of reset. This
+count is used to ensure phases are executed only when required. *enter* and
+*hold* phases are executed only when asserting reset for the first time
+(if an object is already in reset state when calling
+``resettable_assert_reset()`` or ``resettable_reset()``, they are not
+executed).
+The *exit* phase is executed only when the last reset operation ends. Therefore
+the object does not need to care how many of reset controllers it has and how
+many of them have started a reset.
+
+
+Handling reset in a resettable object
+-------------------------------------
+
+This section documents the APIs that an implementation of a resettable object
+must provide and what functions it has access to. It is intended for people
+who want to implement or convert a class which has the resettable interface;
+for example when specializing an existing device or bus.
+
+Methods to implement
+....................
+
+Three methods should be defined or left empty. Each method corresponds to a
+phase of the reset; they are name ``phases.enter()``, ``phases.hold()`` and
+``phases.exit()``. They all take the object as parameter. The *enter* method
+also take the reset type as second parameter.
+
+When extending an existing class, these methods may need to be extended too.
+The ``resettable_class_set_parent_phases()`` class function may be used to
+backup parent class methods.
+
+Here follows an example to implement reset for a Device which sets an IO while
+in reset.
+
+::
+
+ static void mydev_reset_enter(Object *obj, ResetType type)
+ {
+ MyDevClass *myclass = MYDEV_GET_CLASS(obj);
+ MyDevState *mydev = MYDEV(obj);
+ /* call parent class enter phase */
+ if (myclass->parent_phases.enter) {
+ myclass->parent_phases.enter(obj, type);
+ }
+ /* initialize local state only */
+ mydev->var = 0;
+ }
+
+ static void mydev_reset_hold(Object *obj)
+ {
+ MyDevClass *myclass = MYDEV_GET_CLASS(obj);
+ MyDevState *mydev = MYDEV(obj);
+ /* call parent class hold phase */
+ if (myclass->parent_phases.hold) {
+ myclass->parent_phases.hold(obj);
+ }
+ /* set an IO */
+ qemu_set_irq(mydev->irq, 1);
+ }
+
+ static void mydev_reset_exit(Object *obj)
+ {
+ MyDevClass *myclass = MYDEV_GET_CLASS(obj);
+ MyDevState *mydev = MYDEV(obj);
+ /* call parent class exit phase */
+ if (myclass->parent_phases.exit) {
+ myclass->parent_phases.exit(obj);
+ }
+ /* clear an IO */
+ qemu_set_irq(mydev->irq, 0);
+ }
+
+ typedef struct MyDevClass {
+ MyParentClass parent_class;
+ /* to store eventual parent reset methods */
+ ResettablePhases parent_phases;
+ } MyDevClass;
+
+ static void mydev_class_init(ObjectClass *class, void *data)
+ {
+ MyDevClass *myclass = MYDEV_CLASS(class);
+ ResettableClass *rc = RESETTABLE_CLASS(class);
+ resettable_class_set_parent_reset_phases(rc,
+ mydev_reset_enter,
+ mydev_reset_hold,
+ mydev_reset_exit,
+ &myclass->parent_phases);
+ }
+
+In the above example, we override all three phases. It is possible to override
+only some of them by passing NULL instead of a function pointer to
+``resettable_class_set_parent_reset_phases()``. For example, the following will
+only override the *enter* phase and leave *hold* and *exit* untouched::
+
+ resettable_class_set_parent_reset_phases(rc, mydev_reset_enter,
+ NULL, NULL,
+ &myclass->parent_phases);
+
+This is equivalent to providing a trivial implementation of the hold and exit
+phases which does nothing but call the parent class's implementation of the
+phase.
+
+Polling the reset state
+.......................
+
+Resettable interface provides the ``resettable_is_in_reset()`` function.
+This function returns true if the object parameter is currently under reset.
+
+An object is under reset from the beginning of the *init* phase to the end of
+the *exit* phase. During all three phases, the function will return that the
+object is in reset.
+
+This function may be used if the object behavior has to be adapted
+while in reset state. For example if a device has an irq input,
+it will probably need to ignore it while in reset; then it can for
+example check the reset state at the beginning of the irq callback.
+
+Note that until migration of the reset state is supported, an object
+should not be left in reset. So apart from being currently executing
+one of the reset phases, the only cases when this function will return
+true is if an external interaction (like changing an io) is made during
+*hold* or *exit* phase of another object in the same reset group.
+
+Helpers ``device_is_in_reset()`` and ``bus_is_in_reset()`` are also provided
+for devices and buses and should be preferred.
+
+
+Base class handling of reset
+----------------------------
+
+This section documents parts of the reset mechanism that you only need to know
+about if you are extending it to work with a new base class other than
+DeviceClass or BusClass, or maintaining the existing code in those classes. Most
+people can ignore it.
+
+Methods to implement
+....................
+
+There are two other methods that need to exist in a class implementing the
+interface: ``get_state()`` and ``child_foreach()``.
+
+``get_state()`` is simple. *resettable* is an interface and, as a consequence,
+does not have any class state structure. But in order to factorize the code, we
+need one. This method must return a pointer to ``ResettableState`` structure.
+The structure must be allocated by the base class; preferably it should be
+located inside the object instance structure.
+
+``child_foreach()`` is more complex. It should execute the given callback on
+every reset child of the given resettable object. All children must be
+resettable too. Additional parameters (a reset type and an opaque pointer) must
+be passed to the callback too.
+
+In ``DeviceClass`` and ``BusClass`` the ``ResettableState`` is located
+``DeviceState`` and ``BusState`` structure. ``child_foreach()`` is implemented
+to follow the bus hierarchy; for a bus, it calls the function on every child
+device; for a device, it calls the function on every bus child. When we reset
+the main system bus, we reset the whole machine bus tree.
+
+Changing a resettable parent
+............................
+
+One thing which should be taken care of by the base class is handling reset
+hierarchy changes.
+
+The reset hierarchy is supposed to be static and built during machine creation.
+But there are actually some exceptions. To cope with this, the resettable API
+provides ``resettable_change_parent()``. This function allows to set, update or
+remove the parent of a resettable object after machine creation is done. As
+parameters, it takes the object being moved, the old parent if any and the new
+parent if any.
+
+This function can be used at any time when not in a reset operation. During
+a reset operation it must be used only in *hold* phase. Using it in *enter* or
+*exit* phase is an error.
+Also it should not be used during machine creation, although it is harmless to
+do so: the function is a no-op as long as old and new parent are NULL or not
+in reset.
+
+There is currently 2 cases where this function is used:
+
+1. *device hotplug*; it means a new device is introduced on a live bus.
+
+2. *hot bus change*; it means an existing live device is added, moved or
+ removed in the bus hierarchy. At the moment, it occurs only in the raspi
+ machines for changing the sdbus used by sd card.
diff --git a/docs/devel/s390-dasd-ipl.rst b/docs/devel/s390-dasd-ipl.rst
new file mode 100644
index 000000000..2529eb5f5
--- /dev/null
+++ b/docs/devel/s390-dasd-ipl.rst
@@ -0,0 +1,138 @@
+Booting from real channel-attached devices on s390x
+===================================================
+
+s390 hardware IPL
+-----------------
+
+The s390 hardware IPL process consists of the following steps.
+
+1. A READ IPL ccw is constructed in memory location ``0x0``.
+ This ccw, by definition, reads the IPL1 record which is located on the disk
+ at cylinder 0 track 0 record 1. Note that the chain flag is on in this ccw
+ so when it is complete another ccw will be fetched and executed from memory
+ location ``0x08``.
+
+2. Execute the Read IPL ccw at ``0x00``, thereby reading IPL1 data into ``0x00``.
+ IPL1 data is 24 bytes in length and consists of the following pieces of
+ information: ``[psw][read ccw][tic ccw]``. When the machine executes the Read
+ IPL ccw it read the 24-bytes of IPL1 to be read into memory starting at
+ location ``0x0``. Then the ccw program at ``0x08`` which consists of a read
+ ccw and a tic ccw is automatically executed because of the chain flag from
+ the original READ IPL ccw. The read ccw will read the IPL2 data into memory
+ and the TIC (Transfer In Channel) will transfer control to the channel
+ program contained in the IPL2 data. The TIC channel command is the
+ equivalent of a branch/jump/goto instruction for channel programs.
+
+ NOTE: The ccws in IPL1 are defined by the architecture to be format 0.
+
+3. Execute IPL2.
+ The TIC ccw instruction at the end of the IPL1 channel program will begin
+ the execution of the IPL2 channel program. IPL2 is stage-2 of the boot
+ process and will contain a larger channel program than IPL1. The point of
+ IPL2 is to find and load either the operating system or a small program that
+ loads the operating system from disk. At the end of this step all or some of
+ the real operating system is loaded into memory and we are ready to hand
+ control over to the guest operating system. At this point the guest
+ operating system is entirely responsible for loading any more data it might
+ need to function.
+
+ NOTE: The IPL2 channel program might read data into memory
+ location ``0x0`` thereby overwriting the IPL1 psw and channel program. This is ok
+ as long as the data placed in location ``0x0`` contains a psw whose instruction
+ address points to the guest operating system code to execute at the end of
+ the IPL/boot process.
+
+ NOTE: The ccws in IPL2 are defined by the architecture to be format 0.
+
+4. Start executing the guest operating system.
+ The psw that was loaded into memory location ``0x0`` as part of the ipl process
+ should contain the needed flags for the operating system we have loaded. The
+ psw's instruction address will point to the location in memory where we want
+ to start executing the operating system. This psw is loaded (via LPSW
+ instruction) causing control to be passed to the operating system code.
+
+In a non-virtualized environment this process, handled entirely by the hardware,
+is kicked off by the user initiating a "Load" procedure from the hardware
+management console. This "Load" procedure crafts a special "Read IPL" ccw in
+memory location 0x0 that reads IPL1. It then executes this ccw thereby kicking
+off the reading of IPL1 data. Since the channel program from IPL1 will be
+written immediately after the special "Read IPL" ccw, the IPL1 channel program
+will be executed immediately (the special read ccw has the chaining bit turned
+on). The TIC at the end of the IPL1 channel program will cause the IPL2 channel
+program to be executed automatically. After this sequence completes the "Load"
+procedure then loads the psw from ``0x0``.
+
+How this all pertains to QEMU (and the kernel)
+----------------------------------------------
+
+In theory we should merely have to do the following to IPL/boot a guest
+operating system from a DASD device:
+
+1. Place a "Read IPL" ccw into memory location ``0x0`` with chaining bit on.
+2. Execute channel program at ``0x0``.
+3. LPSW ``0x0``.
+
+However, our emulation of the machine's channel program logic within the kernel
+is missing one key feature that is required for this process to work:
+non-prefetch of ccw data.
+
+When we start a channel program we pass the channel subsystem parameters via an
+ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
+bit is on then the vfio-ccw kernel driver is allowed to read the entire channel
+program from guest memory before it starts executing it. This means that any
+channel commands that read additional channel commands will not work as expected
+because the newly read commands will only exist in guest memory and NOT within
+the kernel's channel subsystem memory. The kernel vfio-ccw driver currently
+requires this bit to be on for all channel programs. This is a problem because
+the IPL process consists of transferring control from the "Read IPL" ccw
+immediately to the IPL1 channel program that was read by "Read IPL".
+
+Not being able to turn off prefetch will also prevent the TIC at the end of the
+IPL1 channel program from transferring control to the IPL2 channel program.
+
+Lastly, in some cases (the zipl bootloader for example) the IPL2 program also
+transfers control to another channel program segment immediately after reading
+it from the disk. So we need to be able to handle this case.
+
+What QEMU does
+--------------
+
+Since we are forced to live with prefetch we cannot use the very simple IPL
+procedure we defined in the preceding section. So we compensate by doing the
+following.
+
+1. Place "Read IPL" ccw into memory location ``0x0``, but turn off chaining bit.
+2. Execute "Read IPL" at ``0x0``.
+
+ So now IPL1's psw is at ``0x0`` and IPL1's channel program is at ``0x08``.
+
+3. Write a custom channel program that will seek to the IPL2 record and then
+ execute the READ and TIC ccws from IPL1. Normally the seek is not required
+ because after reading the IPL1 record the disk is automatically positioned
+ to read the very next record which will be IPL2. But since we are not reading
+ both IPL1 and IPL2 as part of the same channel program we must manually set
+ the position.
+
+4. Grab the target address of the TIC instruction from the IPL1 channel program.
+ This address is where the IPL2 channel program starts.
+
+ Now IPL2 is loaded into memory somewhere, and we know the address.
+
+5. Execute the IPL2 channel program at the address obtained in step #4.
+
+ Because this channel program can be dynamic, we must use a special algorithm
+ that detects a READ immediately followed by a TIC and breaks the ccw chain
+ by turning off the chain bit in the READ ccw. When control is returned from
+ the kernel/hardware to the QEMU bios code we immediately issue another start
+ subchannel to execute the remaining TIC instruction. This causes the entire
+ channel program (starting from the TIC) and all needed data to be refetched
+ thereby stepping around the limitation that would otherwise prevent this
+ channel program from executing properly.
+
+ Now the operating system code is loaded somewhere in guest memory and the psw
+ in memory location ``0x0`` will point to entry code for the guest operating
+ system.
+
+6. LPSW ``0x0``
+
+ LPSW transfers control to the guest operating system and we're done.
diff --git a/docs/devel/secure-coding-practices.rst b/docs/devel/secure-coding-practices.rst
new file mode 100644
index 000000000..0454cc527
--- /dev/null
+++ b/docs/devel/secure-coding-practices.rst
@@ -0,0 +1,115 @@
+=======================
+Secure Coding Practices
+=======================
+This document covers topics that both developers and security researchers must
+be aware of so that they can develop safe code and audit existing code
+properly.
+
+Reporting Security Bugs
+-----------------------
+For details on how to report security bugs or ask questions about potential
+security bugs, see the `Security Process wiki page
+<https://wiki.qemu.org/SecurityProcess>`_.
+
+General Secure C Coding Practices
+---------------------------------
+Most CVEs (security bugs) reported against QEMU are not specific to
+virtualization or emulation. They are simply C programming bugs. Therefore
+it's critical to be aware of common classes of security bugs.
+
+There is a wide selection of resources available covering secure C coding. For
+example, the `CERT C Coding Standard
+<https://wiki.sei.cmu.edu/confluence/display/c/SEI+CERT+C+Coding+Standard>`_
+covers the most important classes of security bugs.
+
+Instead of describing them in detail here, only the names of the most important
+classes of security bugs are mentioned:
+
+* Buffer overflows
+* Use-after-free and double-free
+* Integer overflows
+* Format string vulnerabilities
+
+Some of these classes of bugs can be detected by analyzers. Static analysis is
+performed regularly by Coverity and the most obvious of these bugs are even
+reported by compilers. Dynamic analysis is possible with valgrind, tsan, and
+asan.
+
+Input Validation
+----------------
+Inputs from the guest or external sources (e.g. network, files) cannot be
+trusted and may be invalid. Inputs must be checked before using them in a way
+that could crash the program, expose host memory to the guest, or otherwise be
+exploitable by an attacker.
+
+The most sensitive attack surface is device emulation. All hardware register
+accesses and data read from guest memory must be validated. A typical example
+is a device that contains multiple units that are selectable by the guest via
+an index register::
+
+ typedef struct {
+ ProcessingUnit unit[2];
+ ...
+ } MyDeviceState;
+
+ static void mydev_writel(void *opaque, uint32_t addr, uint32_t val)
+ {
+ MyDeviceState *mydev = opaque;
+ ProcessingUnit *unit;
+
+ switch (addr) {
+ case MYDEV_SELECT_UNIT:
+ unit = &mydev->unit[val]; <-- this input wasn't validated!
+ ...
+ }
+ }
+
+If ``val`` is not in range [0, 1] then an out-of-bounds memory access will take
+place when ``unit`` is dereferenced. The code must check that ``val`` is 0 or
+1 and handle the case where it is invalid.
+
+Unexpected Device Accesses
+--------------------------
+The guest may access device registers in unusual orders or at unexpected
+moments. Device emulation code must not assume that the guest follows the
+typical "theory of operation" presented in driver writer manuals. The guest
+may make nonsense accesses to device registers such as starting operations
+before the device has been fully initialized.
+
+A related issue is that device emulation code must be prepared for unexpected
+device register accesses while asynchronous operations are in progress. A
+well-behaved guest might wait for a completion interrupt before accessing
+certain device registers. Device emulation code must handle the case where the
+guest overwrites registers or submits further requests before an ongoing
+request completes. Unexpected accesses must not cause memory corruption or
+leaks in QEMU.
+
+Invalid device register accesses can be reported with
+``qemu_log_mask(LOG_GUEST_ERROR, ...)``. The ``-d guest_errors`` command-line
+option enables these log messages.
+
+Live Migration
+--------------
+Device state can be saved to disk image files and shared with other users.
+Live migration code must validate inputs when loading device state so an
+attacker cannot gain control by crafting invalid device states. Device state
+is therefore considered untrusted even though it is typically generated by QEMU
+itself.
+
+Guest Memory Access Races
+-------------------------
+Guests with multiple vCPUs may modify guest RAM while device emulation code is
+running. Device emulation code must copy in descriptors and other guest RAM
+structures and only process the local copy. This prevents
+time-of-check-to-time-of-use (TOCTOU) race conditions that could cause QEMU to
+crash when a vCPU thread modifies guest RAM while device emulation is
+processing it.
+
+Use of null-co block drivers
+----------------------------
+
+The ``null-co`` block driver is designed for performance: its read accesses are
+not initialized by default. In case this driver has to be used for security
+research, it must be used with the ``read-zeroes=on`` option which fills read
+buffers with zeroes. Security issues reported with the default
+(``read-zeroes=off``) will be discarded.
diff --git a/docs/devel/stable-process.rst b/docs/devel/stable-process.rst
new file mode 100644
index 000000000..c21fb8664
--- /dev/null
+++ b/docs/devel/stable-process.rst
@@ -0,0 +1,73 @@
+.. _stable-process:
+
+QEMU and the stable process
+===========================
+
+QEMU stable releases
+--------------------
+
+QEMU stable releases are based upon the last released QEMU version
+and marked by an additional version number, e.g. 2.10.1. Occasionally,
+a four-number version is released, if a single urgent fix needs to go
+on top.
+
+Usually, stable releases are only provided for the last major QEMU
+release. For example, when QEMU 2.11.0 is released, 2.11.x or 2.11.x.y
+stable releases are produced only until QEMU 2.12.0 is released, at
+which point the stable process moves to producing 2.12.x/2.12.x.y releases.
+
+What should go into a stable release?
+-------------------------------------
+
+Generally, the following patches are considered stable material:
+
+* Patches that fix severe issues, like fixes for CVEs
+
+* Patches that fix regressions
+
+If you think the patch would be important for users of the current release
+(or for a distribution picking fixes), it is usually a good candidate
+for stable.
+
+
+How to get a patch into QEMU stable
+-----------------------------------
+
+There are various ways to get a patch into stable:
+
+* Preferred: Make sure that the stable maintainers are on copy when you send
+ the patch by adding
+
+ .. code::
+
+ Cc: qemu-stable@nongnu.org
+
+ to the patch description. By default, this will send a copy of the patch
+ to ``qemu-stable@nongnu.org`` if you use git send-email, which is where
+ patches that are stable candidates are tracked by the maintainers.
+
+* You can also reply to a patch and put ``qemu-stable@nongnu.org`` on copy
+ directly in your mail client if you think a previously submitted patch
+ should be considered for a stable release.
+
+* If a maintainer judges the patch appropriate for stable later on (or you
+ notify them), they will add the same line to the patch, meaning that
+ the stable maintainers will be on copy on the maintainer's pull request.
+
+* If you judge an already merged patch suitable for stable, send a mail
+ (preferably as a reply to the most recent patch submission) to
+ ``qemu-stable@nongnu.org`` along with ``qemu-devel@nongnu.org`` and
+ appropriate other people (like the patch author or the relevant maintainer)
+ on copy.
+
+Stable release process
+----------------------
+
+When the stable maintainers prepare a new stable release, they will prepare
+a git branch with a release candidate and send the patches out to
+``qemu-devel@nongnu.org`` for review. If any of your patches are included,
+please verify that they look fine, especially if the maintainer had to tweak
+the patch as part of back-porting things across branches. You may also
+nominate other patches that you think are suitable for inclusion. After
+review is complete (may involve more release candidates), a new stable release
+is made available.
diff --git a/docs/devel/style.rst b/docs/devel/style.rst
new file mode 100644
index 000000000..9c5c0fffd
--- /dev/null
+++ b/docs/devel/style.rst
@@ -0,0 +1,703 @@
+.. _coding-style:
+
+=================
+QEMU Coding Style
+=================
+
+.. contents:: Table of Contents
+
+Please use the script checkpatch.pl in the scripts directory to check
+patches before submitting.
+
+Formatting and style
+********************
+
+Whitespace
+==========
+
+Of course, the most important aspect in any coding style is whitespace.
+Crusty old coders who have trouble spotting the glasses on their noses
+can tell the difference between a tab and eight spaces from a distance
+of approximately fifteen parsecs. Many a flamewar has been fought and
+lost on this issue.
+
+QEMU indents are four spaces. Tabs are never used, except in Makefiles
+where they have been irreversibly coded into the syntax.
+Spaces of course are superior to tabs because:
+
+* You have just one way to specify whitespace, not two. Ambiguity breeds
+ mistakes.
+* The confusion surrounding 'use tabs to indent, spaces to justify' is gone.
+* Tab indents push your code to the right, making your screen seriously
+ unbalanced.
+* Tabs will be rendered incorrectly on editors who are misconfigured not
+ to use tab stops of eight positions.
+* Tabs are rendered badly in patches, causing off-by-one errors in almost
+ every line.
+* It is the QEMU coding style.
+
+Do not leave whitespace dangling off the ends of lines.
+
+Multiline Indent
+----------------
+
+There are several places where indent is necessary:
+
+* if/else
+* while/for
+* function definition & call
+
+When breaking up a long line to fit within line width, we need a proper indent
+for the following lines.
+
+In case of if/else, while/for, align the secondary lines just after the
+opening parenthesis of the first.
+
+For example:
+
+.. code-block:: c
+
+ if (a == 1 &&
+ b == 2) {
+
+ while (a == 1 &&
+ b == 2) {
+
+In case of function, there are several variants:
+
+* 4 spaces indent from the beginning
+* align the secondary lines just after the opening parenthesis of the first
+
+For example:
+
+.. code-block:: c
+
+ do_something(x, y,
+ z);
+
+ do_something(x, y,
+ z);
+
+ do_something(x, do_another(y,
+ z));
+
+Line width
+==========
+
+Lines should be 80 characters; try not to make them longer.
+
+Sometimes it is hard to do, especially when dealing with QEMU subsystems
+that use long function or symbol names. If wrapping the line at 80 columns
+is obviously less readable and more awkward, prefer not to wrap it; better
+to have an 85 character line than one which is awkwardly wrapped.
+
+Even in that case, try not to make lines much longer than 80 characters.
+(The checkpatch script will warn at 100 characters, but this is intended
+as a guard against obviously-overlength lines, not a target.)
+
+Rationale:
+
+* Some people like to tile their 24" screens with a 6x4 matrix of 80x24
+ xterms and use vi in all of them. The best way to punish them is to
+ let them keep doing it.
+* Code and especially patches is much more readable if limited to a sane
+ line length. Eighty is traditional.
+* The four-space indentation makes the most common excuse ("But look
+ at all that white space on the left!") moot.
+* It is the QEMU coding style.
+
+Naming
+======
+
+Variables are lower_case_with_underscores; easy to type and read. Structured
+type names are in CamelCase; harder to type but standing out. Enum type
+names and function type names should also be in CamelCase. Scalar type
+names are lower_case_with_underscores_ending_with_a_t, like the POSIX
+uint64_t and family. Note that this last convention contradicts POSIX
+and is therefore likely to be changed.
+
+Variable Naming Conventions
+---------------------------
+
+A number of short naming conventions exist for variables that use
+common QEMU types. For example, the architecture independent CPUState
+is often held as a ``cs`` pointer variable, whereas the concrete
+CPUArchState is usually held in a pointer called ``env``.
+
+Likewise, in device emulation code the common DeviceState is usually
+called ``dev``.
+
+Function Naming Conventions
+---------------------------
+
+Wrapped version of standard library or GLib functions use a ``qemu_``
+prefix to alert readers that they are seeing a wrapped version, for
+example ``qemu_strtol`` or ``qemu_mutex_lock``. Other utility functions
+that are widely called from across the codebase should not have any
+prefix, for example ``pstrcpy`` or bit manipulation functions such as
+``find_first_bit``.
+
+The ``qemu_`` prefix is also used for functions that modify global
+emulator state, for example ``qemu_add_vm_change_state_handler``.
+However, if there is an obvious subsystem-specific prefix it should be
+used instead.
+
+Public functions from a file or subsystem (declared in headers) tend
+to have a consistent prefix to show where they came from. For example,
+``tlb_`` for functions from ``cputlb.c`` or ``cpu_`` for functions
+from cpus.c.
+
+If there are two versions of a function to be called with or without a
+lock held, the function that expects the lock to be already held
+usually uses the suffix ``_locked``.
+
+
+Block structure
+===============
+
+Every indented statement is braced; even if the block contains just one
+statement. The opening brace is on the line that contains the control
+flow statement that introduces the new block; the closing brace is on the
+same line as the else keyword, or on a line by itself if there is no else
+keyword. Example:
+
+.. code-block:: c
+
+ if (a == 5) {
+ printf("a was 5.\n");
+ } else if (a == 6) {
+ printf("a was 6.\n");
+ } else {
+ printf("a was something else entirely.\n");
+ }
+
+Note that 'else if' is considered a single statement; otherwise a long if/
+else if/else if/.../else sequence would need an indent for every else
+statement.
+
+An exception is the opening brace for a function; for reasons of tradition
+and clarity it comes on a line by itself:
+
+.. code-block:: c
+
+ void a_function(void)
+ {
+ do_something();
+ }
+
+Rationale: a consistent (except for functions...) bracing style reduces
+ambiguity and avoids needless churn when lines are added or removed.
+Furthermore, it is the QEMU coding style.
+
+Declarations
+============
+
+Mixed declarations (interleaving statements and declarations within
+blocks) are generally not allowed; declarations should be at the beginning
+of blocks.
+
+Every now and then, an exception is made for declarations inside a
+#ifdef or #ifndef block: if the code looks nicer, such declarations can
+be placed at the top of the block even if there are statements above.
+On the other hand, however, it's often best to move that #ifdef/#ifndef
+block to a separate function altogether.
+
+Conditional statements
+======================
+
+When comparing a variable for (in)equality with a constant, list the
+constant on the right, as in:
+
+.. code-block:: c
+
+ if (a == 1) {
+ /* Reads like: "If a equals 1" */
+ do_something();
+ }
+
+Rationale: Yoda conditions (as in 'if (1 == a)') are awkward to read.
+Besides, good compilers already warn users when '==' is mis-typed as '=',
+even when the constant is on the right.
+
+Comment style
+=============
+
+We use traditional C-style /``*`` ``*``/ comments and avoid // comments.
+
+Rationale: The // form is valid in C99, so this is purely a matter of
+consistency of style. The checkpatch script will warn you about this.
+
+Multiline comment blocks should have a row of stars on the left,
+and the initial /``*`` and terminating ``*``/ both on their own lines:
+
+.. code-block:: c
+
+ /*
+ * like
+ * this
+ */
+
+This is the same format required by the Linux kernel coding style.
+
+(Some of the existing comments in the codebase use the GNU Coding
+Standards form which does not have stars on the left, or other
+variations; avoid these when writing new comments, but don't worry
+about converting to the preferred form unless you're editing that
+comment anyway.)
+
+Rationale: Consistency, and ease of visually picking out a multiline
+comment from the surrounding code.
+
+Language usage
+**************
+
+Preprocessor
+============
+
+Variadic macros
+---------------
+
+For variadic macros, stick with this C99-like syntax:
+
+.. code-block:: c
+
+ #define DPRINTF(fmt, ...) \
+ do { printf("IRQ: " fmt, ## __VA_ARGS__); } while (0)
+
+Include directives
+------------------
+
+Order include directives as follows:
+
+.. code-block:: c
+
+ #include "qemu/osdep.h" /* Always first... */
+ #include <...> /* then system headers... */
+ #include "..." /* and finally QEMU headers. */
+
+The "qemu/osdep.h" header contains preprocessor macros that affect the behavior
+of core system headers like <stdint.h>. It must be the first include so that
+core system headers included by external libraries get the preprocessor macros
+that QEMU depends on.
+
+Do not include "qemu/osdep.h" from header files since the .c file will have
+already included it.
+
+C types
+=======
+
+It should be common sense to use the right type, but we have collected
+a few useful guidelines here.
+
+Scalars
+-------
+
+If you're using "int" or "long", odds are good that there's a better type.
+If a variable is counting something, it should be declared with an
+unsigned type.
+
+If it's host memory-size related, size_t should be a good choice (use
+ssize_t only if required). Guest RAM memory offsets must use ram_addr_t,
+but only for RAM, it may not cover whole guest address space.
+
+If it's file-size related, use off_t.
+If it's file-offset related (i.e., signed), use off_t.
+If it's just counting small numbers use "unsigned int";
+(on all but oddball embedded systems, you can assume that that
+type is at least four bytes wide).
+
+In the event that you require a specific width, use a standard type
+like int32_t, uint32_t, uint64_t, etc. The specific types are
+mandatory for VMState fields.
+
+Don't use Linux kernel internal types like u32, __u32 or __le32.
+
+Use hwaddr for guest physical addresses except pcibus_t
+for PCI addresses. In addition, ram_addr_t is a QEMU internal address
+space that maps guest RAM physical addresses into an intermediate
+address space that can map to host virtual address spaces. Generally
+speaking, the size of guest memory can always fit into ram_addr_t but
+it would not be correct to store an actual guest physical address in a
+ram_addr_t.
+
+For CPU virtual addresses there are several possible types.
+vaddr is the best type to use to hold a CPU virtual address in
+target-independent code. It is guaranteed to be large enough to hold a
+virtual address for any target, and it does not change size from target
+to target. It is always unsigned.
+target_ulong is a type the size of a virtual address on the CPU; this means
+it may be 32 or 64 bits depending on which target is being built. It should
+therefore be used only in target-specific code, and in some
+performance-critical built-per-target core code such as the TLB code.
+There is also a signed version, target_long.
+abi_ulong is for the ``*``-user targets, and represents a type the size of
+'void ``*``' in that target's ABI. (This may not be the same as the size of a
+full CPU virtual address in the case of target ABIs which use 32 bit pointers
+on 64 bit CPUs, like sparc32plus.) Definitions of structures that must match
+the target's ABI must use this type for anything that on the target is defined
+to be an 'unsigned long' or a pointer type.
+There is also a signed version, abi_long.
+
+Of course, take all of the above with a grain of salt. If you're about
+to use some system interface that requires a type like size_t, pid_t or
+off_t, use matching types for any corresponding variables.
+
+Also, if you try to use e.g., "unsigned int" as a type, and that
+conflicts with the signedness of a related variable, sometimes
+it's best just to use the *wrong* type, if "pulling the thread"
+and fixing all related variables would be too invasive.
+
+Finally, while using descriptive types is important, be careful not to
+go overboard. If whatever you're doing causes warnings, or requires
+casts, then reconsider or ask for help.
+
+Pointers
+--------
+
+Ensure that all of your pointers are "const-correct".
+Unless a pointer is used to modify the pointed-to storage,
+give it the "const" attribute. That way, the reader knows
+up-front that this is a read-only pointer. Perhaps more
+importantly, if we're diligent about this, when you see a non-const
+pointer, you're guaranteed that it is used to modify the storage
+it points to, or it is aliased to another pointer that is.
+
+Typedefs
+--------
+
+Typedefs are used to eliminate the redundant 'struct' keyword, since type
+names have a different style than other identifiers ("CamelCase" versus
+"snake_case"). Each named struct type should have a CamelCase name and a
+corresponding typedef.
+
+Since certain C compilers choke on duplicated typedefs, you should avoid
+them and declare a typedef only in one header file. For common types,
+you can use "include/qemu/typedefs.h" for example. However, as a matter
+of convenience it is also perfectly fine to use forward struct
+definitions instead of typedefs in headers and function prototypes; this
+avoids problems with duplicated typedefs and reduces the need to include
+headers from other headers.
+
+Reserved namespaces in C and POSIX
+----------------------------------
+
+Underscore capital, double underscore, and underscore 't' suffixes should be
+avoided.
+
+Low level memory management
+===========================
+
+Use of the ``malloc/free/realloc/calloc/valloc/memalign/posix_memalign``
+APIs is not allowed in the QEMU codebase. Instead of these routines,
+use the GLib memory allocation routines
+``g_malloc/g_malloc0/g_new/g_new0/g_realloc/g_free``
+or QEMU's ``qemu_memalign/qemu_blockalign/qemu_vfree`` APIs.
+
+Please note that ``g_malloc`` will exit on allocation failure, so
+there is no need to test for failure (as you would have to with
+``malloc``). Generally using ``g_malloc`` on start-up is fine as the
+result of a failure to allocate memory is going to be a fatal exit
+anyway. There may be some start-up cases where failing is unreasonable
+(for example speculatively loading a large debug symbol table).
+
+Care should be taken to avoid introducing places where the guest could
+trigger an exit by causing a large allocation. For small allocations,
+of the order of 4k, a failure to allocate is likely indicative of an
+overloaded host and allowing ``g_malloc`` to ``exit`` is a reasonable
+approach. However for larger allocations where we could realistically
+fall-back to a smaller one if need be we should use functions like
+``g_try_new`` and check the result. For example this is valid approach
+for a time/space trade-off like ``tlb_mmu_resize_locked`` in the
+SoftMMU TLB code.
+
+If the lifetime of the allocation is within the function and there are
+multiple exist paths you can also improve the readability of the code
+by using ``g_autofree`` and related annotations. See :ref:`autofree-ref`
+for more details.
+
+Calling ``g_malloc`` with a zero size is valid and will return NULL.
+
+Prefer ``g_new(T, n)`` instead of ``g_malloc(sizeof(T) * n)`` for the following
+reasons:
+
+* It catches multiplication overflowing size_t;
+* It returns T ``*`` instead of void ``*``, letting compiler catch more type errors.
+
+Declarations like
+
+.. code-block:: c
+
+ T *v = g_malloc(sizeof(*v))
+
+are acceptable, though.
+
+Memory allocated by ``qemu_memalign`` or ``qemu_blockalign`` must be freed with
+``qemu_vfree``, since breaking this will cause problems on Win32.
+
+String manipulation
+===================
+
+Do not use the strncpy function. As mentioned in the man page, it does *not*
+guarantee a NULL-terminated buffer, which makes it extremely dangerous to use.
+It also zeros trailing destination bytes out to the specified length. Instead,
+use this similar function when possible, but note its different signature:
+
+.. code-block:: c
+
+ void pstrcpy(char *dest, int dest_buf_size, const char *src)
+
+Don't use strcat because it can't check for buffer overflows, but:
+
+.. code-block:: c
+
+ char *pstrcat(char *buf, int buf_size, const char *s)
+
+The same limitation exists with sprintf and vsprintf, so use snprintf and
+vsnprintf.
+
+QEMU provides other useful string functions:
+
+.. code-block:: c
+
+ int strstart(const char *str, const char *val, const char **ptr)
+ int stristart(const char *str, const char *val, const char **ptr)
+ int qemu_strnlen(const char *s, int max_len)
+
+There are also replacement character processing macros for isxyz and toxyz,
+so instead of e.g. isalnum you should use qemu_isalnum.
+
+Because of the memory management rules, you must use g_strdup/g_strndup
+instead of plain strdup/strndup.
+
+Printf-style functions
+======================
+
+Whenever you add a new printf-style function, i.e., one with a format
+string argument and following "..." in its prototype, be sure to use
+gcc's printf attribute directive in the prototype.
+
+This makes it so gcc's -Wformat and -Wformat-security options can do
+their jobs and cross-check format strings with the number and types
+of arguments.
+
+C standard, implementation defined and undefined behaviors
+==========================================================
+
+C code in QEMU should be written to the C99 language specification. A copy
+of the final version of the C99 standard with corrigenda TC1, TC2, and TC3
+included, formatted as a draft, can be downloaded from:
+
+ `<http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf>`_
+
+The C language specification defines regions of undefined behavior and
+implementation defined behavior (to give compiler authors enough leeway to
+produce better code). In general, code in QEMU should follow the language
+specification and avoid both undefined and implementation defined
+constructs. ("It works fine on the gcc I tested it with" is not a valid
+argument...) However there are a few areas where we allow ourselves to
+assume certain behaviors because in practice all the platforms we care about
+behave in the same way and writing strictly conformant code would be
+painful. These are:
+
+* you may assume that integers are 2s complement representation
+* you may assume that right shift of a signed integer duplicates
+ the sign bit (ie it is an arithmetic shift, not a logical shift)
+
+In addition, QEMU assumes that the compiler does not use the latitude
+given in C99 and C11 to treat aspects of signed '<<' as undefined, as
+documented in the GNU Compiler Collection manual starting at version 4.0.
+
+.. _autofree-ref:
+
+Automatic memory deallocation
+=============================
+
+QEMU has a mandatory dependency either the GCC or CLang compiler. As
+such it has the freedom to make use of a C language extension for
+automatically running a cleanup function when a stack variable goes
+out of scope. This can be used to simplify function cleanup paths,
+often allowing many goto jumps to be eliminated, through automatic
+free'ing of memory.
+
+The GLib2 library provides a number of functions/macros for enabling
+automatic cleanup:
+
+ `<https://developer.gnome.org/glib/stable/glib-Miscellaneous-Macros.html>`_
+
+Most notably:
+
+* g_autofree - will invoke g_free() on the variable going out of scope
+
+* g_autoptr - for structs / objects, will invoke the cleanup func created
+ by a previous use of G_DEFINE_AUTOPTR_CLEANUP_FUNC. This is
+ supported for most GLib data types and GObjects
+
+For example, instead of
+
+.. code-block:: c
+
+ int somefunc(void) {
+ int ret = -1;
+ char *foo = g_strdup_printf("foo%", "wibble");
+ GList *bar = .....
+
+ if (eek) {
+ goto cleanup;
+ }
+
+ ret = 0;
+
+ cleanup:
+ g_free(foo);
+ g_list_free(bar);
+ return ret;
+ }
+
+Using g_autofree/g_autoptr enables the code to be written as:
+
+.. code-block:: c
+
+ int somefunc(void) {
+ g_autofree char *foo = g_strdup_printf("foo%", "wibble");
+ g_autoptr (GList) bar = .....
+
+ if (eek) {
+ return -1;
+ }
+
+ return 0;
+ }
+
+While this generally results in simpler, less leak-prone code, there
+are still some caveats to beware of
+
+* Variables declared with g_auto* MUST always be initialized,
+ otherwise the cleanup function will use uninitialized stack memory
+
+* If a variable declared with g_auto* holds a value which must
+ live beyond the life of the function, that value must be saved
+ and the original variable NULL'd out. This can be simpler using
+ g_steal_pointer
+
+
+.. code-block:: c
+
+ char *somefunc(void) {
+ g_autofree char *foo = g_strdup_printf("foo%", "wibble");
+ g_autoptr (GList) bar = .....
+
+ if (eek) {
+ return NULL;
+ }
+
+ return g_steal_pointer(&foo);
+ }
+
+
+QEMU Specific Idioms
+********************
+
+Error handling and reporting
+============================
+
+Reporting errors to the human user
+----------------------------------
+
+Do not use printf(), fprintf() or monitor_printf(). Instead, use
+error_report() or error_vreport() from error-report.h. This ensures the
+error is reported in the right place (current monitor or stderr), and in
+a uniform format.
+
+Use error_printf() & friends to print additional information.
+
+error_report() prints the current location. In certain common cases
+like command line parsing, the current location is tracked
+automatically. To manipulate it manually, use the loc_``*``() from
+error-report.h.
+
+Propagating errors
+------------------
+
+An error can't always be reported to the user right where it's detected,
+but often needs to be propagated up the call chain to a place that can
+handle it. This can be done in various ways.
+
+The most flexible one is Error objects. See error.h for usage
+information.
+
+Use the simplest suitable method to communicate success / failure to
+callers. Stick to common methods: non-negative on success / -1 on
+error, non-negative / -errno, non-null / null, or Error objects.
+
+Example: when a function returns a non-null pointer on success, and it
+can fail only in one way (as far as the caller is concerned), returning
+null on failure is just fine, and certainly simpler and a lot easier on
+the eyes than propagating an Error object through an Error ``*````*`` parameter.
+
+Example: when a function's callers need to report details on failure
+only the function really knows, use Error ``*````*``, and set suitable errors.
+
+Do not report an error to the user when you're also returning an error
+for somebody else to handle. Leave the reporting to the place that
+consumes the error returned.
+
+Handling errors
+---------------
+
+Calling exit() is fine when handling configuration errors during
+startup. It's problematic during normal operation. In particular,
+monitor commands should never exit().
+
+Do not call exit() or abort() to handle an error that can be triggered
+by the guest (e.g., some unimplemented corner case in guest code
+translation or device emulation). Guests should not be able to
+terminate QEMU.
+
+Note that &error_fatal is just another way to exit(1), and &error_abort
+is just another way to abort().
+
+
+trace-events style
+==================
+
+0x prefix
+---------
+
+In trace-events files, use a '0x' prefix to specify hex numbers, as in:
+
+.. code-block:: c
+
+ some_trace(unsigned x, uint64_t y) "x 0x%x y 0x" PRIx64
+
+An exception is made for groups of numbers that are hexadecimal by
+convention and separated by the symbols '.', '/', ':', or ' ' (such as
+PCI bus id):
+
+.. code-block:: c
+
+ another_trace(int cssid, int ssid, int dev_num) "bus id: %x.%x.%04x"
+
+However, you can use '0x' for such groups if you want. Anyway, be sure that
+it is obvious that numbers are in hex, ex.:
+
+.. code-block:: c
+
+ data_dump(uint8_t c1, uint8_t c2, uint8_t c3) "bytes (in hex): %02x %02x %02x"
+
+Rationale: hex numbers are hard to read in logs when there is no 0x prefix,
+especially when (occasionally) the representation doesn't contain any letters
+and especially in one line with other decimal numbers. Number groups are allowed
+to not use '0x' because for some things notations like %x.%x.%x are used not
+only in QEMU. Also dumping raw data bytes with '0x' is less readable.
+
+'#' printf flag
+---------------
+
+Do not use printf flag '#', like '%#x'.
+
+Rationale: there are two ways to add a '0x' prefix to printed number: '0x%...'
+and '%#...'. For consistency the only one way should be used. Arguments for
+'0x%' are:
+
+* it is more popular
+* '%#' omits the 0x for the value 0 which makes output inconsistent
diff --git a/docs/devel/submitting-a-patch.rst b/docs/devel/submitting-a-patch.rst
new file mode 100644
index 000000000..e51259eb9
--- /dev/null
+++ b/docs/devel/submitting-a-patch.rst
@@ -0,0 +1,562 @@
+.. _submitting-a-patch:
+
+Submitting a Patch
+==================
+
+QEMU welcomes contributions of code (either fixing bugs or adding new
+functionality). However, we get a lot of patches, and so we have some
+guidelines about submitting patches. If you follow these, you'll help
+make our task of code review easier and your patch is likely to be
+committed faster.
+
+This page seems very long, so if you are only trying to post a quick
+one-shot fix, the bare minimum we ask is that:
+
+- You **must** provide a Signed-off-by: line (this is a hard
+ requirement because it's how you say "I'm legally okay to contribute
+ this and happy for it to go into QEMU", modeled after the `Linux kernel
+ <http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__
+ policy.) ``git commit -s`` or ``git format-patch -s`` will add one.
+- All contributions to QEMU must be **sent as patches** to the
+ qemu-devel `mailing list <MailingLists>`__. Patch contributions
+ should not be posted on the bug tracker, posted on forums, or
+ externally hosted and linked to. (We have other mailing lists too,
+ but all patches must go to qemu-devel, possibly with a Cc: to another
+ list.) ``git send-email`` (`step-by-step setup
+ guide <https://git-send-email.io/>`__ and `hints and
+ tips <https://elixir.bootlin.com/linux/latest/source/Documentation/process/email-clients.rst>`__)
+ works best for delivering the patch without mangling it, but
+ attachments can be used as a last resort on a first-time submission.
+- You must read replies to your message, and be willing to act on them.
+ Note, however, that maintainers are often willing to manually fix up
+ first-time contributions, since there is a learning curve involved in
+ making an ideal patch submission.
+
+You do not have to subscribe to post (list policy is to reply-to-all to
+preserve CCs and keep non-subscribers in the loop on the threads they
+start), although you may find it easier as a subscriber to pick up good
+ideas from other posts. If you do subscribe, be prepared for a high
+volume of email, often over one thousand messages in a week. The list is
+moderated; first-time posts from an email address (whether or not you
+subscribed) may be subject to some delay while waiting for a moderator
+to whitelist your address.
+
+The larger your contribution is, or if you plan on becoming a long-term
+contributor, then the more important the rest of this page becomes.
+Reading the table of contents below should already give you an idea of
+the basic requirements. Use the table of contents as a reference, and
+read the parts that you have doubts about.
+
+.. contents:: Table of Contents
+
+.. _writing_your_patches:
+
+Writing your Patches
+--------------------
+
+.. _use_the_qemu_coding_style:
+
+Use the QEMU coding style
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can run run *scripts/checkpatch.pl <patchfile>* before submitting to
+check that you are in compliance with our coding standards. Be aware
+that ``checkpatch.pl`` is not infallible, though, especially where C
+preprocessor macros are involved; use some common sense too. See also:
+
+- :ref:`coding-style`
+- `Automate a checkpatch run on
+ commit <https://blog.vmsplice.net/2011/03/how-to-automatically-run-checkpatchpl.html>`__
+
+.. _base_patches_against_current_git_master:
+
+Base patches against current git master
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There's no point submitting a patch which is based on a released version
+of QEMU because development will have moved on from then and it probably
+won't even apply to master. We only apply selected bugfixes to release
+branches and then only as backports once the code has gone into master.
+
+It is also okay to base patches on top of other on-going work that is
+not yet part of the git master branch. To aid continuous integration
+tools, such as `patchew <http://patchew.org/QEMU/>`__, you should `add a
+tag <https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg01288.html>`__
+line ``Based-on: $MESSAGE_ID`` to your cover letter to make the series
+dependency obvious.
+
+.. _split_up_long_patches:
+
+Split up long patches
+~~~~~~~~~~~~~~~~~~~~~
+
+Split up longer patches into a patch series of logical code changes.
+Each change should compile and execute successfully. For instance, don't
+add a file to the makefile in patch one and then add the file itself in
+patch two. (This rule is here so that people can later use tools like
+`git bisect <http://git-scm.com/docs/git-bisect>`__ without hitting
+points in the commit history where QEMU doesn't work for reasons
+unrelated to the bug they're chasing.) Put documentation first, not
+last, so that someone reading the series can do a clean-room evaluation
+of the documentation, then validate that the code matched the
+documentation. A commit message that mentions "Also, ..." is often a
+good candidate for splitting into multiple patches. For more thoughts on
+properly splitting patches and writing good commit messages, see `this
+advice from
+OpenStack <https://wiki.openstack.org/wiki/GitCommitMessages>`__.
+
+.. _make_code_motion_patches_easy_to_review:
+
+Make code motion patches easy to review
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a series requires large blocks of code motion, there are tricks for
+making the refactoring easier to review. Split up the series so that
+semantic changes (or even function renames) are done in a separate patch
+from the raw code motion. Use a one-time setup of ``git config
+diff.renames true;`` ``git config diff.algorithm patience`` (refer to
+`git-config <http://git-scm.com/docs/git-config>`__). The 'diff.renames'
+property ensures file rename patches will be given in a more compact
+representation that focuses only on the differences across the file
+rename, instead of showing the entire old file as a deletion and the new
+file as an insertion. Meanwhile, the 'diff.algorithm' property ensures
+that extracting a non-contiguous subset of one file into a new file, but
+where all extracted parts occur in the same order both before and after
+the patch, will reduce churn in trying to treat unrelated ``}`` lines in
+the original file as separating hunks of changes.
+
+Ideally, a code motion patch can be reviewed by doing::
+
+ git format-patch --stdout -1 > patch;
+ diff -u <(sed -n 's/^-//p' patch) <(sed -n 's/^\+//p' patch)
+
+to focus on the few changes that weren't wholesale code motion.
+
+.. _dont_include_irrelevant_changes:
+
+Don't include irrelevant changes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In particular, don't include formatting, coding style or whitespace
+changes to bits of code that would otherwise not be touched by the
+patch. (It's OK to fix coding style issues in the immediate area (few
+lines) of the lines you're changing.) If you think a section of code
+really does need a reindent or other large-scale style fix, submit this
+as a separate patch which makes no semantic changes; don't put it in the
+same patch as your bug fix.
+
+For smaller patches in less frequently changed areas of QEMU, consider
+using the :ref:`trivial-patches` process.
+
+.. _write_a_meaningful_commit_message:
+
+Write a meaningful commit message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Commit messages should be meaningful and should stand on their own as a
+historical record of why the changes you applied were necessary or
+useful.
+
+QEMU follows the usual standard for git commit messages: the first line
+(which becomes the email subject line) is "subsystem: single line
+summary of change". Whether the "single line summary of change" starts
+with a capital is a matter of taste, but we prefer that the summary does
+not end in a dot. Look at ``git shortlog -30`` for an idea of sample
+subject lines. Then there is a blank line and a more detailed
+description of the patch, another blank and your Signed-off-by: line.
+Please do not use lines that are longer than 76 characters in your
+commit message (so that the text still shows up nicely with "git show"
+in a 80-columns terminal window).
+
+The body of the commit message is a good place to document why your
+change is important. Don't include comments like "This is a suggestion
+for fixing this bug" (they can go below the ``---`` line in the email so
+they don't go into the final commit message). Make sure the body of the
+commit message can be read in isolation even if the reader's mailer
+displays the subject line some distance apart (that is, a body that
+starts with "... so that" as a continuation of the subject line is
+harder to follow).
+
+If your patch fixes a commit that is already in the repository, please
+add an additional line with "Fixes: <at-least-12-digits-of-SHA-commit-id>
+("Fixed commit subject")" below the patch description / before your
+"Signed-off-by:" line in the commit message.
+
+If your patch fixes a bug in the gitlab bug tracker, please add a line
+with "Resolves: <URL-of-the-bug>" to the commit message, too. Gitlab can
+close bugs automatically once commits with the "Resolved:" keyword get
+merged into the master branch of the project. And if your patch addresses
+a bug in another public bug tracker, you can also use a line with
+"Buglink: <URL-of-the-bug>" for reference here, too.
+
+Example::
+
+ Fixes: 14055ce53c2d ("s390x/tcg: avoid overflows in time2tod/tod2time")
+ Resolves: https://gitlab.com/qemu-project/qemu/-/issues/42
+ Buglink: https://bugs.launchpad.net/qemu/+bug/1804323``
+
+Some other tags that are used in commit messages include "Message-Id:"
+"Tested-by:", "Acked-by:", "Reported-by:", "Suggested-by:". See ``git
+log`` for these keywords for example usage.
+
+.. _test_your_patches:
+
+Test your patches
+~~~~~~~~~~~~~~~~~
+
+Although QEMU has `continuous integration
+services <Testing#Continuous_Integration>`__ that attempt to test
+patches submitted to the list, it still saves everyone time if you have
+already tested that your patch compiles and works. Because QEMU is such
+a large project, it's okay to use configure arguments to limit what is
+built for faster turnaround during your development time; but it is
+still wise to also check that your patches work with a full build before
+submitting a series, especially if your changes might have an unintended
+effect on other areas of the code you don't normally experiment with.
+See `Testing <Testing>`__ for more details on what tests are available.
+Also, it is a wise idea to include a testsuite addition as part of your
+patches - either to ensure that future changes won't regress your new
+feature, or to add a test which exposes the bug that the rest of your
+series fixes. Keeping separate commits for the test and the fix allows
+reviewers to rebase the test to occur first to prove it catches the
+problem, then again to place it last in the series so that bisection
+doesn't land on a known-broken state.
+
+.. _submitting_your_patches:
+
+Submitting your Patches
+-----------------------
+
+.. _if_you_cannot_send_patch_emails:
+
+If you cannot send patch emails
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In rare cases it may not be possible to send properly formatted patch
+emails. You can use `sourcehut <https://sourcehut.org/>`__ to send your
+patches to the QEMU mailing list by following these steps:
+
+#. Register or sign in to your account
+#. Add your SSH public key in `meta \|
+ keys <https://meta.sr.ht/keys>`__.
+#. Publish your git branch using **git push git@git.sr.ht:~USERNAME/qemu
+ HEAD**
+#. Send your patches to the QEMU mailing list using the web-based
+ ``git-send-email`` UI at https://git.sr.ht/~USERNAME/qemu/send-email
+
+`This video
+<https://spacepub.space/videos/watch/ad258d23-0ac6-488c-83fc-2bacf578de3a>`__
+shows the web-based ``git-send-email`` workflow. Documentation is
+available `here
+<https://man.sr.ht/git.sr.ht/#sending-patches-upstream>`__.
+
+.. _cc_the_relevant_maintainer:
+
+CC the relevant maintainer
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send patches both to the mailing list and CC the maintainer(s) of the
+files you are modifying. look in the MAINTAINERS file to find out who
+that is. Also try using scripts/get_maintainer.pl from the repository
+for learning the most common committers for the files you touched.
+
+Example::
+
+ ~/src/qemu/scripts/get_maintainer.pl -f hw/ide/core.c
+
+In fact, you can automate this, via a one-time setup of ``git config
+sendemail.cccmd 'scripts/get_maintainer.pl --nogit-fallback'`` (Refer to
+`git-config <http://git-scm.com/docs/git-config>`__.)
+
+.. _do_not_send_as_an_attachment:
+
+Do not send as an attachment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send patches inline so they are easy to reply to with review comments.
+Do not put patches in attachments.
+
+.. _use_git_format_patch:
+
+Use ``git format-patch``
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use the right diff format.
+`git format-patch <http://git-scm.com/docs/git-format-patch>`__ will
+produce patch emails in the right format (check the documentation to
+find out how to drive it). You can then edit the cover letter before
+using ``git send-email`` to mail the files to the mailing list. (We
+recommend `git send-email <http://git-scm.com/docs/git-send-email>`__
+because mail clients often mangle patches by wrapping long lines or
+messing up whitespace. Some distributions do not include send-email in a
+default install of git; you may need to download additional packages,
+such as 'git-email' on Fedora-based systems.) Patch series need a cover
+letter, with shallow threading (all patches in the series are
+in-reply-to the cover letter, but not to each other); single unrelated
+patches do not need a cover letter (but if you do send a cover letter,
+use ``--numbered`` so the cover and the patch have distinct subject lines).
+Patches are easier to find if they start a new top-level thread, rather
+than being buried in-reply-to another existing thread.
+
+.. _avoid_posting_large_binary_blob:
+
+Avoid posting large binary blob
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you added binaries to the repository, consider producing the patch
+emails using ``git format-patch --no-binary`` and include a link to a
+git repository to fetch the original commit.
+
+.. _patch_emails_must_include_a_signed_off_by_line:
+
+Patch emails must include a ``Signed-off-by:`` line
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For more information see `SubmittingPatches 1.12
+<http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__.
+This is vital or we will not be able to apply your patch! Please use
+your real name to sign a patch (not an alias or acronym).
+
+If you wrote the patch, make sure your "From:" and "Signed-off-by:"
+lines use the same spelling. It's okay if you subscribe or contribute to
+the list via more than one address, but using multiple addresses in one
+commit just confuses things. If someone else wrote the patch, git will
+include a "From:" line in the body of the email (different from your
+envelope From:) that will give credit to the correct author; but again,
+that author's Signed-off-by: line is mandatory, with the same spelling.
+
+.. _include_a_meaningful_cover_letter:
+
+Include a meaningful cover letter
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This is a requirement for any series with multiple patches (as it aids
+continuous integration), but optional for an isolated patch. The cover
+letter explains the overall goal of such a series, and also provides a
+convenient 0/N email for others to reply to the series as a whole. A
+one-time setup of ``git config format.coverletter auto`` (refer to
+`git-config <http://git-scm.com/docs/git-config>`__) will generate the
+cover letter as needed.
+
+When reviewers don't know your goal at the start of their review, they
+may object to early changes that don't make sense until the end of the
+series, because they do not have enough context yet at that point of
+their review. A series where the goal is unclear also risks a higher
+number of review-fix cycles because the reviewers haven't bought into
+the idea yet. If the cover letter can explain these points to the
+reviewer, the process will be smoother patches will get merged faster.
+Make sure your cover letter includes a diffstat of changes made over the
+entire series; potential reviewers know what files they are interested
+in, and they need an easy way determine if your series touches them.
+
+.. _use_the_rfc_tag_if_needed:
+
+Use the RFC tag if needed
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For example, "[PATCH RFC v2]". ``git format-patch --subject-prefix=RFC``
+can help.
+
+"RFC" means "Request For Comments" and is a statement that you don't
+intend for your patchset to be applied to master, but would like some
+review on it anyway. Reasons for doing this include:
+
+- the patch depends on some pending kernel changes which haven't yet
+ been accepted, so the QEMU patch series is blocked until that
+ dependency has been dealt with, but is worth reviewing anyway
+- the patch set is not finished yet (perhaps it doesn't cover all use
+ cases or work with all targets) but you want early review of a major
+ API change or design structure before continuing
+
+In general, since it's asking other people to do review work on a
+patchset that the submitter themselves is saying shouldn't be applied,
+it's best to:
+
+- use it sparingly
+- in the cover letter, be clear about why a patch is an RFC, what areas
+ of the patchset you're looking for review on, and why reviewers
+ should care
+
+.. _consider_whether_your_patch_is_applicable_for_stable:
+
+Consider whether your patch is applicable for stable
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If your patch fixes a severe issue or a regression, it may be applicable
+for stable. In that case, consider adding ``Cc: qemu-stable@nongnu.org``
+to your patch to notify the stable maintainers.
+
+For more details on how QEMU's stable process works, refer to the
+:ref:`stable-process` page.
+
+.. _participating_in_code_review:
+
+Participating in Code Review
+----------------------------
+
+All patches submitted to the QEMU project go through a code review
+process before they are accepted. Some areas of code that are well
+maintained may review patches quickly, lesser-loved areas of code may
+have a longer delay.
+
+.. _stay_around_to_fix_problems_raised_in_code_review:
+
+Stay around to fix problems raised in code review
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Not many patches get into QEMU straight away -- it is quite common that
+developers will identify bugs, or suggest a cleaner approach, or even
+just point out code style issues or commit message typos. You'll need to
+respond to these, and then send a second version of your patches with
+the issues fixed. This takes a little time and effort on your part, but
+if you don't do it then your changes will never get into QEMU. It's also
+just polite -- it is quite disheartening for a developer to spend time
+reviewing your code and suggesting improvements, only to find that
+you're not going to do anything further and it was all wasted effort.
+
+When replying to comments on your patches **reply to all and not just
+the sender** -- keeping discussion on the mailing list means everybody
+can follow it.
+
+.. _pay_attention_to_review_comments:
+
+Pay attention to review comments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Someone took their time to review your work, and it pays to respect that
+effort; repeatedly submitting a series without addressing all comments
+from the previous round tends to alienate reviewers and stall your
+patch. Reviewers aren't always perfect, so it is okay if you want to
+argue that your code was correct in the first place instead of blindly
+doing everything the reviewer asked. On the other hand, if someone
+pointed out a potential issue during review, then even if your code
+turns out to be correct, it's probably a sign that you should improve
+your commit message and/or comments in the code explaining why the code
+is correct.
+
+If you fix issues that are raised during review **resend the entire
+patch series** not just the one patch that was changed. This allows
+maintainers to easily apply the fixed series without having to manually
+identify which patches are relevant. Send the new version as a complete
+fresh email or series of emails -- don't try to make it a followup to
+version 1. (This helps automatic patch email handling tools distinguish
+between v1 and v2 emails.)
+
+.. _when_resending_patches_add_a_version_tag:
+
+When resending patches add a version tag
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All patches beyond the first version should include a version tag -- for
+example, "[PATCH v2]". This means people can easily identify whether
+they're looking at the most recent version. (The first version of a
+patch need not say "v1", just [PATCH] is sufficient.) For patch series,
+the version applies to the whole series -- even if you only change one
+patch, you resend the entire series and mark it as "v2". Don't try to
+track versions of different patches in the series separately. `git
+format-patch <http://git-scm.com/docs/git-format-patch>`__ and `git
+send-email <http://git-scm.com/docs/git-send-email>`__ both understand
+the ``-v2`` option to make this easier. Send each new revision as a new
+top-level thread, rather than burying it in-reply-to an earlier
+revision, as many reviewers are not looking inside deep threads for new
+patches.
+
+.. _include_version_history_in_patchset_revisions:
+
+Include version history in patchset revisions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For later versions of patches, include a summary of changes from
+previous versions, but not in the commit message itself. In an email
+formatted as a git patch, the commit message is the part above the ``---``
+line, and this will go into the git changelog when the patch is
+committed. This part should be a self-contained description of what this
+version of the patch does, written to make sense to anybody who comes
+back to look at this commit in git in six months' time. The part below
+the ``---`` line and above the patch proper (git format-patch puts the
+diffstat here) is a good place to put remarks for people reading the
+patch email, and this is where the "changes since previous version"
+summary belongs. The `git-publish
+<https://github.com/stefanha/git-publish>`__ script can help with
+tracking a good summary across versions. Also, the `git-backport-diff
+<https://github.com/codyprime/git-scripts>`__ script can help focus
+reviewers on what changed between revisions.
+
+.. _tips_and_tricks:
+
+Tips and Tricks
+---------------
+
+.. _proper_use_of_reviewed_by_tags_can_aid_review:
+
+Proper use of Reviewed-by: tags can aid review
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When reviewing a large series, a reviewer can reply to some of the
+patches with a Reviewed-by tag, stating that they are happy with that
+patch in isolation (sometimes conditional on minor cleanup, like fixing
+whitespace, that doesn't affect code content). You should then update
+those commit messages by hand to include the Reviewed-by tag, so that in
+the next revision, reviewers can spot which patches were already clean
+from the previous round. Conversely, if you significantly modify a patch
+that was previously reviewed, remove the reviewed-by tag out of the
+commit message, as well as listing the changes from the previous
+version, to make it easier to focus a reviewer's attention to your
+changes.
+
+.. _if_your_patch_seems_to_have_been_ignored:
+
+If your patch seems to have been ignored
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If your patchset has received no replies you should "ping" it after a
+week or two, by sending an email as a reply-to-all to the patch mail,
+including the word "ping" and ideally also a link to the page for the
+patch on `patchew <https://patchew.org/QEMU/>`__ or
+`lore.kernel.org <https://lore.kernel.org/qemu-devel/>`__. It's worth
+double-checking for reasons why your patch might have been ignored
+(forgot to CC the maintainer? annoyed people by failing to respond to
+review comments on an earlier version?), but often for less-maintained
+areas of QEMU patches do just slip through the cracks. If your ping is
+also ignored, ping again after another week or so. As the submitter, you
+are the person with the most motivation to get your patch applied, so
+you have to be persistent.
+
+.. _is_my_patch_in:
+
+Is my patch in?
+~~~~~~~~~~~~~~~
+
+QEMU has some Continuous Integration machines that try to catch patch
+submission problems as soon as possible. `patchew
+<http://patchew.org/QEMU/>`__ includes a web interface for tracking the
+status of various threads that have been posted to the list, and may
+send you an automated mail if it detected a problem with your patch.
+
+Once your patch has had enough review on list, the maintainer for that
+area of code will send notification to the list that they are including
+your patch in a particular staging branch. Periodically, the maintainer
+then takes care of :ref:`submitting-a-pull-request`
+for aggregating topic branches into mainline QEMU. Generally, you do not
+need to send a pull request unless you have contributed enough patches
+to become a maintainer over a particular section of code. Maintainers
+may further modify your commit, by resolving simple merge conflicts or
+fixing minor typos pointed out during review, but will always add a
+Signed-off-by line in addition to yours, indicating that it went through
+their tree. Occasionally, the maintainer's pull request may hit more
+difficult merge conflicts, where you may be requested to help rebase and
+resolve the problems. It may take a couple of weeks between when your
+patch first had a positive review to when it finally lands in qemu.git;
+release cycle freezes may extend that time even longer.
+
+.. _return_the_favor:
+
+Return the favor
+~~~~~~~~~~~~~~~~
+
+Peer review only works if everyone chips in a bit of review time. If
+everyone submitted more patches than they reviewed, we would have a
+patch backlog. A good goal is to try to review at least as many patches
+from others as what you submit. Don't worry if you don't know the code
+base as well as a maintainer; it's perfectly fine to admit when your
+review is weak because you are unfamiliar with the code.
diff --git a/docs/devel/submitting-a-pull-request.rst b/docs/devel/submitting-a-pull-request.rst
new file mode 100644
index 000000000..c9d1e8afd
--- /dev/null
+++ b/docs/devel/submitting-a-pull-request.rst
@@ -0,0 +1,77 @@
+.. _submitting-a-pull-request:
+
+Submitting a Pull Request
+=========================
+
+QEMU welcomes contributions of code, but we generally expect these to be
+sent as simple patch emails to the mailing list (see our page on
+:ref:`submitting-a-patch`
+for more details). Generally only existing submaintainers of a tree
+will need to submit pull requests, although occasionally for a large
+patch series we might ask a submitter to send a pull request. This page
+documents our recommendations on pull requests for those people.
+
+A good rule of thumb is not to send a pull request unless somebody asks
+you to.
+
+**Resend the patches with the pull request** as emails which are
+threaded as follow-ups to the pull request itself. The simplest way to
+do this is to use ``git format-patch --cover-letter`` to create the
+emails, and then edit the cover letter to include the pull request
+details that ``git request-pull`` outputs.
+
+**Use PULL as the subject line tag** in both the cover letter and the
+retransmitted patch mails (for example, by using
+``--subject-prefix=PULL`` in your ``git format-patch`` command). This
+helps people to filter in or out the resulting emails (especially useful
+if they are only CC'd on one email out of the set).
+
+**Each patch must have your own Signed-off-by: line** as well as that of
+the original author if the patch was not written by you. This is because
+with a pull request you're now indicating that the patch has passed via
+you rather than directly from the original author.
+
+**Don't forget to add Reviewed-by: and Acked-by: lines**. When other
+people have reviewed the patches you're putting in the pull request,
+make sure you've copied their signoffs across. (If you use the `patches
+tool <https://github.com/stefanha/patches>`__ to add patches from email
+directly to your git repo it will include the tags automatically; if
+you're updating patches manually or in some other way you'll need to
+edit the commit messages by hand.)
+
+**Don't send pull requests for code that hasn't passed review**. A pull
+request says these patches are ready to go into QEMU now, so they must
+have passed the standard code review processes. In particular if you've
+corrected issues in one round of code review, you need to send your
+fixed patch series as normal to the list; you can't put it in a pull
+request until it's gone through. (Extremely trivial fixes may be OK to
+just fix in passing, but if in doubt err on the side of not.)
+
+**Test before sending**. This is an obvious thing to say, but make sure
+everything builds (including that it compiles at each step of the patch
+series) and that "make check" passes before sending out the pull
+request. As a submaintainer you're one of QEMU's lines of defense
+against bad code, so double check the details.
+
+**All pull requests must be signed**. If your key is not already signed
+by members of the QEMU community, you should make arrangements to attend
+a `KeySigningParty <https://wiki.qemu.org/KeySigningParty>`__ (for
+example at KVM Forum) or make alternative arrangements to have your key
+signed by an attendee. Key signing requires meeting another community
+member \*in person\* so please make appropriate arrangements. By
+"signed" here we mean that the pullreq email should quote a tag which is
+a GPG-signed tag (as created with 'gpg tag -s ...').
+
+**Pull requests not for master should say "not for master" and have
+"PULL SUBSYSTEM whatever" in the subject tag**. If your pull request is
+targeting a stable branch or some submaintainer tree, please include the
+string "not for master" in the cover letter email, and make sure the
+subject tag is "PULL SUBSYSTEM s390/block/whatever" rather than just
+"PULL". This allows it to be automatically filtered out of the set of
+pull requests that should be applied to master.
+
+You might be interested in the `make-pullreq
+<https://git.linaro.org/people/peter.maydell/misc-scripts.git/tree/make-pullreq>`__
+script which automates some of this process for you and includes a few
+sanity checks. Note that you must edit it to configure it suitably for
+your local situation!
diff --git a/docs/devel/tcg-icount.rst b/docs/devel/tcg-icount.rst
new file mode 100644
index 000000000..50c8e8dab
--- /dev/null
+++ b/docs/devel/tcg-icount.rst
@@ -0,0 +1,94 @@
+..
+ Copyright (c) 2020, Linaro Limited
+ Written by Alex Bennée
+
+
+========================
+TCG Instruction Counting
+========================
+
+TCG has long supported a feature known as icount which allows for
+instruction counting during execution. This should not be confused
+with cycle accurate emulation - QEMU does not attempt to emulate how
+long an instruction would take on real hardware. That is a job for
+other more detailed (and slower) tools that simulate the rest of a
+micro-architecture.
+
+This feature is only available for system emulation and is
+incompatible with multi-threaded TCG. It can be used to better align
+execution time with wall-clock time so a "slow" device doesn't run too
+fast on modern hardware. It can also provides for a degree of
+deterministic execution and is an essential part of the record/replay
+support in QEMU.
+
+Core Concepts
+=============
+
+At its heart icount is simply a count of executed instructions which
+is stored in the TimersState of QEMU's timer sub-system. The number of
+executed instructions can then be used to calculate QEMU_CLOCK_VIRTUAL
+which represents the amount of elapsed time in the system since
+execution started. Depending on the icount mode this may either be a
+fixed number of ns per instruction or adjusted as execution continues
+to keep wall clock time and virtual time in sync.
+
+To be able to calculate the number of executed instructions the
+translator starts by allocating a budget of instructions to be
+executed. The budget of instructions is limited by how long it will be
+until the next timer will expire. We store this budget as part of a
+vCPU icount_decr field which shared with the machinery for handling
+cpu_exit(). The whole field is checked at the start of every
+translated block and will cause a return to the outer loop to deal
+with whatever caused the exit.
+
+In the case of icount, before the flag is checked we subtract the
+number of instructions the translation block would execute. If this
+would cause the instruction budget to go negative we exit the main
+loop and regenerate a new translation block with exactly the right
+number of instructions to take the budget to 0 meaning whatever timer
+was due to expire will expire exactly when we exit the main run loop.
+
+Dealing with MMIO
+-----------------
+
+While we can adjust the instruction budget for known events like timer
+expiry we cannot do the same for MMIO. Every load/store we execute
+might potentially trigger an I/O event, at which point we will need an
+up to date and accurate reading of the icount number.
+
+To deal with this case, when an I/O access is made we:
+
+ - restore un-executed instructions to the icount budget
+ - re-compile a single [1]_ instruction block for the current PC
+ - exit the cpu loop and execute the re-compiled block
+
+The new block is created with the CF_LAST_IO compile flag which
+ensures the final instruction translation starts with a call to
+gen_io_start() so we don't enter a perpetual loop constantly
+recompiling a single instruction block. For translators using the
+common translator_loop this is done automatically.
+
+.. [1] sometimes two instructions if dealing with delay slots
+
+Other I/O operations
+--------------------
+
+MMIO isn't the only type of operation for which we might need a
+correct and accurate clock. IO port instructions and accesses to
+system registers are the common examples here. These instructions have
+to be handled by the individual translators which have the knowledge
+of which operations are I/O operations.
+
+When the translator is handling an instruction of this kind:
+
+* it must call gen_io_start() if icount is enabled, at some
+ point before the generation of the code which actually does
+ the I/O, using a code fragment similar to:
+
+.. code:: c
+
+ if (tb_cflags(s->base.tb) & CF_USE_ICOUNT) {
+ gen_io_start();
+ }
+
+* it must end the TB immediately after this instruction
diff --git a/docs/devel/tcg-plugins.rst b/docs/devel/tcg-plugins.rst
new file mode 100644
index 000000000..f93ef4fe5
--- /dev/null
+++ b/docs/devel/tcg-plugins.rst
@@ -0,0 +1,438 @@
+..
+ Copyright (C) 2017, Emilio G. Cota <cota@braap.org>
+ Copyright (c) 2019, Linaro Limited
+ Written by Emilio Cota and Alex Bennée
+
+QEMU TCG Plugins
+================
+
+QEMU TCG plugins provide a way for users to run experiments taking
+advantage of the total system control emulation can have over a guest.
+It provides a mechanism for plugins to subscribe to events during
+translation and execution and optionally callback into the plugin
+during these events. TCG plugins are unable to change the system state
+only monitor it passively. However they can do this down to an
+individual instruction granularity including potentially subscribing
+to all load and store operations.
+
+Usage
+-----
+
+Any QEMU binary with TCG support has plugins enabled by default.
+Earlier releases needed to be explicitly enabled with::
+
+ configure --enable-plugins
+
+Once built a program can be run with multiple plugins loaded each with
+their own arguments::
+
+ $QEMU $OTHER_QEMU_ARGS \
+ -plugin tests/plugin/libhowvec.so,inline=on,count=hint \
+ -plugin tests/plugin/libhotblocks.so
+
+Arguments are plugin specific and can be used to modify their
+behaviour. In this case the howvec plugin is being asked to use inline
+ops to count and break down the hint instructions by type.
+
+Writing plugins
+---------------
+
+API versioning
+~~~~~~~~~~~~~~
+
+This is a new feature for QEMU and it does allow people to develop
+out-of-tree plugins that can be dynamically linked into a running QEMU
+process. However the project reserves the right to change or break the
+API should it need to do so. The best way to avoid this is to submit
+your plugin upstream so they can be updated if/when the API changes.
+
+All plugins need to declare a symbol which exports the plugin API
+version they were built against. This can be done simply by::
+
+ QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+
+The core code will refuse to load a plugin that doesn't export a
+``qemu_plugin_version`` symbol or if plugin version is outside of QEMU's
+supported range of API versions.
+
+Additionally the ``qemu_info_t`` structure which is passed to the
+``qemu_plugin_install`` method of a plugin will detail the minimum and
+current API versions supported by QEMU. The API version will be
+incremented if new APIs are added. The minimum API version will be
+incremented if existing APIs are changed or removed.
+
+Lifetime of the query handle
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Each callback provides an opaque anonymous information handle which
+can usually be further queried to find out information about a
+translation, instruction or operation. The handles themselves are only
+valid during the lifetime of the callback so it is important that any
+information that is needed is extracted during the callback and saved
+by the plugin.
+
+Plugin life cycle
+~~~~~~~~~~~~~~~~~
+
+First the plugin is loaded and the public qemu_plugin_install function
+is called. The plugin will then register callbacks for various plugin
+events. Generally plugins will register a handler for the *atexit*
+if they want to dump a summary of collected information once the
+program/system has finished running.
+
+When a registered event occurs the plugin callback is invoked. The
+callbacks may provide additional information. In the case of a
+translation event the plugin has an option to enumerate the
+instructions in a block of instructions and optionally register
+callbacks to some or all instructions when they are executed.
+
+There is also a facility to add an inline event where code to
+increment a counter can be directly inlined with the translation.
+Currently only a simple increment is supported. This is not atomic so
+can miss counts. If you want absolute precision you should use a
+callback which can then ensure atomicity itself.
+
+Finally when QEMU exits all the registered *atexit* callbacks are
+invoked.
+
+Exposure of QEMU internals
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The plugin architecture actively avoids leaking implementation details
+about how QEMU's translation works to the plugins. While there are
+conceptions such as translation time and translation blocks the
+details are opaque to plugins. The plugin is able to query select
+details of instructions and system configuration only through the
+exported *qemu_plugin* functions.
+
+API
+~~~
+
+.. kernel-doc:: include/qemu/qemu-plugin.h
+
+Internals
+---------
+
+Locking
+~~~~~~~
+
+We have to ensure we cannot deadlock, particularly under MTTCG. For
+this we acquire a lock when called from plugin code. We also keep the
+list of callbacks under RCU so that we do not have to hold the lock
+when calling the callbacks. This is also for performance, since some
+callbacks (e.g. memory access callbacks) might be called very
+frequently.
+
+ * A consequence of this is that we keep our own list of CPUs, so that
+ we do not have to worry about locking order wrt cpu_list_lock.
+ * Use a recursive lock, since we can get registration calls from
+ callbacks.
+
+As a result registering/unregistering callbacks is "slow", since it
+takes a lock. But this is very infrequent; we want performance when
+calling (or not calling) callbacks, not when registering them. Using
+RCU is great for this.
+
+We support the uninstallation of a plugin at any time (e.g. from
+plugin callbacks). This allows plugins to remove themselves if they no
+longer want to instrument the code. This operation is asynchronous
+which means callbacks may still occur after the uninstall operation is
+requested. The plugin isn't completely uninstalled until the safe work
+has executed while all vCPUs are quiescent.
+
+Example Plugins
+---------------
+
+There are a number of plugins included with QEMU and you are
+encouraged to contribute your own plugins plugins upstream. There is a
+``contrib/plugins`` directory where they can go.
+
+- tests/plugins
+
+These are some basic plugins that are used to test and exercise the
+API during the ``make check-tcg`` target.
+
+- contrib/plugins/hotblocks.c
+
+The hotblocks plugin allows you to examine the where hot paths of
+execution are in your program. Once the program has finished you will
+get a sorted list of blocks reporting the starting PC, translation
+count, number of instructions and execution count. This will work best
+with linux-user execution as system emulation tends to generate
+re-translations as blocks from different programs get swapped in and
+out of system memory.
+
+If your program is single-threaded you can use the ``inline`` option for
+slightly faster (but not thread safe) counters.
+
+Example::
+
+ ./aarch64-linux-user/qemu-aarch64 \
+ -plugin contrib/plugins/libhotblocks.so -d plugin \
+ ./tests/tcg/aarch64-linux-user/sha1
+ SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+ collected 903 entries in the hash table
+ pc, tcount, icount, ecount
+ 0x0000000041ed10, 1, 5, 66087
+ 0x000000004002b0, 1, 4, 66087
+ ...
+
+- contrib/plugins/hotpages.c
+
+Similar to hotblocks but this time tracks memory accesses::
+
+ ./aarch64-linux-user/qemu-aarch64 \
+ -plugin contrib/plugins/libhotpages.so -d plugin \
+ ./tests/tcg/aarch64-linux-user/sha1
+ SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+ Addr, RCPUs, Reads, WCPUs, Writes
+ 0x000055007fe000, 0x0001, 31747952, 0x0001, 8835161
+ 0x000055007ff000, 0x0001, 29001054, 0x0001, 8780625
+ 0x00005500800000, 0x0001, 687465, 0x0001, 335857
+ 0x0000000048b000, 0x0001, 130594, 0x0001, 355
+ 0x0000000048a000, 0x0001, 1826, 0x0001, 11
+
+The hotpages plugin can be configured using the following arguments:
+
+ * sortby=reads|writes|address
+
+ Log the data sorted by either the number of reads, the number of writes, or
+ memory address. (Default: entries are sorted by the sum of reads and writes)
+
+ * io=on
+
+ Track IO addresses. Only relevant to full system emulation. (Default: off)
+
+ * pagesize=N
+
+ The page size used. (Default: N = 4096)
+
+- contrib/plugins/howvec.c
+
+This is an instruction classifier so can be used to count different
+types of instructions. It has a number of options to refine which get
+counted. You can give a value to the ``count`` argument for a class of
+instructions to break it down fully, so for example to see all the system
+registers accesses::
+
+ ./aarch64-softmmu/qemu-system-aarch64 $(QEMU_ARGS) \
+ -append "root=/dev/sda2 systemd.unit=benchmark.service" \
+ -smp 4 -plugin ./contrib/plugins/libhowvec.so,count=sreg -d plugin
+
+which will lead to a sorted list after the class breakdown::
+
+ Instruction Classes:
+ Class: UDEF not counted
+ Class: SVE (68 hits)
+ Class: PCrel addr (47789483 hits)
+ Class: Add/Sub (imm) (192817388 hits)
+ Class: Logical (imm) (93852565 hits)
+ Class: Move Wide (imm) (76398116 hits)
+ Class: Bitfield (44706084 hits)
+ Class: Extract (5499257 hits)
+ Class: Cond Branch (imm) (147202932 hits)
+ Class: Exception Gen (193581 hits)
+ Class: NOP not counted
+ Class: Hints (6652291 hits)
+ Class: Barriers (8001661 hits)
+ Class: PSTATE (1801695 hits)
+ Class: System Insn (6385349 hits)
+ Class: System Reg counted individually
+ Class: Branch (reg) (69497127 hits)
+ Class: Branch (imm) (84393665 hits)
+ Class: Cmp & Branch (110929659 hits)
+ Class: Tst & Branch (44681442 hits)
+ Class: AdvSimd ldstmult (736 hits)
+ Class: ldst excl (9098783 hits)
+ Class: Load Reg (lit) (87189424 hits)
+ Class: ldst noalloc pair (3264433 hits)
+ Class: ldst pair (412526434 hits)
+ Class: ldst reg (imm) (314734576 hits)
+ Class: Loads & Stores (2117774 hits)
+ Class: Data Proc Reg (223519077 hits)
+ Class: Scalar FP (31657954 hits)
+ Individual Instructions:
+ Instr: mrs x0, sp_el0 (2682661 hits) (op=0xd5384100/ System Reg)
+ Instr: mrs x1, tpidr_el2 (1789339 hits) (op=0xd53cd041/ System Reg)
+ Instr: mrs x2, tpidr_el2 (1513494 hits) (op=0xd53cd042/ System Reg)
+ Instr: mrs x0, tpidr_el2 (1490823 hits) (op=0xd53cd040/ System Reg)
+ Instr: mrs x1, sp_el0 (933793 hits) (op=0xd5384101/ System Reg)
+ Instr: mrs x2, sp_el0 (699516 hits) (op=0xd5384102/ System Reg)
+ Instr: mrs x4, tpidr_el2 (528437 hits) (op=0xd53cd044/ System Reg)
+ Instr: mrs x30, ttbr1_el1 (480776 hits) (op=0xd538203e/ System Reg)
+ Instr: msr ttbr1_el1, x30 (480713 hits) (op=0xd518203e/ System Reg)
+ Instr: msr vbar_el1, x30 (480671 hits) (op=0xd518c01e/ System Reg)
+ ...
+
+To find the argument shorthand for the class you need to examine the
+source code of the plugin at the moment, specifically the ``*opt``
+argument in the InsnClassExecCount tables.
+
+- contrib/plugins/lockstep.c
+
+This is a debugging tool for developers who want to find out when and
+where execution diverges after a subtle change to TCG code generation.
+It is not an exact science and results are likely to be mixed once
+asynchronous events are introduced. While the use of -icount can
+introduce determinism to the execution flow it doesn't always follow
+the translation sequence will be exactly the same. Typically this is
+caused by a timer firing to service the GUI causing a block to end
+early. However in some cases it has proved to be useful in pointing
+people at roughly where execution diverges. The only argument you need
+for the plugin is a path for the socket the two instances will
+communicate over::
+
+
+ ./sparc-softmmu/qemu-system-sparc -monitor none -parallel none \
+ -net none -M SS-20 -m 256 -kernel day11/zImage.elf \
+ -plugin ./contrib/plugins/liblockstep.so,sockpath=lockstep-sparc.sock \
+ -d plugin,nochain
+
+which will eventually report::
+
+ qemu-system-sparc: warning: nic lance.0 has no peer
+ @ 0x000000ffd06678 vs 0x000000ffd001e0 (2/1 since last)
+ @ 0x000000ffd07d9c vs 0x000000ffd06678 (3/1 since last)
+ Δ insn_count @ 0x000000ffd07d9c (809900609) vs 0x000000ffd06678 (809900612)
+ previously @ 0x000000ffd06678/10 (809900609 insns)
+ previously @ 0x000000ffd001e0/4 (809900599 insns)
+ previously @ 0x000000ffd080ac/2 (809900595 insns)
+ previously @ 0x000000ffd08098/5 (809900593 insns)
+ previously @ 0x000000ffd080c0/1 (809900588 insns)
+
+- contrib/plugins/hwprofile.c
+
+The hwprofile tool can only be used with system emulation and allows
+the user to see what hardware is accessed how often. It has a number of options:
+
+ * track=read or track=write
+
+ By default the plugin tracks both reads and writes. You can use one
+ of these options to limit the tracking to just one class of accesses.
+
+ * source
+
+ Will include a detailed break down of what the guest PC that made the
+ access was. Not compatible with the pattern option. Example output::
+
+ cirrus-low-memory @ 0xfffffd00000a0000
+ pc:fffffc0000005cdc, 1, 256
+ pc:fffffc0000005ce8, 1, 256
+ pc:fffffc0000005cec, 1, 256
+
+ * pattern
+
+ Instead break down the accesses based on the offset into the HW
+ region. This can be useful for seeing the most used registers of a
+ device. Example output::
+
+ pci0-conf @ 0xfffffd01fe000000
+ off:00000004, 1, 1
+ off:00000010, 1, 3
+ off:00000014, 1, 3
+ off:00000018, 1, 2
+ off:0000001c, 1, 2
+ off:00000020, 1, 2
+ ...
+
+- contrib/plugins/execlog.c
+
+The execlog tool traces executed instructions with memory access. It can be used
+for debugging and security analysis purposes.
+Please be aware that this will generate a lot of output.
+
+The plugin takes no argument::
+
+ qemu-system-arm $(QEMU_ARGS) \
+ -plugin ./contrib/plugins/libexeclog.so -d plugin
+
+which will output an execution trace following this structure::
+
+ # vCPU, vAddr, opcode, disassembly[, load/store, memory addr, device]...
+ 0, 0xa12, 0xf8012400, "movs r4, #0"
+ 0, 0xa14, 0xf87f42b4, "cmp r4, r6"
+ 0, 0xa16, 0xd206, "bhs #0xa26"
+ 0, 0xa18, 0xfff94803, "ldr r0, [pc, #0xc]", load, 0x00010a28, RAM
+ 0, 0xa1a, 0xf989f000, "bl #0xd30"
+ 0, 0xd30, 0xfff9b510, "push {r4, lr}", store, 0x20003ee0, RAM, store, 0x20003ee4, RAM
+ 0, 0xd32, 0xf9893014, "adds r0, #0x14"
+ 0, 0xd34, 0xf9c8f000, "bl #0x10c8"
+ 0, 0x10c8, 0xfff96c43, "ldr r3, [r0, #0x44]", load, 0x200000e4, RAM
+
+- contrib/plugins/cache.c
+
+Cache modelling plugin that measures the performance of a given L1 cache
+configuration, and optionally a unified L2 per-core cache when a given working
+set is run::
+
+ qemu-x86_64 -plugin ./contrib/plugins/libcache.so \
+ -d plugin -D cache.log ./tests/tcg/x86_64-linux-user/float_convs
+
+will report the following::
+
+ core #, data accesses, data misses, dmiss rate, insn accesses, insn misses, imiss rate
+ 0 996695 508 0.0510% 2642799 18617 0.7044%
+
+ address, data misses, instruction
+ 0x424f1e (_int_malloc), 109, movq %rax, 8(%rcx)
+ 0x41f395 (_IO_default_xsputn), 49, movb %dl, (%rdi, %rax)
+ 0x42584d (ptmalloc_init.part.0), 33, movaps %xmm0, (%rax)
+ 0x454d48 (__tunables_init), 20, cmpb $0, (%r8)
+ ...
+
+ address, fetch misses, instruction
+ 0x4160a0 (__vfprintf_internal), 744, movl $1, %ebx
+ 0x41f0a0 (_IO_setb), 744, endbr64
+ 0x415882 (__vfprintf_internal), 744, movq %r12, %rdi
+ 0x4268a0 (__malloc), 696, andq $0xfffffffffffffff0, %rax
+ ...
+
+The plugin has a number of arguments, all of them are optional:
+
+ * limit=N
+
+ Print top N icache and dcache thrashing instructions along with their
+ address, number of misses, and its disassembly. (default: 32)
+
+ * icachesize=N
+ * iblksize=B
+ * iassoc=A
+
+ Instruction cache configuration arguments. They specify the cache size, block
+ size, and associativity of the instruction cache, respectively.
+ (default: N = 16384, B = 64, A = 8)
+
+ * dcachesize=N
+ * dblksize=B
+ * dassoc=A
+
+ Data cache configuration arguments. They specify the cache size, block size,
+ and associativity of the data cache, respectively.
+ (default: N = 16384, B = 64, A = 8)
+
+ * evict=POLICY
+
+ Sets the eviction policy to POLICY. Available policies are: :code:`lru`,
+ :code:`fifo`, and :code:`rand`. The plugin will use the specified policy for
+ both instruction and data caches. (default: POLICY = :code:`lru`)
+
+ * cores=N
+
+ Sets the number of cores for which we maintain separate icache and dcache.
+ (default: for linux-user, N = 1, for full system emulation: N = cores
+ available to guest)
+
+ * l2=on
+
+ Simulates a unified L2 cache (stores blocks for both instructions and data)
+ using the default L2 configuration (cache size = 2MB, associativity = 16-way,
+ block size = 64B).
+
+ * l2cachesize=N
+ * l2blksize=B
+ * l2assoc=A
+
+ L2 cache configuration arguments. They specify the cache size, block size, and
+ associativity of the L2 cache, respectively. Setting any of the L2
+ configuration arguments implies ``l2=on``.
+ (default: N = 2097152 (2MB), B = 64, A = 16)
diff --git a/docs/devel/tcg.rst b/docs/devel/tcg.rst
new file mode 100644
index 000000000..a65fb7b1c
--- /dev/null
+++ b/docs/devel/tcg.rst
@@ -0,0 +1,190 @@
+====================
+Translator Internals
+====================
+
+QEMU is a dynamic translator. When it first encounters a piece of code,
+it converts it to the host instruction set. Usually dynamic translators
+are very complicated and highly CPU dependent. QEMU uses some tricks
+which make it relatively easily portable and simple while achieving good
+performances.
+
+QEMU's dynamic translation backend is called TCG, for "Tiny Code
+Generator". For more information, please take a look at ``tcg/README``.
+
+The following sections outline some notable features and implementation
+details of QEMU's dynamic translator.
+
+CPU state optimisations
+-----------------------
+
+The target CPUs have many internal states which change the way they
+evaluate instructions. In order to achieve a good speed, the
+translation phase considers that some state information of the virtual
+CPU cannot change in it. The state is recorded in the Translation
+Block (TB). If the state changes (e.g. privilege level), a new TB will
+be generated and the previous TB won't be used anymore until the state
+matches the state recorded in the previous TB. The same idea can be applied
+to other aspects of the CPU state. For example, on x86, if the SS,
+DS and ES segments have a zero base, then the translator does not even
+generate an addition for the segment base.
+
+Direct block chaining
+---------------------
+
+After each translated basic block is executed, QEMU uses the simulated
+Program Counter (PC) and other CPU state information (such as the CS
+segment base value) to find the next basic block.
+
+In its simplest, less optimized form, this is done by exiting from the
+current TB, going through the TB epilogue, and then back to the
+main loop. That’s where QEMU looks for the next TB to execute,
+translating it from the guest architecture if it isn’t already available
+in memory. Then QEMU proceeds to execute this next TB, starting at the
+prologue and then moving on to the translated instructions.
+
+Exiting from the TB this way will cause the ``cpu_exec_interrupt()``
+callback to be re-evaluated before executing additional instructions.
+It is mandatory to exit this way after any CPU state changes that may
+unmask interrupts.
+
+In order to accelerate the cases where the TB for the new
+simulated PC is already available, QEMU has mechanisms that allow
+multiple TBs to be chained directly, without having to go back to the
+main loop as described above. These mechanisms are:
+
+``lookup_and_goto_ptr``
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Calling ``tcg_gen_lookup_and_goto_ptr()`` will emit a call to
+``helper_lookup_tb_ptr``. This helper will look for an existing TB that
+matches the current CPU state. If the destination TB is available its
+code address is returned, otherwise the address of the JIT epilogue is
+returned. The call to the helper is always followed by the tcg ``goto_ptr``
+opcode, which branches to the returned address. In this way, we either
+branch to the next TB or return to the main loop.
+
+``goto_tb + exit_tb``
+^^^^^^^^^^^^^^^^^^^^^
+
+The translation code usually implements branching by performing the
+following steps:
+
+1. Call ``tcg_gen_goto_tb()`` passing a jump slot index (either 0 or 1)
+ as a parameter.
+
+2. Emit TCG instructions to update the CPU state with any information
+ that has been assumed constant and is required by the main loop to
+ correctly locate and execute the next TB. For most guests, this is
+ just the PC of the branch destination, but others may store additional
+ data. The information updated in this step must be inferable from both
+ ``cpu_get_tb_cpu_state()`` and ``cpu_restore_state()``.
+
+3. Call ``tcg_gen_exit_tb()`` passing the address of the current TB and
+ the jump slot index again.
+
+Step 1, ``tcg_gen_goto_tb()``, will emit a ``goto_tb`` TCG
+instruction that later on gets translated to a jump to an address
+associated with the specified jump slot. Initially, this is the address
+of step 2's instructions, which update the CPU state information. Step 3,
+``tcg_gen_exit_tb()``, exits from the current TB returning a tagged
+pointer composed of the last executed TB’s address and the jump slot
+index.
+
+The first time this whole sequence is executed, step 1 simply jumps
+to step 2. Then the CPU state information gets updated and we exit from
+the current TB. As a result, the behavior is very similar to the less
+optimized form described earlier in this section.
+
+Next, the main loop looks for the next TB to execute using the
+current CPU state information (creating the TB if it wasn’t already
+available) and, before starting to execute the new TB’s instructions,
+patches the previously executed TB by associating one of its jump
+slots (the one specified in the call to ``tcg_gen_exit_tb()``) with the
+address of the new TB.
+
+The next time this previous TB is executed and we get to that same
+``goto_tb`` step, it will already be patched (assuming the destination TB
+is still in memory) and will jump directly to the first instruction of
+the destination TB, without going back to the main loop.
+
+For the ``goto_tb + exit_tb`` mechanism to be used, the following
+conditions need to be satisfied:
+
+* The change in CPU state must be constant, e.g., a direct branch and
+ not an indirect branch.
+
+* The direct branch cannot cross a page boundary. Memory mappings
+ may change, causing the code at the destination address to change.
+
+Note that, on step 3 (``tcg_gen_exit_tb()``), in addition to the
+jump slot index, the address of the TB just executed is also returned.
+This address corresponds to the TB that will be patched; it may be
+different than the one that was directly executed from the main loop
+if the latter had already been chained to other TBs.
+
+Self-modifying code and translated code invalidation
+----------------------------------------------------
+
+Self-modifying code is a special challenge in x86 emulation because no
+instruction cache invalidation is signaled by the application when code
+is modified.
+
+User-mode emulation marks a host page as write-protected (if it is
+not already read-only) every time translated code is generated for a
+basic block. Then, if a write access is done to the page, Linux raises
+a SEGV signal. QEMU then invalidates all the translated code in the page
+and enables write accesses to the page. For system emulation, write
+protection is achieved through the software MMU.
+
+Correct translated code invalidation is done efficiently by maintaining
+a linked list of every translated block contained in a given page. Other
+linked lists are also maintained to undo direct block chaining.
+
+On RISC targets, correctly written software uses memory barriers and
+cache flushes, so some of the protection above would not be
+necessary. However, QEMU still requires that the generated code always
+matches the target instructions in memory in order to handle
+exceptions correctly.
+
+Exception support
+-----------------
+
+longjmp() is used when an exception such as division by zero is
+encountered.
+
+The host SIGSEGV and SIGBUS signal handlers are used to get invalid
+memory accesses. QEMU keeps a map from host program counter to
+target program counter, and looks up where the exception happened
+based on the host program counter at the exception point.
+
+On some targets, some bits of the virtual CPU's state are not flushed to the
+memory until the end of the translation block. This is done for internal
+emulation state that is rarely accessed directly by the program and/or changes
+very often throughout the execution of a translation block---this includes
+condition codes on x86, delay slots on SPARC, conditional execution on
+Arm, and so on. This state is stored for each target instruction, and
+looked up on exceptions.
+
+MMU emulation
+-------------
+
+For system emulation QEMU uses a software MMU. In that mode, the MMU
+virtual to physical address translation is done at every memory
+access.
+
+QEMU uses an address translation cache (TLB) to speed up the translation.
+In order to avoid flushing the translated code each time the MMU
+mappings change, all caches in QEMU are physically indexed. This
+means that each basic block is indexed with its physical address.
+
+In order to avoid invalidating the basic block chain when MMU mappings
+change, chaining is only performed when the destination of the jump
+shares a page with the basic block that is performing the jump.
+
+The MMU can also distinguish RAM and ROM memory areas from MMIO memory
+areas. Access is faster for RAM and ROM because the translation cache also
+hosts the offset between guest address and host memory. Accessing MMIO
+memory areas instead calls out to C code for device emulation.
+Finally, the MMU helps tracking dirty pages and pages pointed to by
+translation blocks.
+
diff --git a/docs/devel/testing.rst b/docs/devel/testing.rst
new file mode 100644
index 000000000..755343c7d
--- /dev/null
+++ b/docs/devel/testing.rst
@@ -0,0 +1,1309 @@
+Testing in QEMU
+===============
+
+This document describes the testing infrastructure in QEMU.
+
+Testing with "make check"
+-------------------------
+
+The "make check" testing family includes most of the C based tests in QEMU. For
+a quick help, run ``make check-help`` from the source tree.
+
+The usual way to run these tests is:
+
+.. code::
+
+ make check
+
+which includes QAPI schema tests, unit tests, QTests and some iotests.
+Different sub-types of "make check" tests will be explained below.
+
+Before running tests, it is best to build QEMU programs first. Some tests
+expect the executables to exist and will fail with obscure messages if they
+cannot find them.
+
+Unit tests
+~~~~~~~~~~
+
+Unit tests, which can be invoked with ``make check-unit``, are simple C tests
+that typically link to individual QEMU object files and exercise them by
+calling exported functions.
+
+If you are writing new code in QEMU, consider adding a unit test, especially
+for utility modules that are relatively stateless or have few dependencies. To
+add a new unit test:
+
+1. Create a new source file. For example, ``tests/unit/foo-test.c``.
+
+2. Write the test. Normally you would include the header file which exports
+ the module API, then verify the interface behaves as expected from your
+ test. The test code should be organized with the glib testing framework.
+ Copying and modifying an existing test is usually a good idea.
+
+3. Add the test to ``tests/unit/meson.build``. The unit tests are listed in a
+ dictionary called ``tests``. The values are any additional sources and
+ dependencies to be linked with the test. For a simple test whose source
+ is in ``tests/unit/foo-test.c``, it is enough to add an entry like::
+
+ {
+ ...
+ 'foo-test': [],
+ ...
+ }
+
+Since unit tests don't require environment variables, the simplest way to debug
+a unit test failure is often directly invoking it or even running it under
+``gdb``. However there can still be differences in behavior between ``make``
+invocations and your manual run, due to ``$MALLOC_PERTURB_`` environment
+variable (which affects memory reclamation and catches invalid pointers better)
+and gtester options. If necessary, you can run
+
+.. code::
+
+ make check-unit V=1
+
+and copy the actual command line which executes the unit test, then run
+it from the command line.
+
+QTest
+~~~~~
+
+QTest is a device emulation testing framework. It can be very useful to test
+device models; it could also control certain aspects of QEMU (such as virtual
+clock stepping), with a special purpose "qtest" protocol. Refer to
+:doc:`qtest` for more details.
+
+QTest cases can be executed with
+
+.. code::
+
+ make check-qtest
+
+QAPI schema tests
+~~~~~~~~~~~~~~~~~
+
+The QAPI schema tests validate the QAPI parser used by QMP, by feeding
+predefined input to the parser and comparing the result with the reference
+output.
+
+The input/output data is managed under the ``tests/qapi-schema`` directory.
+Each test case includes four files that have a common base name:
+
+ * ``${casename}.json`` - the file contains the JSON input for feeding the
+ parser
+ * ``${casename}.out`` - the file contains the expected stdout from the parser
+ * ``${casename}.err`` - the file contains the expected stderr from the parser
+ * ``${casename}.exit`` - the expected error code
+
+Consider adding a new QAPI schema test when you are making a change on the QAPI
+parser (either fixing a bug or extending/modifying the syntax). To do this:
+
+1. Add four files for the new case as explained above. For example:
+
+ ``$EDITOR tests/qapi-schema/foo.{json,out,err,exit}``.
+
+2. Add the new test in ``tests/Makefile.include``. For example:
+
+ ``qapi-schema += foo.json``
+
+check-block
+~~~~~~~~~~~
+
+``make check-block`` runs a subset of the block layer iotests (the tests that
+are in the "auto" group).
+See the "QEMU iotests" section below for more information.
+
+QEMU iotests
+------------
+
+QEMU iotests, under the directory ``tests/qemu-iotests``, is the testing
+framework widely used to test block layer related features. It is higher level
+than "make check" tests and 99% of the code is written in bash or Python
+scripts. The testing success criteria is golden output comparison, and the
+test files are named with numbers.
+
+To run iotests, make sure QEMU is built successfully, then switch to the
+``tests/qemu-iotests`` directory under the build directory, and run ``./check``
+with desired arguments from there.
+
+By default, "raw" format and "file" protocol is used; all tests will be
+executed, except the unsupported ones. You can override the format and protocol
+with arguments:
+
+.. code::
+
+ # test with qcow2 format
+ ./check -qcow2
+ # or test a different protocol
+ ./check -nbd
+
+It's also possible to list test numbers explicitly:
+
+.. code::
+
+ # run selected cases with qcow2 format
+ ./check -qcow2 001 030 153
+
+Cache mode can be selected with the "-c" option, which may help reveal bugs
+that are specific to certain cache mode.
+
+More options are supported by the ``./check`` script, run ``./check -h`` for
+help.
+
+Writing a new test case
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Consider writing a tests case when you are making any changes to the block
+layer. An iotest case is usually the choice for that. There are already many
+test cases, so it is possible that extending one of them may achieve the goal
+and save the boilerplate to create one. (Unfortunately, there isn't a 100%
+reliable way to find a related one out of hundreds of tests. One approach is
+using ``git grep``.)
+
+Usually an iotest case consists of two files. One is an executable that
+produces output to stdout and stderr, the other is the expected reference
+output. They are given the same number in file names. E.g. Test script ``055``
+and reference output ``055.out``.
+
+In rare cases, when outputs differ between cache mode ``none`` and others, a
+``.out.nocache`` file is added. In other cases, when outputs differ between
+image formats, more than one ``.out`` files are created ending with the
+respective format names, e.g. ``178.out.qcow2`` and ``178.out.raw``.
+
+There isn't a hard rule about how to write a test script, but a new test is
+usually a (copy and) modification of an existing case. There are a few
+commonly used ways to create a test:
+
+* A Bash script. It will make use of several environmental variables related
+ to the testing procedure, and could source a group of ``common.*`` libraries
+ for some common helper routines.
+
+* A Python unittest script. Import ``iotests`` and create a subclass of
+ ``iotests.QMPTestCase``, then call ``iotests.main`` method. The downside of
+ this approach is that the output is too scarce, and the script is considered
+ harder to debug.
+
+* A simple Python script without using unittest module. This could also import
+ ``iotests`` for launching QEMU and utilities etc, but it doesn't inherit
+ from ``iotests.QMPTestCase`` therefore doesn't use the Python unittest
+ execution. This is a combination of 1 and 2.
+
+Pick the language per your preference since both Bash and Python have
+comparable library support for invoking and interacting with QEMU programs. If
+you opt for Python, it is strongly recommended to write Python 3 compatible
+code.
+
+Both Python and Bash frameworks in iotests provide helpers to manage test
+images. They can be used to create and clean up images under the test
+directory. If no I/O or any protocol specific feature is needed, it is often
+more convenient to use the pseudo block driver, ``null-co://``, as the test
+image, which doesn't require image creation or cleaning up. Avoid system-wide
+devices or files whenever possible, such as ``/dev/null`` or ``/dev/zero``.
+Otherwise, image locking implications have to be considered. For example,
+another application on the host may have locked the file, possibly leading to a
+test failure. If using such devices are explicitly desired, consider adding
+``locking=off`` option to disable image locking.
+
+Debugging a test case
+~~~~~~~~~~~~~~~~~~~~~
+
+The following options to the ``check`` script can be useful when debugging
+a failing test:
+
+* ``-gdb`` wraps every QEMU invocation in a ``gdbserver``, which waits for a
+ connection from a gdb client. The options given to ``gdbserver`` (e.g. the
+ address on which to listen for connections) are taken from the ``$GDB_OPTIONS``
+ environment variable. By default (if ``$GDB_OPTIONS`` is empty), it listens on
+ ``localhost:12345``.
+ It is possible to connect to it for example with
+ ``gdb -iex "target remote $addr"``, where ``$addr`` is the address
+ ``gdbserver`` listens on.
+ If the ``-gdb`` option is not used, ``$GDB_OPTIONS`` is ignored,
+ regardless of whether it is set or not.
+
+* ``-valgrind`` attaches a valgrind instance to QEMU. If it detects
+ warnings, it will print and save the log in
+ ``$TEST_DIR/<valgrind_pid>.valgrind``.
+ The final command line will be ``valgrind --log-file=$TEST_DIR/
+ <valgrind_pid>.valgrind --error-exitcode=99 $QEMU ...``
+
+* ``-d`` (debug) just increases the logging verbosity, showing
+ for example the QMP commands and answers.
+
+* ``-p`` (print) redirects QEMU’s stdout and stderr to the test output,
+ instead of saving it into a log file in
+ ``$TEST_DIR/qemu-machine-<random_string>``.
+
+Test case groups
+~~~~~~~~~~~~~~~~
+
+"Tests may belong to one or more test groups, which are defined in the form
+of a comment in the test source file. By convention, test groups are listed
+in the second line of the test file, after the "#!/..." line, like this:
+
+.. code::
+
+ #!/usr/bin/env python3
+ # group: auto quick
+ #
+ ...
+
+Another way of defining groups is creating the tests/qemu-iotests/group.local
+file. This should be used only for downstream (this file should never appear
+in upstream). This file may be used for defining some downstream test groups
+or for temporarily disabling tests, like this:
+
+.. code::
+
+ # groups for some company downstream process
+ #
+ # ci - tests to run on build
+ # down - our downstream tests, not for upstream
+ #
+ # Format of each line is:
+ # TEST_NAME TEST_GROUP [TEST_GROUP ]...
+
+ 013 ci
+ 210 disabled
+ 215 disabled
+ our-ugly-workaround-test down ci
+
+Note that the following group names have a special meaning:
+
+- quick: Tests in this group should finish within a few seconds.
+
+- auto: Tests in this group are used during "make check" and should be
+ runnable in any case. That means they should run with every QEMU binary
+ (also non-x86), with every QEMU configuration (i.e. must not fail if
+ an optional feature is not compiled in - but reporting a "skip" is ok),
+ work at least with the qcow2 file format, work with all kind of host
+ filesystems and users (e.g. "nobody" or "root") and must not take too
+ much memory and disk space (since CI pipelines tend to fail otherwise).
+
+- disabled: Tests in this group are disabled and ignored by check.
+
+.. _container-ref:
+
+Container based tests
+---------------------
+
+Introduction
+~~~~~~~~~~~~
+
+The container testing framework in QEMU utilizes public images to
+build and test QEMU in predefined and widely accessible Linux
+environments. This makes it possible to expand the test coverage
+across distros, toolchain flavors and library versions. The support
+was originally written for Docker although we also support Podman as
+an alternative container runtime. Although the many of the target
+names and scripts are prefixed with "docker" the system will
+automatically run on whichever is configured.
+
+The container images are also used to augment the generation of tests
+for testing TCG. See :ref:`checktcg-ref` for more details.
+
+Docker Prerequisites
+~~~~~~~~~~~~~~~~~~~~
+
+Install "docker" with the system package manager and start the Docker service
+on your development machine, then make sure you have the privilege to run
+Docker commands. Typically it means setting up passwordless ``sudo docker``
+command or login as root. For example:
+
+.. code::
+
+ $ sudo yum install docker
+ $ # or `apt-get install docker` for Ubuntu, etc.
+ $ sudo systemctl start docker
+ $ sudo docker ps
+
+The last command should print an empty table, to verify the system is ready.
+
+An alternative method to set up permissions is by adding the current user to
+"docker" group and making the docker daemon socket file (by default
+``/var/run/docker.sock``) accessible to the group:
+
+.. code::
+
+ $ sudo groupadd docker
+ $ sudo usermod $USER -a -G docker
+ $ sudo chown :docker /var/run/docker.sock
+
+Note that any one of above configurations makes it possible for the user to
+exploit the whole host with Docker bind mounting or other privileged
+operations. So only do it on development machines.
+
+Podman Prerequisites
+~~~~~~~~~~~~~~~~~~~~
+
+Install "podman" with the system package manager.
+
+.. code::
+
+ $ sudo dnf install podman
+ $ podman ps
+
+The last command should print an empty table, to verify the system is ready.
+
+Quickstart
+~~~~~~~~~~
+
+From source tree, type ``make docker-help`` to see the help. Testing
+can be started without configuring or building QEMU (``configure`` and
+``make`` are done in the container, with parameters defined by the
+make target):
+
+.. code::
+
+ make docker-test-build@centos8
+
+This will create a container instance using the ``centos8`` image (the image
+is downloaded and initialized automatically), in which the ``test-build`` job
+is executed.
+
+Registry
+~~~~~~~~
+
+The QEMU project has a container registry hosted by GitLab at
+``registry.gitlab.com/qemu-project/qemu`` which will automatically be
+used to pull in pre-built layers. This avoids unnecessary strain on
+the distro archives created by multiple developers running the same
+container build steps over and over again. This can be overridden
+locally by using the ``NOCACHE`` build option:
+
+.. code::
+
+ make docker-image-debian10 NOCACHE=1
+
+Images
+~~~~~~
+
+Along with many other images, the ``centos8`` image is defined in a Dockerfile
+in ``tests/docker/dockerfiles/``, called ``centos8.docker``. ``make docker-help``
+command will list all the available images.
+
+To add a new image, simply create a new ``.docker`` file under the
+``tests/docker/dockerfiles/`` directory.
+
+A ``.pre`` script can be added beside the ``.docker`` file, which will be
+executed before building the image under the build context directory. This is
+mainly used to do necessary host side setup. One such setup is ``binfmt_misc``,
+for example, to make qemu-user powered cross build containers work.
+
+Tests
+~~~~~
+
+Different tests are added to cover various configurations to build and test
+QEMU. Docker tests are the executables under ``tests/docker`` named
+``test-*``. They are typically shell scripts and are built on top of a shell
+library, ``tests/docker/common.rc``, which provides helpers to find the QEMU
+source and build it.
+
+The full list of tests is printed in the ``make docker-help`` help.
+
+Debugging a Docker test failure
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When CI tasks, maintainers or yourself report a Docker test failure, follow the
+below steps to debug it:
+
+1. Locally reproduce the failure with the reported command line. E.g. run
+ ``make docker-test-mingw@fedora J=8``.
+2. Add "V=1" to the command line, try again, to see the verbose output.
+3. Further add "DEBUG=1" to the command line. This will pause in a shell prompt
+ in the container right before testing starts. You could either manually
+ build QEMU and run tests from there, or press Ctrl-D to let the Docker
+ testing continue.
+4. If you press Ctrl-D, the same building and testing procedure will begin, and
+ will hopefully run into the error again. After that, you will be dropped to
+ the prompt for debug.
+
+Options
+~~~~~~~
+
+Various options can be used to affect how Docker tests are done. The full
+list is in the ``make docker`` help text. The frequently used ones are:
+
+* ``V=1``: the same as in top level ``make``. It will be propagated to the
+ container and enable verbose output.
+* ``J=$N``: the number of parallel tasks in make commands in the container,
+ similar to the ``-j $N`` option in top level ``make``. (The ``-j`` option in
+ top level ``make`` will not be propagated into the container.)
+* ``DEBUG=1``: enables debug. See the previous "Debugging a Docker test
+ failure" section.
+
+Thread Sanitizer
+----------------
+
+Thread Sanitizer (TSan) is a tool which can detect data races. QEMU supports
+building and testing with this tool.
+
+For more information on TSan:
+
+https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual
+
+Thread Sanitizer in Docker
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+TSan is currently supported in the ubuntu2004 docker.
+
+The test-tsan test will build using TSan and then run make check.
+
+.. code::
+
+ make docker-test-tsan@ubuntu2004
+
+TSan warnings under docker are placed in files located at build/tsan/.
+
+We recommend using DEBUG=1 to allow launching the test from inside the docker,
+and to allow review of the warnings generated by TSan.
+
+Building and Testing with TSan
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to build and test with TSan, with a few additional steps.
+These steps are normally done automatically in the docker.
+
+There is a one time patch needed in clang-9 or clang-10 at this time:
+
+.. code::
+
+ sed -i 's/^const/static const/g' \
+ /usr/lib/llvm-10/lib/clang/10.0.0/include/sanitizer/tsan_interface.h
+
+To configure the build for TSan:
+
+.. code::
+
+ ../configure --enable-tsan --cc=clang-10 --cxx=clang++-10 \
+ --disable-werror --extra-cflags="-O0"
+
+The runtime behavior of TSAN is controlled by the TSAN_OPTIONS environment
+variable.
+
+More information on the TSAN_OPTIONS can be found here:
+
+https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags
+
+For example:
+
+.. code::
+
+ export TSAN_OPTIONS=suppressions=<path to qemu>/tests/tsan/suppressions.tsan \
+ detect_deadlocks=false history_size=7 exitcode=0 \
+ log_path=<build path>/tsan/tsan_warning
+
+The above exitcode=0 has TSan continue without error if any warnings are found.
+This allows for running the test and then checking the warnings afterwards.
+If you want TSan to stop and exit with error on warnings, use exitcode=66.
+
+TSan Suppressions
+~~~~~~~~~~~~~~~~~
+Keep in mind that for any data race warning, although there might be a data race
+detected by TSan, there might be no actual bug here. TSan provides several
+different mechanisms for suppressing warnings. In general it is recommended
+to fix the code if possible to eliminate the data race rather than suppress
+the warning.
+
+A few important files for suppressing warnings are:
+
+tests/tsan/suppressions.tsan - Has TSan warnings we wish to suppress at runtime.
+The comment on each suppression will typically indicate why we are
+suppressing it. More information on the file format can be found here:
+
+https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions
+
+tests/tsan/blacklist.tsan - Has TSan warnings we wish to disable
+at compile time for test or debug.
+Add flags to configure to enable:
+
+"--extra-cflags=-fsanitize-blacklist=<src path>/tests/tsan/blacklist.tsan"
+
+More information on the file format can be found here under "Blacklist Format":
+
+https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags
+
+TSan Annotations
+~~~~~~~~~~~~~~~~
+include/qemu/tsan.h defines annotations. See this file for more descriptions
+of the annotations themselves. Annotations can be used to suppress
+TSan warnings or give TSan more information so that it can detect proper
+relationships between accesses of data.
+
+Annotation examples can be found here:
+
+https://github.com/llvm/llvm-project/tree/master/compiler-rt/test/tsan/
+
+Good files to start with are: annotate_happens_before.cpp and ignore_race.cpp
+
+The full set of annotations can be found here:
+
+https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp
+
+VM testing
+----------
+
+This test suite contains scripts that bootstrap various guest images that have
+necessary packages to build QEMU. The basic usage is documented in ``Makefile``
+help which is displayed with ``make vm-help``.
+
+Quickstart
+~~~~~~~~~~
+
+Run ``make vm-help`` to list available make targets. Invoke a specific make
+command to run build test in an image. For example, ``make vm-build-freebsd``
+will build the source tree in the FreeBSD image. The command can be executed
+from either the source tree or the build dir; if the former, ``./configure`` is
+not needed. The command will then generate the test image in ``./tests/vm/``
+under the working directory.
+
+Note: images created by the scripts accept a well-known RSA key pair for SSH
+access, so they SHOULD NOT be exposed to external interfaces if you are
+concerned about attackers taking control of the guest and potentially
+exploiting a QEMU security bug to compromise the host.
+
+QEMU binaries
+~~~~~~~~~~~~~
+
+By default, ``qemu-system-x86_64`` is searched in $PATH to run the guest. If
+there isn't one, or if it is older than 2.10, the test won't work. In this case,
+provide the QEMU binary in env var: ``QEMU=/path/to/qemu-2.10+``.
+
+Likewise the path to ``qemu-img`` can be set in QEMU_IMG environment variable.
+
+Make jobs
+~~~~~~~~~
+
+The ``-j$X`` option in the make command line is not propagated into the VM,
+specify ``J=$X`` to control the make jobs in the guest.
+
+Debugging
+~~~~~~~~~
+
+Add ``DEBUG=1`` and/or ``V=1`` to the make command to allow interactive
+debugging and verbose output. If this is not enough, see the next section.
+``V=1`` will be propagated down into the make jobs in the guest.
+
+Manual invocation
+~~~~~~~~~~~~~~~~~
+
+Each guest script is an executable script with the same command line options.
+For example to work with the netbsd guest, use ``$QEMU_SRC/tests/vm/netbsd``:
+
+.. code::
+
+ $ cd $QEMU_SRC/tests/vm
+
+ # To bootstrap the image
+ $ ./netbsd --build-image --image /var/tmp/netbsd.img
+ <...>
+
+ # To run an arbitrary command in guest (the output will not be echoed unless
+ # --debug is added)
+ $ ./netbsd --debug --image /var/tmp/netbsd.img uname -a
+
+ # To build QEMU in guest
+ $ ./netbsd --debug --image /var/tmp/netbsd.img --build-qemu $QEMU_SRC
+
+ # To get to an interactive shell
+ $ ./netbsd --interactive --image /var/tmp/netbsd.img sh
+
+Adding new guests
+~~~~~~~~~~~~~~~~~
+
+Please look at existing guest scripts for how to add new guests.
+
+Most importantly, create a subclass of BaseVM and implement ``build_image()``
+method and define ``BUILD_SCRIPT``, then finally call ``basevm.main()`` from
+the script's ``main()``.
+
+* Usually in ``build_image()``, a template image is downloaded from a
+ predefined URL. ``BaseVM._download_with_cache()`` takes care of the cache and
+ the checksum, so consider using it.
+
+* Once the image is downloaded, users, SSH server and QEMU build deps should
+ be set up:
+
+ - Root password set to ``BaseVM.ROOT_PASS``
+ - User ``BaseVM.GUEST_USER`` is created, and password set to
+ ``BaseVM.GUEST_PASS``
+ - SSH service is enabled and started on boot,
+ ``$QEMU_SRC/tests/keys/id_rsa.pub`` is added to ssh's ``authorized_keys``
+ file of both root and the normal user
+ - DHCP client service is enabled and started on boot, so that it can
+ automatically configure the virtio-net-pci NIC and communicate with QEMU
+ user net (10.0.2.2)
+ - Necessary packages are installed to untar the source tarball and build
+ QEMU
+
+* Write a proper ``BUILD_SCRIPT`` template, which should be a shell script that
+ untars a raw virtio-blk block device, which is the tarball data blob of the
+ QEMU source tree, then configure/build it. Running "make check" is also
+ recommended.
+
+Image fuzzer testing
+--------------------
+
+An image fuzzer was added to exercise format drivers. Currently only qcow2 is
+supported. To start the fuzzer, run
+
+.. code::
+
+ tests/image-fuzzer/runner.py -c '[["qemu-img", "info", "$test_img"]]' /tmp/test qcow2
+
+Alternatively, some command different from ``qemu-img info`` can be tested, by
+changing the ``-c`` option.
+
+Integration tests using the Avocado Framework
+---------------------------------------------
+
+The ``tests/avocado`` directory hosts integration tests. They're usually
+higher level tests, and may interact with external resources and with
+various guest operating systems.
+
+These tests are written using the Avocado Testing Framework (which must
+be installed separately) in conjunction with a the ``avocado_qemu.Test``
+class, implemented at ``tests/avocado/avocado_qemu``.
+
+Tests based on ``avocado_qemu.Test`` can easily:
+
+ * Customize the command line arguments given to the convenience
+ ``self.vm`` attribute (a QEMUMachine instance)
+
+ * Interact with the QEMU monitor, send QMP commands and check
+ their results
+
+ * Interact with the guest OS, using the convenience console device
+ (which may be useful to assert the effectiveness and correctness of
+ command line arguments or QMP commands)
+
+ * Interact with external data files that accompany the test itself
+ (see ``self.get_data()``)
+
+ * Download (and cache) remote data files, such as firmware and kernel
+ images
+
+ * Have access to a library of guest OS images (by means of the
+ ``avocado.utils.vmimage`` library)
+
+ * Make use of various other test related utilities available at the
+ test class itself and at the utility library:
+
+ - http://avocado-framework.readthedocs.io/en/latest/api/test/avocado.html#avocado.Test
+ - http://avocado-framework.readthedocs.io/en/latest/api/utils/avocado.utils.html
+
+Running tests
+~~~~~~~~~~~~~
+
+You can run the avocado tests simply by executing:
+
+.. code::
+
+ make check-avocado
+
+This involves the automatic creation of Python virtual environment
+within the build tree (at ``tests/venv``) which will have all the
+right dependencies, and will save tests results also within the
+build tree (at ``tests/results``).
+
+Note: the build environment must be using a Python 3 stack, and have
+the ``venv`` and ``pip`` packages installed. If necessary, make sure
+``configure`` is called with ``--python=`` and that those modules are
+available. On Debian and Ubuntu based systems, depending on the
+specific version, they may be on packages named ``python3-venv`` and
+``python3-pip``.
+
+It is also possible to run tests based on tags using the
+``make check-avocado`` command and the ``AVOCADO_TAGS`` environment
+variable:
+
+.. code::
+
+ make check-avocado AVOCADO_TAGS=quick
+
+Note that tags separated with commas have an AND behavior, while tags
+separated by spaces have an OR behavior. For more information on Avocado
+tags, see:
+
+ https://avocado-framework.readthedocs.io/en/latest/guides/user/chapters/tags.html
+
+To run a single test file, a couple of them, or a test within a file
+using the ``make check-avocado`` command, set the ``AVOCADO_TESTS``
+environment variable with the test files or test names. To run all
+tests from a single file, use:
+
+ .. code::
+
+ make check-avocado AVOCADO_TESTS=$FILEPATH
+
+The same is valid to run tests from multiple test files:
+
+ .. code::
+
+ make check-avocado AVOCADO_TESTS='$FILEPATH1 $FILEPATH2'
+
+To run a single test within a file, use:
+
+ .. code::
+
+ make check-avocado AVOCADO_TESTS=$FILEPATH:$TESTCLASS.$TESTNAME
+
+The same is valid to run single tests from multiple test files:
+
+ .. code::
+
+ make check-avocado AVOCADO_TESTS='$FILEPATH1:$TESTCLASS1.$TESTNAME1 $FILEPATH2:$TESTCLASS2.$TESTNAME2'
+
+The scripts installed inside the virtual environment may be used
+without an "activation". For instance, the Avocado test runner
+may be invoked by running:
+
+ .. code::
+
+ tests/venv/bin/avocado run $OPTION1 $OPTION2 tests/avocado/
+
+Note that if ``make check-avocado`` was not executed before, it is
+possible to create the Python virtual environment with the dependencies
+needed running:
+
+ .. code::
+
+ make check-venv
+
+It is also possible to run tests from a single file or a single test within
+a test file. To run tests from a single file within the build tree, use:
+
+ .. code::
+
+ tests/venv/bin/avocado run tests/avocado/$TESTFILE
+
+To run a single test within a test file, use:
+
+ .. code::
+
+ tests/venv/bin/avocado run tests/avocado/$TESTFILE:$TESTCLASS.$TESTNAME
+
+Valid test names are visible in the output from any previous execution
+of Avocado or ``make check-avocado``, and can also be queried using:
+
+ .. code::
+
+ tests/venv/bin/avocado list tests/avocado
+
+Manual Installation
+~~~~~~~~~~~~~~~~~~~
+
+To manually install Avocado and its dependencies, run:
+
+.. code::
+
+ pip install --user avocado-framework
+
+Alternatively, follow the instructions on this link:
+
+ https://avocado-framework.readthedocs.io/en/latest/guides/user/chapters/installing.html
+
+Overview
+~~~~~~~~
+
+The ``tests/avocado/avocado_qemu`` directory provides the
+``avocado_qemu`` Python module, containing the ``avocado_qemu.Test``
+class. Here's a simple usage example:
+
+.. code::
+
+ from avocado_qemu import QemuSystemTest
+
+
+ class Version(QemuSystemTest):
+ """
+ :avocado: tags=quick
+ """
+ def test_qmp_human_info_version(self):
+ self.vm.launch()
+ res = self.vm.command('human-monitor-command',
+ command_line='info version')
+ self.assertRegexpMatches(res, r'^(\d+\.\d+\.\d)')
+
+To execute your test, run:
+
+.. code::
+
+ avocado run version.py
+
+Tests may be classified according to a convention by using docstring
+directives such as ``:avocado: tags=TAG1,TAG2``. To run all tests
+in the current directory, tagged as "quick", run:
+
+.. code::
+
+ avocado run -t quick .
+
+The ``avocado_qemu.Test`` base test class
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``avocado_qemu.Test`` class has a number of characteristics that
+are worth being mentioned right away.
+
+First of all, it attempts to give each test a ready to use QEMUMachine
+instance, available at ``self.vm``. Because many tests will tweak the
+QEMU command line, launching the QEMUMachine (by using ``self.vm.launch()``)
+is left to the test writer.
+
+The base test class has also support for tests with more than one
+QEMUMachine. The way to get machines is through the ``self.get_vm()``
+method which will return a QEMUMachine instance. The ``self.get_vm()``
+method accepts arguments that will be passed to the QEMUMachine creation
+and also an optional ``name`` attribute so you can identify a specific
+machine and get it more than once through the tests methods. A simple
+and hypothetical example follows:
+
+.. code::
+
+ from avocado_qemu import QemuSystemTest
+
+
+ class MultipleMachines(QemuSystemTest):
+ def test_multiple_machines(self):
+ first_machine = self.get_vm()
+ second_machine = self.get_vm()
+ self.get_vm(name='third_machine').launch()
+
+ first_machine.launch()
+ second_machine.launch()
+
+ first_res = first_machine.command(
+ 'human-monitor-command',
+ command_line='info version')
+
+ second_res = second_machine.command(
+ 'human-monitor-command',
+ command_line='info version')
+
+ third_res = self.get_vm(name='third_machine').command(
+ 'human-monitor-command',
+ command_line='info version')
+
+ self.assertEquals(first_res, second_res, third_res)
+
+At test "tear down", ``avocado_qemu.Test`` handles all the QEMUMachines
+shutdown.
+
+The ``avocado_qemu.LinuxTest`` base test class
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``avocado_qemu.LinuxTest`` is further specialization of the
+``avocado_qemu.Test`` class, so it contains all the characteristics of
+the later plus some extra features.
+
+First of all, this base class is intended for tests that need to
+interact with a fully booted and operational Linux guest. At this
+time, it uses a Fedora 31 guest image. The most basic example looks
+like this:
+
+.. code::
+
+ from avocado_qemu import LinuxTest
+
+
+ class SomeTest(LinuxTest):
+
+ def test(self):
+ self.launch_and_wait()
+ self.ssh_command('some_command_to_be_run_in_the_guest')
+
+Please refer to tests that use ``avocado_qemu.LinuxTest`` under
+``tests/avocado`` for more examples.
+
+QEMUMachine
+~~~~~~~~~~~
+
+The QEMUMachine API is already widely used in the Python iotests,
+device-crash-test and other Python scripts. It's a wrapper around the
+execution of a QEMU binary, giving its users:
+
+ * the ability to set command line arguments to be given to the QEMU
+ binary
+
+ * a ready to use QMP connection and interface, which can be used to
+ send commands and inspect its results, as well as asynchronous
+ events
+
+ * convenience methods to set commonly used command line arguments in
+ a more succinct and intuitive way
+
+QEMU binary selection
+^^^^^^^^^^^^^^^^^^^^^
+
+The QEMU binary used for the ``self.vm`` QEMUMachine instance will
+primarily depend on the value of the ``qemu_bin`` parameter. If it's
+not explicitly set, its default value will be the result of a dynamic
+probe in the same source tree. A suitable binary will be one that
+targets the architecture matching host machine.
+
+Based on this description, test writers will usually rely on one of
+the following approaches:
+
+1) Set ``qemu_bin``, and use the given binary
+
+2) Do not set ``qemu_bin``, and use a QEMU binary named like
+ "qemu-system-${arch}", either in the current
+ working directory, or in the current source tree.
+
+The resulting ``qemu_bin`` value will be preserved in the
+``avocado_qemu.Test`` as an attribute with the same name.
+
+Attribute reference
+~~~~~~~~~~~~~~~~~~~
+
+Test
+^^^^
+
+Besides the attributes and methods that are part of the base
+``avocado.Test`` class, the following attributes are available on any
+``avocado_qemu.Test`` instance.
+
+vm
+''
+
+A QEMUMachine instance, initially configured according to the given
+``qemu_bin`` parameter.
+
+arch
+''''
+
+The architecture can be used on different levels of the stack, e.g. by
+the framework or by the test itself. At the framework level, it will
+currently influence the selection of a QEMU binary (when one is not
+explicitly given).
+
+Tests are also free to use this attribute value, for their own needs.
+A test may, for instance, use the same value when selecting the
+architecture of a kernel or disk image to boot a VM with.
+
+The ``arch`` attribute will be set to the test parameter of the same
+name. If one is not given explicitly, it will either be set to
+``None``, or, if the test is tagged with one (and only one)
+``:avocado: tags=arch:VALUE`` tag, it will be set to ``VALUE``.
+
+cpu
+'''
+
+The cpu model that will be set to all QEMUMachine instances created
+by the test.
+
+The ``cpu`` attribute will be set to the test parameter of the same
+name. If one is not given explicitly, it will either be set to
+``None ``, or, if the test is tagged with one (and only one)
+``:avocado: tags=cpu:VALUE`` tag, it will be set to ``VALUE``.
+
+machine
+'''''''
+
+The machine type that will be set to all QEMUMachine instances created
+by the test.
+
+The ``machine`` attribute will be set to the test parameter of the same
+name. If one is not given explicitly, it will either be set to
+``None``, or, if the test is tagged with one (and only one)
+``:avocado: tags=machine:VALUE`` tag, it will be set to ``VALUE``.
+
+qemu_bin
+''''''''
+
+The preserved value of the ``qemu_bin`` parameter or the result of the
+dynamic probe for a QEMU binary in the current working directory or
+source tree.
+
+LinuxTest
+^^^^^^^^^
+
+Besides the attributes present on the ``avocado_qemu.Test`` base
+class, the ``avocado_qemu.LinuxTest`` adds the following attributes:
+
+distro
+''''''
+
+The name of the Linux distribution used as the guest image for the
+test. The name should match the **Provider** column on the list
+of images supported by the avocado.utils.vmimage library:
+
+https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images
+
+distro_version
+''''''''''''''
+
+The version of the Linux distribution as the guest image for the
+test. The name should match the **Version** column on the list
+of images supported by the avocado.utils.vmimage library:
+
+https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images
+
+distro_checksum
+'''''''''''''''
+
+The sha256 hash of the guest image file used for the test.
+
+If this value is not set in the code or by a test parameter (with the
+same name), no validation on the integrity of the image will be
+performed.
+
+Parameter reference
+~~~~~~~~~~~~~~~~~~~
+
+To understand how Avocado parameters are accessed by tests, and how
+they can be passed to tests, please refer to::
+
+ https://avocado-framework.readthedocs.io/en/latest/guides/writer/chapters/writing.html#accessing-test-parameters
+
+Parameter values can be easily seen in the log files, and will look
+like the following:
+
+.. code::
+
+ PARAMS (key=qemu_bin, path=*, default=./qemu-system-x86_64) => './qemu-system-x86_64
+
+Test
+^^^^
+
+arch
+''''
+
+The architecture that will influence the selection of a QEMU binary
+(when one is not explicitly given).
+
+Tests are also free to use this parameter value, for their own needs.
+A test may, for instance, use the same value when selecting the
+architecture of a kernel or disk image to boot a VM with.
+
+This parameter has a direct relation with the ``arch`` attribute. If
+not given, it will default to None.
+
+cpu
+'''
+
+The cpu model that will be set to all QEMUMachine instances created
+by the test.
+
+machine
+'''''''
+
+The machine type that will be set to all QEMUMachine instances created
+by the test.
+
+qemu_bin
+''''''''
+
+The exact QEMU binary to be used on QEMUMachine.
+
+LinuxTest
+^^^^^^^^^
+
+Besides the parameters present on the ``avocado_qemu.Test`` base
+class, the ``avocado_qemu.LinuxTest`` adds the following parameters:
+
+distro
+''''''
+
+The name of the Linux distribution used as the guest image for the
+test. The name should match the **Provider** column on the list
+of images supported by the avocado.utils.vmimage library:
+
+https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images
+
+distro_version
+''''''''''''''
+
+The version of the Linux distribution as the guest image for the
+test. The name should match the **Version** column on the list
+of images supported by the avocado.utils.vmimage library:
+
+https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images
+
+distro_checksum
+'''''''''''''''
+
+The sha256 hash of the guest image file used for the test.
+
+If this value is not set in the code or by this parameter no
+validation on the integrity of the image will be performed.
+
+Skipping tests
+~~~~~~~~~~~~~~
+
+The Avocado framework provides Python decorators which allow for easily skip
+tests running under certain conditions. For example, on the lack of a binary
+on the test system or when the running environment is a CI system. For further
+information about those decorators, please refer to::
+
+ https://avocado-framework.readthedocs.io/en/latest/guides/writer/chapters/writing.html#skipping-tests
+
+While the conditions for skipping tests are often specifics of each one, there
+are recurring scenarios identified by the QEMU developers and the use of
+environment variables became a kind of standard way to enable/disable tests.
+
+Here is a list of the most used variables:
+
+AVOCADO_ALLOW_LARGE_STORAGE
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Tests which are going to fetch or produce assets considered *large* are not
+going to run unless that ``AVOCADO_ALLOW_LARGE_STORAGE=1`` is exported on
+the environment.
+
+The definition of *large* is a bit arbitrary here, but it usually means an
+asset which occupies at least 1GB of size on disk when uncompressed.
+
+AVOCADO_ALLOW_UNTRUSTED_CODE
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+There are tests which will boot a kernel image or firmware that can be
+considered not safe to run on the developer's workstation, thus they are
+skipped by default. The definition of *not safe* is also arbitrary but
+usually it means a blob which either its source or build process aren't
+public available.
+
+You should export ``AVOCADO_ALLOW_UNTRUSTED_CODE=1`` on the environment in
+order to allow tests which make use of those kind of assets.
+
+AVOCADO_TIMEOUT_EXPECTED
+^^^^^^^^^^^^^^^^^^^^^^^^
+The Avocado framework has a timeout mechanism which interrupts tests to avoid the
+test suite of getting stuck. The timeout value can be set via test parameter or
+property defined in the test class, for further details::
+
+ https://avocado-framework.readthedocs.io/en/latest/guides/writer/chapters/writing.html#setting-a-test-timeout
+
+Even though the timeout can be set by the test developer, there are some tests
+that may not have a well-defined limit of time to finish under certain
+conditions. For example, tests that take longer to execute when QEMU is
+compiled with debug flags. Therefore, the ``AVOCADO_TIMEOUT_EXPECTED`` variable
+has been used to determine whether those tests should run or not.
+
+GITLAB_CI
+^^^^^^^^^
+A number of tests are flagged to not run on the GitLab CI. Usually because
+they proved to the flaky or there are constraints on the CI environment which
+would make them fail. If you encounter a similar situation then use that
+variable as shown on the code snippet below to skip the test:
+
+.. code::
+
+ @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab')
+ def test(self):
+ do_something()
+
+Uninstalling Avocado
+~~~~~~~~~~~~~~~~~~~~
+
+If you've followed the manual installation instructions above, you can
+easily uninstall Avocado. Start by listing the packages you have
+installed::
+
+ pip list --user
+
+And remove any package you want with::
+
+ pip uninstall <package_name>
+
+If you've used ``make check-avocado``, the Python virtual environment where
+Avocado is installed will be cleaned up as part of ``make check-clean``.
+
+.. _checktcg-ref:
+
+Testing with "make check-tcg"
+-----------------------------
+
+The check-tcg tests are intended for simple smoke tests of both
+linux-user and softmmu TCG functionality. However to build test
+programs for guest targets you need to have cross compilers available.
+If your distribution supports cross compilers you can do something as
+simple as::
+
+ apt install gcc-aarch64-linux-gnu
+
+The configure script will automatically pick up their presence.
+Sometimes compilers have slightly odd names so the availability of
+them can be prompted by passing in the appropriate configure option
+for the architecture in question, for example::
+
+ $(configure) --cross-cc-aarch64=aarch64-cc
+
+There is also a ``--cross-cc-flags-ARCH`` flag in case additional
+compiler flags are needed to build for a given target.
+
+If you have the ability to run containers as the user the build system
+will automatically use them where no system compiler is available. For
+architectures where we also support building QEMU we will generally
+use the same container to build tests. However there are a number of
+additional containers defined that have a minimal cross-build
+environment that is only suitable for building test cases. Sometimes
+we may use a bleeding edge distribution for compiler features needed
+for test cases that aren't yet in the LTS distros we support for QEMU
+itself.
+
+See :ref:`container-ref` for more details.
+
+Running subset of tests
+~~~~~~~~~~~~~~~~~~~~~~~
+
+You can build the tests for one architecture::
+
+ make build-tcg-tests-$TARGET
+
+And run with::
+
+ make run-tcg-tests-$TARGET
+
+Adding ``V=1`` to the invocation will show the details of how to
+invoke QEMU for the test which is useful for debugging tests.
+
+TCG test dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+The TCG tests are deliberately very light on dependencies and are
+either totally bare with minimal gcc lib support (for softmmu tests)
+or just glibc (for linux-user tests). This is because getting a cross
+compiler to work with additional libraries can be challenging.
+
+Other TCG Tests
+---------------
+
+There are a number of out-of-tree test suites that are used for more
+extensive testing of processor features.
+
+KVM Unit Tests
+~~~~~~~~~~~~~~
+
+The KVM unit tests are designed to run as a Guest OS under KVM but
+there is no reason why they can't exercise the TCG as well. It
+provides a minimal OS kernel with hooks for enabling the MMU as well
+as reporting test results via a special device::
+
+ https://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
+
+Linux Test Project
+~~~~~~~~~~~~~~~~~~
+
+The LTP is focused on exercising the syscall interface of a Linux
+kernel. It checks that syscalls behave as documented and strives to
+exercise as many corner cases as possible. It is a useful test suite
+to run to exercise QEMU's linux-user code::
+
+ https://linux-test-project.github.io/
+
+GCC gcov support
+----------------
+
+``gcov`` is a GCC tool to analyze the testing coverage by
+instrumenting the tested code. To use it, configure QEMU with
+``--enable-gcov`` option and build. Then run the tests as usual.
+
+If you want to gather coverage information on a single test the ``make
+clean-gcda`` target can be used to delete any existing coverage
+information before running a single test.
+
+You can generate a HTML coverage report by executing ``make
+coverage-html`` which will create
+``meson-logs/coveragereport/index.html``.
+
+Further analysis can be conducted by running the ``gcov`` command
+directly on the various .gcda output files. Please read the ``gcov``
+documentation for more information.
diff --git a/docs/devel/tracing.rst b/docs/devel/tracing.rst
new file mode 100644
index 000000000..ba8395489
--- /dev/null
+++ b/docs/devel/tracing.rst
@@ -0,0 +1,498 @@
+=======
+Tracing
+=======
+
+Introduction
+============
+
+This document describes the tracing infrastructure in QEMU and how to use it
+for debugging, profiling, and observing execution.
+
+Quickstart
+==========
+
+Enable tracing of ``memory_region_ops_read`` and ``memory_region_ops_write``
+events::
+
+ $ qemu --trace "memory_region_ops_*" ...
+ ...
+ 719585@1608130130.441188:memory_region_ops_read cpu 0 mr 0x562fdfbb3820 addr 0x3cc value 0x67 size 1
+ 719585@1608130130.441190:memory_region_ops_write cpu 0 mr 0x562fdfbd2f00 addr 0x3d4 value 0x70e size 2
+
+This output comes from the "log" trace backend that is enabled by default when
+``./configure --enable-trace-backends=BACKENDS`` was not explicitly specified.
+
+Multiple patterns can be specified by repeating the ``--trace`` option::
+
+ $ qemu --trace "kvm_*" --trace "virtio_*" ...
+
+When patterns are used frequently it is more convenient to store them in a
+file to avoid long command-line options::
+
+ $ echo "memory_region_ops_*" >/tmp/events
+ $ echo "kvm_*" >>/tmp/events
+ $ qemu --trace events=/tmp/events ...
+
+Trace events
+============
+
+Sub-directory setup
+-------------------
+
+Each directory in the source tree can declare a set of trace events in a local
+"trace-events" file. All directories which contain "trace-events" files must be
+listed in the "trace_events_subdirs" variable in the top level meson.build
+file. During build, the "trace-events" file in each listed subdirectory will be
+processed by the "tracetool" script to generate code for the trace events.
+
+The individual "trace-events" files are merged into a "trace-events-all" file,
+which is also installed into "/usr/share/qemu" with the name "trace-events".
+This merged file is to be used by the "simpletrace.py" script to later analyse
+traces in the simpletrace data format.
+
+The following files are automatically generated in <builddir>/trace/ during the
+build:
+
+ - trace-<subdir>.c - the trace event state declarations
+ - trace-<subdir>.h - the trace event enums and probe functions
+ - trace-dtrace-<subdir>.h - DTrace event probe specification
+ - trace-dtrace-<subdir>.dtrace - DTrace event probe helper declaration
+ - trace-dtrace-<subdir>.o - binary DTrace provider (generated by dtrace)
+ - trace-ust-<subdir>.h - UST event probe helper declarations
+
+Here <subdir> is the sub-directory path with '/' replaced by '_'. For example,
+"accel/kvm" becomes "accel_kvm" and the final filename for "trace-<subdir>.c"
+becomes "trace-accel_kvm.c".
+
+Source files in the source tree do not directly include generated files in
+"<builddir>/trace/". Instead they #include the local "trace.h" file, without
+any sub-directory path prefix. eg io/channel-buffer.c would do::
+
+ #include "trace.h"
+
+The "io/trace.h" file must be created manually with an #include of the
+corresponding "trace/trace-<subdir>.h" file that will be generated in the
+builddir::
+
+ $ echo '#include "trace/trace-io.h"' >io/trace.h
+
+While it is possible to include a trace.h file from outside a source file's own
+sub-directory, this is discouraged in general. It is strongly preferred that
+all events be declared directly in the sub-directory that uses them. The only
+exception is where there are some shared trace events defined in the top level
+directory trace-events file. The top level directory generates trace files
+with a filename prefix of "trace/trace-root" instead of just "trace". This is
+to avoid ambiguity between a trace.h in the current directory, vs the top level
+directory.
+
+Using trace events
+------------------
+
+Trace events are invoked directly from source code like this::
+
+ #include "trace.h" /* needed for trace event prototype */
+
+ void *qemu_vmalloc(size_t size)
+ {
+ void *ptr;
+ size_t align = QEMU_VMALLOC_ALIGN;
+
+ if (size < align) {
+ align = getpagesize();
+ }
+ ptr = qemu_memalign(align, size);
+ trace_qemu_vmalloc(size, ptr);
+ return ptr;
+ }
+
+Declaring trace events
+----------------------
+
+The "tracetool" script produces the trace.h header file which is included by
+every source file that uses trace events. Since many source files include
+trace.h, it uses a minimum of types and other header files included to keep the
+namespace clean and compile times and dependencies down.
+
+Trace events should use types as follows:
+
+ * Use stdint.h types for fixed-size types. Most offsets and guest memory
+ addresses are best represented with uint32_t or uint64_t. Use fixed-size
+ types over primitive types whose size may change depending on the host
+ (32-bit versus 64-bit) so trace events don't truncate values or break
+ the build.
+
+ * Use void * for pointers to structs or for arrays. The trace.h header
+ cannot include all user-defined struct declarations and it is therefore
+ necessary to use void * for pointers to structs.
+
+ * For everything else, use primitive scalar types (char, int, long) with the
+ appropriate signedness.
+
+ * Avoid floating point types (float and double) because SystemTap does not
+ support them. In most cases it is possible to round to an integer type
+ instead. This may require scaling the value first by multiplying it by 1000
+ or the like when digits after the decimal point need to be preserved.
+
+Format strings should reflect the types defined in the trace event. Take
+special care to use PRId64 and PRIu64 for int64_t and uint64_t types,
+respectively. This ensures portability between 32- and 64-bit platforms.
+Format strings must not end with a newline character. It is the responsibility
+of backends to adapt line ending for proper logging.
+
+Each event declaration will start with the event name, then its arguments,
+finally a format string for pretty-printing. For example::
+
+ qemu_vmalloc(size_t size, void *ptr) "size %zu ptr %p"
+ qemu_vfree(void *ptr) "ptr %p"
+
+
+Hints for adding new trace events
+---------------------------------
+
+1. Trace state changes in the code. Interesting points in the code usually
+ involve a state change like starting, stopping, allocating, freeing. State
+ changes are good trace events because they can be used to understand the
+ execution of the system.
+
+2. Trace guest operations. Guest I/O accesses like reading device registers
+ are good trace events because they can be used to understand guest
+ interactions.
+
+3. Use correlator fields so the context of an individual line of trace output
+ can be understood. For example, trace the pointer returned by malloc and
+ used as an argument to free. This way mallocs and frees can be matched up.
+ Trace events with no context are not very useful.
+
+4. Name trace events after their function. If there are multiple trace events
+ in one function, append a unique distinguisher at the end of the name.
+
+Generic interface and monitor commands
+======================================
+
+You can programmatically query and control the state of trace events through a
+backend-agnostic interface provided by the header "trace/control.h".
+
+Note that some of the backends do not provide an implementation for some parts
+of this interface, in which case QEMU will just print a warning (please refer to
+header "trace/control.h" to see which routines are backend-dependent).
+
+The state of events can also be queried and modified through monitor commands:
+
+* ``info trace-events``
+ View available trace events and their state. State 1 means enabled, state 0
+ means disabled.
+
+* ``trace-event NAME on|off``
+ Enable/disable a given trace event or a group of events (using wildcards).
+
+The "--trace events=<file>" command line argument can be used to enable the
+events listed in <file> from the very beginning of the program. This file must
+contain one event name per line.
+
+If a line in the "--trace events=<file>" file begins with a '-', the trace event
+will be disabled instead of enabled. This is useful when a wildcard was used
+to enable an entire family of events but one noisy event needs to be disabled.
+
+Wildcard matching is supported in both the monitor command "trace-event" and the
+events list file. That means you can enable/disable the events having a common
+prefix in a batch. For example, virtio-blk trace events could be enabled using
+the following monitor command::
+
+ trace-event virtio_blk_* on
+
+Trace backends
+==============
+
+The "tracetool" script automates tedious trace event code generation and also
+keeps the trace event declarations independent of the trace backend. The trace
+events are not tightly coupled to a specific trace backend, such as LTTng or
+SystemTap. Support for trace backends can be added by extending the "tracetool"
+script.
+
+The trace backends are chosen at configure time::
+
+ ./configure --enable-trace-backends=simple,dtrace
+
+For a list of supported trace backends, try ./configure --help or see below.
+If multiple backends are enabled, the trace is sent to them all.
+
+If no backends are explicitly selected, configure will default to the
+"log" backend.
+
+The following subsections describe the supported trace backends.
+
+Nop
+---
+
+The "nop" backend generates empty trace event functions so that the compiler
+can optimize out trace events completely. This imposes no performance
+penalty.
+
+Note that regardless of the selected trace backend, events with the "disable"
+property will be generated with the "nop" backend.
+
+Log
+---
+
+The "log" backend sends trace events directly to standard error. This
+effectively turns trace events into debug printfs.
+
+This is the simplest backend and can be used together with existing code that
+uses DPRINTF().
+
+The -msg timestamp=on|off command-line option controls whether or not to print
+the tid/timestamp prefix for each trace event.
+
+Simpletrace
+-----------
+
+The "simple" backend writes binary trace logs to a file from a thread, making
+it lower overhead than the "log" backend. A Python API is available for writing
+offline trace file analysis scripts. It may not be as powerful as
+platform-specific or third-party trace backends but it is portable and has no
+special library dependencies.
+
+Monitor commands
+~~~~~~~~~~~~~~~~
+
+* ``trace-file on|off|flush|set <path>``
+ Enable/disable/flush the trace file or set the trace file name.
+
+Analyzing trace files
+~~~~~~~~~~~~~~~~~~~~~
+
+The "simple" backend produces binary trace files that can be formatted with the
+simpletrace.py script. The script takes the "trace-events-all" file and the
+binary trace::
+
+ ./scripts/simpletrace.py trace-events-all trace-12345
+
+You must ensure that the same "trace-events-all" file was used to build QEMU,
+otherwise trace event declarations may have changed and output will not be
+consistent.
+
+Ftrace
+------
+
+The "ftrace" backend writes trace data to ftrace marker. This effectively
+sends trace events to ftrace ring buffer, and you can compare qemu trace
+data and kernel(especially kvm.ko when using KVM) trace data.
+
+if you use KVM, enable kvm events in ftrace::
+
+ # echo 1 > /sys/kernel/debug/tracing/events/kvm/enable
+
+After running qemu by root user, you can get the trace::
+
+ # cat /sys/kernel/debug/tracing/trace
+
+Restriction: "ftrace" backend is restricted to Linux only.
+
+Syslog
+------
+
+The "syslog" backend sends trace events using the POSIX syslog API. The log
+is opened specifying the LOG_DAEMON facility and LOG_PID option (so events
+are tagged with the pid of the particular QEMU process that generated
+them). All events are logged at LOG_INFO level.
+
+NOTE: syslog may squash duplicate consecutive trace events and apply rate
+ limiting.
+
+Restriction: "syslog" backend is restricted to POSIX compliant OS.
+
+LTTng Userspace Tracer
+----------------------
+
+The "ust" backend uses the LTTng Userspace Tracer library. There are no
+monitor commands built into QEMU, instead UST utilities should be used to list,
+enable/disable, and dump traces.
+
+Package lttng-tools is required for userspace tracing. You must ensure that the
+current user belongs to the "tracing" group, or manually launch the
+lttng-sessiond daemon for the current user prior to running any instance of
+QEMU.
+
+While running an instrumented QEMU, LTTng should be able to list all available
+events::
+
+ lttng list -u
+
+Create tracing session::
+
+ lttng create mysession
+
+Enable events::
+
+ lttng enable-event qemu:g_malloc -u
+
+Where the events can either be a comma-separated list of events, or "-a" to
+enable all tracepoint events. Start and stop tracing as needed::
+
+ lttng start
+ lttng stop
+
+View the trace::
+
+ lttng view
+
+Destroy tracing session::
+
+ lttng destroy
+
+Babeltrace can be used at any later time to view the trace::
+
+ babeltrace $HOME/lttng-traces/mysession-<date>-<time>
+
+SystemTap
+---------
+
+The "dtrace" backend uses DTrace sdt probes but has only been tested with
+SystemTap. When SystemTap support is detected a .stp file with wrapper probes
+is generated to make use in scripts more convenient. This step can also be
+performed manually after a build in order to change the binary name in the .stp
+probes::
+
+ scripts/tracetool.py --backends=dtrace --format=stap \
+ --binary path/to/qemu-binary \
+ --target-type system \
+ --target-name x86_64 \
+ --group=all \
+ trace-events-all \
+ qemu.stp
+
+To facilitate simple usage of systemtap where there merely needs to be printf
+logging of certain probes, a helper script "qemu-trace-stap" is provided.
+Consult its manual page for guidance on its usage.
+
+Trace event properties
+======================
+
+Each event in the "trace-events-all" file can be prefixed with a space-separated
+list of zero or more of the following event properties.
+
+"disable"
+---------
+
+If a specific trace event is going to be invoked a huge number of times, this
+might have a noticeable performance impact even when the event is
+programmatically disabled.
+
+In this case you should declare such event with the "disable" property. This
+will effectively disable the event at compile time (by using the "nop" backend),
+thus having no performance impact at all on regular builds (i.e., unless you
+edit the "trace-events-all" file).
+
+In addition, there might be cases where relatively complex computations must be
+performed to generate values that are only used as arguments for a trace
+function. In these cases you can use 'trace_event_get_state_backends()' to
+guard such computations, so they are skipped if the event has been either
+compile-time disabled or run-time disabled. If the event is compile-time
+disabled, this check will have no performance impact.
+
+::
+
+ #include "trace.h" /* needed for trace event prototype */
+
+ void *qemu_vmalloc(size_t size)
+ {
+ void *ptr;
+ size_t align = QEMU_VMALLOC_ALIGN;
+
+ if (size < align) {
+ align = getpagesize();
+ }
+ ptr = qemu_memalign(align, size);
+ if (trace_event_get_state_backends(TRACE_QEMU_VMALLOC)) {
+ void *complex;
+ /* some complex computations to produce the 'complex' value */
+ trace_qemu_vmalloc(size, ptr, complex);
+ }
+ return ptr;
+ }
+
+"tcg"
+-----
+
+Guest code generated by TCG can be traced by defining an event with the "tcg"
+event property. Internally, this property generates two events:
+"<eventname>_trans" to trace the event at translation time, and
+"<eventname>_exec" to trace the event at execution time.
+
+Instead of using these two events, you should instead use the function
+"trace_<eventname>_tcg" during translation (TCG code generation). This function
+will automatically call "trace_<eventname>_trans", and will generate the
+necessary TCG code to call "trace_<eventname>_exec" during guest code execution.
+
+Events with the "tcg" property can be declared in the "trace-events" file with a
+mix of native and TCG types, and "trace_<eventname>_tcg" will gracefully forward
+them to the "<eventname>_trans" and "<eventname>_exec" events. Since TCG values
+are not known at translation time, these are ignored by the "<eventname>_trans"
+event. Because of this, the entry in the "trace-events" file needs two printing
+formats (separated by a comma)::
+
+ tcg foo(uint8_t a1, TCGv_i32 a2) "a1=%d", "a1=%d a2=%d"
+
+For example::
+
+ #include "trace-tcg.h"
+
+ void some_disassembly_func (...)
+ {
+ uint8_t a1 = ...;
+ TCGv_i32 a2 = ...;
+ trace_foo_tcg(a1, a2);
+ }
+
+This will immediately call::
+
+ void trace_foo_trans(uint8_t a1);
+
+and will generate the TCG code to call::
+
+ void trace_foo(uint8_t a1, uint32_t a2);
+
+"vcpu"
+------
+
+Identifies events that trace vCPU-specific information. It implicitly adds a
+"CPUState*" argument, and extends the tracing print format to show the vCPU
+information. If used together with the "tcg" property, it adds a second
+"TCGv_env" argument that must point to the per-target global TCG register that
+points to the vCPU when guest code is executed (usually the "cpu_env" variable).
+
+The "tcg" and "vcpu" properties are currently only honored in the root
+./trace-events file.
+
+The following example events::
+
+ foo(uint32_t a) "a=%x"
+ vcpu bar(uint32_t a) "a=%x"
+ tcg vcpu baz(uint32_t a) "a=%x", "a=%x"
+
+Can be used as::
+
+ #include "trace-tcg.h"
+
+ CPUArchState *env;
+ TCGv_ptr cpu_env;
+
+ void some_disassembly_func(...)
+ {
+ /* trace emitted at this point */
+ trace_foo(0xd1);
+ /* trace emitted at this point */
+ trace_bar(env_cpu(env), 0xd2);
+ /* trace emitted at this point (env) and when guest code is executed (cpu_env) */
+ trace_baz_tcg(env_cpu(env), cpu_env, 0xd3);
+ }
+
+If the translating vCPU has address 0xc1 and code is later executed by vCPU
+0xc2, this would be an example output::
+
+ // at guest code translation
+ foo a=0xd1
+ bar cpu=0xc1 a=0xd2
+ baz_trans cpu=0xc1 a=0xd3
+ // at guest code execution
+ baz_exec cpu=0xc2 a=0xd3
diff --git a/docs/devel/trivial-patches.rst b/docs/devel/trivial-patches.rst
new file mode 100644
index 000000000..9380c730f
--- /dev/null
+++ b/docs/devel/trivial-patches.rst
@@ -0,0 +1,52 @@
+.. _trivial-patches:
+
+Trivial Patches
+===============
+
+Overview
+--------
+
+Trivial patches that change just a few lines of code sometimes languish
+on the mailing list even though they require only a small amount of
+review. This is often the case for patches that do not fall under an
+actively maintained subsystem and therefore fall through the cracks.
+
+The trivial patches team take on the task of reviewing and building pull
+requests for patches that:
+
+- Do not fall under an actively maintained subsystem.
+- Are single patches or short series (max 2-4 patches).
+- Only touch a few lines of code.
+
+**You should hint that your patch is a candidate by CCing
+qemu-trivial@nongnu.org.**
+
+Repositories
+------------
+
+Since the trivial patch team rotates maintainership there is only one
+active repository at a time:
+
+- git://github.com/vivier/qemu.git trivial-patches - `browse <https://github.com/vivier/qemu/tree/trivial-patches>`__
+
+Workflow
+--------
+
+The trivial patches team rotates the duty of collecting trivial patches
+amongst its members. A team member's job is to:
+
+1. Identify trivial patches on the development mailing list.
+2. Review trivial patches, merge them into a git tree, and reply to state
+ that the patch is queued.
+3. Send pull requests to the development mailing list once a week.
+
+A single team member can be on duty as long as they like. The suggested
+time is 1 week before handing off to the next member.
+
+Team
+----
+
+If you would like to join the trivial patches team, contact Laurent
+Vivier. The current team includes:
+
+- `Laurent Vivier <mailto:laurent@vivier.eu>`__
diff --git a/docs/devel/ui.rst b/docs/devel/ui.rst
new file mode 100644
index 000000000..17fb667de
--- /dev/null
+++ b/docs/devel/ui.rst
@@ -0,0 +1,8 @@
+=================
+QEMU UI subsystem
+=================
+
+QEMU Clipboard
+--------------
+
+.. kernel-doc:: include/ui/clipboard.h
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
new file mode 100644
index 000000000..9ff6163c8
--- /dev/null
+++ b/docs/devel/vfio-migration.rst
@@ -0,0 +1,150 @@
+=====================
+VFIO device Migration
+=====================
+
+Migration of virtual machine involves saving the state for each device that
+the guest is running on source host and restoring this saved state on the
+destination host. This document details how saving and restoring of VFIO
+devices is done in QEMU.
+
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices can choose to skip
+the pre-copy phase of migration by returning pending_bytes as zero during the
+pre-copy phase.
+
+A detailed description of the UAPI for VFIO device migration can be found in
+the comment for the ``vfio_device_migration_info`` structure in the header
+file linux-headers/linux/vfio.h.
+
+VFIO implements the device hooks for the iterative approach as follows:
+
+* A ``save_setup`` function that sets up the migration region and sets _SAVING
+ flag in the VFIO device state.
+
+* A ``load_setup`` function that sets up the migration region on the
+ destination and sets _RESUMING flag in the VFIO device state.
+
+* A ``save_live_pending`` function that reads pending_bytes from the vendor
+ driver, which indicates the amount of data that the vendor driver has yet to
+ save for the VFIO device.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+ vendor driver through the migration region during iterative phase.
+
+* A ``save_state`` function to save the device config space if it is present.
+
+* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
+ VFIO device state and iteratively copies the remaining data for the VFIO
+ device until the vendor driver indicates that no data remains (pending bytes
+ is zero).
+
+* A ``load_state`` function that loads the config section and the data
+ sections that are generated by the save functions above
+
+* ``cleanup`` functions for both save and load that perform any migration
+ related cleanup, including unmapping the migration region
+
+
+The VFIO migration code uses a VM state change handler to change the VFIO
+device state when the VM state changes from running to not-running, and
+vice versa.
+
+Similarly, a migration state change handler is used to trigger a transition of
+the VFIO device state when certain changes of the migration state occur. For
+example, the VFIO device state is transitioned back to _RUNNING in case a
+migration failed or was canceled.
+
+System memory dirty pages tracking
+----------------------------------
+
+A ``log_global_start`` and ``log_global_stop`` memory listener callback informs
+the VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync``
+memory listener callback marks those system memory pages as dirty which are
+used for DMA by the VFIO device. The dirty pages bitmap is queried per
+container. All pages pinned by the vendor driver through external APIs have to
+be marked as dirty during migration. When there are CPU writes, CPU dirty page
+tracking can identify dirtied pages, but any page pinned by the vendor driver
+can also be written by the device. There is currently no device or IOMMU
+support for dirty page tracking in hardware.
+
+By default, dirty pages are tracked when the device is in pre-copy as well as
+stop-and-copy phase. So, a page pinned by the vendor driver will be copied to
+the destination in both phases. Copying dirty pages in pre-copy phase helps
+QEMU to predict if it can achieve its downtime tolerances. If QEMU during
+pre-copy phase keeps finding dirty pages continuously, then it understands
+that even in stop-and-copy phase, it is likely to find dirty pages and can
+predict the downtime accordingly.
+
+QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
+which disables querying the dirty bitmap during pre-copy phase. If it is set to
+off, all dirty pages will be copied to the destination in stop-and-copy phase
+only.
+
+System memory dirty pages tracking when vIOMMU is enabled
+---------------------------------------------------------
+
+With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
+phase of migration. In that case, the unmap ioctl returns any dirty pages in
+that range and QEMU reports corresponding guest physical pages dirty. During
+stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
+pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
+mapped ranges.
+
+Flow of state changes during Live migration
+===========================================
+
+Below is the flow of state change during live migration.
+The values in the brackets represent the VM state, the migration state, and
+the VFIO device state, respectively.
+
+Live migration save path
+------------------------
+
+::
+
+ QEMU normal running state
+ (RUNNING, _NONE, _RUNNING)
+ |
+ migrate_init spawns migration_thread
+ Migration thread then calls each device's .save_setup()
+ (RUNNING, _SETUP, _RUNNING|_SAVING)
+ |
+ (RUNNING, _ACTIVE, _RUNNING|_SAVING)
+ If device is active, get pending_bytes by .save_live_pending()
+ If total pending_bytes >= threshold_size, call .save_live_iterate()
+ Data of VFIO device for pre-copy phase is copied
+ Iterate till total pending bytes converge and are less than threshold
+ |
+ On migration completion, vCPU stops and calls .save_live_complete_precopy for
+ each active device. The VFIO device is then transitioned into _SAVING state
+ (FINISH_MIGRATE, _DEVICE, _SAVING)
+ |
+ For the VFIO device, iterate in .save_live_complete_precopy until
+ pending data is 0
+ (FINISH_MIGRATE, _DEVICE, _STOPPED)
+ |
+ (FINISH_MIGRATE, _COMPLETED, _STOPPED)
+ Migraton thread schedules cleanup bottom half and exits
+
+Live migration resume path
+--------------------------
+
+::
+
+ Incoming migration calls .load_setup for each device
+ (RESTORE_VM, _ACTIVE, _STOPPED)
+ |
+ For each device, .load_state is called for that device section data
+ (RESTORE_VM, _ACTIVE, _RESUMING)
+ |
+ At the end, .load_cleanup is called for each device and vCPUs are started
+ (RUNNING, _NONE, _RUNNING)
+
+Postcopy
+========
+
+Postcopy migration is currently not supported for VFIO devices.
diff --git a/docs/devel/virtio-migration.txt b/docs/devel/virtio-migration.txt
new file mode 100644
index 000000000..98a6b0ffb
--- /dev/null
+++ b/docs/devel/virtio-migration.txt
@@ -0,0 +1,108 @@
+Virtio devices and migration
+============================
+
+Copyright 2015 IBM Corp.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later. See
+the COPYING file in the top-level directory.
+
+Saving and restoring the state of virtio devices is a bit of a twisty maze,
+for several reasons:
+- state is distributed between several parts:
+ - virtio core, for common fields like features, number of queues, ...
+ - virtio transport (pci, ccw, ...), for the different proxy devices and
+ transport specific state (msix vectors, indicators, ...)
+ - virtio device (net, blk, ...), for the different device types and their
+ state (mac address, request queue, ...)
+- most fields are saved via the stream interface; subsequently, subsections
+ have been added to make cross-version migration possible
+
+This file attempts to document the current procedure and point out some
+caveats.
+
+
+Save state procedure
+====================
+
+virtio core virtio transport virtio device
+----------- ---------------- -------------
+
+ save() function registered
+ via VMState wrapper on
+ device class
+virtio_save() <----------
+ ------> save_config()
+ - save proxy device
+ - save transport-specific
+ device fields
+- save common device
+ fields
+- save common virtqueue
+ fields
+ ------> save_queue()
+ - save transport-specific
+ virtqueue fields
+ ------> save_device()
+ - save device-specific
+ fields
+- save subsections
+ - device endianness,
+ if changed from
+ default endianness
+ - 64 bit features, if
+ any high feature bit
+ is set
+ - virtio-1 virtqueue
+ fields, if VERSION_1
+ is set
+
+
+Load state procedure
+====================
+
+virtio core virtio transport virtio device
+----------- ---------------- -------------
+
+ load() function registered
+ via VMState wrapper on
+ device class
+virtio_load() <----------
+ ------> load_config()
+ - load proxy device
+ - load transport-specific
+ device fields
+- load common device
+ fields
+- load common virtqueue
+ fields
+ ------> load_queue()
+ - load transport-specific
+ virtqueue fields
+- notify guest
+ ------> load_device()
+ - load device-specific
+ fields
+- load subsections
+ - device endianness
+ - 64 bit features
+ - virtio-1 virtqueue
+ fields
+- sanitize endianness
+- sanitize features
+- virtqueue index sanity
+ check
+ - feature-dependent setup
+
+
+Implications of this setup
+==========================
+
+Devices need to be careful in their state processing during load: The
+load_device() procedure is invoked by the core before subsections have
+been loaded. Any code that depends on information transmitted in subsections
+therefore has to be invoked in the device's load() function _after_
+virtio_load() returned (like e.g. code depending on features).
+
+Any extension of the state being migrated should be done in subsections
+added to the core for compatibility reasons. If transport or device specific
+state is added, core needs to invoke a callback from the new subsection.
diff --git a/docs/devel/writing-monitor-commands.rst b/docs/devel/writing-monitor-commands.rst
new file mode 100644
index 000000000..1693822f8
--- /dev/null
+++ b/docs/devel/writing-monitor-commands.rst
@@ -0,0 +1,751 @@
+How to write monitor commands
+=============================
+
+This document is a step-by-step guide on how to write new QMP commands using
+the QAPI framework and HMP commands.
+
+This document doesn't discuss QMP protocol level details, nor does it dive
+into the QAPI framework implementation.
+
+For an in-depth introduction to the QAPI framework, please refer to
+docs/devel/qapi-code-gen.txt. For documentation about the QMP protocol,
+start with docs/interop/qmp-intro.txt.
+
+New commands may be implemented in QMP only. New HMP commands should be
+implemented on top of QMP. The typical HMP command wraps around an
+equivalent QMP command, but HMP convenience commands built from QMP
+building blocks are also fine. The long term goal is to make all
+existing HMP commands conform to this, to fully isolate HMP from the
+internals of QEMU. Refer to the `Writing a debugging aid returning
+unstructured text`_ section for further guidance on commands that
+would have traditionally been HMP only.
+
+Overview
+--------
+
+Generally speaking, the following steps should be taken in order to write a
+new QMP command.
+
+1. Define the command and any types it needs in the appropriate QAPI
+ schema module.
+
+2. Write the QMP command itself, which is a regular C function. Preferably,
+ the command should be exported by some QEMU subsystem. But it can also be
+ added to the monitor/qmp-cmds.c file
+
+3. At this point the command can be tested under the QMP protocol
+
+4. Write the HMP command equivalent. This is not required and should only be
+ done if it does make sense to have the functionality in HMP. The HMP command
+ is implemented in terms of the QMP command
+
+The following sections will demonstrate each of the steps above. We will start
+very simple and get more complex as we progress.
+
+
+Testing
+-------
+
+For all the examples in the next sections, the test setup is the same and is
+shown here.
+
+First, QEMU should be started like this::
+
+ # qemu-system-TARGET [...] \
+ -chardev socket,id=qmp,port=4444,host=localhost,server=on \
+ -mon chardev=qmp,mode=control,pretty=on
+
+Then, in a different terminal::
+
+ $ telnet localhost 4444
+ Trying 127.0.0.1...
+ Connected to localhost.
+ Escape character is '^]'.
+ {
+ "QMP": {
+ "version": {
+ "qemu": {
+ "micro": 50,
+ "minor": 15,
+ "major": 0
+ },
+ "package": ""
+ },
+ "capabilities": [
+ ]
+ }
+ }
+
+The above output is the QMP server saying you're connected. The server is
+actually in capabilities negotiation mode. To enter in command mode type::
+
+ { "execute": "qmp_capabilities" }
+
+Then the server should respond::
+
+ {
+ "return": {
+ }
+ }
+
+Which is QMP's way of saying "the latest command executed OK and didn't return
+any data". Now you're ready to enter the QMP example commands as explained in
+the following sections.
+
+
+Writing a simple command: hello-world
+-------------------------------------
+
+That's the most simple QMP command that can be written. Usually, this kind of
+command carries some meaningful action in QEMU but here it will just print
+"Hello, world" to the standard output.
+
+Our command will be called "hello-world". It takes no arguments, nor does it
+return any data.
+
+The first step is defining the command in the appropriate QAPI schema
+module. We pick module qapi/misc.json, and add the following line at
+the bottom::
+
+ { 'command': 'hello-world' }
+
+The "command" keyword defines a new QMP command. It's an JSON object. All
+schema entries are JSON objects. The line above will instruct the QAPI to
+generate any prototypes and the necessary code to marshal and unmarshal
+protocol data.
+
+The next step is to write the "hello-world" implementation. As explained
+earlier, it's preferable for commands to live in QEMU subsystems. But
+"hello-world" doesn't pertain to any, so we put its implementation in
+monitor/qmp-cmds.c::
+
+ void qmp_hello_world(Error **errp)
+ {
+ printf("Hello, world!\n");
+ }
+
+There are a few things to be noticed:
+
+1. QMP command implementation functions must be prefixed with "qmp\_"
+2. qmp_hello_world() returns void, this is in accordance with the fact that the
+ command doesn't return any data
+3. It takes an "Error \*\*" argument. This is required. Later we will see how to
+ return errors and take additional arguments. The Error argument should not
+ be touched if the command doesn't return errors
+4. We won't add the function's prototype. That's automatically done by the QAPI
+5. Printing to the terminal is discouraged for QMP commands, we do it here
+ because it's the easiest way to demonstrate a QMP command
+
+You're done. Now build qemu, run it as suggested in the "Testing" section,
+and then type the following QMP command::
+
+ { "execute": "hello-world" }
+
+Then check the terminal running qemu and look for the "Hello, world" string. If
+you don't see it then something went wrong.
+
+
+Arguments
+~~~~~~~~~
+
+Let's add an argument called "message" to our "hello-world" command. The new
+argument will contain the string to be printed to stdout. It's an optional
+argument, if it's not present we print our default "Hello, World" string.
+
+The first change we have to do is to modify the command specification in the
+schema file to the following::
+
+ { 'command': 'hello-world', 'data': { '*message': 'str' } }
+
+Notice the new 'data' member in the schema. It's an JSON object whose each
+element is an argument to the command in question. Also notice the asterisk,
+it's used to mark the argument optional (that means that you shouldn't use it
+for mandatory arguments). Finally, 'str' is the argument's type, which
+stands for "string". The QAPI also supports integers, booleans, enumerations
+and user defined types.
+
+Now, let's update our C implementation in monitor/qmp-cmds.c::
+
+ void qmp_hello_world(bool has_message, const char *message, Error **errp)
+ {
+ if (has_message) {
+ printf("%s\n", message);
+ } else {
+ printf("Hello, world\n");
+ }
+ }
+
+There are two important details to be noticed:
+
+1. All optional arguments are accompanied by a 'has\_' boolean, which is set
+ if the optional argument is present or false otherwise
+2. The C implementation signature must follow the schema's argument ordering,
+ which is defined by the "data" member
+
+Time to test our new version of the "hello-world" command. Build qemu, run it as
+described in the "Testing" section and then send two commands::
+
+ { "execute": "hello-world" }
+ {
+ "return": {
+ }
+ }
+
+ { "execute": "hello-world", "arguments": { "message": "We love qemu" } }
+ {
+ "return": {
+ }
+ }
+
+You should see "Hello, world" and "We love qemu" in the terminal running qemu,
+if you don't see these strings, then something went wrong.
+
+
+Errors
+~~~~~~
+
+QMP commands should use the error interface exported by the error.h header
+file. Basically, most errors are set by calling the error_setg() function.
+
+Let's say we don't accept the string "message" to contain the word "love". If
+it does contain it, we want the "hello-world" command to return an error::
+
+ void qmp_hello_world(bool has_message, const char *message, Error **errp)
+ {
+ if (has_message) {
+ if (strstr(message, "love")) {
+ error_setg(errp, "the word 'love' is not allowed");
+ return;
+ }
+ printf("%s\n", message);
+ } else {
+ printf("Hello, world\n");
+ }
+ }
+
+The first argument to the error_setg() function is the Error pointer
+to pointer, which is passed to all QMP functions. The next argument is a human
+description of the error, this is a free-form printf-like string.
+
+Let's test the example above. Build qemu, run it as defined in the "Testing"
+section, and then issue the following command::
+
+ { "execute": "hello-world", "arguments": { "message": "all you need is love" } }
+
+The QMP server's response should be::
+
+ {
+ "error": {
+ "class": "GenericError",
+ "desc": "the word 'love' is not allowed"
+ }
+ }
+
+Note that error_setg() produces a "GenericError" class. In general,
+all QMP errors should have that error class. There are two exceptions
+to this rule:
+
+ 1. To support a management application's need to recognize a specific
+ error for special handling
+
+ 2. Backward compatibility
+
+If the failure you want to report falls into one of the two cases above,
+use error_set() with a second argument of an ErrorClass value.
+
+
+Command Documentation
+~~~~~~~~~~~~~~~~~~~~~
+
+There's only one step missing to make "hello-world"'s implementation complete,
+and that's its documentation in the schema file.
+
+There are many examples of such documentation in the schema file already, but
+here goes "hello-world"'s new entry for qapi/misc.json::
+
+ ##
+ # @hello-world:
+ #
+ # Print a client provided string to the standard output stream.
+ #
+ # @message: string to be printed
+ #
+ # Returns: Nothing on success.
+ #
+ # Notes: if @message is not provided, the "Hello, world" string will
+ # be printed instead
+ #
+ # Since: <next qemu stable release, eg. 1.0>
+ ##
+ { 'command': 'hello-world', 'data': { '*message': 'str' } }
+
+Please, note that the "Returns" clause is optional if a command doesn't return
+any data nor any errors.
+
+
+Implementing the HMP command
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the QMP command is in place, we can also make it available in the human
+monitor (HMP).
+
+With the introduction of the QAPI, HMP commands make QMP calls. Most of the
+time HMP commands are simple wrappers. All HMP commands implementation exist in
+the monitor/hmp-cmds.c file.
+
+Here's the implementation of the "hello-world" HMP command::
+
+ void hmp_hello_world(Monitor *mon, const QDict *qdict)
+ {
+ const char *message = qdict_get_try_str(qdict, "message");
+ Error *err = NULL;
+
+ qmp_hello_world(!!message, message, &err);
+ if (hmp_handle_error(mon, err)) {
+ return;
+ }
+ }
+
+Also, you have to add the function's prototype to the hmp.h file.
+
+There are three important points to be noticed:
+
+1. The "mon" and "qdict" arguments are mandatory for all HMP functions. The
+ former is the monitor object. The latter is how the monitor passes
+ arguments entered by the user to the command implementation
+2. hmp_hello_world() performs error checking. In this example we just call
+ hmp_handle_error() which prints a message to the user, but we could do
+ more, like taking different actions depending on the error
+ qmp_hello_world() returns
+3. The "err" variable must be initialized to NULL before performing the
+ QMP call
+
+There's one last step to actually make the command available to monitor users,
+we should add it to the hmp-commands.hx file::
+
+ {
+ .name = "hello-world",
+ .args_type = "message:s?",
+ .params = "hello-world [message]",
+ .help = "Print message to the standard output",
+ .cmd = hmp_hello_world,
+ },
+
+::
+
+ STEXI
+ @item hello_world @var{message}
+ @findex hello_world
+ Print message to the standard output
+ ETEXI
+
+To test this you have to open a user monitor and issue the "hello-world"
+command. It might be instructive to check the command's documentation with
+HMP's "help" command.
+
+Please, check the "-monitor" command-line option to know how to open a user
+monitor.
+
+
+Writing more complex commands
+-----------------------------
+
+A QMP command is capable of returning any data the QAPI supports like integers,
+strings, booleans, enumerations and user defined types.
+
+In this section we will focus on user defined types. Please, check the QAPI
+documentation for information about the other types.
+
+
+Modelling data in QAPI
+~~~~~~~~~~~~~~~~~~~~~~
+
+For a QMP command that to be considered stable and supported long term,
+there is a requirement returned data should be explicitly modelled
+using fine-grained QAPI types. As a general guide, a caller of the QMP
+command should never need to parse individual returned data fields. If
+a field appears to need parsing, then it should be split into separate
+fields corresponding to each distinct data item. This should be the
+common case for any new QMP command that is intended to be used by
+machines, as opposed to exclusively human operators.
+
+Some QMP commands, however, are only intended as ad hoc debugging aids
+for human operators. While they may return large amounts of formatted
+data, it is not expected that machines will need to parse the result.
+The overhead of defining a fine grained QAPI type for the data may not
+be justified by the potential benefit. In such cases, it is permitted
+to have a command return a simple string that contains formatted data,
+however, it is mandatory for the command to use the 'x-' name prefix.
+This indicates that the command is not guaranteed to be long term
+stable / liable to change in future and is not following QAPI design
+best practices. An example where this approach is taken is the QMP
+command "x-query-registers". This returns a formatted dump of the
+architecture specific CPU state. The way the data is formatted varies
+across QEMU targets, is liable to change over time, and is only
+intended to be consumed as an opaque string by machines. Refer to the
+`Writing a debugging aid returning unstructured text`_ section for
+an illustration.
+
+User Defined Types
+~~~~~~~~~~~~~~~~~~
+
+FIXME This example needs to be redone after commit 6d32717
+
+For this example we will write the query-alarm-clock command, which returns
+information about QEMU's timer alarm. For more information about it, please
+check the "-clock" command-line option.
+
+We want to return two pieces of information. The first one is the alarm clock's
+name. The second one is when the next alarm will fire. The former information is
+returned as a string, the latter is an integer in nanoseconds (which is not
+very useful in practice, as the timer has probably already fired when the
+information reaches the client).
+
+The best way to return that data is to create a new QAPI type, as shown below::
+
+ ##
+ # @QemuAlarmClock
+ #
+ # QEMU alarm clock information.
+ #
+ # @clock-name: The alarm clock method's name.
+ #
+ # @next-deadline: The time (in nanoseconds) the next alarm will fire.
+ #
+ # Since: 1.0
+ ##
+ { 'type': 'QemuAlarmClock',
+ 'data': { 'clock-name': 'str', '*next-deadline': 'int' } }
+
+The "type" keyword defines a new QAPI type. Its "data" member contains the
+type's members. In this example our members are the "clock-name" and the
+"next-deadline" one, which is optional.
+
+Now let's define the query-alarm-clock command::
+
+ ##
+ # @query-alarm-clock
+ #
+ # Return information about QEMU's alarm clock.
+ #
+ # Returns a @QemuAlarmClock instance describing the alarm clock method
+ # being currently used by QEMU (this is usually set by the '-clock'
+ # command-line option).
+ #
+ # Since: 1.0
+ ##
+ { 'command': 'query-alarm-clock', 'returns': 'QemuAlarmClock' }
+
+Notice the "returns" keyword. As its name suggests, it's used to define the
+data returned by a command.
+
+It's time to implement the qmp_query_alarm_clock() function, you can put it
+in the qemu-timer.c file::
+
+ QemuAlarmClock *qmp_query_alarm_clock(Error **errp)
+ {
+ QemuAlarmClock *clock;
+ int64_t deadline;
+
+ clock = g_malloc0(sizeof(*clock));
+
+ deadline = qemu_next_alarm_deadline();
+ if (deadline > 0) {
+ clock->has_next_deadline = true;
+ clock->next_deadline = deadline;
+ }
+ clock->clock_name = g_strdup(alarm_timer->name);
+
+ return clock;
+ }
+
+There are a number of things to be noticed:
+
+1. The QemuAlarmClock type is automatically generated by the QAPI framework,
+ its members correspond to the type's specification in the schema file
+2. As specified in the schema file, the function returns a QemuAlarmClock
+ instance and takes no arguments (besides the "errp" one, which is mandatory
+ for all QMP functions)
+3. The "clock" variable (which will point to our QAPI type instance) is
+ allocated by the regular g_malloc0() function. Note that we chose to
+ initialize the memory to zero. This is recommended for all QAPI types, as
+ it helps avoiding bad surprises (specially with booleans)
+4. Remember that "next_deadline" is optional? All optional members have a
+ 'has_TYPE_NAME' member that should be properly set by the implementation,
+ as shown above
+5. Even static strings, such as "alarm_timer->name", should be dynamically
+ allocated by the implementation. This is so because the QAPI also generates
+ a function to free its types and it cannot distinguish between dynamically
+ or statically allocated strings
+6. You have to include "qapi/qapi-commands-misc.h" in qemu-timer.c
+
+Time to test the new command. Build qemu, run it as described in the "Testing"
+section and try this::
+
+ { "execute": "query-alarm-clock" }
+ {
+ "return": {
+ "next-deadline": 2368219,
+ "clock-name": "dynticks"
+ }
+ }
+
+
+The HMP command
+~~~~~~~~~~~~~~~
+
+Here's the HMP counterpart of the query-alarm-clock command::
+
+ void hmp_info_alarm_clock(Monitor *mon)
+ {
+ QemuAlarmClock *clock;
+ Error *err = NULL;
+
+ clock = qmp_query_alarm_clock(&err);
+ if (hmp_handle_error(mon, err)) {
+ return;
+ }
+
+ monitor_printf(mon, "Alarm clock method in use: '%s'\n", clock->clock_name);
+ if (clock->has_next_deadline) {
+ monitor_printf(mon, "Next alarm will fire in %" PRId64 " nanoseconds\n",
+ clock->next_deadline);
+ }
+
+ qapi_free_QemuAlarmClock(clock);
+ }
+
+It's important to notice that hmp_info_alarm_clock() calls
+qapi_free_QemuAlarmClock() to free the data returned by qmp_query_alarm_clock().
+For user defined types, the QAPI will generate a qapi_free_QAPI_TYPE_NAME()
+function and that's what you have to use to free the types you define and
+qapi_free_QAPI_TYPE_NAMEList() for list types (explained in the next section).
+If the QMP call returns a string, then you should g_free() to free it.
+
+Also note that hmp_info_alarm_clock() performs error handling. That's not
+strictly required if you're sure the QMP function doesn't return errors, but
+it's good practice to always check for errors.
+
+Another important detail is that HMP's "info" commands don't go into the
+hmp-commands.hx. Instead, they go into the info_cmds[] table, which is defined
+in the monitor/misc.c file. The entry for the "info alarmclock" follows::
+
+ {
+ .name = "alarmclock",
+ .args_type = "",
+ .params = "",
+ .help = "show information about the alarm clock",
+ .cmd = hmp_info_alarm_clock,
+ },
+
+To test this, run qemu and type "info alarmclock" in the user monitor.
+
+
+Returning Lists
+~~~~~~~~~~~~~~~
+
+For this example, we're going to return all available methods for the timer
+alarm, which is pretty much what the command-line option "-clock ?" does,
+except that we're also going to inform which method is in use.
+
+This first step is to define a new type::
+
+ ##
+ # @TimerAlarmMethod
+ #
+ # Timer alarm method information.
+ #
+ # @method-name: The method's name.
+ #
+ # @current: true if this alarm method is currently in use, false otherwise
+ #
+ # Since: 1.0
+ ##
+ { 'type': 'TimerAlarmMethod',
+ 'data': { 'method-name': 'str', 'current': 'bool' } }
+
+The command will be called "query-alarm-methods", here is its schema
+specification::
+
+ ##
+ # @query-alarm-methods
+ #
+ # Returns information about available alarm methods.
+ #
+ # Returns: a list of @TimerAlarmMethod for each method
+ #
+ # Since: 1.0
+ ##
+ { 'command': 'query-alarm-methods', 'returns': ['TimerAlarmMethod'] }
+
+Notice the syntax for returning lists "'returns': ['TimerAlarmMethod']", this
+should be read as "returns a list of TimerAlarmMethod instances".
+
+The C implementation follows::
+
+ TimerAlarmMethodList *qmp_query_alarm_methods(Error **errp)
+ {
+ TimerAlarmMethodList *method_list = NULL;
+ const struct qemu_alarm_timer *p;
+ bool current = true;
+
+ for (p = alarm_timers; p->name; p++) {
+ TimerAlarmMethod *value = g_malloc0(*value);
+ value->method_name = g_strdup(p->name);
+ value->current = current;
+ QAPI_LIST_PREPEND(method_list, value);
+ current = false;
+ }
+
+ return method_list;
+ }
+
+The most important difference from the previous examples is the
+TimerAlarmMethodList type, which is automatically generated by the QAPI from
+the TimerAlarmMethod type.
+
+Each list node is represented by a TimerAlarmMethodList instance. We have to
+allocate it, and that's done inside the for loop: the "info" pointer points to
+an allocated node. We also have to allocate the node's contents, which is
+stored in its "value" member. In our example, the "value" member is a pointer
+to an TimerAlarmMethod instance.
+
+Notice that the "current" variable is used as "true" only in the first
+iteration of the loop. That's because the alarm timer method in use is the
+first element of the alarm_timers array. Also notice that QAPI lists are handled
+by hand and we return the head of the list.
+
+Now Build qemu, run it as explained in the "Testing" section and try our new
+command::
+
+ { "execute": "query-alarm-methods" }
+ {
+ "return": [
+ {
+ "current": false,
+ "method-name": "unix"
+ },
+ {
+ "current": true,
+ "method-name": "dynticks"
+ }
+ ]
+ }
+
+The HMP counterpart is a bit more complex than previous examples because it
+has to traverse the list, it's shown below for reference::
+
+ void hmp_info_alarm_methods(Monitor *mon)
+ {
+ TimerAlarmMethodList *method_list, *method;
+ Error *err = NULL;
+
+ method_list = qmp_query_alarm_methods(&err);
+ if (hmp_handle_error(mon, err)) {
+ return;
+ }
+
+ for (method = method_list; method; method = method->next) {
+ monitor_printf(mon, "%c %s\n", method->value->current ? '*' : ' ',
+ method->value->method_name);
+ }
+
+ qapi_free_TimerAlarmMethodList(method_list);
+ }
+
+Writing a debugging aid returning unstructured text
+---------------------------------------------------
+
+As discussed in section `Modelling data in QAPI`_, it is required that
+commands expecting machine usage be using fine-grained QAPI data types.
+The exception to this rule applies when the command is solely intended
+as a debugging aid and allows for returning unstructured text. This is
+commonly needed for query commands that report aspects of QEMU's
+internal state that are useful to human operators.
+
+In this example we will consider a simplified variant of the HMP
+command ``info roms``. Following the earlier rules, this command will
+need to live under the ``x-`` name prefix, so its QMP implementation
+will be called ``x-query-roms``. It will have no parameters and will
+return a single text string::
+
+ { 'struct': 'HumanReadableText',
+ 'data': { 'human-readable-text': 'str' } }
+
+ { 'command': 'x-query-roms',
+ 'returns': 'HumanReadableText' }
+
+The ``HumanReadableText`` struct is intended to be used for all
+commands, under the ``x-`` name prefix that are returning unstructured
+text targeted at humans. It should never be used for commands outside
+the ``x-`` name prefix, as those should be using structured QAPI types.
+
+Implementing the QMP command
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The QMP implementation will typically involve creating a ``GString``
+object and printing formatted data into it::
+
+ HumanReadableText *qmp_x_query_roms(Error **errp)
+ {
+ g_autoptr(GString) buf = g_string_new("");
+ Rom *rom;
+
+ QTAILQ_FOREACH(rom, &roms, next) {
+ g_string_append_printf("%s size=0x%06zx name=\"%s\"\n",
+ memory_region_name(rom->mr),
+ rom->romsize,
+ rom->name);
+ }
+
+ return human_readable_text_from_str(buf);
+ }
+
+
+Implementing the HMP command
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the QMP command is in place, we can also make it available in
+the human monitor (HMP) as shown in previous examples. The HMP
+implementations will all look fairly similar, as all they need do is
+invoke the QMP command and then print the resulting text or error
+message. Here's the implementation of the "info roms" HMP command::
+
+ void hmp_info_roms(Monitor *mon, const QDict *qdict)
+ {
+ Error err = NULL;
+ g_autoptr(HumanReadableText) info = qmp_x_query_roms(&err);
+
+ if (hmp_handle_error(mon, err)) {
+ return;
+ }
+ monitor_printf(mon, "%s", info->human_readable_text);
+ }
+
+Also, you have to add the function's prototype to the hmp.h file.
+
+There's one last step to actually make the command available to
+monitor users, we should add it to the hmp-commands-info.hx file::
+
+ {
+ .name = "roms",
+ .args_type = "",
+ .params = "",
+ .help = "show roms",
+ .cmd = hmp_info_roms,
+ },
+
+The case of writing a HMP info handler that calls a no-parameter QMP query
+command is quite common. To simplify the implementation there is a general
+purpose HMP info handler for this scenario. All that is required to expose
+a no-parameter QMP query command via HMP is to declare it using the
+'.cmd_info_hrt' field to point to the QMP handler, and leave the '.cmd'
+field NULL::
+
+ {
+ .name = "roms",
+ .args_type = "",
+ .params = "",
+ .help = "show roms",
+ .cmd_info_hrt = qmp_x_query_roms,
+ },