diff options
Diffstat (limited to 'docs/devel')
50 files changed, 16059 insertions, 0 deletions
diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst new file mode 100644 index 000000000..52baa0736 --- /dev/null +++ b/docs/devel/atomics.rst @@ -0,0 +1,507 @@ +========================= +Atomic operations in QEMU +========================= + +CPUs perform independent memory operations effectively in random order. +but this can be a problem for CPU-CPU interaction (including interactions +between QEMU and the guest). Multi-threaded programs use various tools +to instruct the compiler and the CPU to restrict the order to something +that is consistent with the expectations of the programmer. + +The most basic tool is locking. Mutexes, condition variables and +semaphores are used in QEMU, and should be the default approach to +synchronization. Anything else is considerably harder, but it's +also justified more often than one would like; +the most performance-critical parts of QEMU in particular require +a very low level approach to concurrency, involving memory barriers +and atomic operations. The semantics of concurrent memory accesses are governed +by the C11 memory model. + +QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to +provide better portability and a less verbose syntax. ``qemu/atomic.h`` +provides macros that fall in three camps: + +- compiler barriers: ``barrier()``; + +- weak atomic access and manual memory barriers: ``qatomic_read()``, + ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, + ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``; + +- sequentially consistent atomic access: everything else. + +In general, use of ``qemu/atomic.h`` should be wrapped with more easily +used data structures (e.g. the lock-free singly-linked list operations +``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization +primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of +atomic operations and memory barriers should be limited to inter-thread +checking of flags and documented thoroughly. + + + +Compiler memory barrier +======================= + +``barrier()`` prevents the compiler from moving the memory accesses on +either side of it to the other side. The compiler barrier has no direct +effect on the CPU, which may then reorder things however it wishes. + +``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some +architectures, CPU guarantees are strong enough that blocking compiler +optimizations already ensures the correct order of execution. In this +case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple +compiler barriers. + +Still, ``barrier()`` can be useful when writing code that can be interrupted +by signal handlers. + + +Sequentially consistent atomic access +===================================== + +Most of the operations in the ``qemu/atomic.h`` header ensure *sequential +consistency*, where "the result of any execution is the same as if the +operations of all the processors were executed in some sequential order, +and the operations of each individual processor appear in this sequence +in the order specified by its program". + +``qemu/atomic.h`` provides the following set of atomic read-modify-write +operations:: + + void qatomic_inc(ptr) + void qatomic_dec(ptr) + void qatomic_add(ptr, val) + void qatomic_sub(ptr, val) + void qatomic_and(ptr, val) + void qatomic_or(ptr, val) + + typeof(*ptr) qatomic_fetch_inc(ptr) + typeof(*ptr) qatomic_fetch_dec(ptr) + typeof(*ptr) qatomic_fetch_add(ptr, val) + typeof(*ptr) qatomic_fetch_sub(ptr, val) + typeof(*ptr) qatomic_fetch_and(ptr, val) + typeof(*ptr) qatomic_fetch_or(ptr, val) + typeof(*ptr) qatomic_fetch_xor(ptr, val) + typeof(*ptr) qatomic_fetch_inc_nonzero(ptr) + typeof(*ptr) qatomic_xchg(ptr, val) + typeof(*ptr) qatomic_cmpxchg(ptr, old, new) + +all of which return the old value of ``*ptr``. These operations are +polymorphic; they operate on any type that is as wide as a pointer or +smaller. + +Similar operations return the new value of ``*ptr``:: + + typeof(*ptr) qatomic_inc_fetch(ptr) + typeof(*ptr) qatomic_dec_fetch(ptr) + typeof(*ptr) qatomic_add_fetch(ptr, val) + typeof(*ptr) qatomic_sub_fetch(ptr, val) + typeof(*ptr) qatomic_and_fetch(ptr, val) + typeof(*ptr) qatomic_or_fetch(ptr, val) + typeof(*ptr) qatomic_xor_fetch(ptr, val) + +``qemu/atomic.h`` also provides loads and stores that cannot be reordered +with each other:: + + typeof(*ptr) qatomic_mb_read(ptr) + void qatomic_mb_set(ptr, val) + +However these do not provide sequential consistency and, in particular, +they do not participate in the total ordering enforced by +sequentially-consistent operations. For this reason they are deprecated. +They should instead be replaced with any of the following (ordered from +easiest to hardest): + +- accesses inside a mutex or spinlock + +- lightweight synchronization primitives such as ``QemuEvent`` + +- RCU operations (``qatomic_rcu_read``, ``qatomic_rcu_set``) when publishing + or accessing a new version of a data structure + +- other atomic accesses: ``qatomic_read`` and ``qatomic_load_acquire`` for + loads, ``qatomic_set`` and ``qatomic_store_release`` for stores, ``smp_mb`` + to forbid reordering subsequent loads before a store. + + +Weak atomic access and manual memory barriers +============================================= + +Compared to sequentially consistent atomic access, programming with +weaker consistency models can be considerably more complicated. +The only guarantees that you can rely upon in this case are: + +- atomic accesses will not cause data races (and hence undefined behavior); + ordinary accesses instead cause data races if they are concurrent with + other accesses of which at least one is a write. In order to ensure this, + the compiler will not optimize accesses out of existence, create unsolicited + accesses, or perform other similar optimzations. + +- acquire operations will appear to happen, with respect to the other + components of the system, before all the LOAD or STORE operations + specified afterwards. + +- release operations will appear to happen, with respect to the other + components of the system, after all the LOAD or STORE operations + specified before. + +- release operations will *synchronize with* acquire operations; + see :ref:`acqrel` for a detailed explanation. + +When using this model, variables are accessed with: + +- ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from + optimizing accesses out of existence and creating unsolicited + accesses, but do not otherwise impose any ordering on loads and + stores: both the compiler and the processor are free to reorder + them. + +- ``qatomic_load_acquire()``, which guarantees the LOAD to appear to + happen, with respect to the other components of the system, + before all the LOAD or STORE operations specified afterwards. + Operations coming before ``qatomic_load_acquire()`` can still be + reordered after it. + +- ``qatomic_store_release()``, which guarantees the STORE to appear to + happen, with respect to the other components of the system, + after all the LOAD or STORE operations specified before. + Operations coming after ``qatomic_store_release()`` can still be + reordered before it. + +Restrictions to the ordering of accesses can also be specified +using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, +``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. + +Memory barriers control the order of references to shared memory. +They come in six kinds: + +- ``smp_rmb()`` guarantees that all the LOAD operations specified before + the barrier will appear to happen before all the LOAD operations + specified after the barrier with respect to the other components of + the system. + + In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not + required to have any effect on stores. + +- ``smp_wmb()`` guarantees that all the STORE operations specified before + the barrier will appear to happen before all the STORE operations + specified after the barrier with respect to the other components of + the system. + + In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not + required to have any effect on loads. + +- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before + the barrier will appear to happen before all the LOAD or STORE operations + specified after the barrier with respect to the other components of + the system. + +- ``smp_mb_release()`` guarantees that all the STORE operations specified *after* + the barrier will appear to happen after all the LOAD or STORE operations + specified *before* the barrier with respect to the other components of + the system. + +- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified + before the barrier will appear to happen before all the LOAD and + STORE operations specified after the barrier with respect to the other + components of the system. + + ``smp_mb()`` puts a partial ordering on both loads and stores. It is + stronger than both a read and a write memory barrier; it implies both + ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs + coming before the barrier from overtaking LOADs coming after the + barrier and vice versa. + +- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On + most processors, whenever two loads are performed such that the + second depends on the result of the first (e.g., the first load + retrieves the address to which the second load will be directed), + the processor will guarantee that the first LOAD will appear to happen + before the second with respect to the other components of the system. + However, this is not always true---for example, it was not true on + Alpha processors. Whenever this kind of access happens to shared + memory (that is not protected by a lock), a read barrier is needed, + and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``. + + Note that the first load really has to have a _data_ dependency and not + a control dependency. If the address for the second load is dependent + on the first load, but the dependency is through a conditional rather + than actually loading the address itself, then it's a _control_ + dependency and a full read barrier or better is required. + + +Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are +mostly used when a data structure has one thread that is always a writer +and one thread that is always a reader: + + +----------------------------------+----------------------------------+ + | thread 1 | thread 2 | + +==================================+==================================+ + | :: | :: | + | | | + | qatomic_store_release(&a, x); | y = qatomic_load_acquire(&b); | + | qatomic_store_release(&b, y); | x = qatomic_load_acquire(&a); | + +----------------------------------+----------------------------------+ + +In this case, correctness is easy to check for using the "pairing" +trick that is explained below. + +Sometimes, a thread is accessing many variables that are otherwise +unrelated to each other (for example because, apart from the current +thread, exactly one other thread will read or write each of these +variables). In this case, it is possible to "hoist" the barriers +outside a loop. For example: + + +------------------------------------------+----------------------------------+ + | before | after | + +==========================================+==================================+ + | :: | :: | + | | | + | n = 0; | n = 0; | + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | + | n += qatomic_load_acquire(&a[i]); | n += qatomic_read(&a[i]); | + | | smp_mb_acquire(); | + +------------------------------------------+----------------------------------+ + | :: | :: | + | | | + | | smp_mb_release(); | + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | + | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | + +------------------------------------------+----------------------------------+ + +Splitting a loop can also be useful to reduce the number of barriers: + + +------------------------------------------+----------------------------------+ + | before | after | + +==========================================+==================================+ + | :: | :: | + | | | + | n = 0; | smp_mb_release(); | + | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | + | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | + | smp_mb(); | smb_mb(); | + | n += qatomic_read(&b[i]); | n = 0; | + | } | for (i = 0; i < 10; i++) | + | | n += qatomic_read(&b[i]); | + +------------------------------------------+----------------------------------+ + +In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer +as well) ``smp_wmb()``: + + +------------------------------------------+----------------------------------+ + | before | after | + +==========================================+==================================+ + | :: | :: | + | | | + | | smp_mb_release(); | + | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | + | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | + | qatomic_store_release(&b[i], false); | smb_wmb(); | + | } | for (i = 0; i < 10; i++) | + | | qatomic_set(&b[i], false); | + +------------------------------------------+----------------------------------+ + + +.. _acqrel: + +Acquire/release pairing and the *synchronizes-with* relation +------------------------------------------------------------ + +Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have +either *acquire* or *release* semantics [#rmw]_. This has two effects: + +.. [#rmw] Read-modify-write operations can have both---acquire applies to the + read part, and release to the write. + +- within a thread, they are ordered either before subsequent operations + (for acquire) or after previous operations (for release). + +- if a release operation in one thread *synchronizes with* an acquire operation + in another thread, the ordering constraints propagates from the first to the + second thread. That is, everything before the release operation in the + first thread is guaranteed to *happen before* everything after the + acquire operation in the second thread. + +The concept of acquire and release semantics is not exclusive to atomic +operations; almost all higher-level synchronization primitives also have +acquire or release semantics. For example: + +- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has + release semantics and synchronizes with a ``pthread_mutex_lock`` for the + same mutex. + +- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics; + ``pthread_cond_wait`` has both release semantics (synchronizing with + ``pthread_mutex_lock``) and acquire semantics (synchronizing with + ``pthread_mutex_unlock`` and signaling of the condition variable). + +- ``pthread_create`` has release semantics and synchronizes with the start + of the new thread; ``pthread_join`` has acquire semantics and synchronizes + with the exiting of the thread. + +- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has + acquire semantics. + +For example, in the following example there are no atomic accesses, but still +thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit`` +(release) and ``pthread_join`` (acquire): + + +----------------------+-------------------------------+ + | thread 1 | thread 2 | + +======================+===============================+ + | :: | :: | + | | | + | *a = 1; | | + | pthread_exit(a); | pthread_join(thread1, &a); | + | | x = *a; | + +----------------------+-------------------------------+ + +Synchronization between threads basically descends from this pairing of +a release operation and an acquire operation. Therefore, atomic operations +other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be +paired with another operation of the opposite kind: an acquire operation +will pair with a release operation and vice versa. This rule of thumb is +extremely useful; in the case of QEMU, however, note that the other +operation may actually be in a driver that runs in the guest! + +``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``, +``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count +as acquire operations. ``smp_wmb()``, ``smp_mb_release()``, +``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release +operations. ``smp_mb()`` counts as both acquire and release, therefore +it can pair with any other atomic operation. Here is an example: + + +----------------------+------------------------------+ + | thread 1 | thread 2 | + +======================+==============================+ + | :: | :: | + | | | + | qatomic_set(&a, 1);| | + | smp_wmb(); | | + | qatomic_set(&b, 2);| x = qatomic_read(&b); | + | | smp_rmb(); | + | | y = qatomic_read(&a); | + +----------------------+------------------------------+ + +Note that a load-store pair only counts if the two operations access the +same variable: that is, a store-release on a variable ``x`` *synchronizes +with* a load-acquire on a variable ``x``, while a release barrier +synchronizes with any acquire operation. The following example shows +correct synchronization: + + +--------------------------------+--------------------------------+ + | thread 1 | thread 2 | + +================================+================================+ + | :: | :: | + | | | + | qatomic_set(&a, 1); | | + | qatomic_store_release(&b, 2);| x = qatomic_load_acquire(&b);| + | | y = qatomic_read(&a); | + +--------------------------------+--------------------------------+ + +Acquire and release semantics of higher-level primitives can also be +relied upon for the purpose of establishing the *synchronizes with* +relation. + +Note that the "writing" thread is accessing the variables in the +opposite order as the "reading" thread. This is expected: stores +before a release operation will normally match the loads after +the acquire operation, and vice versa. In fact, this happened already +in the ``pthread_exit``/``pthread_join`` example above. + +Finally, this more complex example has more than two accesses and data +dependency barriers. It also does not use atomic accesses whenever there +cannot be a data race: + + +----------------------+------------------------------+ + | thread 1 | thread 2 | + +======================+==============================+ + | :: | :: | + | | | + | b[2] = 1; | | + | smp_wmb(); | | + | x->i = 2; | | + | smp_wmb(); | | + | qatomic_set(&a, x);| x = qatomic_read(&a); | + | | smp_read_barrier_depends(); | + | | y = x->i; | + | | smp_read_barrier_depends(); | + | | z = b[y]; | + +----------------------+------------------------------+ + +Comparison with Linux kernel primitives +======================================= + +Here is a list of differences between Linux kernel atomic operations +and memory barriers, and the equivalents in QEMU: + +- atomic operations in Linux are always on a 32-bit int type and + use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic + and use normal C types. + +- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee + at all. Linux 4.1 updated them to implement volatile + semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). + + QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed + semantics if the compiler supports it, and volatile semantics otherwise. + Both semantics prevent the compiler from doing certain transformations; + the difference is that atomic accesses are guaranteed to be atomic, + while volatile accesses aren't. Thus, in the volatile case we just cross + our fingers hoping that the compiler will generate atomic accesses, + since we assume the variables passed are machine-word sized and + properly aligned. + + No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either + Linux or QEMU. + +- atomic read-modify-write operations in Linux are of three kinds: + + ===================== ========================================= + ``atomic_OP`` returns void + ``atomic_OP_return`` returns new value of the variable + ``atomic_fetch_OP`` returns the old value of the variable + ``atomic_cmpxchg`` returns the old value of the variable + ===================== ========================================= + + In QEMU, the second kind is named ``atomic_OP_fetch``. + +- different atomic read-modify-write operations in Linux imply + a different set of memory barriers; in QEMU, all of them enforce + sequential consistency. + +- in QEMU, ``qatomic_read()`` and ``qatomic_set()`` do not participate in + the total ordering enforced by sequentially-consistent operations. + This is because QEMU uses the C11 memory model. The following example + is correct in Linux but not in QEMU: + + +----------------------------------+--------------------------------+ + | Linux (correct) | QEMU (incorrect) | + +==================================+================================+ + | :: | :: | + | | | + | a = atomic_fetch_add(&x, 2); | a = qatomic_fetch_add(&x, 2);| + | b = READ_ONCE(&y); | b = qatomic_read(&y); | + +----------------------------------+--------------------------------+ + + because the read of ``y`` can be moved (by either the processor or the + compiler) before the write of ``x``. + + Fixing this requires an ``smp_mb()`` memory barrier between the write + of ``x`` and the read of ``y``. In the common case where only one thread + writes ``x``, it is also possible to write it like this: + + +--------------------------------+ + | QEMU (correct) | + +================================+ + | :: | + | | + | a = qatomic_read(&x); | + | qatomic_set(&x, a + 2); | + | smp_mb(); | + | b = qatomic_read(&y); | + +--------------------------------+ + +Sources +======= + +- ``Documentation/memory-barriers.txt`` from the Linux kernel diff --git a/docs/devel/bitops.rst b/docs/devel/bitops.rst new file mode 100644 index 000000000..6addaecf8 --- /dev/null +++ b/docs/devel/bitops.rst @@ -0,0 +1,8 @@ +================== +Bitwise operations +================== + +The header ``qemu/bitops.h`` provides utility functions for +performing bitwise operations. + +.. kernel-doc:: include/qemu/bitops.h diff --git a/docs/devel/blkdebug.txt b/docs/devel/blkdebug.txt new file mode 100644 index 000000000..0b0c128d3 --- /dev/null +++ b/docs/devel/blkdebug.txt @@ -0,0 +1,162 @@ +Block I/O error injection using blkdebug +---------------------------------------- +Copyright (C) 2014-2015 Red Hat Inc + +This work is licensed under the terms of the GNU GPL, version 2 or later. See +the COPYING file in the top-level directory. + +The blkdebug block driver is a rule-based error injection engine. It can be +used to exercise error code paths in block drivers including ENOSPC (out of +space) and EIO. + +This document gives an overview of the features available in blkdebug. + +Background +---------- +Block drivers have many error code paths that handle I/O errors. Image formats +are especially complex since metadata I/O errors during cluster allocation or +while updating tables happen halfway through request processing and require +discipline to keep image files consistent. + +Error injection allows test cases to trigger I/O errors at specific points. +This way, all error paths can be tested to make sure they are correct. + +Rules +----- +The blkdebug block driver takes a list of "rules" that tell the error injection +engine when to fail an I/O request. + +Each I/O request is evaluated against the rules. If a rule matches the request +then its "action" is executed. + +Rules can be placed in a configuration file; the configuration file +follows the same .ini-like format used by QEMU's -readconfig option, and +each section of the file represents a rule. + +The following configuration file defines a single rule: + + $ cat blkdebug.conf + [inject-error] + event = "read_aio" + errno = "28" + +This rule fails all aio read requests with ENOSPC (28). Note that the errno +value depends on the host. On Linux, see +/usr/include/asm-generic/errno-base.h for errno values. + +Invoke QEMU as follows: + + $ qemu-system-x86_64 + -drive if=none,cache=none,file=blkdebug:blkdebug.conf:test.img,id=drive0 \ + -device virtio-blk-pci,drive=drive0,id=virtio-blk-pci0 + +Rules support the following attributes: + + event - which type of operation to match (e.g. read_aio, write_aio, + flush_to_os, flush_to_disk). See the "Events" section for + information on events. + + state - (optional) the engine must be in this state number in order for this + rule to match. See the "State transitions" section for information + on states. + + errno - the numeric errno value to return when a request matches this rule. + The errno values depend on the host since the numeric values are not + standardized in the POSIX specification. + + sector - (optional) a sector number that the request must overlap in order to + match this rule + + once - (optional, default "off") only execute this action on the first + matching request + + immediately - (optional, default "off") return a NULL BlockAIOCB + pointer and fail without an errno instead. This + exercises the code path where BlockAIOCB fails and the + caller's BlockCompletionFunc is not invoked. + +Events +------ +Block drivers provide information about the type of I/O request they are about +to make so rules can match specific types of requests. For example, the qcow2 +block driver tells blkdebug when it accesses the L1 table so rules can match +only L1 table accesses and not other metadata or guest data requests. + +The core events are: + + read_aio - guest data read + + write_aio - guest data write + + flush_to_os - write out unwritten block driver state (e.g. cached metadata) + + flush_to_disk - flush the host block device's disk cache + +See qapi/block-core.json:BlkdebugEvent for the full list of events. +You may need to grep block driver source code to understand the +meaning of specific events. + +State transitions +----------------- +There are cases where more power is needed to match a particular I/O request in +a longer sequence of requests. For example: + + write_aio + flush_to_disk + write_aio + +How do we match the 2nd write_aio but not the first? This is where state +transitions come in. + +The error injection engine has an integer called the "state" that always starts +initialized to 1. The state integer is internal to blkdebug and cannot be +observed from outside but rules can interact with it for powerful matching +behavior. + +Rules can be conditional on the current state and they can transition to a new +state. + +When a rule's "state" attribute is non-zero then the current state must equal +the attribute in order for the rule to match. + +For example, to match the 2nd write_aio: + + [set-state] + event = "write_aio" + state = "1" + new_state = "2" + + [inject-error] + event = "write_aio" + state = "2" + errno = "5" + +The first write_aio request matches the set-state rule and transitions from +state 1 to state 2. Once state 2 has been entered, the set-state rule no +longer matches since it requires state 1. But the inject-error rule now +matches the next write_aio request and injects EIO (5). + +State transition rules support the following attributes: + + event - which type of operation to match (e.g. read_aio, write_aio, + flush_to_os, flush_to_disk). See the "Events" section for + information on events. + + state - (optional) the engine must be in this state number in order for this + rule to match + + new_state - transition to this state number + +Suspend and resume +------------------ +Exercising code paths in block drivers may require specific ordering amongst +concurrent requests. The "breakpoint" feature allows requests to be halted on +a blkdebug event and resumed later. This makes it possible to achieve +deterministic ordering when multiple requests are in flight. + +Breakpoints on blkdebug events are associated with a user-defined "tag" string. +This tag serves as an identifier by which the request can be resumed at a later +point. + +See the qemu-io(1) break, resume, remove_break, and wait_break commands for +details. diff --git a/docs/devel/blkverify.txt b/docs/devel/blkverify.txt new file mode 100644 index 000000000..aca826c51 --- /dev/null +++ b/docs/devel/blkverify.txt @@ -0,0 +1,69 @@ += Block driver correctness testing with blkverify = + +== Introduction == + +This document describes how to use the blkverify protocol to test that a block +driver is operating correctly. + +It is difficult to test and debug block drivers against real guests. Often +processes inside the guest will crash because corrupt sectors were read as part +of the executable. Other times obscure errors are raised by a program inside +the guest. These issues are extremely hard to trace back to bugs in the block +driver. + +Blkverify solves this problem by catching data corruption inside QEMU the first +time bad data is read and reporting the disk sector that is corrupted. + +== How it works == + +The blkverify protocol has two child block devices, the "test" device and the +"raw" device. Read/write operations are mirrored to both devices so their +state should always be in sync. + +The "raw" device is a raw image, a flat file, that has identical starting +contents to the "test" image. The idea is that the "raw" device will handle +read/write operations correctly and not corrupt data. It can be used as a +reference for comparison against the "test" device. + +After a mirrored read operation completes, blkverify will compare the data and +raise an error if it is not identical. This makes it possible to catch the +first instance where corrupt data is read. + +== Example == + +Imagine raw.img has 0xcd repeated throughout its first sector: + + $ ./qemu-io -c 'read -v 0 512' raw.img + 00000000: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................ + 00000010: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................ + [...] + 000001e0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................ + 000001f0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................ + read 512/512 bytes at offset 0 + 512.000000 bytes, 1 ops; 0.0000 sec (97.656 MiB/sec and 200000.0000 ops/sec) + +And test.img is corrupt, its first sector is zeroed when it shouldn't be: + + $ ./qemu-io -c 'read -v 0 512' test.img + 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ + 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ + [...] + 000001e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ + 000001f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ + read 512/512 bytes at offset 0 + 512.000000 bytes, 1 ops; 0.0000 sec (81.380 MiB/sec and 166666.6667 ops/sec) + +This error is caught by blkverify: + + $ ./qemu-io -c 'read 0 512' blkverify:a.img:b.img + blkverify: read sector_num=0 nb_sectors=4 contents mismatch in sector 0 + +A more realistic scenario is verifying the installation of a guest OS: + + $ ./qemu-img create raw.img 16G + $ ./qemu-img create -f qcow2 test.qcow2 16G + $ ./qemu-system-x86_64 -cdrom debian.iso \ + -drive file=blkverify:raw.img:test.qcow2 + +If the installation is aborted when blkverify detects corruption, use qemu-io +to explore the contents of the disk image at the sector in question. diff --git a/docs/devel/block-coroutine-wrapper.rst b/docs/devel/block-coroutine-wrapper.rst new file mode 100644 index 000000000..412851986 --- /dev/null +++ b/docs/devel/block-coroutine-wrapper.rst @@ -0,0 +1,54 @@ +======================= +block-coroutine-wrapper +======================= + +A lot of functions in QEMU block layer (see ``block/*``) can only be +called in coroutine context. Such functions are normally marked by the +coroutine_fn specifier. Still, sometimes we need to call them from +non-coroutine context; for this we need to start a coroutine, run the +needed function from it and wait for the coroutine to finish in a +BDRV_POLL_WHILE() loop. To run a coroutine we need a function with one +void* argument. So for each coroutine_fn function which needs a +non-coroutine interface, we should define a structure to pack the +parameters, define a separate function to unpack the parameters and +call the original function and finally define a new interface function +with same list of arguments as original one, which will pack the +parameters into a struct, create a coroutine, run it and wait in +BDRV_POLL_WHILE() loop. It's boring to create such wrappers by hand, +so we have a script to generate them. + +Usage +===== + +Assume we have defined the ``coroutine_fn`` function +``bdrv_co_foo(<some args>)`` and need a non-coroutine interface for it, +called ``bdrv_foo(<same args>)``. In this case the script can help. To +trigger the generation: + +1. You need ``bdrv_foo`` declaration somewhere (for example, in + ``block/coroutines.h``) with the ``generated_co_wrapper`` mark, + like this: + +.. code-block:: c + + int generated_co_wrapper bdrv_foo(<some args>); + +2. You need to feed this declaration to block-coroutine-wrapper script. + For this, add the .h (or .c) file with the declaration to the + ``input: files(...)`` list of ``block_gen_c`` target declaration in + ``block/meson.build`` + +You are done. During the build, coroutine wrappers will be generated in +``<BUILD_DIR>/block/block-gen.c``. + +Links +===== + +1. The script location is ``scripts/block-coroutine-wrapper.py``. + +2. Generic place for private ``generated_co_wrapper`` declarations is + ``block/coroutines.h``, for public declarations: + ``include/block/block.h`` + +3. The core API of generated coroutine wrappers is placed in + (not generated) ``block/block-gen.h`` diff --git a/docs/devel/build-system.rst b/docs/devel/build-system.rst new file mode 100644 index 000000000..431caba7a --- /dev/null +++ b/docs/devel/build-system.rst @@ -0,0 +1,486 @@ +================================== +The QEMU build system architecture +================================== + +This document aims to help developers understand the architecture of the +QEMU build system. As with projects using GNU autotools, the QEMU build +system has two stages, first the developer runs the "configure" script +to determine the local build environment characteristics, then they run +"make" to build the project. There is about where the similarities with +GNU autotools end, so try to forget what you know about them. + + +Stage 1: configure +================== + +The QEMU configure script is written directly in shell, and should be +compatible with any POSIX shell, hence it uses #!/bin/sh. An important +implication of this is that it is important to avoid using bash-isms on +development platforms where bash is the primary host. + +In contrast to autoconf scripts, QEMU's configure is expected to be +silent while it is checking for features. It will only display output +when an error occurs, or to show the final feature enablement summary +on completion. + +Because QEMU uses the Meson build system under the hood, only VPATH +builds are supported. There are two general ways to invoke configure & +perform a build: + + - VPATH, build artifacts outside of QEMU source tree entirely:: + + cd ../ + mkdir build + cd build + ../qemu/configure + make + + - VPATH, build artifacts in a subdir of QEMU source tree:: + + mkdir build + cd build + ../configure + make + +The configure script automatically recognizes +command line options for which a same-named Meson option exists; +dashes in the command line are replaced with underscores. + +Many checks on the compilation environment are still found in configure +rather than ``meson.build``, but new checks should be added directly to +``meson.build``. + +Patches are also welcome to move existing checks from the configure +phase to ``meson.build``. When doing so, ensure that ``meson.build`` does +not use anymore the keys that you have removed from ``config-host.mak``. +Typically these will be replaced in ``meson.build`` by boolean variables, +``get_option('optname')`` invocations, or ``dep.found()`` expressions. +In general, the remaining checks have little or no interdependencies, +so they can be moved one by one. + +Helper functions +---------------- + +The configure script provides a variety of helper functions to assist +developers in checking for system features: + +``do_cc $ARGS...`` + Attempt to run the system C compiler passing it $ARGS... + +``do_cxx $ARGS...`` + Attempt to run the system C++ compiler passing it $ARGS... + +``compile_object $CFLAGS`` + Attempt to compile a test program with the system C compiler using + $CFLAGS. The test program must have been previously written to a file + called $TMPC. The replacement in Meson is the compiler object ``cc``, + which has methods such as ``cc.compiles()``, + ``cc.check_header()``, ``cc.has_function()``. + +``compile_prog $CFLAGS $LDFLAGS`` + Attempt to compile a test program with the system C compiler using + $CFLAGS and link it with the system linker using $LDFLAGS. The test + program must have been previously written to a file called $TMPC. + The replacement in Meson is ``cc.find_library()`` and ``cc.links()``. + +``has $COMMAND`` + Determine if $COMMAND exists in the current environment, either as a + shell builtin, or executable binary, returning 0 on success. The + replacement in Meson is ``find_program()``. + +``check_define $NAME`` + Determine if the macro $NAME is defined by the system C compiler + +``check_include $NAME`` + Determine if the include $NAME file is available to the system C + compiler. The replacement in Meson is ``cc.has_header()``. + +``write_c_skeleton`` + Write a minimal C program main() function to the temporary file + indicated by $TMPC + +``feature_not_found $NAME $REMEDY`` + Print a message to stderr that the feature $NAME was not available + on the system, suggesting the user try $REMEDY to address the + problem. + +``error_exit $MESSAGE $MORE...`` + Print $MESSAGE to stderr, followed by $MORE... and then exit from the + configure script with non-zero status + +``query_pkg_config $ARGS...`` + Run pkg-config passing it $ARGS. If QEMU is doing a static build, + then --static will be automatically added to $ARGS + + +Stage 2: Meson +============== + +The Meson build system is currently used to describe the build +process for: + +1) executables, which include: + + - Tools - ``qemu-img``, ``qemu-nbd``, ``qga`` (guest agent), etc + + - System emulators - ``qemu-system-$ARCH`` + + - Userspace emulators - ``qemu-$ARCH`` + + - Unit tests + +2) documentation + +3) ROMs, which can be either installed as binary blobs or compiled + +4) other data files, such as icons or desktop files + +All executables are built by default, except for some ``contrib/`` +binaries that are known to fail to build on some platforms (for example +32-bit or big-endian platforms). Tests are also built by default, +though that might change in the future. + +The source code is highly modularized, split across many files to +facilitate building of all of these components with as little duplicated +compilation as possible. Using the Meson "sourceset" functionality, +``meson.build`` files group the source files in rules that are +enabled according to the available system libraries and to various +configuration symbols. Sourcesets belong to one of four groups: + +Subsystem sourcesets: + Various subsystems that are common to both tools and emulators have + their own sourceset, for example ``block_ss`` for the block device subsystem, + ``chardev_ss`` for the character device subsystem, etc. These sourcesets + are then turned into static libraries as follows:: + + libchardev = static_library('chardev', chardev_ss.sources(), + name_suffix: 'fa', + build_by_default: false) + + chardev = declare_dependency(link_whole: libchardev) + + As of Meson 0.55.1, the special ``.fa`` suffix should be used for everything + that is used with ``link_whole``, to ensure that the link flags are placed + correctly in the command line. + +Target-independent emulator sourcesets: + Various general purpose helper code is compiled only once and + the .o files are linked into all output binaries that need it. + This includes error handling infrastructure, standard data structures, + platform portability wrapper functions, etc. + + Target-independent code lives in the ``common_ss``, ``softmmu_ss`` and + ``user_ss`` sourcesets. ``common_ss`` is linked into all emulators, + ``softmmu_ss`` only in system emulators, ``user_ss`` only in user-mode + emulators. + + Target-independent sourcesets must exercise particular care when using + ``if_false`` rules. The ``if_false`` rule will be used correctly when linking + emulator binaries; however, when *compiling* target-independent files + into .o files, Meson may need to pick *both* the ``if_true`` and + ``if_false`` sides to cater for targets that want either side. To + achieve that, you can add a special rule using the ``CONFIG_ALL`` + symbol:: + + # Some targets have CONFIG_ACPI, some don't, so this is not enough + softmmu_ss.add(when: 'CONFIG_ACPI', if_true: files('acpi.c'), + if_false: files('acpi-stub.c')) + + # This is required as well: + softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('acpi-stub.c')) + +Target-dependent emulator sourcesets: + In the target-dependent set lives CPU emulation, some device emulation and + much glue code. This sometimes also has to be compiled multiple times, + once for each target being built. Target-dependent files are included + in the ``specific_ss`` sourceset. + + Each emulator also includes sources for files in the ``hw/`` and ``target/`` + subdirectories. The subdirectory used for each emulator comes + from the target's definition of ``TARGET_BASE_ARCH`` or (if missing) + ``TARGET_ARCH``, as found in ``default-configs/targets/*.mak``. + + Each subdirectory in ``hw/`` adds one sourceset to the ``hw_arch`` dictionary, + for example:: + + arm_ss = ss.source_set() + arm_ss.add(files('boot.c'), fdt) + ... + hw_arch += {'arm': arm_ss} + + The sourceset is only used for system emulators. + + Each subdirectory in ``target/`` instead should add one sourceset to each + of the ``target_arch`` and ``target_softmmu_arch``, which are used respectively + for all emulators and for system emulators only. For example:: + + arm_ss = ss.source_set() + arm_softmmu_ss = ss.source_set() + ... + target_arch += {'arm': arm_ss} + target_softmmu_arch += {'arm': arm_softmmu_ss} + +Module sourcesets: + There are two dictionaries for modules: ``modules`` is used for + target-independent modules and ``target_modules`` is used for + target-dependent modules. When modules are disabled the ``module`` + source sets are added to ``softmmu_ss`` and the ``target_modules`` + source sets are added to ``specific_ss``. + + Both dictionaries are nested. One dictionary is created per + subdirectory, and these per-subdirectory dictionaries are added to + the toplevel dictionaries. For example:: + + hw_display_modules = {} + qxl_ss = ss.source_set() + ... + hw_display_modules += { 'qxl': qxl_ss } + modules += { 'hw-display': hw_display_modules } + +Utility sourcesets: + All binaries link with a static library ``libqemuutil.a``. This library + is built from several sourcesets; most of them however host generated + code, and the only two of general interest are ``util_ss`` and ``stub_ss``. + + The separation between these two is purely for documentation purposes. + ``util_ss`` contains generic utility files. Even though this code is only + linked in some binaries, sometimes it requires hooks only in some of + these and depend on other functions that are not fully implemented by + all QEMU binaries. ``stub_ss`` links dummy stubs that will only be linked + into the binary if the real implementation is not present. In a way, + the stubs can be thought of as a portable implementation of the weak + symbols concept. + + +The following files concur in the definition of which files are linked +into each emulator: + +``default-configs/devices/*.mak`` + The files under ``default-configs/devices/`` control the boards and devices + that are built into each QEMU system emulation targets. They merely contain + a list of config variable definitions such as:: + + include arm-softmmu.mak + CONFIG_XLNX_ZYNQMP_ARM=y + CONFIG_XLNX_VERSAL=y + +``*/Kconfig`` + These files are processed together with ``default-configs/devices/*.mak`` and + describe the dependencies between various features, subsystems and + device models. They are described in :ref:`kconfig` + +``default-configs/targets/*.mak`` + These files mostly define symbols that appear in the ``*-config-target.h`` + file for each emulator [#cfgtarget]_. However, the ``TARGET_ARCH`` + and ``TARGET_BASE_ARCH`` will also be used to select the ``hw/`` and + ``target/`` subdirectories that are compiled into each target. + +.. [#cfgtarget] This header is included by ``qemu/osdep.h`` when + compiling files from the target-specific sourcesets. + +These files rarely need changing unless you are adding a completely +new target, or enabling new devices or hardware for a particular +system/userspace emulation target + + +Adding checks +------------- + +New checks should be added to Meson. Compiler checks can be as simple as +the following:: + + config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h')) + +A more complex task such as adding a new dependency usually +comprises the following tasks: + + - Add a Meson build option to meson_options.txt. + + - Add code to perform the actual feature check. + + - Add code to include the feature status in ``config-host.h`` + + - Add code to print out the feature status in the configure summary + upon completion. + +Taking the probe for SDL2_Image as an example, we have the following +in ``meson_options.txt``:: + + option('sdl_image', type : 'feature', value : 'auto', + description: 'SDL Image support for icons') + +Unless the option was given a non-``auto`` value (on the configure +command line), the detection code must be performed only if the +dependency will be used:: + + sdl_image = not_found + if not get_option('sdl_image').auto() or have_system + sdl_image = dependency('SDL2_image', required: get_option('sdl_image'), + method: 'pkg-config', + static: enable_static) + endif + +This avoids warnings on static builds of user-mode emulators, for example. +Most of the libraries used by system-mode emulators are not available for +static linking. + +The other supporting code is generally simple:: + + # Create config-host.h (if applicable) + config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found()) + + # Summary + summary_info += {'SDL image support': sdl_image.found()} + +For the configure script to parse the new option, the +``scripts/meson-buildoptions.sh`` file must be up-to-date; ``make +update-buildoptions`` (or just ``make``) will take care of updating it. + + +Support scripts +--------------- + +Meson has a special convention for invoking Python scripts: if their +first line is ``#! /usr/bin/env python3`` and the file is *not* executable, +find_program() arranges to invoke the script under the same Python +interpreter that was used to invoke Meson. This is the most common +and preferred way to invoke support scripts from Meson build files, +because it automatically uses the value of configure's --python= option. + +In case the script is not written in Python, use a ``#! /usr/bin/env ...`` +line and make the script executable. + +Scripts written in Python, where it is desirable to make the script +executable (for example for test scripts that developers may want to +invoke from the command line, such as tests/qapi-schema/test-qapi.py), +should be invoked through the ``python`` variable in meson.build. For +example:: + + test('QAPI schema regression tests', python, + args: files('test-qapi.py'), + env: test_env, suite: ['qapi-schema', 'qapi-frontend']) + +This is needed to obey the --python= option passed to the configure +script, which may point to something other than the first python3 +binary on the path. + + +Stage 3: makefiles +================== + +The use of GNU make is required with the QEMU build system. + +The output of Meson is a build.ninja file, which is used with the Ninja +build system. QEMU uses a different approach, where Makefile rules are +synthesized from the build.ninja file. The main Makefile includes these +rules and wraps them so that e.g. submodules are built before QEMU. +The resulting build system is largely non-recursive in nature, in +contrast to common practices seen with automake. + +Tests are also ran by the Makefile with the traditional ``make check`` +phony target, while benchmarks are run with ``make bench``. Meson test +suites such as ``unit`` can be ran with ``make check-unit`` too. It is also +possible to run tests defined in meson.build with ``meson test``. + +Useful make targets +------------------- + +``help`` + Print a help message for the most common build targets. + +``print-VAR`` + Print the value of the variable VAR. Useful for debugging the build + system. + +Important files for the build system +==================================== + +Statically defined files +------------------------ + +The following key files are statically defined in the source tree, with +the rules needed to build QEMU. Their behaviour is influenced by a +number of dynamically created files listed later. + +``Makefile`` + The main entry point used when invoking make to build all the components + of QEMU. The default 'all' target will naturally result in the build of + every component. Makefile takes care of recursively building submodules + directly via a non-recursive set of rules. + +``*/meson.build`` + The meson.build file in the root directory is the main entry point for the + Meson build system, and it coordinates the configuration and build of all + executables. Build rules for various subdirectories are included in + other meson.build files spread throughout the QEMU source tree. + +``tests/Makefile.include`` + Rules for external test harnesses. These include the TCG tests, + ``qemu-iotests`` and the Avocado-based integration tests. + +``tests/docker/Makefile.include`` + Rules for Docker tests. Like tests/Makefile, this file is included + directly by the top level Makefile, anything defined in this file will + influence the entire build system. + +``tests/vm/Makefile.include`` + Rules for VM-based tests. Like tests/Makefile, this file is included + directly by the top level Makefile, anything defined in this file will + influence the entire build system. + +Dynamically created files +------------------------- + +The following files are generated dynamically by configure in order to +control the behaviour of the statically defined makefiles. This avoids +the need for QEMU makefiles to go through any pre-processing as seen +with autotools, where Makefile.am generates Makefile.in which generates +Makefile. + +Built by configure: + +``config-host.mak`` + When configure has determined the characteristics of the build host it + will write a long list of variables to config-host.mak file. This + provides the various install directories, compiler / linker flags and a + variety of ``CONFIG_*`` variables related to optionally enabled features. + This is imported by the top level Makefile and meson.build in order to + tailor the build output. + + config-host.mak is also used as a dependency checking mechanism. If make + sees that the modification timestamp on configure is newer than that on + config-host.mak, then configure will be re-run. + + The variables defined here are those which are applicable to all QEMU + build outputs. Variables which are potentially different for each + emulator target are defined by the next file... + + +Built by Meson: + +``${TARGET-NAME}-config-devices.mak`` + TARGET-NAME is again the name of a system or userspace emulator. The + config-devices.mak file is automatically generated by make using the + scripts/make_device_config.sh program, feeding it the + default-configs/$TARGET-NAME file as input. + +``config-host.h``, ``$TARGET_NAME-config-target.h``, ``$TARGET_NAME-config-devices.h`` + These files are used by source code to determine what features are + enabled. They are generated from the contents of the corresponding + ``*.mak`` files using Meson's ``configure_file()`` function. + +``build.ninja`` + The build rules. + + +Built by Makefile: + +``Makefile.ninja`` + A Makefile include that bridges to ninja for the actual build. The + Makefile is mostly a list of targets that Meson included in build.ninja. + +``Makefile.mtest`` + The Makefile definitions that let "make check" run tests defined in + meson.build. The rules are produced from Meson's JSON description of + tests (obtained with "meson introspect --tests") through the script + scripts/mtest2make.py. diff --git a/docs/devel/ci-definitions.rst.inc b/docs/devel/ci-definitions.rst.inc new file mode 100644 index 000000000..6d5c6fd9f --- /dev/null +++ b/docs/devel/ci-definitions.rst.inc @@ -0,0 +1,121 @@ +Definition of terms +=================== + +This section defines the terms used in this document and correlates them with +what is currently used on QEMU. + +Automated tests +--------------- + +An automated test is written on a test framework using its generic test +functions/classes. The test framework can run the tests and report their +success or failure [1]_. + +An automated test has essentially three parts: + +1. The test initialization of the parameters, where the expected parameters, + like inputs and expected results, are set up; +2. The call to the code that should be tested; +3. An assertion, comparing the result from the previous call with the expected + result set during the initialization of the parameters. If the result + matches the expected result, the test has been successful; otherwise, it has + failed. + +Unit testing +------------ + +A unit test is responsible for exercising individual software components as a +unit, like interfaces, data structures, and functionality, uncovering errors +within the boundaries of a component. The verification effort is in the +smallest software unit and focuses on the internal processing logic and data +structures. A test case of unit tests should be designed to uncover errors due +to erroneous computations, incorrect comparisons, or improper control flow [2]_. + +On QEMU, unit testing is represented by the 'check-unit' target from 'make'. + +Functional testing +------------------ + +A functional test focuses on the functional requirement of the software. +Deriving sets of input conditions, the functional tests should fully exercise +all the functional requirements for a program. Functional testing is +complementary to other testing techniques, attempting to find errors like +incorrect or missing functions, interface errors, behavior errors, and +initialization and termination errors [3]_. + +On QEMU, functional testing is represented by the 'check-qtest' target from +'make'. + +System testing +-------------- + +System tests ensure all application elements mesh properly while the overall +functionality and performance are achieved [4]_. Some or all system components +are integrated to create a complete system to be tested as a whole. System +testing ensures that components are compatible, interact correctly, and +transfer the right data at the right time across their interfaces. As system +testing focuses on interactions, use case-based testing is a practical approach +to system testing [5]_. Note that, in some cases, system testing may require +interaction with third-party software, like operating system images, databases, +networks, and so on. + +On QEMU, system testing is represented by the 'check-avocado' target from +'make'. + +Flaky tests +----------- + +A flaky test is defined as a test that exhibits both a passing and a failing +result with the same code on different runs. Some usual reasons for an +intermittent/flaky test are async wait, concurrency, and test order dependency +[6]_. + +Gating +------ + +A gate restricts the move of code from one stage to another on a +test/deployment pipeline. The step move is granted with approval. The approval +can be a manual intervention or a set of tests succeeding [7]_. + +On QEMU, the gating process happens during the pull request. The approval is +done by the project leader running its own set of tests. The pull request gets +merged when the tests succeed. + +Continuous Integration (CI) +--------------------------- + +Continuous integration (CI) requires the builds of the entire application and +the execution of a comprehensive set of automated tests every time there is a +need to commit any set of changes [8]_. The automated tests can be composed of +the unit, functional, system, and other tests. + +Keynotes about continuous integration (CI) [9]_: + +1. System tests may depend on external software (operating system images, + firmware, database, network). +2. It may take a long time to build and test. It may be impractical to build + the system being developed several times per day. +3. If the development platform is different from the target platform, it may + not be possible to run system tests in the developer’s private workspace. + There may be differences in hardware, operating system, or installed + software. Therefore, more time is required for testing the system. + +References +---------- + +.. [1] Sommerville, Ian (2016). Software Engineering. p. 233. +.. [2] Pressman, Roger S. & Maxim, Bruce R. (2020). Software Engineering, + A Practitioner’s Approach. p. 48, 376, 378, 381. +.. [3] Pressman, Roger S. & Maxim, Bruce R. (2020). Software Engineering, + A Practitioner’s Approach. p. 388. +.. [4] Pressman, Roger S. & Maxim, Bruce R. (2020). Software Engineering, + A Practitioner’s Approach. Software Engineering, p. 377. +.. [5] Sommerville, Ian (2016). Software Engineering. p. 59, 232, 240. +.. [6] Luo, Qingzhou, et al. An empirical analysis of flaky tests. + Proceedings of the 22nd ACM SIGSOFT International Symposium on + Foundations of Software Engineering. 2014. +.. [7] Humble, Jez & Farley, David (2010). Continuous Delivery: + Reliable Software Releases Through Build, Test, and Deployment, p. 122. +.. [8] Humble, Jez & Farley, David (2010). Continuous Delivery: + Reliable Software Releases Through Build, Test, and Deployment, p. 55. +.. [9] Sommerville, Ian (2016). Software Engineering. p. 743. diff --git a/docs/devel/ci-jobs.rst.inc b/docs/devel/ci-jobs.rst.inc new file mode 100644 index 000000000..db3f571d5 --- /dev/null +++ b/docs/devel/ci-jobs.rst.inc @@ -0,0 +1,58 @@ +Custom CI/CD variables +====================== + +QEMU CI pipelines can be tuned by setting some CI environment variables. + +Set variable globally in the user's CI namespace +------------------------------------------------ + +Variables can be set globally in the user's CI namespace setting. + +For further information about how to set these variables, please refer to:: + + https://docs.gitlab.com/ee/ci/variables/#add-a-cicd-variable-to-a-project + +Set variable manually when pushing a branch or tag to the user's repository +--------------------------------------------------------------------------- + +Variables can be set manually when pushing a branch or tag, using +git-push command line arguments. + +Example setting the QEMU_CI_EXAMPLE_VAR variable: + +.. code:: + + git push -o ci.variable="QEMU_CI_EXAMPLE_VAR=value" myrepo mybranch + +For further information about how to set these variables, please refer to:: + + https://docs.gitlab.com/ee/user/project/push_options.html#push-options-for-gitlab-cicd + +Here is a list of the most used variables: + +QEMU_CI_AVOCADO_TESTING +~~~~~~~~~~~~~~~~~~~~~~~ +By default, tests using the Avocado framework are not run automatically in +the pipelines (because multiple artifacts have to be downloaded, and if +these artifacts are not already cached, downloading them make the jobs +reach the timeout limit). Set this variable to have the tests using the +Avocado framework run automatically. + +AARCH64_RUNNER_AVAILABLE +~~~~~~~~~~~~~~~~~~~~~~~~ +If you've got access to an aarch64 host that can be used as a gitlab-CI +runner, you can set this variable to enable the tests that require this +kind of host. The runner should be tagged with "aarch64". + +S390X_RUNNER_AVAILABLE +~~~~~~~~~~~~~~~~~~~~~~ +If you've got access to an IBM Z host that can be used as a gitlab-CI +runner, you can set this variable to enable the tests that require this +kind of host. The runner should be tagged with "s390x". + +CENTOS_STREAM_8_x86_64_RUNNER_AVAILABLE +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If you've got access to a CentOS Stream 8 x86_64 host that can be +used as a gitlab-CI runner, you can set this variable to enable the +tests that require this kind of host. The runner should be tagged with +both "centos_stream_8" and "x86_64". diff --git a/docs/devel/ci-runners.rst.inc b/docs/devel/ci-runners.rst.inc new file mode 100644 index 000000000..7817001fb --- /dev/null +++ b/docs/devel/ci-runners.rst.inc @@ -0,0 +1,117 @@ +Jobs on Custom Runners +====================== + +Besides the jobs run under the various CI systems listed before, there +are a number additional jobs that will run before an actual merge. +These use the same GitLab CI's service/framework already used for all +other GitLab based CI jobs, but rely on additional systems, not the +ones provided by GitLab as "shared runners". + +The architecture of GitLab's CI service allows different machines to +be set up with GitLab's "agent", called gitlab-runner, which will take +care of running jobs created by events such as a push to a branch. +Here, the combination of a machine, properly configured with GitLab's +gitlab-runner, is called a "custom runner". + +The GitLab CI jobs definition for the custom runners are located under:: + + .gitlab-ci.d/custom-runners.yml + +Custom runners entail custom machines. To see a list of the machines +currently deployed in the QEMU GitLab CI and their maintainers, please +refer to the QEMU `wiki <https://wiki.qemu.org/AdminContacts>`__. + +Machine Setup Howto +------------------- + +For all Linux based systems, the setup can be mostly automated by the +execution of two Ansible playbooks. Create an ``inventory`` file +under ``scripts/ci/setup``, such as this:: + + fully.qualified.domain + other.machine.hostname + +You may need to set some variables in the inventory file itself. One +very common need is to tell Ansible to use a Python 3 interpreter on +those hosts. This would look like:: + + fully.qualified.domain ansible_python_interpreter=/usr/bin/python3 + other.machine.hostname ansible_python_interpreter=/usr/bin/python3 + +Build environment +~~~~~~~~~~~~~~~~~ + +The ``scripts/ci/setup/build-environment.yml`` Ansible playbook will +set up machines with the environment needed to perform builds and run +QEMU tests. This playbook consists on the installation of various +required packages (and a general package update while at it). It +currently covers a number of different Linux distributions, but it can +be expanded to cover other systems. + +The minimum required version of Ansible successfully tested in this +playbook is 2.8.0 (a version check is embedded within the playbook +itself). To run the playbook, execute:: + + cd scripts/ci/setup + ansible-playbook -i inventory build-environment.yml + +Please note that most of the tasks in the playbook require superuser +privileges, such as those from the ``root`` account or those obtained +by ``sudo``. If necessary, please refer to ``ansible-playbook`` +options such as ``--become``, ``--become-method``, ``--become-user`` +and ``--ask-become-pass``. + +gitlab-runner setup and registration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The gitlab-runner agent needs to be installed on each machine that +will run jobs. The association between a machine and a GitLab project +happens with a registration token. To find the registration token for +your repository/project, navigate on GitLab's web UI to: + + * Settings (the gears-like icon at the bottom of the left hand side + vertical toolbar), then + * CI/CD, then + * Runners, and click on the "Expand" button, then + * Under "Set up a specific Runner manually", look for the value under + "And this registration token:" + +Copy the ``scripts/ci/setup/vars.yml.template`` file to +``scripts/ci/setup/vars.yml``. Then, set the +``gitlab_runner_registration_token`` variable to the value obtained +earlier. + +To run the playbook, execute:: + + cd scripts/ci/setup + ansible-playbook -i inventory gitlab-runner.yml + +Following the registration, it's necessary to configure the runner tags, +and optionally other configurations on the GitLab UI. Navigate to: + + * Settings (the gears like icon), then + * CI/CD, then + * Runners, and click on the "Expand" button, then + * "Runners activated for this project", then + * Click on the "Edit" icon (next to the "Lock" Icon) + +Tags are very important as they are used to route specific jobs to +specific types of runners, so it's a good idea to double check that +the automatically created tags are consistent with the OS and +architecture. For instance, an Ubuntu 20.04 aarch64 system should +have tags set as:: + + ubuntu_20.04,aarch64 + +Because the job definition at ``.gitlab-ci.d/custom-runners.yml`` +would contain:: + + ubuntu-20.04-aarch64-all: + tags: + - ubuntu_20.04 + - aarch64 + +It's also recommended to: + + * increase the "Maximum job timeout" to something like ``2h`` + * give it a better Description diff --git a/docs/devel/ci.rst b/docs/devel/ci.rst new file mode 100644 index 000000000..d10661009 --- /dev/null +++ b/docs/devel/ci.rst @@ -0,0 +1,13 @@ +== +CI +== + +QEMU has configurations enabled for a number of different CI services. +The most up to date information about them and their status can be +found at:: + + https://wiki.qemu.org/Testing/CI + +.. include:: ci-definitions.rst.inc +.. include:: ci-jobs.rst.inc +.. include:: ci-runners.rst.inc diff --git a/docs/devel/clocks.rst b/docs/devel/clocks.rst new file mode 100644 index 000000000..675fbeb6a --- /dev/null +++ b/docs/devel/clocks.rst @@ -0,0 +1,528 @@ +Modelling a clock tree in QEMU +============================== + +What are clocks? +---------------- + +Clocks are QOM objects developed for the purpose of modelling the +distribution of clocks in QEMU. + +They allow us to model the clock distribution of a platform and detect +configuration errors in the clock tree such as badly configured PLL, clock +source selection or disabled clock. + +The object is *Clock* and its QOM name is ``clock`` (in C code, the macro +``TYPE_CLOCK``). + +Clocks are typically used with devices where they are used to model inputs +and outputs. They are created in a similar way to GPIOs. Inputs and outputs +of different devices can be connected together. + +In these cases a Clock object is a child of a Device object, but this +is not a requirement. Clocks can be independent of devices. For +example it is possible to create a clock outside of any device to +model the main clock source of a machine. + +Here is an example of clocks:: + + +---------+ +----------------------+ +--------------+ + | Clock 1 | | Device B | | Device C | + | | | +-------+ +-------+ | | +-------+ | + | |>>-+-->>|Clock 2| |Clock 3|>>--->>|Clock 6| | + +---------+ | | | (in) | | (out) | | | | (in) | | + | | +-------+ +-------+ | | +-------+ | + | | +-------+ | +--------------+ + | | |Clock 4|>> + | | | (out) | | +--------------+ + | | +-------+ | | Device D | + | | +-------+ | | +-------+ | + | | |Clock 5|>>--->>|Clock 7| | + | | | (out) | | | | (in) | | + | | +-------+ | | +-------+ | + | +----------------------+ | | + | | +-------+ | + +----------------------------->>|Clock 8| | + | | (in) | | + | +-------+ | + +--------------+ + +Clocks are defined in the ``include/hw/clock.h`` header and device +related functions are defined in the ``include/hw/qdev-clock.h`` +header. + +The clock state +--------------- + +The state of a clock is its period; it is stored as an integer +representing it in units of 2 :sup:`-32` ns. The special value of 0 is used to +represent the clock being inactive or gated. The clocks do not model +the signal itself (pin toggling) or other properties such as the duty +cycle. + +All clocks contain this state: outputs as well as inputs. This allows +the current period of a clock to be fetched at any time. When a clock +is updated, the value is immediately propagated to all connected +clocks in the tree. + +To ease interaction with clocks, helpers with a unit suffix are defined for +every clock state setter or getter. The suffixes are: + +- ``_ns`` for handling periods in nanoseconds +- ``_hz`` for handling frequencies in hertz + +The 0 period value is converted to 0 in hertz and vice versa. 0 always means +that the clock is disabled. + +Adding a new clock +------------------ + +Adding clocks to a device must be done during the init method of the Device +instance. + +To add an input clock to a device, the function ``qdev_init_clock_in()`` +must be used. It takes the name, a callback, an opaque parameter +for the callback and a mask of events when the callback should be +called (this will be explained in a following section). +Output is simpler; only the name is required. Typically:: + + qdev_init_clock_in(DEVICE(dev), "clk_in", clk_in_callback, dev, ClockUpdate); + qdev_init_clock_out(DEVICE(dev), "clk_out"); + +Both functions return the created Clock pointer, which should be saved in the +device's state structure for further use. + +These objects will be automatically deleted by the QOM reference mechanism. + +Note that it is possible to create a static array describing clock inputs and +outputs. The function ``qdev_init_clocks()`` must be called with the array as +parameter to initialize the clocks: it has the same behaviour as calling the +``qdev_init_clock_in/out()`` for each clock in the array. To ease the array +construction, some macros are defined in ``include/hw/qdev-clock.h``. +As an example, the following creates 2 clocks to a device: one input and one +output. + +.. code-block:: c + + /* device structure containing pointers to the clock objects */ + typedef struct MyDeviceState { + DeviceState parent_obj; + Clock *clk_in; + Clock *clk_out; + } MyDeviceState; + + /* + * callback for the input clock (see "Callback on input clock + * change" section below for more information). + */ + static void clk_in_callback(void *opaque, ClockEvent event); + + /* + * static array describing clocks: + * + a clock input named "clk_in", whose pointer is stored in + * the clk_in field of a MyDeviceState structure with callback + * clk_in_callback. + * + a clock output named "clk_out" whose pointer is stored in + * the clk_out field of a MyDeviceState structure. + */ + static const ClockPortInitArray mydev_clocks = { + QDEV_CLOCK_IN(MyDeviceState, clk_in, clk_in_callback, ClockUpdate), + QDEV_CLOCK_OUT(MyDeviceState, clk_out), + QDEV_CLOCK_END + }; + + /* device initialization function */ + static void mydev_init(Object *obj) + { + /* cast to MyDeviceState */ + MyDeviceState *mydev = MYDEVICE(obj); + /* create and fill the pointer fields in the MyDeviceState */ + qdev_init_clocks(mydev, mydev_clocks); + [...] + } + +An alternative way to create a clock is to simply call +``object_new(TYPE_CLOCK)``. In that case the clock will neither be an +input nor an output of a device. After the whole QOM hierarchy of the +clock has been set ``clock_setup_canonical_path()`` should be called. + +At creation, the period of the clock is 0: the clock is disabled. You can +change it using ``clock_set_ns()`` or ``clock_set_hz()``. + +Note that if you are creating a clock with a fixed period which will never +change (for example the main clock source of a board), then you'll have +nothing else to do. This value will be propagated to other clocks when +connecting the clocks together and devices will fetch the right value during +the first reset. + +Clock callbacks +--------------- + +You can give a clock a callback function in several ways: + + * by passing it as an argument to ``qdev_init_clock_in()`` + * as an argument to the ``QDEV_CLOCK_IN()`` macro initializing an + array to be passed to ``qdev_init_clocks()`` + * by directly calling the ``clock_set_callback()`` function + +The callback function must be of this type: + +.. code-block:: c + + typedef void ClockCallback(void *opaque, ClockEvent event); + +The ``opaque`` argument is the pointer passed to ``qdev_init_clock_in()`` +or ``clock_set_callback()``; for ``qdev_init_clocks()`` it is the +``dev`` device pointer. + +The ``event`` argument specifies why the callback has been called. +When you register the callback you specify a mask of ClockEvent values +that you are interested in. The callback will only be called for those +events. + +The events currently supported are: + + * ``ClockPreUpdate`` : called when the input clock's period is about to + update. This is useful if the device needs to do some action for + which it needs to know the old value of the clock period. During + this callback, Clock API functions like ``clock_get()`` or + ``clock_ticks_to_ns()`` will use the old period. + * ``ClockUpdate`` : called after the input clock's period has changed. + During this callback, Clock API functions like ``clock_ticks_to_ns()`` + will use the new period. + +Note that a clock only has one callback: it is not possible to register +different functions for different events. You must register a single +callback which listens for all of the events you are interested in, +and use the ``event`` argument to identify which event has happened. + +Retrieving clocks from a device +------------------------------- + +``qdev_get_clock_in()`` and ``dev_get_clock_out()`` are available to +get the clock inputs or outputs of a device. For example: + +.. code-block:: c + + Clock *clk = qdev_get_clock_in(DEVICE(mydev), "clk_in"); + +or: + +.. code-block:: c + + Clock *clk = qdev_get_clock_out(DEVICE(mydev), "clk_out"); + +Connecting two clocks together +------------------------------ + +To connect two clocks together, use the ``clock_set_source()`` function. +Given two clocks ``clk1``, and ``clk2``, ``clock_set_source(clk2, clk1);`` +configures ``clk2`` to follow the ``clk1`` period changes. Every time ``clk1`` +is updated, ``clk2`` will be updated too. + +When connecting clock between devices, prefer using the +``qdev_connect_clock_in()`` function to set the source of an input +device clock. For example, to connect the input clock ``clk2`` of +``devB`` to the output clock ``clk1`` of ``devA``, do: + +.. code-block:: c + + qdev_connect_clock_in(devB, "clk2", qdev_get_clock_out(devA, "clk1")) + +We used ``qdev_get_clock_out()`` above, but any clock can drive an +input clock, even another input clock. The following diagram shows +some examples of connections. Note also that a clock can drive several +other clocks. + +:: + + +------------+ +--------------------------------------------------+ + | Device A | | Device B | + | | | +---------------------+ | + | | | | Device C | | + | +-------+ | | +-------+ | +-------+ +-------+ | +-------+ | + | |Clock 1|>>-->>|Clock 2|>>+-->>|Clock 3| |Clock 5|>>>>|Clock 6|>> + | | (out) | | | | (in) | | | | (in) | | (out) | | | (out) | | + | +-------+ | | +-------+ | | +-------+ +-------+ | +-------+ | + +------------+ | | +---------------------+ | + | | | + | | +--------------+ | + | | | Device D | | + | | | +-------+ | | + | +-->>|Clock 4| | | + | | | (in) | | | + | | +-------+ | | + | +--------------+ | + +--------------------------------------------------+ + +In the above example, when *Clock 1* is updated by *Device A*, three +clocks get the new clock period value: *Clock 2*, *Clock 3* and *Clock 4*. + +It is not possible to disconnect a clock or to change the clock connection +after it is connected. + +Clock multiplier and divider settings +------------------------------------- + +By default, when clocks are connected together, the child +clocks run with the same period as their source (parent) clock. +The Clock API supports a built-in period multiplier/divider +mechanism so you can configure a clock to make its children +run at a different period from its own. If you call the +``clock_set_mul_div()`` function you can specify the clock's +multiplier and divider values. The children of that clock +will all run with a period of ``parent_period * multiplier / divider``. +For instance, if the clock has a frequency of 8MHz and you set its +multiplier to 2 and its divider to 3, the child clocks will run +at 12MHz. + +You can change the multiplier and divider of a clock at runtime, +so you can use this to model clock controller devices which +have guest-programmable frequency multipliers or dividers. + +Note that ``clock_set_mul_div()`` does not automatically call +``clock_propagate()``. If you make a runtime change to the +multiplier or divider you must call clock_propagate() yourself. + +Unconnected input clocks +------------------------ + +A newly created input clock is disabled (period of 0). This means the +clock will be considered as disabled until the period is updated. If +the clock remains unconnected it will always keep its initial value +of 0. If this is not the desired behaviour, ``clock_set()``, +``clock_set_ns()`` or ``clock_set_hz()`` should be called on the Clock +object during device instance init. For example: + +.. code-block:: c + + clk = qdev_init_clock_in(DEVICE(dev), "clk-in", clk_in_callback, + dev, ClockUpdate); + /* set initial value to 10ns / 100MHz */ + clock_set_ns(clk, 10); + +To enforce that the clock is wired up by the board code, you can +call ``clock_has_source()`` in your device's realize method: + +.. code-block:: c + + if (!clock_has_source(s->clk)) { + error_setg(errp, "MyDevice: clk input must be connected"); + return; + } + +Note that this only checks that the clock has been wired up; it is +still possible that the output clock connected to it is disabled +or has not yet been configured, in which case the period will be +zero. You should use the clock callback to find out when the clock +period changes. + +Fetching clock frequency/period +------------------------------- + +To get the current state of a clock, use the functions ``clock_get()`` +or ``clock_get_hz()``. + +``clock_get()`` returns the period of the clock in its fully precise +internal representation, as an unsigned 64-bit integer in units of +2^-32 nanoseconds. (For many purposes ``clock_ticks_to_ns()`` will +be more convenient; see the section below on expiry deadlines.) + +``clock_get_hz()`` returns the frequency of the clock, rounded to the +next lowest integer. This implies some inaccuracy due to the rounding, +so be cautious about using it in calculations. + +It is also possible to register a callback on clock frequency changes. +Here is an example, which assumes that ``clock_callback`` has been +specified as the callback for the ``ClockUpdate`` event: + +.. code-block:: c + + void clock_callback(void *opaque, ClockEvent event) { + MyDeviceState *s = (MyDeviceState *) opaque; + /* + * 'opaque' is the argument passed to qdev_init_clock_in(); + * usually this will be the device state pointer. + */ + + /* do something with the new period */ + fprintf(stdout, "device new period is %" PRIu64 "* 2^-32 ns\n", + clock_get(dev->my_clk_input)); + } + +If you are only interested in the frequency for displaying it to +humans (for instance in debugging), use ``clock_display_freq()``, +which returns a prettified string-representation, e.g. "33.3 MHz". +The caller must free the string with g_free() after use. + +Calculating expiry deadlines +---------------------------- + +A commonly required operation for a clock is to calculate how long +it will take for the clock to tick N times; this can then be used +to set a timer expiry deadline. Use the function ``clock_ticks_to_ns()``, +which takes an unsigned 64-bit count of ticks and returns the length +of time in nanoseconds required for the clock to tick that many times. + +It is important not to try to calculate expiry deadlines using a +shortcut like multiplying a "period of clock in nanoseconds" value +by the tick count, because clocks can have periods which are not a +whole number of nanoseconds, and the accumulated error in the +multiplication can be significant. + +For a clock with a very long period and a large number of ticks, +the result of this function could in theory be too large to fit in +a 64-bit value. To avoid overflow in this case, ``clock_ticks_to_ns()`` +saturates the result to INT64_MAX (because this is the largest valid +input to the QEMUTimer APIs). Since INT64_MAX nanoseconds is almost +300 years, anything with an expiry later than that is in the "will +never happen" category. Callers of ``clock_ticks_to_ns()`` should +therefore generally not special-case the possibility of a saturated +result but just allow the timer to be set to that far-future value. +(If you are performing further calculations on the returned value +rather than simply passing it to a QEMUTimer function like +``timer_mod_ns()`` then you should be careful to avoid overflow +in those calculations, of course.) + +Obtaining tick counts +--------------------- + +For calculations where you need to know the number of ticks in +a given duration, use ``clock_ns_to_ticks()``. This function handles +possible non-whole-number-of-nanoseconds periods and avoids +potential rounding errors. It will return '0' if the clock is stopped +(i.e. it has period zero). If the inputs imply a tick count that +overflows a 64-bit value (a very long duration for a clock with a +very short period) the output value is truncated, so effectively +the 64-bit output wraps around. + +Changing a clock period +----------------------- + +A device can change its outputs using the ``clock_update()``, +``clock_update_ns()`` or ``clock_update_hz()`` function. It will trigger +updates on every connected input. + +For example, let's say that we have an output clock *clkout* and we +have a pointer to it in the device state because we did the following +in init phase: + +.. code-block:: c + + dev->clkout = qdev_init_clock_out(DEVICE(dev), "clkout"); + +Then at any time (apart from the cases listed below), it is possible to +change the clock value by doing: + +.. code-block:: c + + clock_update_hz(dev->clkout, 1000 * 1000 * 1000); /* 1GHz */ + +Because updating a clock may trigger any side effects through +connected clocks and their callbacks, this operation must be done +while holding the qemu io lock. + +For the same reason, one can update clocks only when it is allowed to have +side effects on other objects. In consequence, it is forbidden: + +* during migration, +* and in the enter phase of reset. + +Note that calling ``clock_update[_ns|_hz]()`` is equivalent to calling +``clock_set[_ns|_hz]()`` (with the same arguments) then +``clock_propagate()`` on the clock. Thus, setting the clock value can +be separated from triggering the side-effects. This is often required +to factorize code to handle reset and migration in devices. + +Aliasing clocks +--------------- + +Sometimes, one needs to forward, or inherit, a clock from another +device. Typically, when doing device composition, a device might +expose a sub-device's clock without interfering with it. The function +``qdev_alias_clock()`` can be used to achieve this behaviour. Note +that it is possible to expose the clock under a different name. +``qdev_alias_clock()`` works for both input and output clocks. + +For example, if device B is a child of device A, +``device_a_instance_init()`` may do something like this: + +.. code-block:: c + + void device_a_instance_init(Object *obj) + { + AState *A = DEVICE_A(obj); + BState *B; + /* create object B as child of A */ + [...] + qdev_alias_clock(B, "clk", A, "b_clk"); + /* + * Now A has a clock "b_clk" which is an alias to + * the clock "clk" of its child B. + */ + } + +This function does not return any clock object. The new clock has the +same direction (input or output) as the original one. This function +only adds a link to the existing clock. In the above example, object B +remains the only object allowed to use the clock and device A must not +try to change the clock period or set a callback to the clock. This +diagram describes the example with an input clock:: + + +--------------------------+ + | Device A | + | +--------------+ | + | | Device B | | + | | +-------+ | | + >>"b_clk">>>| "clk" | | | + | (in) | | (in) | | | + | | +-------+ | | + | +--------------+ | + +--------------------------+ + +Migration +--------- + +Clock state is not migrated automatically. Every device must handle its +clock migration. Alias clocks must not be migrated. + +To ensure clock states are restored correctly during migration, there +are two solutions. + +Clock states can be migrated by adding an entry into the device +vmstate description. You should use the ``VMSTATE_CLOCK`` macro for this. +This is typically used to migrate an input clock state. For example: + +.. code-block:: c + + MyDeviceState { + DeviceState parent_obj; + [...] /* some fields */ + Clock *clk; + }; + + VMStateDescription my_device_vmstate = { + .name = "my_device", + .fields = (VMStateField[]) { + [...], /* other migrated fields */ + VMSTATE_CLOCK(clk, MyDeviceState), + VMSTATE_END_OF_LIST() + } + }; + +The second solution is to restore the clock state using information already +at our disposal. This can be used to restore output clock states using the +device state. The functions ``clock_set[_ns|_hz]()`` can be used during the +``post_load()`` migration callback. + +When adding clock support to an existing device, if you care about +migration compatibility you will need to be careful, as simply adding +a ``VMSTATE_CLOCK()`` line will break compatibility. Instead, you can +put the ``VMSTATE_CLOCK()`` line into a vmstate subsection with a +suitable ``needed`` function, and use ``clock_set()`` in a +``pre_load()`` function to set the default value that will be used if +the source virtual machine in the migration does not send the clock +state. + +Care should be taken not to use ``clock_update[_ns|_hz]()`` or +``clock_propagate()`` during the whole migration procedure because it +will trigger side effects to other devices in an unknown state. diff --git a/docs/devel/code-of-conduct.rst b/docs/devel/code-of-conduct.rst new file mode 100644 index 000000000..195444d1b --- /dev/null +++ b/docs/devel/code-of-conduct.rst @@ -0,0 +1,60 @@ +Code of Conduct +=============== + +The QEMU community is made up of a mixture of professionals and +volunteers from all over the world. Diversity is one of our strengths, +but it can also lead to communication issues and unhappiness. +To that end, we have a few ground rules that we ask people to adhere to. + +* Be welcoming. We are committed to making participation in this project + a harassment-free experience for everyone, regardless of level of + experience, gender, gender identity and expression, sexual orientation, + disability, personal appearance, body size, race, ethnicity, age, religion, + or nationality. + +* Be respectful. Not all of us will agree all the time. Disagreements, both + social and technical, happen all the time and the QEMU community is no + exception. When we disagree, we try to understand why. It is important that + we resolve disagreements and differing views constructively. Members of the + QEMU community should be respectful when dealing with other contributors as + well as with people outside the QEMU community and with users of QEMU. + +Harassment and other exclusionary behavior are not acceptable. A community +where people feel uncomfortable or threatened is neither welcoming nor +respectful. Examples of unacceptable behavior by participants include: + +* The use of sexualized language or imagery + +* Personal attacks + +* Trolling or insulting/derogatory comments + +* Public or private harassment + +* Publishing other's private information, such as physical or electronic + addresses, without explicit permission + +This isn't an exhaustive list of things that you can't do. Rather, take +it in the spirit in which it's intended: a guide to make it easier to +be excellent to each other. + +This code of conduct applies to all spaces managed by the QEMU project. +This includes IRC, the mailing lists, the issue tracker, community +events, and any other forums created by the project team which the +community uses for communication. This code of conduct also applies +outside these spaces, when an individual acts as a representative or a +member of the project or its community. + +By adopting this code of conduct, project maintainers commit themselves +to fairly and consistently applying these principles to every aspect of +managing this project. If you believe someone is violating the code of +conduct, please read the :ref:`conflict-resolution` document for +information about how to proceed. + +Sources +------- + +This document is based on the `Fedora Code of Conduct +<http://web.archive.org/web/20210429132536/https://docs.fedoraproject.org/en-US/project/code-of-conduct/>`__ +(as of April 2021) and the `Contributor Covenant version 1.3.0 +<https://www.contributor-covenant.org/version/1/3/0/code-of-conduct/>`__. diff --git a/docs/devel/conflict-resolution.rst b/docs/devel/conflict-resolution.rst new file mode 100644 index 000000000..bb25f6186 --- /dev/null +++ b/docs/devel/conflict-resolution.rst @@ -0,0 +1,80 @@ +.. _conflict-resolution: + +Conflict Resolution Policy +========================== + +Conflicts in the community can take many forms, from someone having a +bad day and using harsh and hurtful language on the mailing list to more +serious code of conduct violations (including sexist/racist statements +or threats of violence), and everything in between. + +For the vast majority of issues, we aim to empower individuals to first +resolve conflicts themselves, asking for help when needed, and only +after that fails to escalate further. This approach gives people more +control over the outcome of their dispute. + +How we resolve conflicts +------------------------ + +If you are experiencing conflict, please consider first addressing the +perceived conflict directly with other involved parties, preferably through +a real-time medium such as IRC. You could also try to get a third-party (e.g. +a mutual friend, and/or someone with background on the issue, but not +involved in the conflict) to intercede or mediate. + +If this fails or if you do not feel comfortable proceeding this way, or +if the problem requires immediate escalation, report the issue to the QEMU +leadership committee by sending an email to qemu@sfconservancy.org, providing +references to the misconduct. +For very urgent topics, you can also inform one or more members through IRC. +The up-to-date list of members is `available on the QEMU wiki +<https://wiki.qemu.org/Conservancy>`__. + +Your report will be treated confidentially by the leadership committee and +not be published without your agreement. The QEMU leadership committee will +then do its best to review the incident in a timely manner, and will either +seek further information, or will make a determination on next steps. + +Remedies +-------- + +Escalating an issue to the QEMU leadership committee may result in actions +impacting one or more involved parties. In the event the leadership +committee has to intervene, here are some of the ways they might respond: + +1. Take no action. For example, if the leadership committee determines + the complaint has not been substantiated or is being made in bad faith, + or if it is deemed to be outside its purview. + +2. A private reprimand, explaining the consequences of continued behavior, + to one or more involved individuals. + +3. A private reprimand and request for a private or public apology + +4. A public reprimand and request for a public apology + +5. A public reprimand plus a mandatory cooling off period. The cooling + off period may require, for example, one or more of the following: + abstaining from maintainer duties; not interacting with people involved, + including unsolicited interaction with those enforcing the guidelines + and interaction on social media; being denied participation to in-person + events. The cooling off period is voluntary but may escalate to a + temporary ban in order to enforce it. + +6. A temporary or permanent ban from some or all current and future QEMU + spaces (mailing lists, IRC, wiki, etc.), possibly including in-person + events. + +In the event of severe harassment, the leadership committee may advise that +the matter be escalated to the relevant local law enforcement agency. It +is however not the role of the leadership committee to initiate contact +with law enforcement on behalf of any of the community members involved +in an incident. + +Sources +------- + +This document was developed based on the `Drupal Conflict Resolution +Policy and Process <https://www.drupal.org/conflict-resolution>`__ +and the `Mozilla Consequence Ladder +<https://github.com/mozilla/diversity/blob/master/code-of-conduct-enforcement/consequence-ladder.md>`__ diff --git a/docs/devel/control-flow-integrity.rst b/docs/devel/control-flow-integrity.rst new file mode 100644 index 000000000..e6b73a4fe --- /dev/null +++ b/docs/devel/control-flow-integrity.rst @@ -0,0 +1,137 @@ +============================ +Control-Flow Integrity (CFI) +============================ + +This document describes the current control-flow integrity (CFI) mechanism in +QEMU. How it can be enabled, its benefits and deficiencies, and how it affects +new and existing code in QEMU + +Basics +------ + +CFI is a hardening technique that focusing on guaranteeing that indirect +function calls have not been altered by an attacker. +The type used in QEMU is a forward-edge control-flow integrity that ensures +function calls performed through function pointers, always call a "compatible" +function. A compatible function is a function with the same signature of the +function pointer declared in the source code. + +This type of CFI is entirely compiler-based and relies on the compiler knowing +the signature of every function and every function pointer used in the code. +As of now, the only compiler that provides support for CFI is Clang. + +CFI is best used on production binaries, to protect against unknown attack +vectors. + +In case of a CFI violation (i.e. call to a non-compatible function) QEMU will +terminate abruptly, to stop the possible attack. + +Building with CFI +----------------- + +NOTE: CFI requires the use of link-time optimization. Therefore, when CFI is +selected, LTO will be automatically enabled. + +To build with CFI, the minimum requirement is Clang 6+. If you +are planning to also enable fuzzing, then Clang 11+ is needed (more on this +later). + +Given the use of LTO, a version of AR that supports LLVM IR is required. +The easies way of doing this is by selecting the AR provided by LLVM:: + + AR=llvm-ar-9 CC=clang-9 CXX=clang++-9 /path/to/configure --enable-cfi + +CFI is enabled on every binary produced. + +If desired, an additional flag to increase the verbosity of the output in case +of a CFI violation is offered (``--enable-debug-cfi``). + +Using QEMU built with CFI +------------------------- + +A binary with CFI will work exactly like a standard binary. In case of a CFI +violation, the binary will terminate with an illegal instruction signal. + +Incompatible code with CFI +-------------------------- + +As mentioned above, CFI is entirely compiler-based and therefore relies on +compile-time knowledge of the code. This means that, while generally supported +for most code, some specific use pattern can break CFI compatibility, and +create false-positives. The two main patterns that can cause issues are: + +* Just-in-time compiled code: since such code is created at runtime, the jump + to the buffer containing JIT code will fail. + +* Libraries loaded dynamically, e.g. with dlopen/dlsym, since the library was + not known at compile time. + +Current areas of QEMU that are not entirely compatible with CFI are: + +1. TCG, since the idea of TCG is to pre-compile groups of instructions at + runtime to speed-up interpretation, quite similarly to a JIT compiler + +2. TCI, where the interpreter has to interpret the generic *call* operation + +3. Plugins, since a plugin is implemented as an external library + +4. Modules, since they are implemented as an external library + +5. Directly calling signal handlers from the QEMU source code, since the + signal handler may have been provided by an external library or even plugged + at runtime. + +Disabling CFI for a specific function +------------------------------------- + +If you are working on function that is performing a call using an +incompatible way, as described before, you can selectively disable CFI checks +for such function by using the decorator ``QEMU_DISABLE_CFI`` at function +definition, and add an explanation on why the function is not compatible +with CFI. An example of the use of ``QEMU_DISABLE_CFI`` is provided here:: + + /* + * Disable CFI checks. + * TCG creates binary blobs at runtime, with the transformed code. + * A TB is a blob of binary code, created at runtime and called with an + * indirect function call. Since such function did not exist at compile time, + * the CFI runtime has no way to verify its signature and would fail. + * TCG is not considered a security-sensitive part of QEMU so this does not + * affect the impact of CFI in environment with high security requirements + */ + QEMU_DISABLE_CFI + static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb) + +NOTE: CFI needs to be disabled at the **caller** function, (i.e. a compatible +cfi function that calls a non-compatible one), since the check is performed +when the function call is performed. + +CFI and fuzzing +--------------- + +There is generally no advantage of using CFI and fuzzing together, because +they target different environments (production for CFI, debug for fuzzing). + +CFI could be used in conjunction with fuzzing to identify a broader set of +bugs that may not end immediately in a segmentation fault or triggering +an assertion. However, other sanitizers such as address and ub sanitizers +can identify such bugs in a more precise way than CFI. + +There is, however, an interesting use case in using CFI in conjunction with +fuzzing, that is to make sure that CFI is not triggering any false positive +in remote-but-possible parts of the code. + +CFI can be enabled with fuzzing, but with some caveats: +1. Fuzzing relies on the linker performing function wrapping at link-time. +The standard BFD linker does not support function wrapping when LTO is +also enabled. The workaround is to use LLVM's lld linker. +2. Fuzzing also relies on a custom linker script, which is only supported by +lld with version 11+. + +In other words, to compile with fuzzing and CFI, clang 11+ is required, and +lld needs to be used as a linker:: + + AR=llvm-ar-11 CC=clang-11 CXX=clang++-11 /path/to/configure --enable-cfi \ + -enable-fuzzing --extra-ldflags="-fuse-ld=lld" + +and then, compile the fuzzers as usual. diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst new file mode 100644 index 000000000..49ea50c2a --- /dev/null +++ b/docs/devel/decodetree.rst @@ -0,0 +1,237 @@ +======================== +Decodetree Specification +======================== + +A *decodetree* is built from instruction *patterns*. A pattern may +represent a single architectural instruction or a group of same, depending +on what is convenient for further processing. + +Each pattern has both *fixedbits* and *fixedmask*, the combination of which +describes the condition under which the pattern is matched:: + + (insn & fixedmask) == fixedbits + +Each pattern may have *fields*, which are extracted from the insn and +passed along to the translator. Examples of such are registers, +immediates, and sub-opcodes. + +In support of patterns, one may declare *fields*, *argument sets*, and +*formats*, each of which may be re-used to simplify further definitions. + +Fields +====== + +Syntax:: + + field_def := '%' identifier ( unnamed_field )* ( !function=identifier )? + unnamed_field := number ':' ( 's' ) number + +For *unnamed_field*, the first number is the least-significant bit position +of the field and the second number is the length of the field. If the 's' is +present, the field is considered signed. If multiple ``unnamed_fields`` are +present, they are concatenated. In this way one can define disjoint fields. + +If ``!function`` is specified, the concatenated result is passed through the +named function, taking and returning an integral value. + +One may use ``!function`` with zero ``unnamed_fields``. This case is called +a *parameter*, and the named function is only passed the ``DisasContext`` +and returns an integral value extracted from there. + +A field with no ``unnamed_fields`` and no ``!function`` is in error. + +Field examples: + ++---------------------------+---------------------------------------------+ +| Input | Generated code | ++===========================+=============================================+ +| %disp 0:s16 | sextract(i, 0, 16) | ++---------------------------+---------------------------------------------+ +| %imm9 16:6 10:3 | extract(i, 16, 6) << 3 | extract(i, 10, 3) | ++---------------------------+---------------------------------------------+ +| %disp12 0:s1 1:1 2:10 | sextract(i, 0, 1) << 11 | | +| | extract(i, 1, 1) << 10 | | +| | extract(i, 2, 10) | ++---------------------------+---------------------------------------------+ +| %shimm8 5:s8 13:1 | expand_shimm8(sextract(i, 5, 8) << 1 | | +| !function=expand_shimm8 | extract(i, 13, 1)) | ++---------------------------+---------------------------------------------+ + +Argument Sets +============= + +Syntax:: + + args_def := '&' identifier ( args_elt )+ ( !extern )? + args_elt := identifier (':' identifier)? + +Each *args_elt* defines an argument within the argument set. +If the form of the *args_elt* contains a colon, the first +identifier is the argument name and the second identifier is +the argument type. If the colon is missing, the argument +type will be ``int``. + +Each argument set will be rendered as a C structure "arg_$name" +with each of the fields being one of the member arguments. + +If ``!extern`` is specified, the backing structure is assumed +to have been already declared, typically via a second decoder. + +Argument sets are useful when one wants to define helper functions +for the translator functions that can perform operations on a common +set of arguments. This can ensure, for instance, that the ``AND`` +pattern and the ``OR`` pattern put their operands into the same named +structure, so that a common ``gen_logic_insn`` may be able to handle +the operations common between the two. + +Argument set examples:: + + ®3 ra rb rc + &loadstore reg base offset + &longldst reg base offset:int64_t + + +Formats +======= + +Syntax:: + + fmt_def := '@' identifier ( fmt_elt )+ + fmt_elt := fixedbit_elt | field_elt | field_ref | args_ref + fixedbit_elt := [01.-]+ + field_elt := identifier ':' 's'? number + field_ref := '%' identifier | identifier '=' '%' identifier + args_ref := '&' identifier + +Defining a format is a handy way to avoid replicating groups of fields +across many instruction patterns. + +A *fixedbit_elt* describes a contiguous sequence of bits that must +be 1, 0, or don't care. The difference between '.' and '-' +is that '.' means that the bit will be covered with a field or a +final 0 or 1 from the pattern, and '-' means that the bit is really +ignored by the cpu and will not be specified. + +A *field_elt* describes a simple field only given a width; the position of +the field is implied by its position with respect to other *fixedbit_elt* +and *field_elt*. + +If any *fixedbit_elt* or *field_elt* appear, then all bits must be defined. +Padding with a *fixedbit_elt* of all '.' is an easy way to accomplish that. + +A *field_ref* incorporates a field by reference. This is the only way to +add a complex field to a format. A field may be renamed in the process +via assignment to another identifier. This is intended to allow the +same argument set be used with disjoint named fields. + +A single *args_ref* may specify an argument set to use for the format. +The set of fields in the format must be a subset of the arguments in +the argument set. If an argument set is not specified, one will be +inferred from the set of fields. + +It is recommended, but not required, that all *field_ref* and *args_ref* +appear at the end of the line, not interleaving with *fixedbit_elf* or +*field_elt*. + +Format examples:: + + @opr ...... ra:5 rb:5 ... 0 ....... rc:5 + @opi ...... ra:5 lit:8 1 ....... rc:5 + +Patterns +======== + +Syntax:: + + pat_def := identifier ( pat_elt )+ + pat_elt := fixedbit_elt | field_elt | field_ref | args_ref | fmt_ref | const_elt + fmt_ref := '@' identifier + const_elt := identifier '=' number + +The *fixedbit_elt* and *field_elt* specifiers are unchanged from formats. +A pattern that does not specify a named format will have one inferred +from a referenced argument set (if present) and the set of fields. + +A *const_elt* allows a argument to be set to a constant value. This may +come in handy when fields overlap between patterns and one has to +include the values in the *fixedbit_elt* instead. + +The decoder will call a translator function for each pattern matched. + +Pattern examples:: + + addl_r 010000 ..... ..... .... 0000000 ..... @opr + addl_i 010000 ..... ..... .... 0000000 ..... @opi + +which will, in part, invoke:: + + trans_addl_r(ctx, &arg_opr, insn) + +and:: + + trans_addl_i(ctx, &arg_opi, insn) + +Pattern Groups +============== + +Syntax:: + + group := overlap_group | no_overlap_group + overlap_group := '{' ( pat_def | group )+ '}' + no_overlap_group := '[' ( pat_def | group )+ ']' + +A *group* begins with a lone open-brace or open-bracket, with all +subsequent lines indented two spaces, and ending with a lone +close-brace or close-bracket. Groups may be nested, increasing the +required indentation of the lines within the nested group to two +spaces per nesting level. + +Patterns within overlap groups are allowed to overlap. Conflicts are +resolved by selecting the patterns in order. If all of the fixedbits +for a pattern match, its translate function will be called. If the +translate function returns false, then subsequent patterns within the +group will be matched. + +Patterns within no-overlap groups are not allowed to overlap, just +the same as ungrouped patterns. Thus no-overlap groups are intended +to be nested inside overlap groups. + +The following example from PA-RISC shows specialization of the *or* +instruction:: + + { + { + nop 000010 ----- ----- 0000 001001 0 00000 + copy 000010 00000 r1:5 0000 001001 0 rt:5 + } + or 000010 rt2:5 r1:5 cf:4 001001 0 rt:5 + } + +When the *cf* field is zero, the instruction has no side effects, +and may be specialized. When the *rt* field is zero, the output +is discarded and so the instruction has no effect. When the *rt2* +field is zero, the operation is ``reg[r1] | 0`` and so encodes +the canonical register copy operation. + +The output from the generator might look like:: + + switch (insn & 0xfc000fe0) { + case 0x08000240: + /* 000010.. ........ ....0010 010..... */ + if ((insn & 0x0000f000) == 0x00000000) { + /* 000010.. ........ 00000010 010..... */ + if ((insn & 0x0000001f) == 0x00000000) { + /* 000010.. ........ 00000010 01000000 */ + extract_decode_Fmt_0(&u.f_decode0, insn); + if (trans_nop(ctx, &u.f_decode0)) return true; + } + if ((insn & 0x03e00000) == 0x00000000) { + /* 00001000 000..... 00000010 010..... */ + extract_decode_Fmt_1(&u.f_decode1, insn); + if (trans_copy(ctx, &u.f_decode1)) return true; + } + } + extract_decode_Fmt_2(&u.f_decode2, insn); + if (trans_or(ctx, &u.f_decode2)) return true; + return false; + } diff --git a/docs/devel/ebpf_rss.rst b/docs/devel/ebpf_rss.rst new file mode 100644 index 000000000..4a68682b3 --- /dev/null +++ b/docs/devel/ebpf_rss.rst @@ -0,0 +1,125 @@ +=========================== +eBPF RSS virtio-net support +=========================== + +RSS(Receive Side Scaling) is used to distribute network packets to guest virtqueues +by calculating packet hash. Usually every queue is processed then by a specific guest CPU core. + +For now there are 2 RSS implementations in qemu: +- 'in-qemu' RSS (functions if qemu receives network packets, i.e. vhost=off) +- eBPF RSS (can function with also with vhost=on) + +eBPF support (CONFIG_EBPF) is enabled by 'configure' script. +To enable eBPF RSS support use './configure --enable-bpf'. + +If steering BPF is not set for kernel's TUN module, the TUN uses automatic selection +of rx virtqueue based on lookup table built according to calculated symmetric hash +of transmitted packets. +If steering BPF is set for TUN the BPF code calculates the hash of packet header and +returns the virtqueue number to place the packet to. + +Simplified decision formula: + +.. code:: C + + queue_index = indirection_table[hash(<packet data>)%<indirection_table size>] + + +Not for all packets, the hash can/should be calculated. + +Note: currently, eBPF RSS does not support hash reporting. + +eBPF RSS turned on by different combinations of vhost-net, vitrio-net and tap configurations: + +- eBPF is used: + + tap,vhost=off & virtio-net-pci,rss=on,hash=off + +- eBPF is used: + + tap,vhost=on & virtio-net-pci,rss=on,hash=off + +- 'in-qemu' RSS is used: + + tap,vhost=off & virtio-net-pci,rss=on,hash=on + +- eBPF is used, hash population feature is not reported to the guest: + + tap,vhost=on & virtio-net-pci,rss=on,hash=on + +If CONFIG_EBPF is not set then only 'in-qemu' RSS is supported. +Also 'in-qemu' RSS, as a fallback, is used if the eBPF program failed to load or set to TUN. + +RSS eBPF program +---------------- + +RSS program located in ebpf/rss.bpf.skeleton.h generated by bpftool. +So the program is part of the qemu binary. +Initially, the eBPF program was compiled by clang and source code located at tools/ebpf/rss.bpf.c. +Prerequisites to recompile the eBPF program (regenerate ebpf/rss.bpf.skeleton.h): + + llvm, clang, kernel source tree, bpftool + Adjust Makefile.ebpf to reflect the location of the kernel source tree + + $ cd tools/ebpf + $ make -f Makefile.ebpf + +Current eBPF RSS implementation uses 'bounded loops' with 'backward jump instructions' which present in the last kernels. +Overall eBPF RSS works on kernels 5.8+. + +eBPF RSS implementation +----------------------- + +eBPF RSS loading functionality located in ebpf/ebpf_rss.c and ebpf/ebpf_rss.h. + +The ``struct EBPFRSSContext`` structure that holds 4 file descriptors: + +- ctx - pointer of the libbpf context. +- program_fd - file descriptor of the eBPF RSS program. +- map_configuration - file descriptor of the 'configuration' map. This map contains one element of 'struct EBPFRSSConfig'. This configuration determines eBPF program behavior. +- map_toeplitz_key - file descriptor of the 'Toeplitz key' map. One element of the 40byte key prepared for the hashing algorithm. +- map_indirections_table - 128 elements of queue indexes. + +``struct EBPFRSSConfig`` fields: + +- redirect - "boolean" value, should the hash be calculated, on false - ``default_queue`` would be used as the final decision. +- populate_hash - for now, not used. eBPF RSS doesn't support hash reporting. +- hash_types - binary mask of different hash types. See ``VIRTIO_NET_RSS_HASH_TYPE_*`` defines. If for packet hash should not be calculated - ``default_queue`` would be used. +- indirections_len - length of the indirections table, maximum 128. +- default_queue - the queue index that used for packet that shouldn't be hashed. For some packets, the hash can't be calculated(g.e ARP). + +Functions: + +- ``ebpf_rss_init()`` - sets ctx to NULL, which indicates that EBPFRSSContext is not loaded. +- ``ebpf_rss_load()`` - creates 3 maps and loads eBPF program from the rss.bpf.skeleton.h. Returns 'true' on success. After that, program_fd can be used to set steering for TAP. +- ``ebpf_rss_set_all()`` - sets values for eBPF maps. ``indirections_table`` length is in EBPFRSSConfig. ``toeplitz_key`` is VIRTIO_NET_RSS_MAX_KEY_SIZE aka 40 bytes array. +- ``ebpf_rss_unload()`` - close all file descriptors and set ctx to NULL. + +Simplified eBPF RSS workflow: + +.. code:: C + + struct EBPFRSSConfig config; + config.redirect = 1; + config.hash_types = VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | VIRTIO_NET_RSS_HASH_TYPE_TCPv4; + config.indirections_len = VIRTIO_NET_RSS_MAX_TABLE_LEN; + config.default_queue = 0; + + uint16_t table[VIRTIO_NET_RSS_MAX_TABLE_LEN] = {...}; + uint8_t key[VIRTIO_NET_RSS_MAX_KEY_SIZE] = {...}; + + struct EBPFRSSContext ctx; + ebpf_rss_init(&ctx); + ebpf_rss_load(&ctx); + ebpf_rss_set_all(&ctx, &config, table, key); + if (net_client->info->set_steering_ebpf != NULL) { + net_client->info->set_steering_ebpf(net_client, ctx->program_fd); + } + ... + ebpf_unload(&ctx); + + +NetClientState SetSteeringEBPF() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For now, ``set_steering_ebpf()`` method supported by Linux TAP NetClientState. The method requires an eBPF program file descriptor as an argument. diff --git a/docs/devel/fuzzing.rst b/docs/devel/fuzzing.rst new file mode 100644 index 000000000..784ecb99e --- /dev/null +++ b/docs/devel/fuzzing.rst @@ -0,0 +1,322 @@ +======== +Fuzzing +======== + +This document describes the virtual-device fuzzing infrastructure in QEMU and +how to use it to implement additional fuzzers. + +Basics +------ + +Fuzzing operates by passing inputs to an entry point/target function. The +fuzzer tracks the code coverage triggered by the input. Based on these +findings, the fuzzer mutates the input and repeats the fuzzing. + +To fuzz QEMU, we rely on libfuzzer. Unlike other fuzzers such as AFL, libfuzzer +is an *in-process* fuzzer. For the developer, this means that it is their +responsibility to ensure that state is reset between fuzzing-runs. + +Building the fuzzers +-------------------- + +*NOTE*: If possible, build a 32-bit binary. When forking, the 32-bit fuzzer is +much faster, since the page-map has a smaller size. This is due to the fact that +AddressSanitizer maps ~20TB of memory, as part of its detection. This results +in a large page-map, and a much slower ``fork()``. + +To build the fuzzers, install a recent version of clang: +Configure with (substitute the clang binaries with the version you installed). +Here, enable-sanitizers, is optional but it allows us to reliably detect bugs +such as out-of-bounds accesses, use-after-frees, double-frees etc.:: + + CC=clang-8 CXX=clang++-8 /path/to/configure --enable-fuzzing \ + --enable-sanitizers + +Fuzz targets are built similarly to system targets:: + + make qemu-fuzz-i386 + +This builds ``./qemu-fuzz-i386`` + +The first option to this command is: ``--fuzz-target=FUZZ_NAME`` +To list all of the available fuzzers run ``qemu-fuzz-i386`` with no arguments. + +For example:: + + ./qemu-fuzz-i386 --fuzz-target=virtio-scsi-fuzz + +Internally, libfuzzer parses all arguments that do not begin with ``"--"``. +Information about these is available by passing ``-help=1`` + +Now the only thing left to do is wait for the fuzzer to trigger potential +crashes. + +Useful libFuzzer flags +---------------------- + +As mentioned above, libFuzzer accepts some arguments. Passing ``-help=1`` will +list the available arguments. In particular, these arguments might be helpful: + +* ``CORPUS_DIR/`` : Specify a directory as the last argument to libFuzzer. + libFuzzer stores each "interesting" input in this corpus directory. The next + time you run libFuzzer, it will read all of the inputs from the corpus, and + continue fuzzing from there. You can also specify multiple directories. + libFuzzer loads existing inputs from all specified directories, but will only + write new ones to the first one specified. + +* ``-max_len=4096`` : specify the maximum byte-length of the inputs libFuzzer + will generate. + +* ``-close_fd_mask={1,2,3}`` : close, stderr, or both. Useful for targets that + trigger many debug/error messages, or create output on the serial console. + +* ``-jobs=4 -workers=4`` : These arguments configure libFuzzer to run 4 fuzzers in + parallel (4 fuzzing jobs in 4 worker processes). Alternatively, with only + ``-jobs=N``, libFuzzer automatically spawns a number of workers less than or equal + to half the available CPU cores. Replace 4 with a number appropriate for your + machine. Make sure to specify a ``CORPUS_DIR``, which will allow the parallel + fuzzers to share information about the interesting inputs they find. + +* ``-use_value_profile=1`` : For each comparison operation, libFuzzer computes + ``(caller_pc&4095) | (popcnt(Arg1 ^ Arg2) << 12)`` and places this in the + coverage table. Useful for targets with "magic" constants. If Arg1 came from + the fuzzer's input and Arg2 is a magic constant, then each time the Hamming + distance between Arg1 and Arg2 decreases, libFuzzer adds the input to the + corpus. + +* ``-shrink=1`` : Tries to make elements of the corpus "smaller". Might lead to + better coverage performance, depending on the target. + +Note that libFuzzer's exact behavior will depend on the version of +clang and libFuzzer used to build the device fuzzers. + +Generating Coverage Reports +--------------------------- + +Code coverage is a crucial metric for evaluating a fuzzer's performance. +libFuzzer's output provides a "cov: " column that provides a total number of +unique blocks/edges covered. To examine coverage on a line-by-line basis we +can use Clang coverage: + + 1. Configure libFuzzer to store a corpus of all interesting inputs (see + CORPUS_DIR above) + 2. ``./configure`` the QEMU build with :: + + --enable-fuzzing \ + --extra-cflags="-fprofile-instr-generate -fcoverage-mapping" + + 3. Re-run the fuzzer. Specify $CORPUS_DIR/* as an argument, telling libfuzzer + to execute all of the inputs in $CORPUS_DIR and exit. Once the process + exits, you should find a file, "default.profraw" in the working directory. + 4. Execute these commands to generate a detailed HTML coverage-report:: + + llvm-profdata merge -output=default.profdata default.profraw + llvm-cov show ./path/to/qemu-fuzz-i386 -instr-profile=default.profdata \ + --format html -output-dir=/path/to/output/report + +Adding a new fuzzer +------------------- + +Coverage over virtual devices can be improved by adding additional fuzzers. +Fuzzers are kept in ``tests/qtest/fuzz/`` and should be added to +``tests/qtest/fuzz/meson.build`` + +Fuzzers can rely on both qtest and libqos to communicate with virtual devices. + +1. Create a new source file. For example ``tests/qtest/fuzz/foo-device-fuzz.c``. + +2. Write the fuzzing code using the libqtest/libqos API. See existing fuzzers + for reference. + +3. Add the fuzzer to ``tests/qtest/fuzz/meson.build``. + +Fuzzers can be more-or-less thought of as special qtest programs which can +modify the qtest commands and/or qtest command arguments based on inputs +provided by libfuzzer. Libfuzzer passes a byte array and length. Commonly the +fuzzer loops over the byte-array interpreting it as a list of qtest commands, +addresses, or values. + +The Generic Fuzzer +------------------ + +Writing a fuzz target can be a lot of effort (especially if a device driver has +not be built-out within libqos). Many devices can be fuzzed to some degree, +without any device-specific code, using the generic-fuzz target. + +The generic-fuzz target is capable of fuzzing devices over their PIO, MMIO, +and DMA input-spaces. To apply the generic-fuzz to a device, we need to define +two env-variables, at minimum: + +* ``QEMU_FUZZ_ARGS=`` is the set of QEMU arguments used to configure a machine, with + the device attached. For example, if we want to fuzz the virtio-net device + attached to a pc-i440fx machine, we can specify:: + + QEMU_FUZZ_ARGS="-M pc -nodefaults -netdev user,id=user0 \ + -device virtio-net,netdev=user0" + +* ``QEMU_FUZZ_OBJECTS=`` is a set of space-delimited strings used to identify + the MemoryRegions that will be fuzzed. These strings are compared against + MemoryRegion names and MemoryRegion owner names, to decide whether each + MemoryRegion should be fuzzed. These strings support globbing. For the + virtio-net example, we could use one of :: + + QEMU_FUZZ_OBJECTS='virtio-net' + QEMU_FUZZ_OBJECTS='virtio*' + QEMU_FUZZ_OBJECTS='virtio* pcspk' # Fuzz the virtio devices and the speaker + QEMU_FUZZ_OBJECTS='*' # Fuzz the whole machine`` + +The ``"info mtree"`` and ``"info qom-tree"`` monitor commands can be especially +useful for identifying the ``MemoryRegion`` and ``Object`` names used for +matching. + +As a generic rule-of-thumb, the more ``MemoryRegions``/Devices we match, the +greater the input-space, and the smaller the probability of finding crashing +inputs for individual devices. As such, it is usually a good idea to limit the +fuzzer to only a few ``MemoryRegions``. + +To ensure that these env variables have been configured correctly, we can use:: + + ./qemu-fuzz-i386 --fuzz-target=generic-fuzz -runs=0 + +The output should contain a complete list of matched MemoryRegions. + +OSS-Fuzz +-------- +QEMU is continuously fuzzed on `OSS-Fuzz +<https://github.com/google/oss-fuzz>`_. By default, the OSS-Fuzz build +will try to fuzz every fuzz-target. Since the generic-fuzz target +requires additional information provided in environment variables, we +pre-define some generic-fuzz configs in +``tests/qtest/fuzz/generic_fuzz_configs.h``. Each config must specify: + +- ``.name``: To identify the fuzzer config + +- ``.args`` OR ``.argfunc``: A string or pointer to a function returning a + string. These strings are used to specify the ``QEMU_FUZZ_ARGS`` + environment variable. ``argfunc`` is useful when the config relies on e.g. + a dynamically created temp directory, or a free tcp/udp port. + +- ``.objects``: A string that specifies the ``QEMU_FUZZ_OBJECTS`` environment + variable. + +To fuzz additional devices/device configuration on OSS-Fuzz, send patches for +either a new device-specific fuzzer or a new generic-fuzz config. + +Build details: + +- The Dockerfile that sets up the environment for building QEMU's + fuzzers on OSS-Fuzz can be fund in the OSS-Fuzz repository + __(https://github.com/google/oss-fuzz/blob/master/projects/qemu/Dockerfile) + +- The script responsible for building the fuzzers can be found in the + QEMU source tree at ``scripts/oss-fuzz/build.sh`` + +Building Crash Reproducers +----------------------------------------- +When we find a crash, we should try to create an independent reproducer, that +can be used on a non-fuzzer build of QEMU. This filters out any potential +false-positives, and improves the debugging experience for developers. +Here are the steps for building a reproducer for a crash found by the +generic-fuzz target. + +- Ensure the crash reproduces:: + + qemu-fuzz-i386 --fuzz-target... ./crash-... + +- Gather the QTest output for the crash:: + + QEMU_FUZZ_TIMEOUT=0 QTEST_LOG=1 FUZZ_SERIALIZE_QTEST=1 \ + qemu-fuzz-i386 --fuzz-target... ./crash-... &> /tmp/trace + +- Reorder and clean-up the resulting trace:: + + scripts/oss-fuzz/reorder_fuzzer_qtest_trace.py /tmp/trace > /tmp/reproducer + +- Get the arguments needed to start qemu, and provide a path to qemu:: + + less /tmp/trace # The args should be logged at the top of this file + export QEMU_ARGS="-machine ..." + export QEMU_PATH="path/to/qemu-system" + +- Ensure the crash reproduces in qemu-system:: + + $QEMU_PATH $QEMU_ARGS -qtest stdio < /tmp/reproducer + +- From the crash output, obtain some string that identifies the crash. This + can be a line in the stack-trace, for example:: + + export CRASH_TOKEN="hw/usb/hcd-xhci.c:1865" + +- Minimize the reproducer:: + + scripts/oss-fuzz/minimize_qtest_trace.py -M1 -M2 \ + /tmp/reproducer /tmp/reproducer-minimized + +- Confirm that the minimized reproducer still crashes:: + + $QEMU_PATH $QEMU_ARGS -qtest stdio < /tmp/reproducer-minimized + +- Create a one-liner reproducer that can be sent over email:: + + ./scripts/oss-fuzz/output_reproducer.py -bash /tmp/reproducer-minimized + +- Output the C source code for a test case that will reproduce the bug:: + + ./scripts/oss-fuzz/output_reproducer.py -owner "John Smith <john@smith.com>"\ + -name "test_function_name" /tmp/reproducer-minimized + +- Report the bug and send a patch with the C reproducer upstream + +Implementation Details / Fuzzer Lifecycle +----------------------------------------- + +The fuzzer has two entrypoints that libfuzzer calls. libfuzzer provides it's +own ``main()``, which performs some setup, and calls the entrypoints: + +``LLVMFuzzerInitialize``: called prior to fuzzing. Used to initialize all of the +necessary state + +``LLVMFuzzerTestOneInput``: called for each fuzzing run. Processes the input and +resets the state at the end of each run. + +In more detail: + +``LLVMFuzzerInitialize`` parses the arguments to the fuzzer (must start with two +dashes, so they are ignored by libfuzzer ``main()``). Currently, the arguments +select the fuzz target. Then, the qtest client is initialized. If the target +requires qos, qgraph is set up and the QOM/LIBQOS modules are initialized. +Then the QGraph is walked and the QEMU cmd_line is determined and saved. + +After this, the ``vl.c:qemu_main`` is called to set up the guest. There are +target-specific hooks that can be called before and after qemu_main, for +additional setup(e.g. PCI setup, or VM snapshotting). + +``LLVMFuzzerTestOneInput``: Uses qtest/qos functions to act based on the fuzz +input. It is also responsible for manually calling ``main_loop_wait`` to ensure +that bottom halves are executed and any cleanup required before the next input. + +Since the same process is reused for many fuzzing runs, QEMU state needs to +be reset at the end of each run. There are currently two implemented +options for resetting state: + +- Reboot the guest between runs. + - *Pros*: Straightforward and fast for simple fuzz targets. + + - *Cons*: Depending on the device, does not reset all device state. If the + device requires some initialization prior to being ready for fuzzing (common + for QOS-based targets), this initialization needs to be done after each + reboot. + + - *Example target*: ``i440fx-qtest-reboot-fuzz`` + +- Run each test case in a separate forked process and copy the coverage + information back to the parent. This is fairly similar to AFL's "deferred" + fork-server mode [3] + + - *Pros*: Relatively fast. Devices only need to be initialized once. No need to + do slow reboots or vmloads. + + - *Cons*: Not officially supported by libfuzzer. Does not work well for + devices that rely on dedicated threads. + + - *Example target*: ``virtio-net-fork-fuzz`` diff --git a/docs/devel/index.rst b/docs/devel/index.rst new file mode 100644 index 000000000..afd937535 --- /dev/null +++ b/docs/devel/index.rst @@ -0,0 +1,50 @@ +--------------------- +Developer Information +--------------------- + +This section of the manual documents various parts of the internals of QEMU. +You only need to read it if you are interested in reading or +modifying QEMU's source code. + +.. toctree:: + :maxdepth: 2 + :includehidden: + + code-of-conduct + conflict-resolution + build-system + style + kconfig + testing + fuzzing + control-flow-integrity + loads-stores + memory + migration + atomics + stable-process + ci + qtest + decodetree + secure-coding-practices + tcg + tcg-icount + tracing + multi-thread-tcg + tcg-plugins + bitops + ui + reset + s390-dasd-ipl + clocks + qom + modules + block-coroutine-wrapper + multi-process + ebpf_rss + vfio-migration + qapi-code-gen + writing-monitor-commands + trivial-patches + submitting-a-patch + submitting-a-pull-request diff --git a/docs/devel/kconfig.rst b/docs/devel/kconfig.rst new file mode 100644 index 000000000..a1cdbec75 --- /dev/null +++ b/docs/devel/kconfig.rst @@ -0,0 +1,307 @@ +.. _kconfig: + +================ +QEMU and Kconfig +================ + +QEMU is a very versatile emulator; it can be built for a variety of +targets, where each target can emulate various boards and at the same +time different targets can share large amounts of code. For example, +a POWER and an x86 board can run the same code to emulate a PCI network +card, even though the boards use different PCI host bridges, and they +can run the same code to emulate a SCSI disk while using different +SCSI adapters. Arm, s390 and x86 boards can all present a virtio-blk +disk to their guests, but with three different virtio guest interfaces. + +Each QEMU target enables a subset of the boards, devices and buses that +are included in QEMU's source code. As a result, each QEMU executable +only links a small subset of the files that form QEMU's source code; +anything that is not needed to support a particular target is culled. + +QEMU uses a simple domain-specific language to describe the dependencies +between components. This is useful for two reasons: + +* new targets and boards can be added without knowing in detail the + architecture of the hardware emulation subsystems. Boards only have + to list the components they need, and the compiled executable will + include all the required dependencies and all the devices that the + user can add to that board; + +* users can easily build reduced versions of QEMU that support only a subset + of boards or devices. For example, by default most targets will include + all emulated PCI devices that QEMU supports, but the build process is + configurable and it is easy to drop unnecessary (or otherwise unwanted) + code to make a leaner binary. + +This domain-specific language is based on the Kconfig language that +originated in the Linux kernel, though it was heavily simplified and +the handling of dependencies is stricter in QEMU. + +Unlike Linux, there is no user interface to edit the configuration, which +is instead specified in per-target files under the ``default-configs/`` +directory of the QEMU source tree. This is because, unlike Linux, +configuration and dependencies can be treated as a black box when building +QEMU; the default configuration that QEMU ships with should be okay in +almost all cases. + +The Kconfig language +-------------------- + +Kconfig defines configurable components in files named ``hw/*/Kconfig``. +Note that configurable components are _not_ visible in C code as preprocessor +symbols; they are only visible in the Makefile. Each configurable component +defines a Makefile variable whose name starts with ``CONFIG_``. + +All elements have boolean (true/false) type; truth is written as ``y``, while +falsehood is written ``n``. They are defined in a Kconfig +stanza like the following:: + + config ARM_VIRT + bool + imply PCI_DEVICES + imply VFIO_AMD_XGBE + imply VFIO_XGMAC + select A15MPCORE + select ACPI + select ARM_SMMUV3 + +The ``config`` keyword introduces a new configuration element. In the example +above, Makefiles will have access to a variable named ``CONFIG_ARM_VIRT``, +with value ``y`` or ``n`` (respectively for boolean true and false). + +Boolean expressions can be used within the language, whenever ``<expr>`` +is written in the remainder of this section. The ``&&``, ``||`` and +``!`` operators respectively denote conjunction (AND), disjunction (OR) +and negation (NOT). + +The ``bool`` data type declaration is optional, but it is suggested to +include it for clarity and future-proofing. After ``bool`` the following +directives can be included: + +**dependencies**: ``depends on <expr>`` + + This defines a dependency for this configurable element. Dependencies + evaluate an expression and force the value of the variable to false + if the expression is false. + +**reverse dependencies**: ``select <symbol> [if <expr>]`` + + While ``depends on`` can force a symbol to false, reverse dependencies can + be used to force another symbol to true. In the following example, + ``CONFIG_BAZ`` will be true whenever ``CONFIG_FOO`` is true:: + + config FOO + select BAZ + + The optional expression will prevent ``select`` from having any effect + unless it is true. + + Note that unlike Linux's Kconfig implementation, QEMU will detect + contradictions between ``depends on`` and ``select`` statements and prevent + you from building such a configuration. + +**default value**: ``default <value> [if <expr>]`` + + Default values are assigned to the config symbol if no other value was + set by the user via ``default-configs/*.mak`` files, and only if + ``select`` or ``depends on`` directives do not force the value to true + or false respectively. ``<value>`` can be ``y`` or ``n``; it cannot + be an arbitrary Boolean expression. However, a condition for applying + the default value can be added with ``if``. + + A configuration element can have any number of default values (usually, + if more than one default is present, they will have different + conditions). If multiple default values satisfy their condition, + only the first defined one is active. + +**reverse default** (weak reverse dependency): ``imply <symbol> [if <expr>]`` + + This is similar to ``select`` as it applies a lower limit of ``y`` + to another symbol. However, the lower limit is only a default + and the "implied" symbol's value may still be set to ``n`` from a + ``default-configs/*.mak`` files. The following two examples are + equivalent:: + + config FOO + bool + imply BAZ + + config BAZ + bool + default y if FOO + + The next section explains where to use ``imply`` or ``default y``. + +Guidelines for writing Kconfig files +------------------------------------ + +Configurable elements in QEMU fall under five broad groups. Each group +declares its dependencies in different ways: + +**subsystems**, of which **buses** are a special case + + Example:: + + config SCSI + bool + + Subsystems always default to false (they have no ``default`` directive) + and are never visible in ``default-configs/*.mak`` files. It's + up to other symbols to ``select`` whatever subsystems they require. + + They sometimes have ``select`` directives to bring in other required + subsystems or buses. For example, ``AUX`` (the DisplayPort auxiliary + channel "bus") selects ``I2C`` because it can act as an I2C master too. + +**devices** + + Example:: + + config MEGASAS_SCSI_PCI + bool + default y if PCI_DEVICES + depends on PCI + select SCSI + + Devices are the most complex of the five. They can have a variety + of directives that cooperate so that a default configuration includes + all the devices that can be accessed from QEMU. + + Devices *depend on* the bus that they lie on, for example a PCI + device would specify ``depends on PCI``. An MMIO device will likely + have no ``depends on`` directive. Devices also *select* the buses + that the device provides, for example a SCSI adapter would specify + ``select SCSI``. Finally, devices are usually ``default y`` if and + only if they have at least one ``depends on``; the default could be + conditional on a device group. + + Devices also select any optional subsystem that they use; for example + a video card might specify ``select EDID`` if it needs to build EDID + information and publish it to the guest. + +**device groups** + + Example:: + + config PCI_DEVICES + bool + + Device groups provide a convenient mechanism to enable/disable many + devices in one go. This is useful when a set of devices is likely to + be enabled/disabled by several targets. Device groups usually need + no directive and are not used in the Makefile either; they only appear + as conditions for ``default y`` directives. + + QEMU currently has two device groups, ``PCI_DEVICES`` and + ``TEST_DEVICES``. PCI devices usually have a ``default y if + PCI_DEVICES`` directive rather than just ``default y``. This lets + some boards (notably s390) easily support a subset of PCI devices, + for example only VFIO (passthrough) and virtio-pci devices. + ``TEST_DEVICES`` instead is used for devices that are rarely used on + production virtual machines, but provide useful hooks to test QEMU + or KVM. + +**boards** + + Example:: + + config SUN4M + bool + imply TCX + imply CG3 + select CS4231 + select ECCMEMCTL + select EMPTY_SLOT + select ESCC + select ESP + select FDC + select SLAVIO + select LANCE + select M48T59 + select STP2000 + + Boards specify their constituent devices using ``imply`` and ``select`` + directives. A device should be listed under ``select`` if the board + cannot be started at all without it. It should be listed under + ``imply`` if (depending on the QEMU command line) the board may or + may not be started without it. Boards also default to false; they are + enabled by the ``default-configs/*.mak`` for the target they apply to. + +**internal elements** + + Example:: + + config ECCMEMCTL + bool + select ECC + + Internal elements group code that is useful in several boards or + devices. They are usually enabled with ``select`` and in turn select + other elements; they are never visible in ``default-configs/*.mak`` + files, and often not even in the Makefile. + +Writing and modifying default configurations +-------------------------------------------- + +In addition to the Kconfig files under hw/, each target also includes +a file called ``default-configs/TARGETNAME-softmmu.mak``. These files +initialize some Kconfig variables to non-default values and provide the +starting point to turn on devices and subsystems. + +A file in ``default-configs/`` looks like the following example:: + + # Default configuration for alpha-softmmu + + # Uncomment the following lines to disable these optional devices: + # + #CONFIG_PCI_DEVICES=n + #CONFIG_TEST_DEVICES=n + + # Boards: + # + CONFIG_DP264=y + +The first part, consisting of commented-out ``=n`` assignments, tells +the user which devices or device groups are implied by the boards. +The second part, consisting of ``=y`` assignments, tells the user which +boards are supported by the target. The user will typically modify +the default configuration by uncommenting lines in the first group, +or commenting out lines in the second group. + +It is also possible to run QEMU's configure script with the +``--without-default-devices`` option. When this is done, everything defaults +to ``n`` unless it is ``select``ed or explicitly switched on in the +``.mak`` files. In other words, ``default`` and ``imply`` directives +are disabled. When QEMU is built with this option, the user will probably +want to change some lines in the first group, for example like this:: + + CONFIG_PCI_DEVICES=y + #CONFIG_TEST_DEVICES=n + +and/or pick a subset of the devices in those device groups. Right now +there is no single place that lists all the optional devices for +``CONFIG_PCI_DEVICES`` and ``CONFIG_TEST_DEVICES``. In the future, +we expect that ``.mak`` files will be automatically generated, so that +they will include all these symbols and some help text on what they do. + +``Kconfig.host`` +---------------- + +In some special cases, a configurable element depends on host features +that are detected by QEMU's configure or ``meson.build`` scripts; for +example some devices depend on the availability of KVM or on the presence +of a library on the host. + +These symbols should be listed in ``Kconfig.host`` like this:: + + config TPM + bool + +and also listed as follows in the top-level meson.build's host_kconfig +variable:: + + host_kconfig = \ + ('CONFIG_TPM' in config_host ? ['CONFIG_TPM=y'] : []) + \ + ('CONFIG_SPICE' in config_host ? ['CONFIG_SPICE=y'] : []) + \ + (have_ivshmem ? ['CONFIG_IVSHMEM=y'] : []) + \ + ... diff --git a/docs/devel/loads-stores.rst b/docs/devel/loads-stores.rst new file mode 100644 index 000000000..8f0035c82 --- /dev/null +++ b/docs/devel/loads-stores.rst @@ -0,0 +1,558 @@ +.. + Copyright (c) 2017 Linaro Limited + Written by Peter Maydell + +=================== +Load and Store APIs +=================== + +QEMU internally has multiple families of functions for performing +loads and stores. This document attempts to enumerate them all +and indicate when to use them. It does not provide detailed +documentation of each API -- for that you should look at the +documentation comments in the relevant header files. + + +``ld*_p and st*_p`` +~~~~~~~~~~~~~~~~~~~ + +These functions operate on a host pointer, and should be used +when you already have a pointer into host memory (corresponding +to guest ram or a local buffer). They deal with doing accesses +with the desired endianness and with correctly handling +potentially unaligned pointer values. + +Function names follow the pattern: + +load: ``ld{sign}{size}_{endian}_p(ptr)`` + +store: ``st{size}_{endian}_p(ptr, val)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``he`` : host endian + - ``be`` : big endian + - ``le`` : little endian + +The ``_{endian}`` infix is omitted for target-endian accesses. + +The target endian accessors are only available to source +files which are built per-target. + +There are also functions which take the size as an argument: + +load: ``ldn{endian}_p(ptr, sz)`` + +which performs an unsigned load of ``sz`` bytes from ``ptr`` +as an ``{endian}`` order value and returns it in a uint64_t. + +store: ``stn{endian}_p(ptr, sz, val)`` + +which stores ``val`` to ``ptr`` as an ``{endian}`` order value +of size ``sz`` bytes. + + +Regexes for git grep + - ``\<ld[us]\?[bwlq]\(_[hbl]e\)\?_p\>`` + - ``\<st[bwlq]\(_[hbl]e\)\?_p\>`` + - ``\<ldn_\([hbl]e\)?_p\>`` + - ``\<stn_\([hbl]e\)?_p\>`` + +``cpu_{ld,st}*_mmu`` +~~~~~~~~~~~~~~~~~~~~ + +These functions operate on a guest virtual address, plus a context +known as a "mmu index" which controls how that virtual address is +translated, plus a ``MemOp`` which contains alignment requirements +among other things. The ``MemOp`` and mmu index are combined into +a single argument of type ``MemOpIdx``. + +The meaning of the indexes are target specific, but specifying a +particular index might be necessary if, for instance, the helper +requires a "always as non-privileged" access rather than the +default access for the current state of the guest CPU. + +These functions may cause a guest CPU exception to be taken +(e.g. for an alignment fault or MMU fault) which will result in +guest CPU state being updated and control longjmp'ing out of the +function call. They should therefore only be used in code that is +implementing emulation of the guest CPU. + +The ``retaddr`` parameter is used to control unwinding of the +guest CPU state in case of a guest CPU exception. This is passed +to ``cpu_restore_state()``. Therefore the value should either be 0, +to indicate that the guest CPU state is already synchronized, or +the result of ``GETPC()`` from the top level ``HELPER(foo)`` +function, which is a return address into the generated code [#gpc]_. + +.. [#gpc] Note that ``GETPC()`` should be used with great care: calling + it in other functions that are *not* the top level + ``HELPER(foo)`` will cause unexpected behavior. Instead, the + value of ``GETPC()`` should be read from the helper and passed + if needed to the functions that the helper calls. + +Function names follow the pattern: + +load: ``cpu_ld{size}{end}_mmu(env, ptr, oi, retaddr)`` + +store: ``cpu_st{size}{end}_mmu(env, ptr, val, oi, retaddr)`` + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``end`` + - (empty) : for target endian, or 8 bit sizes + - ``_be`` : big endian + - ``_le`` : little endian + +Regexes for git grep: + - ``\<cpu_ld[bwlq](_[bl]e)\?_mmu\>`` + - ``\<cpu_st[bwlq](_[bl]e)\?_mmu\>`` + + +``cpu_{ld,st}*_mmuidx_ra`` +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These functions work like the ``cpu_{ld,st}_mmu`` functions except +that the ``mmuidx`` parameter is not combined with a ``MemOp``, +and therefore there is no required alignment supplied or enforced. + +Function names follow the pattern: + +load: ``cpu_ld{sign}{size}{end}_mmuidx_ra(env, ptr, mmuidx, retaddr)`` + +store: ``cpu_st{size}{end}_mmuidx_ra(env, ptr, val, mmuidx, retaddr)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``end`` + - (empty) : for target endian, or 8 bit sizes + - ``_be`` : big endian + - ``_le`` : little endian + +Regexes for git grep: + - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_mmuidx_ra\>`` + - ``\<cpu_st[bwlq](_[bl]e)\?_mmuidx_ra\>`` + +``cpu_{ld,st}*_data_ra`` +~~~~~~~~~~~~~~~~~~~~~~~~ + +These functions work like the ``cpu_{ld,st}_mmuidx_ra`` functions +except that the ``mmuidx`` parameter is taken from the current mode +of the guest CPU, as determined by ``cpu_mmu_index(env, false)``. + +These are generally the preferred way to do accesses by guest +virtual address from helper functions, unless the access should +be performed with a context other than the default, or alignment +should be enforced for the access. + +Function names follow the pattern: + +load: ``cpu_ld{sign}{size}{end}_data_ra(env, ptr, ra)`` + +store: ``cpu_st{size}{end}_data_ra(env, ptr, val, ra)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``end`` + - (empty) : for target endian, or 8 bit sizes + - ``_be`` : big endian + - ``_le`` : little endian + +Regexes for git grep: + - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_data_ra\>`` + - ``\<cpu_st[bwlq](_[bl]e)\?_data_ra\>`` + +``cpu_{ld,st}*_data`` +~~~~~~~~~~~~~~~~~~~~~ + +These functions work like the ``cpu_{ld,st}_data_ra`` functions +except that the ``retaddr`` parameter is 0, and thus does not +unwind guest CPU state. + +This means they must only be used from helper functions where the +translator has saved all necessary CPU state. These functions are +the right choice for calls made from hooks like the CPU ``do_interrupt`` +hook or when you know for certain that the translator had to save all +the CPU state anyway. + +Function names follow the pattern: + +load: ``cpu_ld{sign}{size}{end}_data(env, ptr)`` + +store: ``cpu_st{size}{end}_data(env, ptr, val)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``end`` + - (empty) : for target endian, or 8 bit sizes + - ``_be`` : big endian + - ``_le`` : little endian + +Regexes for git grep + - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_data\>`` + - ``\<cpu_st[bwlq](_[bl]e)\?_data\+\>`` + +``cpu_ld*_code`` +~~~~~~~~~~~~~~~~ + +These functions perform a read for instruction execution. The ``mmuidx`` +parameter is taken from the current mode of the guest CPU, as determined +by ``cpu_mmu_index(env, true)``. The ``retaddr`` parameter is 0, and +thus does not unwind guest CPU state, because CPU state is always +synchronized while translating instructions. Any guest CPU exception +that is raised will indicate an instruction execution fault rather than +a data read fault. + +In general these functions should not be used directly during translation. +There are wrapper functions that are to be used which also take care of +plugins for tracing. + +Function names follow the pattern: + +load: ``cpu_ld{sign}{size}_code(env, ptr)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +Regexes for git grep: + - ``\<cpu_ld[us]\?[bwlq]_code\>`` + +``translator_ld*`` +~~~~~~~~~~~~~~~~~~ + +These functions are a wrapper for ``cpu_ld*_code`` which also perform +any actions required by any tracing plugins. They are only to be +called during the translator callback ``translate_insn``. + +There is a set of functions ending in ``_swap`` which, if the parameter +is true, returns the value in the endianness that is the reverse of +the guest native endianness, as determined by ``TARGET_WORDS_BIGENDIAN``. + +Function names follow the pattern: + +load: ``translator_ld{sign}{size}(env, ptr)`` + +swap: ``translator_ld{sign}{size}_swap(env, ptr, swap)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +Regexes for git grep + - ``\<translator_ld[us]\?[bwlq]\(_swap\)\?\>`` + +``helper_*_{ld,st}*_mmu`` +~~~~~~~~~~~~~~~~~~~~~~~~~ + +These functions are intended primarily to be called by the code +generated by the TCG backend. They may also be called by target +CPU helper function code. Like the ``cpu_{ld,st}_mmuidx_ra`` functions +they perform accesses by guest virtual address, with a given ``mmuidx``. + +These functions specify an ``opindex`` parameter which encodes +(among other things) the mmu index to use for the access. This parameter +should be created by calling ``make_memop_idx()``. + +The ``retaddr`` parameter should be the result of GETPC() called directly +from the top level HELPER(foo) function (or 0 if no guest CPU state +unwinding is required). + +**TODO** The names of these functions are a bit odd for historical +reasons because they were originally expected to be called only from +within generated code. We should rename them to bring them more in +line with the other memory access functions. The explicit endianness +is the only feature they have beyond ``*_mmuidx_ra``. + +load: ``helper_{endian}_ld{sign}{size}_mmu(env, addr, opindex, retaddr)`` + +store: ``helper_{endian}_st{size}_mmu(env, addr, val, opindex, retaddr)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``le`` : little endian + - ``be`` : big endian + - ``ret`` : target endianness + +Regexes for git grep + - ``\<helper_\(le\|be\|ret\)_ld[us]\?[bwlq]_mmu\>`` + - ``\<helper_\(le\|be\|ret\)_st[bwlq]_mmu\>`` + +``address_space_*`` +~~~~~~~~~~~~~~~~~~~ + +These functions are the primary ones to use when emulating CPU +or device memory accesses. They take an AddressSpace, which is the +way QEMU defines the view of memory that a device or CPU has. +(They generally correspond to being the "master" end of a hardware bus +or bus fabric.) + +Each CPU has an AddressSpace. Some kinds of CPU have more than +one AddressSpace (for instance Arm guest CPUs have an AddressSpace +for the Secure world and one for NonSecure if they implement TrustZone). +Devices which can do DMA-type operations should generally have an +AddressSpace. There is also a "system address space" which typically +has all the devices and memory that all CPUs can see. (Some older +device models use the "system address space" rather than properly +modelling that they have an AddressSpace of their own.) + +Functions are provided for doing byte-buffer reads and writes, +and also for doing one-data-item loads and stores. + +In all cases the caller provides a MemTxAttrs to specify bus +transaction attributes, and can check whether the memory transaction +succeeded using a MemTxResult return code. + +``address_space_read(address_space, addr, attrs, buf, len)`` + +``address_space_write(address_space, addr, attrs, buf, len)`` + +``address_space_rw(address_space, addr, attrs, buf, len, is_write)`` + +``address_space_ld{sign}{size}_{endian}(address_space, addr, attrs, txresult)`` + +``address_space_st{size}_{endian}(address_space, addr, val, attrs, txresult)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + +(No signed load operations are provided.) + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``le`` : little endian + - ``be`` : big endian + +The ``_{endian}`` suffix is omitted for byte accesses. + +Regexes for git grep + - ``\<address_space_\(read\|write\|rw\)\>`` + - ``\<address_space_ldu\?[bwql]\(_[lb]e\)\?\>`` + - ``\<address_space_st[bwql]\(_[lb]e\)\?\>`` + +``address_space_write_rom`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This function performs a write by physical address like +``address_space_write``, except that if the write is to a ROM then +the ROM contents will be modified, even though a write by the guest +CPU to the ROM would be ignored. This is used for non-guest writes +like writes from the gdb debug stub or initial loading of ROM contents. + +Note that portions of the write which attempt to write data to a +device will be silently ignored -- only real RAM and ROM will +be written to. + +Regexes for git grep + - ``address_space_write_rom`` + +``{ld,st}*_phys`` +~~~~~~~~~~~~~~~~~ + +These are functions which are identical to +``address_space_{ld,st}*``, except that they always pass +``MEMTXATTRS_UNSPECIFIED`` for the transaction attributes, and ignore +whether the transaction succeeded or failed. + +The fact that they ignore whether the transaction succeeded means +they should not be used in new code, unless you know for certain +that your code will only be used in a context where the CPU or +device doing the access has no way to report such an error. + +``load: ld{sign}{size}_{endian}_phys`` + +``store: st{size}_{endian}_phys`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + +(No signed load operations are provided.) + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``le`` : little endian + - ``be`` : big endian + +The ``_{endian}_`` infix is omitted for byte accesses. + +Regexes for git grep + - ``\<ldu\?[bwlq]\(_[bl]e\)\?_phys\>`` + - ``\<st[bwlq]\(_[bl]e\)\?_phys\>`` + +``cpu_physical_memory_*`` +~~~~~~~~~~~~~~~~~~~~~~~~~ + +These are convenience functions which are identical to +``address_space_*`` but operate specifically on the system address space, +always pass a ``MEMTXATTRS_UNSPECIFIED`` set of memory attributes and +ignore whether the memory transaction succeeded or failed. +For new code they are better avoided: + +* there is likely to be behaviour you need to model correctly for a + failed read or write operation +* a device should usually perform operations on its own AddressSpace + rather than using the system address space + +``cpu_physical_memory_read`` + +``cpu_physical_memory_write`` + +``cpu_physical_memory_rw`` + +Regexes for git grep + - ``\<cpu_physical_memory_\(read\|write\|rw\)\>`` + +``cpu_memory_rw_debug`` +~~~~~~~~~~~~~~~~~~~~~~~ + +Access CPU memory by virtual address for debug purposes. + +This function is intended for use by the GDB stub and similar code. +It takes a virtual address, converts it to a physical address via +an MMU lookup using the current settings of the specified CPU, +and then performs the access (using ``address_space_rw`` for +reads or ``cpu_physical_memory_write_rom`` for writes). +This means that if the access is a write to a ROM then this +function will modify the contents (whereas a normal guest CPU access +would ignore the write attempt). + +``cpu_memory_rw_debug`` + +``dma_memory_*`` +~~~~~~~~~~~~~~~~ + +These behave like ``address_space_*``, except that they perform a DMA +barrier operation first. + +**TODO**: We should provide guidance on when you need the DMA +barrier operation and when it's OK to use ``address_space_*``, and +make sure our existing code is doing things correctly. + +``dma_memory_read`` + +``dma_memory_write`` + +``dma_memory_rw`` + +Regexes for git grep + - ``\<dma_memory_\(read\|write\|rw\)\>`` + - ``\<ldu\?[bwlq]\(_[bl]e\)\?_dma\>`` + - ``\<st[bwlq]\(_[bl]e\)\?_dma\>`` + +``pci_dma_*`` and ``{ld,st}*_pci_dma`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These functions are specifically for PCI device models which need to +perform accesses where the PCI device is a bus master. You pass them a +``PCIDevice *`` and they will do ``dma_memory_*`` operations on the +correct address space for that device. + +``pci_dma_read`` + +``pci_dma_write`` + +``pci_dma_rw`` + +``load: ld{sign}{size}_{endian}_pci_dma`` + +``store: st{size}_{endian}_pci_dma`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + +(No signed load operations are provided.) + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``le`` : little endian + - ``be`` : big endian + +The ``_{endian}_`` infix is omitted for byte accesses. + +Regexes for git grep + - ``\<pci_dma_\(read\|write\|rw\)\>`` + - ``\<ldu\?[bwlq]\(_[bl]e\)\?_pci_dma\>`` + - ``\<st[bwlq]\(_[bl]e\)\?_pci_dma\>`` diff --git a/docs/devel/lockcnt.txt b/docs/devel/lockcnt.txt new file mode 100644 index 000000000..a3fb3bc5d --- /dev/null +++ b/docs/devel/lockcnt.txt @@ -0,0 +1,277 @@ +DOCUMENTATION FOR LOCKED COUNTERS (aka QemuLockCnt) +=================================================== + +QEMU often uses reference counts to track data structures that are being +accessed and should not be freed. For example, a loop that invoke +callbacks like this is not safe: + + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + +QLIST_FOREACH_SAFE protects against deletion of the current node (ioh) +by stashing away its "next" pointer. However, ioh->fd_write could +actually delete the next node from the list. The simplest way to +avoid this is to mark the node as deleted, and remove it from the +list in the above loop: + + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + } else { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + } + +If however this loop must also be reentrant, i.e. it is possible that +ioh->fd_write invokes the loop again, some kind of counting is needed: + + walking_handlers++; + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + if (walking_handlers == 1) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + } + } else { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + } + walking_handlers--; + +One may think of using the RCU primitives, rcu_read_lock() and +rcu_read_unlock(); effectively, the RCU nesting count would take +the place of the walking_handlers global variable. Indeed, +reference counting and RCU have similar purposes, but their usage in +general is complementary: + +- reference counting is fine-grained and limited to a single data + structure; RCU delays reclamation of *all* RCU-protected data + structures; + +- reference counting works even in the presence of code that keeps + a reference for a long time; RCU critical sections in principle + should be kept short; + +- reference counting is often applied to code that is not thread-safe + but is reentrant; in fact, usage of reference counting in QEMU predates + the introduction of threads by many years. RCU is generally used to + protect readers from other threads freeing memory after concurrent + modifications to a data structure. + +- reclaiming data can be done by a separate thread in the case of RCU; + this can improve performance, but also delay reclamation undesirably. + With reference counting, reclamation is deterministic. + +This file documents QemuLockCnt, an abstraction for using reference +counting in code that has to be both thread-safe and reentrant. + + +QemuLockCnt concepts +-------------------- + +A QemuLockCnt comprises both a counter and a mutex; it has primitives +to increment and decrement the counter, and to take and release the +mutex. The counter notes how many visits to the data structures are +taking place (the visits could be from different threads, or there could +be multiple reentrant visits from the same thread). The basic rules +governing the counter/mutex pair then are the following: + +- Data protected by the QemuLockCnt must not be freed unless the + counter is zero and the mutex is taken. + +- A new visit cannot be started while the counter is zero and the + mutex is taken. + +Most of the time, the mutex protects all writes to the data structure, +not just frees, though there could be cases where this is not necessary. + +Reads, instead, can be done without taking the mutex, as long as the +readers and writers use the same macros that are used for RCU, for +example qatomic_rcu_read, qatomic_rcu_set, QLIST_FOREACH_RCU, etc. This is +because the reads are done outside a lock and a set or QLIST_INSERT_HEAD +can happen concurrently with the read. The RCU API ensures that the +processor and the compiler see all required memory barriers. + +This could be implemented simply by protecting the counter with the +mutex, for example: + + // (1) + qemu_mutex_lock(&walking_handlers_mutex); + walking_handlers++; + qemu_mutex_unlock(&walking_handlers_mutex); + + ... + + // (2) + qemu_mutex_lock(&walking_handlers_mutex); + if (--walking_handlers == 0) { + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + } + } + } + qemu_mutex_unlock(&walking_handlers_mutex); + +Here, no frees can happen in the code represented by the ellipsis. +If another thread is executing critical section (2), that part of +the code cannot be entered, because the thread will not be able +to increment the walking_handlers variable. And of course +during the visit any other thread will see a nonzero value for +walking_handlers, as in the single-threaded code. + +Note that it is possible for multiple concurrent accesses to delay +the cleanup arbitrarily; in other words, for the walking_handlers +counter to never become zero. For this reason, this technique is +more easily applicable if concurrent access to the structure is rare. + +However, critical sections are easy to forget since you have to do +them for each modification of the counter. QemuLockCnt ensures that +all modifications of the counter take the lock appropriately, and it +can also be more efficient in two ways: + +- it avoids taking the lock for many operations (for example + incrementing the counter while it is non-zero); + +- on some platforms, one can implement QemuLockCnt to hold the lock + and the mutex in a single word, making the fast path no more expensive + than simply managing a counter using atomic operations (see + docs/devel/atomics.rst). This can be very helpful if concurrent access to + the data structure is expected to be rare. + + +Using the same mutex for frees and writes can still incur some small +inefficiencies; for example, a visit can never start if the counter is +zero and the mutex is taken---even if the mutex is taken by a write, +which in principle need not block a visit of the data structure. +However, these are usually not a problem if any of the following +assumptions are valid: + +- concurrent access is possible but rare + +- writes are rare + +- writes are frequent, but this kind of write (e.g. appending to a + list) has a very small critical section. + +For example, QEMU uses QemuLockCnt to manage an AioContext's list of +bottom halves and file descriptor handlers. Modifications to the list +of file descriptor handlers are rare. Creation of a new bottom half is +frequent and can happen on a fast path; however: 1) it is almost never +concurrent with a visit to the list of bottom halves; 2) it only has +three instructions in the critical path, two assignments and a smp_wmb(). + + +QemuLockCnt API +--------------- + +The QemuLockCnt API is described in include/qemu/thread.h. + + +QemuLockCnt usage +----------------- + +This section explains the typical usage patterns for QemuLockCnt functions. + +Setting a variable to a non-NULL value can be done between +qemu_lockcnt_lock and qemu_lockcnt_unlock: + + qemu_lockcnt_lock(&xyz_lockcnt); + if (!xyz) { + new_xyz = g_new(XYZ, 1); + ... + qatomic_rcu_set(&xyz, new_xyz); + } + qemu_lockcnt_unlock(&xyz_lockcnt); + +Accessing the value can be done between qemu_lockcnt_inc and +qemu_lockcnt_dec: + + qemu_lockcnt_inc(&xyz_lockcnt); + if (xyz) { + XYZ *p = qatomic_rcu_read(&xyz); + ... + /* Accesses can now be done through "p". */ + } + qemu_lockcnt_dec(&xyz_lockcnt); + +Freeing the object can similarly use qemu_lockcnt_lock and +qemu_lockcnt_unlock, but you also need to ensure that the count +is zero (i.e. there is no concurrent visit). Because qemu_lockcnt_inc +takes the QemuLockCnt's lock, the count cannot become non-zero while +the object is being freed. Freeing an object looks like this: + + qemu_lockcnt_lock(&xyz_lockcnt); + if (!qemu_lockcnt_count(&xyz_lockcnt)) { + g_free(xyz); + xyz = NULL; + } + qemu_lockcnt_unlock(&xyz_lockcnt); + +If an object has to be freed right after a visit, you can combine +the decrement, the locking and the check on count as follows: + + qemu_lockcnt_inc(&xyz_lockcnt); + if (xyz) { + XYZ *p = qatomic_rcu_read(&xyz); + ... + /* Accesses can now be done through "p". */ + } + if (qemu_lockcnt_dec_and_lock(&xyz_lockcnt)) { + g_free(xyz); + xyz = NULL; + qemu_lockcnt_unlock(&xyz_lockcnt); + } + +QemuLockCnt can also be used to access a list as follows: + + qemu_lockcnt_inc(&io_handlers_lockcnt); + QLIST_FOREACH_RCU(ioh, &io_handlers, pioh) { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + + if (qemu_lockcnt_dec_and_lock(&io_handlers_lockcnt)) { + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + } + } + qemu_lockcnt_unlock(&io_handlers_lockcnt); + } + +Again, the RCU primitives are used because new items can be added to the +list during the walk. QLIST_FOREACH_RCU ensures that the processor and +the compiler see the appropriate memory barriers. + +An alternative pattern uses qemu_lockcnt_dec_if_lock: + + qemu_lockcnt_inc(&io_handlers_lockcnt); + QLIST_FOREACH_SAFE_RCU(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + if (qemu_lockcnt_dec_if_lock(&io_handlers_lockcnt)) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + qemu_lockcnt_inc_and_unlock(&io_handlers_lockcnt); + } + } else { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + } + qemu_lockcnt_dec(&io_handlers_lockcnt); + +Here you can use qemu_lockcnt_dec instead of qemu_lockcnt_dec_and_lock, +because there is no special task to do if the count goes from 1 to 0. diff --git a/docs/devel/memory.rst b/docs/devel/memory.rst new file mode 100644 index 000000000..5dc8a1268 --- /dev/null +++ b/docs/devel/memory.rst @@ -0,0 +1,368 @@ +============== +The memory API +============== + +The memory API models the memory and I/O buses and controllers of a QEMU +machine. It attempts to allow modelling of: + +- ordinary RAM +- memory-mapped I/O (MMIO) +- memory controllers that can dynamically reroute physical memory regions + to different destinations + +The memory model provides support for + +- tracking RAM changes by the guest +- setting up coalesced memory for kvm +- setting up ioeventfd regions for kvm + +Memory is modelled as an acyclic graph of MemoryRegion objects. Sinks +(leaves) are RAM and MMIO regions, while other nodes represent +buses, memory controllers, and memory regions that have been rerouted. + +In addition to MemoryRegion objects, the memory API provides AddressSpace +objects for every root and possibly for intermediate MemoryRegions too. +These represent memory as seen from the CPU or a device's viewpoint. + +Types of regions +---------------- + +There are multiple types of memory regions (all represented by a single C type +MemoryRegion): + +- RAM: a RAM region is simply a range of host memory that can be made available + to the guest. + You typically initialize these with memory_region_init_ram(). Some special + purposes require the variants memory_region_init_resizeable_ram(), + memory_region_init_ram_from_file(), or memory_region_init_ram_ptr(). + +- MMIO: a range of guest memory that is implemented by host callbacks; + each read or write causes a callback to be called on the host. + You initialize these with memory_region_init_io(), passing it a + MemoryRegionOps structure describing the callbacks. + +- ROM: a ROM memory region works like RAM for reads (directly accessing + a region of host memory), and forbids writes. You initialize these with + memory_region_init_rom(). + +- ROM device: a ROM device memory region works like RAM for reads + (directly accessing a region of host memory), but like MMIO for + writes (invoking a callback). You initialize these with + memory_region_init_rom_device(). + +- IOMMU region: an IOMMU region translates addresses of accesses made to it + and forwards them to some other target memory region. As the name suggests, + these are only needed for modelling an IOMMU, not for simple devices. + You initialize these with memory_region_init_iommu(). + +- container: a container simply includes other memory regions, each at + a different offset. Containers are useful for grouping several regions + into one unit. For example, a PCI BAR may be composed of a RAM region + and an MMIO region. + + A container's subregions are usually non-overlapping. In some cases it is + useful to have overlapping regions; for example a memory controller that + can overlay a subregion of RAM with MMIO or ROM, or a PCI controller + that does not prevent card from claiming overlapping BARs. + + You initialize a pure container with memory_region_init(). + +- alias: a subsection of another region. Aliases allow a region to be + split apart into discontiguous regions. Examples of uses are memory banks + used when the guest address space is smaller than the amount of RAM + addressed, or a memory controller that splits main memory to expose a "PCI + hole". Aliases may point to any type of region, including other aliases, + but an alias may not point back to itself, directly or indirectly. + You initialize these with memory_region_init_alias(). + +- reservation region: a reservation region is primarily for debugging. + It claims I/O space that is not supposed to be handled by QEMU itself. + The typical use is to track parts of the address space which will be + handled by the host kernel when KVM is enabled. You initialize these + by passing a NULL callback parameter to memory_region_init_io(). + +It is valid to add subregions to a region which is not a pure container +(that is, to an MMIO, RAM or ROM region). This means that the region +will act like a container, except that any addresses within the container's +region which are not claimed by any subregion are handled by the +container itself (ie by its MMIO callbacks or RAM backing). However +it is generally possible to achieve the same effect with a pure container +one of whose subregions is a low priority "background" region covering +the whole address range; this is often clearer and is preferred. +Subregions cannot be added to an alias region. + +Migration +--------- + +Where the memory region is backed by host memory (RAM, ROM and +ROM device memory region types), this host memory needs to be +copied to the destination on migration. These APIs which allocate +the host memory for you will also register the memory so it is +migrated: + +- memory_region_init_ram() +- memory_region_init_rom() +- memory_region_init_rom_device() + +For most devices and boards this is the correct thing. If you +have a special case where you need to manage the migration of +the backing memory yourself, you can call the functions: + +- memory_region_init_ram_nomigrate() +- memory_region_init_rom_nomigrate() +- memory_region_init_rom_device_nomigrate() + +which only initialize the MemoryRegion and leave handling +migration to the caller. + +The functions: + +- memory_region_init_resizeable_ram() +- memory_region_init_ram_from_file() +- memory_region_init_ram_from_fd() +- memory_region_init_ram_ptr() +- memory_region_init_ram_device_ptr() + +are for special cases only, and so they do not automatically +register the backing memory for migration; the caller must +manage migration if necessary. + +Region names +------------ + +Regions are assigned names by the constructor. For most regions these are +only used for debugging purposes, but RAM regions also use the name to identify +live migration sections. This means that RAM region names need to have ABI +stability. + +Region lifecycle +---------------- + +A region is created by one of the memory_region_init*() functions and +attached to an object, which acts as its owner or parent. QEMU ensures +that the owner object remains alive as long as the region is visible to +the guest, or as long as the region is in use by a virtual CPU or another +device. For example, the owner object will not die between an +address_space_map operation and the corresponding address_space_unmap. + +After creation, a region can be added to an address space or a +container with memory_region_add_subregion(), and removed using +memory_region_del_subregion(). + +Various region attributes (read-only, dirty logging, coalesced mmio, +ioeventfd) can be changed during the region lifecycle. They take effect +as soon as the region is made visible. This can be immediately, later, +or never. + +Destruction of a memory region happens automatically when the owner +object dies. + +If however the memory region is part of a dynamically allocated data +structure, you should call object_unparent() to destroy the memory region +before the data structure is freed. For an example see VFIOMSIXInfo +and VFIOQuirk in hw/vfio/pci.c. + +You must not destroy a memory region as long as it may be in use by a +device or CPU. In order to do this, as a general rule do not create or +destroy memory regions dynamically during a device's lifetime, and only +call object_unparent() in the memory region owner's instance_finalize +callback. The dynamically allocated data structure that contains the +memory region then should obviously be freed in the instance_finalize +callback as well. + +If you break this rule, the following situation can happen: + +- the memory region's owner had a reference taken via memory_region_ref + (for example by address_space_map) + +- the region is unparented, and has no owner anymore + +- when address_space_unmap is called, the reference to the memory region's + owner is leaked. + + +There is an exception to the above rule: it is okay to call +object_unparent at any time for an alias or a container region. It is +therefore also okay to create or destroy alias and container regions +dynamically during a device's lifetime. + +This exceptional usage is valid because aliases and containers only help +QEMU building the guest's memory map; they are never accessed directly. +memory_region_ref and memory_region_unref are never called on aliases +or containers, and the above situation then cannot happen. Exploiting +this exception is rarely necessary, and therefore it is discouraged, +but nevertheless it is used in a few places. + +For regions that "have no owner" (NULL is passed at creation time), the +machine object is actually used as the owner. Since instance_finalize is +never called for the machine object, you must never call object_unparent +on regions that have no owner, unless they are aliases or containers. + + +Overlapping regions and priority +-------------------------------- +Usually, regions may not overlap each other; a memory address decodes into +exactly one target. In some cases it is useful to allow regions to overlap, +and sometimes to control which of an overlapping regions is visible to the +guest. This is done with memory_region_add_subregion_overlap(), which +allows the region to overlap any other region in the same container, and +specifies a priority that allows the core to decide which of two regions at +the same address are visible (highest wins). +Priority values are signed, and the default value is zero. This means that +you can use memory_region_add_subregion_overlap() both to specify a region +that must sit 'above' any others (with a positive priority) and also a +background region that sits 'below' others (with a negative priority). + +If the higher priority region in an overlap is a container or alias, then +the lower priority region will appear in any "holes" that the higher priority +region has left by not mapping subregions to that area of its address range. +(This applies recursively -- if the subregions are themselves containers or +aliases that leave holes then the lower priority region will appear in these +holes too.) + +For example, suppose we have a container A of size 0x8000 with two subregions +B and C. B is a container mapped at 0x2000, size 0x4000, priority 2; C is +an MMIO region mapped at 0x0, size 0x6000, priority 1. B currently has two +of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at +offset 0x2000. As a diagram:: + + 0 1000 2000 3000 4000 5000 6000 7000 8000 + |------|------|------|------|------|------|------|------| + A: [ ] + C: [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC] + B: [ ] + D: [DDDDD] + E: [EEEEE] + +The regions that will be seen within this address range then are:: + + [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC] + +Since B has higher priority than C, its subregions appear in the flat map +even where they overlap with C. In ranges where B has not mapped anything +C's region appears. + +If B had provided its own MMIO operations (ie it was not a pure container) +then these would be used for any addresses in its range not handled by +D or E, and the result would be:: + + [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB] + +Priority values are local to a container, because the priorities of two +regions are only compared when they are both children of the same container. +This means that the device in charge of the container (typically modelling +a bus or a memory controller) can use them to manage the interaction of +its child regions without any side effects on other parts of the system. +In the example above, the priorities of D and E are unimportant because +they do not overlap each other. It is the relative priority of B and C +that causes D and E to appear on top of C: D and E's priorities are never +compared against the priority of C. + +Visibility +---------- +The memory core uses the following rules to select a memory region when the +guest accesses an address: + +- all direct subregions of the root region are matched against the address, in + descending priority order + + - if the address lies outside the region offset/size, the subregion is + discarded + - if the subregion is a leaf (RAM or MMIO), the search terminates, returning + this leaf region + - if the subregion is a container, the same algorithm is used within the + subregion (after the address is adjusted by the subregion offset) + - if the subregion is an alias, the search is continued at the alias target + (after the address is adjusted by the subregion offset and alias offset) + - if a recursive search within a container or alias subregion does not + find a match (because of a "hole" in the container's coverage of its + address range), then if this is a container with its own MMIO or RAM + backing the search terminates, returning the container itself. Otherwise + we continue with the next subregion in priority order + +- if none of the subregions match the address then the search terminates + with no match found + +Example memory map +------------------ + +:: + + system_memory: container@0-2^48-1 + | + +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff) + | + +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff) + | + +---- vga-window: alias@0xa0000-0xbffff ---> #pci (0xa0000-0xbffff) + | (prio 1) + | + +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff) + + pci (0-2^32-1) + | + +--- vga-area: container@0xa0000-0xbffff + | | + | +--- alias@0x00000-0x7fff ---> #vram (0x010000-0x017fff) + | | + | +--- alias@0x08000-0xffff ---> #vram (0x020000-0x027fff) + | + +---- vram: ram@0xe1000000-0xe1ffffff + | + +---- vga-mmio: mmio@0xe2000000-0xe200ffff + + ram: ram@0x00000000-0xffffffff + +This is a (simplified) PC memory map. The 4GB RAM block is mapped into the +system address space via two aliases: "lomem" is a 1:1 mapping of the first +3.5GB; "himem" maps the last 0.5GB at address 4GB. This leaves 0.5GB for the +so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with +4GB of memory. + +The memory controller diverts addresses in the range 640K-768K to the PCI +address space. This is modelled using the "vga-window" alias, mapped at a +higher priority so it obscures the RAM at the same addresses. The vga window +can be removed by programming the memory controller; this is modelled by +removing the alias and exposing the RAM underneath. + +The pci address space is not a direct child of the system address space, since +we only want parts of it to be visible (we accomplish this using aliases). +It has two subregions: vga-area models the legacy vga window and is occupied +by two 32K memory banks pointing at two sections of the framebuffer. +In addition the vram is mapped as a BAR at address e1000000, and an additional +BAR containing MMIO registers is mapped after it. + +Note that if the guest maps a BAR outside the PCI hole, it would not be +visible as the pci-hole alias clips it to a 0.5GB range. + +MMIO Operations +--------------- + +MMIO regions are provided with ->read() and ->write() callbacks, +which are sufficient for most devices. Some devices change behaviour +based on the attributes used for the memory transaction, or need +to be able to respond that the access should provoke a bus error +rather than completing successfully; those devices can use the +->read_with_attrs() and ->write_with_attrs() callbacks instead. + +In addition various constraints can be supplied to control how these +callbacks are called: + +- .valid.min_access_size, .valid.max_access_size define the access sizes + (in bytes) which the device accepts; accesses outside this range will + have device and bus specific behaviour (ignored, or machine check) +- .valid.unaligned specifies that the *device being modelled* supports + unaligned accesses; if false, unaligned accesses will invoke the + appropriate bus or CPU specific behaviour. +- .impl.min_access_size, .impl.max_access_size define the access sizes + (in bytes) supported by the *implementation*; other access sizes will be + emulated using the ones available. For example a 4-byte write will be + emulated using four 1-byte writes, if .impl.max_access_size = 1. +- .impl.unaligned specifies that the *implementation* supports unaligned + accesses; if false, unaligned accesses will be emulated by two aligned + accesses. + +API Reference +------------- + +.. kernel-doc:: include/exec/memory.h diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst new file mode 100644 index 000000000..240125348 --- /dev/null +++ b/docs/devel/migration.rst @@ -0,0 +1,883 @@ +========= +Migration +========= + +QEMU has code to load/save the state of the guest that it is running. +These are two complementary operations. Saving the state just does +that, saves the state for each device that the guest is running. +Restoring a guest is just the opposite operation: we need to load the +state of each device. + +For this to work, QEMU has to be launched with the same arguments the +two times. I.e. it can only restore the state in one guest that has +the same devices that the one it was saved (this last requirement can +be relaxed a bit, but for now we can consider that configuration has +to be exactly the same). + +Once that we are able to save/restore a guest, a new functionality is +requested: migration. This means that QEMU is able to start in one +machine and being "migrated" to another machine. I.e. being moved to +another machine. + +Next was the "live migration" functionality. This is important +because some guests run with a lot of state (specially RAM), and it +can take a while to move all state from one machine to another. Live +migration allows the guest to continue running while the state is +transferred. Only while the last part of the state is transferred has +the guest to be stopped. Typically the time that the guest is +unresponsive during live migration is the low hundred of milliseconds +(notice that this depends on a lot of things). + +Transports +========== + +The migration stream is normally just a byte stream that can be passed +over any transport. + +- tcp migration: do the migration using tcp sockets +- unix migration: do the migration using unix sockets +- exec migration: do the migration using the stdin/stdout through a process. +- fd migration: do the migration using a file descriptor that is + passed to QEMU. QEMU doesn't care how this file descriptor is opened. + +In addition, support is included for migration using RDMA, which +transports the page data using ``RDMA``, where the hardware takes care of +transporting the pages, and the load on the CPU is much lower. While the +internals of RDMA migration are a bit different, this isn't really visible +outside the RAM migration code. + +All these migration protocols use the same infrastructure to +save/restore state devices. This infrastructure is shared with the +savevm/loadvm functionality. + +Debugging +========= + +The migration stream can be analyzed thanks to ``scripts/analyze-migration.py``. + +Example usage: + +.. code-block:: shell + + $ qemu-system-x86_64 -display none -monitor stdio + (qemu) migrate "exec:cat > mig" + (qemu) q + $ ./scripts/analyze-migration.py -f mig + { + "ram (3)": { + "section sizes": { + "pc.ram": "0x0000000008000000", + ... + +See also ``analyze-migration.py -h`` help for more options. + +Common infrastructure +===================== + +The files, sockets or fd's that carry the migration stream are abstracted by +the ``QEMUFile`` type (see ``migration/qemu-file.h``). In most cases this +is connected to a subtype of ``QIOChannel`` (see ``io/``). + + +Saving the state of one device +============================== + +For most devices, the state is saved in a single call to the migration +infrastructure; these are *non-iterative* devices. The data for these +devices is sent at the end of precopy migration, when the CPUs are paused. +There are also *iterative* devices, which contain a very large amount of +data (e.g. RAM or large tables). See the iterative device section below. + +General advice for device developers +------------------------------------ + +- The migration state saved should reflect the device being modelled rather + than the way your implementation works. That way if you change the implementation + later the migration stream will stay compatible. That model may include + internal state that's not directly visible in a register. + +- When saving a migration stream the device code may walk and check + the state of the device. These checks might fail in various ways (e.g. + discovering internal state is corrupt or that the guest has done something bad). + Consider carefully before asserting/aborting at this point, since the + normal response from users is that *migration broke their VM* since it had + apparently been running fine until then. In these error cases, the device + should log a message indicating the cause of error, and should consider + putting the device into an error state, allowing the rest of the VM to + continue execution. + +- The migration might happen at an inconvenient point, + e.g. right in the middle of the guest reprogramming the device, during + guest reboot or shutdown or while the device is waiting for external IO. + It's strongly preferred that migrations do not fail in this situation, + since in the cloud environment migrations might happen automatically to + VMs that the administrator doesn't directly control. + +- If you do need to fail a migration, ensure that sufficient information + is logged to identify what went wrong. + +- The destination should treat an incoming migration stream as hostile + (which we do to varying degrees in the existing code). Check that offsets + into buffers and the like can't cause overruns. Fail the incoming migration + in the case of a corrupted stream like this. + +- Take care with internal device state or behaviour that might become + migration version dependent. For example, the order of PCI capabilities + is required to stay constant across migration. Another example would + be that a special case handled by subsections (see below) might become + much more common if a default behaviour is changed. + +- The state of the source should not be changed or destroyed by the + outgoing migration. Migrations timing out or being failed by + higher levels of management, or failures of the destination host are + not unusual, and in that case the VM is restarted on the source. + Note that the management layer can validly revert the migration + even though the QEMU level of migration has succeeded as long as it + does it before starting execution on the destination. + +- Buses and devices should be able to explicitly specify addresses when + instantiated, and management tools should use those. For example, + when hot adding USB devices it's important to specify the ports + and addresses, since implicit ordering based on the command line order + may be different on the destination. This can result in the + device state being loaded into the wrong device. + +VMState +------- + +Most device data can be described using the ``VMSTATE`` macros (mostly defined +in ``include/migration/vmstate.h``). + +An example (from hw/input/pckbd.c) + +.. code:: c + + static const VMStateDescription vmstate_kbd = { + .name = "pckbd", + .version_id = 3, + .minimum_version_id = 3, + .fields = (VMStateField[]) { + VMSTATE_UINT8(write_cmd, KBDState), + VMSTATE_UINT8(status, KBDState), + VMSTATE_UINT8(mode, KBDState), + VMSTATE_UINT8(pending, KBDState), + VMSTATE_END_OF_LIST() + } + }; + +We are declaring the state with name "pckbd". +The ``version_id`` is 3, and the fields are 4 uint8_t in a KBDState structure. +We registered this with: + +.. code:: c + + vmstate_register(NULL, 0, &vmstate_kbd, s); + +For devices that are ``qdev`` based, we can register the device in the class +init function: + +.. code:: c + + dc->vmsd = &vmstate_kbd_isa; + +The VMState macros take care of ensuring that the device data section +is formatted portably (normally big endian) and make some compile time checks +against the types of the fields in the structures. + +VMState macros can include other VMStateDescriptions to store substructures +(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length +arrays (``VMSTATE_VARRAY_``). Various other macros exist for special +cases. + +Note that the format on the wire is still very raw; i.e. a VMSTATE_UINT32 +ends up with a 4 byte bigendian representation on the wire; in the future +it might be possible to use a more structured format. + +Legacy way +---------- + +This way is going to disappear as soon as all current users are ported to VMSTATE; +although converting existing code can be tricky, and thus 'soon' is relative. + +Each device has to register two functions, one to save the state and +another to load the state back. + +.. code:: c + + int register_savevm_live(const char *idstr, + int instance_id, + int version_id, + SaveVMHandlers *ops, + void *opaque); + +Two functions in the ``ops`` structure are the ``save_state`` +and ``load_state`` functions. Notice that ``load_state`` receives a version_id +parameter to know what state format is receiving. ``save_state`` doesn't +have a version_id parameter because it always uses the latest version. + +Note that because the VMState macros still save the data in a raw +format, in many cases it's possible to replace legacy code +with a carefully constructed VMState description that matches the +byte layout of the existing code. + +Changing migration data structures +---------------------------------- + +When we migrate a device, we save/load the state as a series +of fields. Sometimes, due to bugs or new functionality, we need to +change the state to store more/different information. Changing the migration +state saved for a device can break migration compatibility unless +care is taken to use the appropriate techniques. In general QEMU tries +to maintain forward migration compatibility (i.e. migrating from +QEMU n->n+1) and there are users who benefit from backward compatibility +as well. + +Subsections +----------- + +The most common structure change is adding new data, e.g. when adding +a newer form of device, or adding that state that you previously +forgot to migrate. This is best solved using a subsection. + +A subsection is "like" a device vmstate, but with a particularity, it +has a Boolean function that tells if that values are needed to be sent +or not. If this functions returns false, the subsection is not sent. +Subsections have a unique name, that is looked for on the receiving +side. + +On the receiving side, if we found a subsection for a device that we +don't understand, we just fail the migration. If we understand all +the subsections, then we load the state with success. There's no check +that a subsection is loaded, so a newer QEMU that knows about a subsection +can (with care) load a stream from an older QEMU that didn't send +the subsection. + +If the new data is only needed in a rare case, then the subsection +can be made conditional on that case and the migration will still +succeed to older QEMUs in most cases. This is OK for data that's +critical, but in some use cases it's preferred that the migration +should succeed even with the data missing. To support this the +subsection can be connected to a device property and from there +to a versioned machine type. + +The 'pre_load' and 'post_load' functions on subsections are only +called if the subsection is loaded. + +One important note is that the outer post_load() function is called "after" +loading all subsections, because a newer subsection could change the same +value that it uses. A flag, and the combination of outer pre_load and +post_load can be used to detect whether a subsection was loaded, and to +fall back on default behaviour when the subsection isn't present. + +Example: + +.. code:: c + + static bool ide_drive_pio_state_needed(void *opaque) + { + IDEState *s = opaque; + + return ((s->status & DRQ_STAT) != 0) + || (s->bus->error_status & BM_STATUS_PIO_RETRY); + } + + const VMStateDescription vmstate_ide_drive_pio_state = { + .name = "ide_drive/pio_state", + .version_id = 1, + .minimum_version_id = 1, + .pre_save = ide_drive_pio_pre_save, + .post_load = ide_drive_pio_post_load, + .needed = ide_drive_pio_state_needed, + .fields = (VMStateField[]) { + VMSTATE_INT32(req_nb_sectors, IDEState), + VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1, + vmstate_info_uint8, uint8_t), + VMSTATE_INT32(cur_io_buffer_offset, IDEState), + VMSTATE_INT32(cur_io_buffer_len, IDEState), + VMSTATE_UINT8(end_transfer_fn_idx, IDEState), + VMSTATE_INT32(elementary_transfer_size, IDEState), + VMSTATE_INT32(packet_transfer_size, IDEState), + VMSTATE_END_OF_LIST() + } + }; + + const VMStateDescription vmstate_ide_drive = { + .name = "ide_drive", + .version_id = 3, + .minimum_version_id = 0, + .post_load = ide_drive_post_load, + .fields = (VMStateField[]) { + .... several fields .... + VMSTATE_END_OF_LIST() + }, + .subsections = (const VMStateDescription*[]) { + &vmstate_ide_drive_pio_state, + NULL + } + }; + +Here we have a subsection for the pio state. We only need to +save/send this state when we are in the middle of a pio operation +(that is what ``ide_drive_pio_state_needed()`` checks). If DRQ_STAT is +not enabled, the values on that fields are garbage and don't need to +be sent. + +Connecting subsections to properties +------------------------------------ + +Using a condition function that checks a 'property' to determine whether +to send a subsection allows backward migration compatibility when +new subsections are added, especially when combined with versioned +machine types. + +For example: + + a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and + default it to true. + b) Add an entry to the ``hw_compat_`` for the previous version that sets + the property to false. + c) Add a static bool support_foo function that tests the property. + d) Add a subsection with a .needed set to the support_foo function + e) (potentially) Add an outer pre_load that sets up a default value + for 'foo' to be used if the subsection isn't loaded. + +Now that subsection will not be generated when using an older +machine type and the migration stream will be accepted by older +QEMU versions. + +Not sending existing elements +----------------------------- + +Sometimes members of the VMState are no longer needed: + + - removing them will break migration compatibility + + - making them version dependent and bumping the version will break backward migration + compatibility. + +Adding a dummy field into the migration stream is normally the best way to preserve +compatibility. + +If the field really does need to be removed then: + + a) Add a new property/compatibility/function in the same way for subsections above. + b) replace the VMSTATE macro with the _TEST version of the macro, e.g.: + + ``VMSTATE_UINT32(foo, barstruct)`` + + becomes + + ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)`` + + Sometime in the future when we no longer care about the ancient versions these can be killed off. + Note that for backward compatibility it's important to fill in the structure with + data that the destination will understand. + +Any difference in the predicates on the source and destination will end up +with different fields being enabled and data being loaded into the wrong +fields; for this reason conditional fields like this are very fragile. + +Versions +-------- + +Version numbers are intended for major incompatible changes to the +migration of a device, and using them breaks backward-migration +compatibility; in general most changes can be made by adding Subsections +(see above) or _TEST macros (see above) which won't break compatibility. + +Each version is associated with a series of fields saved. The ``save_state`` always saves +the state as the newer version. But ``load_state`` sometimes is able to +load state from an older version. + +You can see that there are several version fields: + +- ``version_id``: the maximum version_id supported by VMState for that device. +- ``minimum_version_id``: the minimum version_id that VMState is able to understand + for that device. +- ``minimum_version_id_old``: For devices that were not able to port to vmstate, we can + assign a function that knows how to read this old state. This field is + ignored if there is no ``load_state_old`` handler. + +VMState is able to read versions from minimum_version_id to +version_id. And the function ``load_state_old()`` (if present) is able to +load state from minimum_version_id_old to minimum_version_id. This +function is deprecated and will be removed when no more users are left. + +There are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields, +e.g. + +.. code:: c + + VMSTATE_UINT16_V(ip_id, Slirp, 2), + +only loads that field for versions 2 and newer. + +Saving state will always create a section with the 'version_id' value +and thus can't be loaded by any older QEMU. + +Massaging functions +------------------- + +Sometimes, it is not enough to be able to save the state directly +from one structure, we need to fill the correct values there. One +example is when we are using kvm. Before saving the cpu state, we +need to ask kvm to copy to QEMU the state that it is using. And the +opposite when we are loading the state, we need a way to tell kvm to +load the state for the cpu that we have just loaded from the QEMUFile. + +The functions to do that are inside a vmstate definition, and are called: + +- ``int (*pre_load)(void *opaque);`` + + This function is called before we load the state of one device. + +- ``int (*post_load)(void *opaque, int version_id);`` + + This function is called after we load the state of one device. + +- ``int (*pre_save)(void *opaque);`` + + This function is called before we save the state of one device. + +- ``int (*post_save)(void *opaque);`` + + This function is called after we save the state of one device + (even upon failure, unless the call to pre_save returned an error). + +Example: You can look at hpet.c, that uses the first three functions +to massage the state that is transferred. + +The ``VMSTATE_WITH_TMP`` macro may be useful when the migration +data doesn't match the stored device data well; it allows an +intermediate temporary structure to be populated with migration +data and then transferred to the main structure. + +If you use memory API functions that update memory layout outside +initialization (i.e., in response to a guest action), this is a strong +indication that you need to call these functions in a ``post_load`` callback. +Examples of such memory API functions are: + + - memory_region_add_subregion() + - memory_region_del_subregion() + - memory_region_set_readonly() + - memory_region_set_nonvolatile() + - memory_region_set_enabled() + - memory_region_set_address() + - memory_region_set_alias_offset() + +Iterative device migration +-------------------------- + +Some devices, such as RAM, Block storage or certain platform devices, +have large amounts of data that would mean that the CPUs would be +paused for too long if they were sent in one section. For these +devices an *iterative* approach is taken. + +The iterative devices generally don't use VMState macros +(although it may be possible in some cases) and instead use +qemu_put_*/qemu_get_* macros to read/write data to the stream. Specialist +versions exist for high bandwidth IO. + + +An iterative device must provide: + + - A ``save_setup`` function that initialises the data structures and + transmits a first section containing information on the device. In the + case of RAM this transmits a list of RAMBlocks and sizes. + + - A ``load_setup`` function that initialises the data structures on the + destination. + + - A ``save_live_pending`` function that is called repeatedly and must + indicate how much more data the iterative data must save. The core + migration code will use this to determine when to pause the CPUs + and complete the migration. + + - A ``save_live_iterate`` function (called after ``save_live_pending`` + when there is significant data still to be sent). It should send + a chunk of data until the point that stream bandwidth limits tell it + to stop. Each call generates one section. + + - A ``save_live_complete_precopy`` function that must transmit the + last section for the device containing any remaining data. + + - A ``load_state`` function used to load sections generated by + any of the save functions that generate sections. + + - ``cleanup`` functions for both save and load that are called + at the end of migration. + +Note that the contents of the sections for iterative migration tend +to be open-coded by the devices; care should be taken in parsing +the results and structuring the stream to make them easy to validate. + +Device ordering +--------------- + +There are cases in which the ordering of device loading matters; for +example in some systems where a device may assert an interrupt during loading, +if the interrupt controller is loaded later then it might lose the state. + +Some ordering is implicitly provided by the order in which the machine +definition creates devices, however this is somewhat fragile. + +The ``MigrationPriority`` enum provides a means of explicitly enforcing +ordering. Numerically higher priorities are loaded earlier. +The priority is set by setting the ``priority`` field of the top level +``VMStateDescription`` for the device. + +Stream structure +================ + +The stream tries to be word and endian agnostic, allowing migration between hosts +of different characteristics running the same VM. + + - Header + + - Magic + - Version + - VM configuration section + + - Machine type + - Target page bits + - List of sections + Each section contains a device, or one iteration of a device save. + + - section type + - section id + - ID string (First section of each device) + - instance id (First section of each device) + - version id (First section of each device) + - <device data> + - Footer mark + - EOF mark + - VM Description structure + Consisting of a JSON description of the contents for analysis only + +The ``device data`` in each section consists of the data produced +by the code described above. For non-iterative devices they have a single +section; iterative devices have an initial and last section and a set +of parts in between. +Note that there is very little checking by the common code of the integrity +of the ``device data`` contents, that's up to the devices themselves. +The ``footer mark`` provides a little bit of protection for the case where +the receiving side reads more or less data than expected. + +The ``ID string`` is normally unique, having been formed from a bus name +and device address, PCI devices and storage devices hung off PCI controllers +fit this pattern well. Some devices are fixed single instances (e.g. "pc-ram"). +Others (especially either older devices or system devices which for +some reason don't have a bus concept) make use of the ``instance id`` +for otherwise identically named devices. + +Return path +----------- + +Only a unidirectional stream is required for normal migration, however a +``return path`` can be created when bidirectional communication is desired. +This is primarily used by postcopy, but is also used to return a success +flag to the source at the end of migration. + +``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return +path. + + Source side + + Forward path - written by migration thread + Return path - opened by main thread, read by return-path thread + + Destination side + + Forward path - read by main thread + Return path - opened by main thread, written by main thread AND postcopy + thread (protected by rp_mutex) + +Postcopy +======== + +'Postcopy' migration is a way to deal with migrations that refuse to converge +(or take too long to converge) its plus side is that there is an upper bound on +the amount of migration traffic and time it takes, the down side is that during +the postcopy phase, a failure of *either* side or the network connection causes +the guest to be lost. + +In postcopy the destination CPUs are started before all the memory has been +transferred, and accesses to pages that are yet to be transferred cause +a fault that's translated by QEMU into a request to the source QEMU. + +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy +doesn't finish in a given time the switch is made to postcopy. + +Enabling postcopy +----------------- + +To enable postcopy, issue this command on the monitor (both source and +destination) prior to the start of migration: + +``migrate_set_capability postcopy-ram on`` + +The normal commands are then used to start a migration, which is still +started in precopy mode. Issuing: + +``migrate_start_postcopy`` + +will now cause the transition from precopy to postcopy. +It can be issued immediately after migration is started or any +time later on. Issuing it after the end of a migration is harmless. + +Blocktime is a postcopy live migration metric, intended to show how +long the vCPU was in state of interruptible sleep due to pagefault. +That metric is calculated both for all vCPUs as overlapped value, and +separately for each vCPU. These values are calculated on destination +side. To enable postcopy blocktime calculation, enter following +command on destination monitor: + +``migrate_set_capability postcopy-blocktime on`` + +Postcopy blocktime can be retrieved by query-migrate qmp command. +postcopy-blocktime value of qmp command will show overlapped blocking +time for all vCPU, postcopy-vcpu-blocktime will show list of blocking +time per vCPU. + +.. note:: + During the postcopy phase, the bandwidth limits set using + ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that + the destination is waiting for). + +Postcopy device transfer +------------------------ + +Loading of device data may cause the device emulation to access guest RAM +that may trigger faults that have to be resolved by the source, as such +the migration stream has to be able to respond with page data *during* the +device load, and hence the device data has to be read from the stream completely +before the device load begins to free the stream up. This is achieved by +'packaging' the device data into a blob that's read in one go. + +Source behaviour +---------------- + +Until postcopy is entered the migration stream is identical to normal +precopy, except for the addition of a 'postcopy advise' command at +the beginning, to tell the destination that postcopy might happen. +When postcopy starts the source sends the page discard data and then +forms the 'package' containing: + + - Command: 'postcopy listen' + - The device state + + A series of sections, identical to the precopy streams device state stream + containing everything except postcopiable devices (i.e. RAM) + - Command: 'postcopy run' + +The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the +contents are formatted in the same way as the main migration stream. + +During postcopy the source scans the list of dirty pages and sends them +to the destination without being requested (in much the same way as precopy), +however when a page request is received from the destination, the dirty page +scanning restarts from the requested location. This causes requested pages +to be sent quickly, and also causes pages directly after the requested page +to be sent quickly in the hope that those pages are likely to be used +by the destination soon. + +Destination behaviour +--------------------- + +Initially the destination looks the same as precopy, with a single thread +reading the migration stream; the 'postcopy advise' and 'discard' commands +are processed to change the way RAM is managed, but don't affect the stream +processing. + +:: + + ------------------------------------------------------------------------------ + 1 2 3 4 5 6 7 + main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) + thread | | + | (page request) + | \___ + v \ + listen thread: --- page -- page -- page -- page -- page -- + + a b c + ------------------------------------------------------------------------------ + +- On receipt of ``CMD_PACKAGED`` (1) + + All the data associated with the package - the ( ... ) section in the diagram - + is read into memory, and the main thread recurses into qemu_loadvm_state_main + to process the contents of the package (2) which contains commands (3,6) and + devices (4...) + +- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) + + a new thread (a) is started that takes over servicing the migration stream, + while the main thread carries on loading the package. It loads normal + background page data (b) but if during a device load a fault happens (5) + the returned page (c) is loaded by the listen thread allowing the main + threads device load to carry on. + +- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) + + letting the destination CPUs start running. At the end of the + ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and + is no longer used by migration, while the listen thread carries on servicing + page data until the end of migration. + +Postcopy states +--------------- + +Postcopy moves through a series of states (see postcopy_state) from +ADVISE->DISCARD->LISTEN->RUNNING->END + + - Advise + + Set at the start of migration if postcopy is enabled, even + if it hasn't had the start command; here the destination + checks that its OS has the support needed for postcopy, and performs + setup to ensure the RAM mappings are suitable for later postcopy. + The destination will fail early in migration at this point if the + required OS support is not present. + (Triggered by reception of POSTCOPY_ADVISE command) + + - Discard + + Entered on receipt of the first 'discard' command; prior to + the first Discard being performed, hugepages are switched off + (using madvise) to ensure that no new huge pages are created + during the postcopy phase, and to cause any huge pages that + have discards on them to be broken. + + - Listen + + The first command in the package, POSTCOPY_LISTEN, switches + the destination state to Listen, and starts a new thread + (the 'listen thread') which takes over the job of receiving + pages off the migration stream, while the main thread carries + on processing the blob. With this thread able to process page + reception, the destination now 'sensitises' the RAM to detect + any access to missing pages (on Linux using the 'userfault' + system). + + - Running + + POSTCOPY_RUN causes the destination to synchronise all + state and start the CPUs and IO devices running. The main + thread now finishes processing the migration package and + now carries on as it would for normal precopy migration + (although it can't do the cleanup it would do as it + finishes a normal migration). + + - End + + The listen thread can now quit, and perform the cleanup of migration + state, the migration is now complete. + +Source side page maps +--------------------- + +The source side keeps two bitmaps during postcopy; 'the migration bitmap' +and 'unsent map'. The 'migration bitmap' is basically the same as in +the precopy case, and holds a bit to indicate that page is 'dirty' - +i.e. needs sending. During the precopy phase this is updated as the CPU +dirties pages, however during postcopy the CPUs are stopped and nothing +should dirty anything any more. + +The 'unsent map' is used for the transition to postcopy. It is a bitmap that +has a bit cleared whenever a page is sent to the destination, however during +the transition to postcopy mode it is combined with the migration bitmap +to form a set of pages that: + + a) Have been sent but then redirtied (which must be discarded) + b) Have not yet been sent - which also must be discarded to cause any + transparent huge pages built during precopy to be broken. + +Note that the contents of the unsentmap are sacrificed during the calculation +of the discard set and thus aren't valid once in postcopy. The dirtymap +is still valid and is used to ensure that no page is sent more than once. Any +request for a page that has already been sent is ignored. Duplicate requests +such as this can happen as a page is sent at about the same time the +destination accesses it. + +Postcopy with hugepages +----------------------- + +Postcopy now works with hugetlbfs backed memory: + + a) The linux kernel on the destination must support userfault on hugepages. + b) The huge-page configuration on the source and destination VMs must be + identical; i.e. RAMBlocks on both sides must use the same page size. + c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal + RAM if it doesn't have enough hugepages, triggering (b) to fail. + Using ``-mem-prealloc`` enforces the allocation using hugepages. + d) Care should be taken with the size of hugepage used; postcopy with 2MB + hugepages works well, however 1GB hugepages are likely to be problematic + since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, + and until the full page is transferred the destination thread is blocked. + +Postcopy with shared memory +--------------------------- + +Postcopy migration with shared memory needs explicit support from the other +processes that share memory and from QEMU. There are restrictions on the type of +memory that userfault can support shared. + +The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs`` +(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)`` +for hugetlbfs which may be a problem in some configurations). + +The vhost-user code in QEMU supports clients that have Postcopy support, +and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes +to support postcopy. + +The client needs to open a userfaultfd and register the areas +of memory that it maps with userfault. The client must then pass the +userfaultfd back to QEMU together with a mapping table that allows +fault addresses in the clients address space to be converted back to +RAMBlock/offsets. The client's userfaultfd is added to the postcopy +fault-thread and page requests are made on behalf of the client by QEMU. +QEMU performs 'wake' operations on the client's userfaultfd to allow it +to continue after a page has arrived. + +.. note:: + There are two future improvements that would be nice: + a) Some way to make QEMU ignorant of the addresses in the clients + address space + b) Avoiding the need for QEMU to perform ufd-wake calls after the + pages have arrived + +Retro-fitting postcopy to existing clients is possible: + a) A mechanism is needed for the registration with userfault as above, + and the registration needs to be coordinated with the phases of + postcopy. In vhost-user extra messages are added to the existing + control channel. + b) Any thread that can block due to guest memory accesses must be + identified and the implication understood; for example if the + guest memory access is made while holding a lock then all other + threads waiting for that lock will also be blocked. + +Firmware +======== + +Migration migrates the copies of RAM and ROM, and thus when running +on the destination it includes the firmware from the source. Even after +resetting a VM, the old firmware is used. Only once QEMU has been restarted +is the new firmware in use. + +- Changes in firmware size can cause changes in the required RAMBlock size + to hold the firmware and thus migration can fail. In practice it's best + to pad firmware images to convenient powers of 2 with plenty of space + for growth. + +- Care should be taken with device emulation code so that newer + emulation code can work with older firmware to allow forward migration. + +- Care should be taken with newer firmware so that backward migration + to older systems with older device emulation code will work. + +In some cases it may be best to tie specific firmware versions to specific +versioned machine types to cut down on the combinations that will need +support. This is also useful when newer versions of firmware outgrow +the padding. + diff --git a/docs/devel/modules.rst b/docs/devel/modules.rst new file mode 100644 index 000000000..8e999c4fa --- /dev/null +++ b/docs/devel/modules.rst @@ -0,0 +1,5 @@ +============ +QEMU modules +============ + +.. kernel-doc:: include/qemu/module.h diff --git a/docs/devel/multi-process.rst b/docs/devel/multi-process.rst new file mode 100644 index 000000000..e4801751f --- /dev/null +++ b/docs/devel/multi-process.rst @@ -0,0 +1,968 @@ +Multi-process QEMU +=================== + +.. note:: + + This is the design document for multi-process QEMU. It does not + necessarily reflect the status of the current implementation, which + may lack features or be considerably different from what is described + in this document. This document is still useful as a description of + the goals and general direction of this feature. + + Please refer to the following wiki for latest details: + https://wiki.qemu.org/Features/MultiProcessQEMU + +QEMU is often used as the hypervisor for virtual machines running in the +Oracle cloud. Since one of the advantages of cloud computing is the +ability to run many VMs from different tenants in the same cloud +infrastructure, a guest that compromised its hypervisor could +potentially use the hypervisor's access privileges to access data it is +not authorized for. + +QEMU can be susceptible to security attacks because it is a large, +monolithic program that provides many features to the VMs it services. +Many of these features can be configured out of QEMU, but even a reduced +configuration QEMU has a large amount of code a guest can potentially +attack. Separating QEMU reduces the attack surface by aiding to +limit each component in the system to only access the resources that +it needs to perform its job. + +QEMU services +------------- + +QEMU can be broadly described as providing three main services. One is a +VM control point, where VMs can be created, migrated, re-configured, and +destroyed. A second is to emulate the CPU instructions within the VM, +often accelerated by HW virtualization features such as Intel's VT +extensions. Finally, it provides IO services to the VM by emulating HW +IO devices, such as disk and network devices. + +A multi-process QEMU +~~~~~~~~~~~~~~~~~~~~ + +A multi-process QEMU involves separating QEMU services into separate +host processes. Each of these processes can be given only the privileges +it needs to provide its service, e.g., a disk service could be given +access only to the disk images it provides, and not be allowed to +access other files, or any network devices. An attacker who compromised +this service would not be able to use this exploit to access files or +devices beyond what the disk service was given access to. + +A QEMU control process would remain, but in multi-process mode, will +have no direct interfaces to the VM. During VM execution, it would still +provide the user interface to hot-plug devices or live migrate the VM. + +A first step in creating a multi-process QEMU is to separate IO services +from the main QEMU program, which would continue to provide CPU +emulation. i.e., the control process would also be the CPU emulation +process. In a later phase, CPU emulation could be separated from the +control process. + +Separating IO services +---------------------- + +Separating IO services into individual host processes is a good place to +begin for a couple of reasons. One is the sheer number of IO devices QEMU +can emulate provides a large surface of interfaces which could potentially +be exploited, and, indeed, have been a source of exploits in the past. +Another is the modular nature of QEMU device emulation code provides +interface points where the QEMU functions that perform device emulation +can be separated from the QEMU functions that manage the emulation of +guest CPU instructions. The devices emulated in the separate process are +referred to as remote devices. + +QEMU device emulation +~~~~~~~~~~~~~~~~~~~~~ + +QEMU uses an object oriented SW architecture for device emulation code. +Configured objects are all compiled into the QEMU binary, then objects +are instantiated by name when used by the guest VM. For example, the +code to emulate a device named "foo" is always present in QEMU, but its +instantiation code is only run when the device is included in the target +VM. (e.g., via the QEMU command line as *-device foo*) + +The object model is hierarchical, so device emulation code names its +parent object (such as "pci-device" for a PCI device) and QEMU will +instantiate a parent object before calling the device's instantiation +code. + +Current separation models +~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to separate the device emulation code from the CPU emulation +code, the device object code must run in a different process. There are +a couple of existing QEMU features that can run emulation code +separately from the main QEMU process. These are examined below. + +vhost user model +^^^^^^^^^^^^^^^^ + +Virtio guest device drivers can be connected to vhost user applications +in order to perform their IO operations. This model uses special virtio +device drivers in the guest and vhost user device objects in QEMU, but +once the QEMU vhost user code has configured the vhost user application, +mission-mode IO is performed by the application. The vhost user +application is a daemon process that can be contacted via a known UNIX +domain socket. + +vhost socket +'''''''''''' + +As mentioned above, one of the tasks of the vhost device object within +QEMU is to contact the vhost application and send it configuration +information about this device instance. As part of the configuration +process, the application can also be sent other file descriptors over +the socket, which then can be used by the vhost user application in +various ways, some of which are described below. + +vhost MMIO store acceleration +''''''''''''''''''''''''''''' + +VMs are often run using HW virtualization features via the KVM kernel +driver. This driver allows QEMU to accelerate the emulation of guest CPU +instructions by running the guest in a virtual HW mode. When the guest +executes instructions that cannot be executed by virtual HW mode, +execution returns to the KVM driver so it can inform QEMU to emulate the +instructions in SW. + +One of the events that can cause a return to QEMU is when a guest device +driver accesses an IO location. QEMU then dispatches the memory +operation to the corresponding QEMU device object. In the case of a +vhost user device, the memory operation would need to be sent over a +socket to the vhost application. This path is accelerated by the QEMU +virtio code by setting up an eventfd file descriptor that the vhost +application can directly receive MMIO store notifications from the KVM +driver, instead of needing them to be sent to the QEMU process first. + +vhost interrupt acceleration +'''''''''''''''''''''''''''' + +Another optimization used by the vhost application is the ability to +directly inject interrupts into the VM via the KVM driver, again, +bypassing the need to send the interrupt back to the QEMU process first. +The QEMU virtio setup code configures the KVM driver with an eventfd +that triggers the device interrupt in the guest when the eventfd is +written. This irqfd file descriptor is then passed to the vhost user +application program. + +vhost access to guest memory +'''''''''''''''''''''''''''' + +The vhost application is also allowed to directly access guest memory, +instead of needing to send the data as messages to QEMU. This is also +done with file descriptors sent to the vhost user application by QEMU. +These descriptors can be passed to ``mmap()`` by the vhost application +to map the guest address space into the vhost application. + +IOMMUs introduce another level of complexity, since the address given to +the guest virtio device to DMA to or from is not a guest physical +address. This case is handled by having vhost code within QEMU register +as a listener for IOMMU mapping changes. The vhost application maintains +a cache of IOMMMU translations: sending translation requests back to +QEMU on cache misses, and in turn receiving flush requests from QEMU +when mappings are purged. + +applicability to device separation +'''''''''''''''''''''''''''''''''' + +Much of the vhost model can be re-used by separated device emulation. In +particular, the ideas of using a socket between QEMU and the device +emulation application, using a file descriptor to inject interrupts into +the VM via KVM, and allowing the application to ``mmap()`` the guest +should be re used. + +There are, however, some notable differences between how a vhost +application works and the needs of separated device emulation. The most +basic is that vhost uses custom virtio device drivers which always +trigger IO with MMIO stores. A separated device emulation model must +work with existing IO device models and guest device drivers. MMIO loads +break vhost store acceleration since they are synchronous - guest +progress cannot continue until the load has been emulated. By contrast, +stores are asynchronous, the guest can continue after the store event +has been sent to the vhost application. + +Another difference is that in the vhost user model, a single daemon can +support multiple QEMU instances. This is contrary to the security regime +desired, in which the emulation application should only be allowed to +access the files or devices the VM it's running on behalf of can access. +#### qemu-io model + +``qemu-io`` is a test harness used to test changes to the QEMU block backend +object code (e.g., the code that implements disk images for disk driver +emulation). ``qemu-io`` is not a device emulation application per se, but it +does compile the QEMU block objects into a separate binary from the main +QEMU one. This could be useful for disk device emulation, since its +emulation applications will need to include the QEMU block objects. + +New separation model based on proxy objects +------------------------------------------- + +A different model based on proxy objects in the QEMU program +communicating with remote emulation programs could provide separation +while minimizing the changes needed to the device emulation code. The +rest of this section is a discussion of how a proxy object model would +work. + +Remote emulation processes +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The remote emulation process will run the QEMU object hierarchy without +modification. The device emulation objects will be also be based on the +QEMU code, because for anything but the simplest device, it would not be +a tractable to re-implement both the object model and the many device +backends that QEMU has. + +The processes will communicate with the QEMU process over UNIX domain +sockets. The processes can be executed either as standalone processes, +or be executed by QEMU. In both cases, the host backends the emulation +processes will provide are specified on its command line, as they would +be for QEMU. For example: + +:: + + disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ + -blockdev driver=qcow2,node-name=drive0,file=file0 + +would indicate process *disk-proc* uses a qcow2 emulated disk named +*file0* as its backend. + +Emulation processes may emulate more than one guest controller. A common +configuration might be to put all controllers of the same device class +(e.g., disk, network, etc.) in a single process, so that all backends of +the same type can be managed by a single QMP monitor. + +communication with QEMU +^^^^^^^^^^^^^^^^^^^^^^^ + +The first argument to the remote emulation process will be a Unix domain +socket that connects with the Proxy object. This is a required argument. + +:: + + disk-proc <socket number> <backend list> + +remote process QMP monitor +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Remote emulation processes can be monitored via QMP, similar to QEMU +itself. The QMP monitor socket is specified the same as for a QEMU +process: + +:: + + disk-proc -qmp unix:/tmp/disk-mon,server + +can be monitored over the UNIX socket path */tmp/disk-mon*. + +QEMU command line +~~~~~~~~~~~~~~~~~ + +Each remote device emulated in a remote process on the host is +represented as a *-device* of type *pci-proxy-dev*. A socket +sub-option to this option specifies the Unix socket that connects +to the remote process. An *id* sub-option is required, and it should +be the same id as used in the remote process. + +:: + + qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3 + +can be used to add a device emulated in a remote process + + +QEMU management of remote processes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU is not aware of the type of type of the remote PCI device. It is +a pass through device as far as QEMU is concerned. + +communication with emulation process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +primary channel +''''''''''''''' + +The primary channel (referred to as com in the code) is used to bootstrap +the remote process. It is also used to pass on device-agnostic commands +like reset. + +per-device channels +''''''''''''''''''' + +Each remote device communicates with QEMU using a dedicated communication +channel. The proxy object sets up this channel using the primary +channel during its initialization. + +QEMU device proxy objects +~~~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU has an object model based on sub-classes inherited from the +"object" super-class. The sub-classes that are of interest here are the +"device" and "bus" sub-classes whose child sub-classes make up the +device tree of a QEMU emulated system. + +The proxy object model will use device proxy objects to replace the +device emulation code within the QEMU process. These objects will live +in the same place in the object and bus hierarchies as the objects they +replace. i.e., the proxy object for an LSI SCSI controller will be a +sub-class of the "pci-device" class, and will have the same PCI bus +parent and the same SCSI bus child objects as the LSI controller object +it replaces. + +It is worth noting that the same proxy object is used to mediate with +all types of remote PCI devices. + +object initialization +^^^^^^^^^^^^^^^^^^^^^ + +The Proxy device objects are initialized in the exact same manner in +which any other QEMU device would be initialized. + +In addition, the Proxy objects perform the following two tasks: +- Parses the "socket" sub option and connects to the remote process +using this channel +- Uses the "id" sub-option to connect to the emulated device on the +separate process + +class\_init +''''''''''' + +The ``class_init()`` method of a proxy object will, in general behave +similarly to the object it replaces, including setting any static +properties and methods needed by the proxy. + +instance\_init / realize +'''''''''''''''''''''''' + +The ``instance_init()`` and ``realize()`` functions would only need to +perform tasks related to being a proxy, such are registering its own +MMIO handlers, or creating a child bus that other proxy devices can be +attached to later. + +Other tasks will be device-specific. For example, PCI device objects +will initialize the PCI config space in order to make a valid PCI device +tree within the QEMU process. + +address space registration +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Most devices are driven by guest device driver accesses to IO addresses +or ports. The QEMU device emulation code uses QEMU's memory region +function calls (such as ``memory_region_init_io()``) to add callback +functions that QEMU will invoke when the guest accesses the device's +areas of the IO address space. When a guest driver does access the +device, the VM will exit HW virtualization mode and return to QEMU, +which will then lookup and execute the corresponding callback function. + +A proxy object would need to mirror the memory region calls the actual +device emulator would perform in its initialization code, but with its +own callbacks. When invoked by QEMU as a result of a guest IO operation, +they will forward the operation to the device emulation process. + +PCI config space +^^^^^^^^^^^^^^^^ + +PCI devices also have a configuration space that can be accessed by the +guest driver. Guest accesses to this space is not handled by the device +emulation object, but by its PCI parent object. Much of this space is +read-only, but certain registers (especially BAR and MSI-related ones) +need to be propagated to the emulation process. + +PCI parent proxy +'''''''''''''''' + +One way to propagate guest PCI config accesses is to create a +"pci-device-proxy" class that can serve as the parent of a PCI device +proxy object. This class's parent would be "pci-device" and it would +override the PCI parent's ``config_read()`` and ``config_write()`` +methods with ones that forward these operations to the emulation +program. + +interrupt receipt +^^^^^^^^^^^^^^^^^ + +A proxy for a device that generates interrupts will need to create a +socket to receive interrupt indications from the emulation process. An +incoming interrupt indication would then be sent up to its bus parent to +be injected into the guest. For example, a PCI device object may use +``pci_set_irq()``. + +live migration +^^^^^^^^^^^^^^ + +The proxy will register to save and restore any *vmstate* it needs over +a live migration event. The device proxy does not need to manage the +remote device's *vmstate*; that will be handled by the remote process +proxy (see below). + +QEMU remote device operation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Generic device operations, such as DMA, will be performed by the remote +process proxy by sending messages to the remote process. + +DMA operations +^^^^^^^^^^^^^^ + +DMA operations would be handled much like vhost applications do. One of +the initial messages sent to the emulation process is a guest memory +table. Each entry in this table consists of a file descriptor and size +that the emulation process can ``mmap()`` to directly access guest +memory, similar to ``vhost_user_set_mem_table()``. Note guest memory +must be backed by file descriptors, such as when QEMU is given the +*-mem-path* command line option. + +IOMMU operations +^^^^^^^^^^^^^^^^ + +When the emulated system includes an IOMMU, the remote process proxy in +QEMU will need to create a socket for IOMMU requests from the emulation +process. It will handle those requests with an +``address_space_get_iotlb_entry()`` call. In order to handle IOMMU +unmaps, the remote process proxy will also register as a listener on the +device's DMA address space. When an IOMMU memory region is created +within the DMA address space, an IOMMU notifier for unmaps will be added +to the memory region that will forward unmaps to the emulation process +over the IOMMU socket. + +device hot-plug via QMP +^^^^^^^^^^^^^^^^^^^^^^^ + +An QMP "device\_add" command can add a device emulated by a remote +process. It will also have "rid" option to the command, just as the +*-device* command line option does. The remote process may either be one +started at QEMU startup, or be one added by the "add-process" QMP +command described above. In either case, the remote process proxy will +forward the new device's JSON description to the corresponding emulation +process. + +live migration +^^^^^^^^^^^^^^ + +The remote process proxy will also register for live migration +notifications with ``vmstate_register()``. When called to save state, +the proxy will send the remote process a secondary socket file +descriptor to save the remote process's device *vmstate* over. The +incoming byte stream length and data will be saved as the proxy's +*vmstate*. When the proxy is resumed on its new host, this *vmstate* +will be extracted, and a secondary socket file descriptor will be sent +to the new remote process through which it receives the *vmstate* in +order to restore the devices there. + +device emulation in remote process +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The parts of QEMU that the emulation program will need include the +object model; the memory emulation objects; the device emulation objects +of the targeted device, and any dependent devices; and, the device's +backends. It will also need code to setup the machine environment, +handle requests from the QEMU process, and route machine-level requests +(such as interrupts or IOMMU mappings) back to the QEMU process. + +initialization +^^^^^^^^^^^^^^ + +The process initialization sequence will follow the same sequence +followed by QEMU. It will first initialize the backend objects, then +device emulation objects. The JSON descriptions sent by the QEMU process +will drive which objects need to be created. + +- address spaces + +Before the device objects are created, the initial address spaces and +memory regions must be configured with ``memory_map_init()``. This +creates a RAM memory region object (*system\_memory*) and an IO memory +region object (*system\_io*). + +- RAM + +RAM memory region creation will follow how ``pc_memory_init()`` creates +them, but must use ``memory_region_init_ram_from_fd()`` instead of +``memory_region_allocate_system_memory()``. The file descriptors needed +will be supplied by the guest memory table from above. Those RAM regions +would then be added to the *system\_memory* memory region with +``memory_region_add_subregion()``. + +- PCI + +IO initialization will be driven by the JSON descriptions sent from the +QEMU process. For a PCI device, a PCI bus will need to be created with +``pci_root_bus_new()``, and a PCI memory region will need to be created +and added to the *system\_memory* memory region with +``memory_region_add_subregion_overlap()``. The overlap version is +required for architectures where PCI memory overlaps with RAM memory. + +MMIO handling +^^^^^^^^^^^^^ + +The device emulation objects will use ``memory_region_init_io()`` to +install their MMIO handlers, and ``pci_register_bar()`` to associate +those handlers with a PCI BAR, as they do within QEMU currently. + +In order to use ``address_space_rw()`` in the emulation process to +handle MMIO requests from QEMU, the PCI physical addresses must be the +same in the QEMU process and the device emulation process. In order to +accomplish that, guest BAR programming must also be forwarded from QEMU +to the emulation process. + +interrupt injection +^^^^^^^^^^^^^^^^^^^ + +When device emulation wants to inject an interrupt into the VM, the +request climbs the device's bus object hierarchy until the point where a +bus object knows how to signal the interrupt to the guest. The details +depend on the type of interrupt being raised. + +- PCI pin interrupts + +On x86 systems, there is an emulated IOAPIC object attached to the root +PCI bus object, and the root PCI object forwards interrupt requests to +it. The IOAPIC object, in turn, calls the KVM driver to inject the +corresponding interrupt into the VM. The simplest way to handle this in +an emulation process would be to setup the root PCI bus driver (via +``pci_bus_irqs()``) to send a interrupt request back to the QEMU +process, and have the device proxy object reflect it up the PCI tree +there. + +- PCI MSI/X interrupts + +PCI MSI/X interrupts are implemented in HW as DMA writes to a +CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives +these DMA writes, then calls into the KVM driver to inject the interrupt +into the VM. A simple emulation process implementation would be to send +the MSI DMA address from QEMU as a message at initialization, then +install an address space handler at that address which forwards the MSI +message back to QEMU. + +DMA operations +^^^^^^^^^^^^^^ + +When a emulation object wants to DMA into or out of guest memory, it +first must use dma\_memory\_map() to convert the DMA address to a local +virtual address. The emulation process memory region objects setup above +will be used to translate the DMA address to a local virtual address the +device emulation code can access. + +IOMMU +^^^^^ + +When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory +regions to translate the DMA address to a guest physical address before +that physical address can be translated to a local virtual address. The +emulation process will need similar functionality. + +- IOTLB cache + +The emulation process will maintain a cache of recent IOMMU translations +(the IOTLB). When the translate() callback of an IOMMU memory region is +invoked, the IOTLB cache will be searched for an entry that will map the +DMA address to a guest PA. On a cache miss, a message will be sent back +to QEMU requesting the corresponding translation entry, which be both be +used to return a guest address and be added to the cache. + +- IOTLB purge + +The IOMMU emulation will also need to act on unmap requests from QEMU. +These happen when the guest IOMMU driver purges an entry from the +guest's translation table. + +live migration +^^^^^^^^^^^^^^ + +When a remote process receives a live migration indication from QEMU, it +will set up a channel using the received file descriptor with +``qio_channel_socket_new_fd()``. This channel will be used to create a +*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send +the process's device state back to QEMU. This method will be reversed on +restore - the channel will be passed to ``qemu_loadvm_state()`` to +restore the device state. + +Accelerating device emulation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The messages that are required to be sent between QEMU and the emulation +process can add considerable latency to IO operations. The optimizations +described below attempt to ameliorate this effect by allowing the +emulation process to communicate directly with the kernel KVM driver. +The KVM file descriptors created would be passed to the emulation process +via initialization messages, much like the guest memory table is done. +#### MMIO acceleration + +Vhost user applications can receive guest virtio driver stores directly +from KVM. The issue with the eventfd mechanism used by vhost user is +that it does not pass any data with the event indication, so it cannot +handle guest loads or guest stores that carry store data. This concept +could, however, be expanded to cover more cases. + +The expanded idea would require a new type of KVM device: +*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master +descriptor that QEMU can use for configuration, and a slave descriptor +that the emulation process can use to receive MMIO notifications. QEMU +would create both descriptors using the KVM driver, and pass the slave +descriptor to the emulation process via an initialization message. + +data structures +^^^^^^^^^^^^^^^ + +- guest physical range + +The guest physical range structure describes the address range that a +device will respond to. It includes the base and length of the range, as +well as which bus the range resides on (e.g., on an x86machine, it can +specify whether the range refers to memory or IO addresses). + +A device can have multiple physical address ranges it responds to (e.g., +a PCI device can have multiple BARs), so the structure will also include +an enumerated identifier to specify which of the device's ranges is +being referred to. + ++--------+----------------------------+ +| Name | Description | ++========+============================+ +| addr | range base address | ++--------+----------------------------+ +| len | range length | ++--------+----------------------------+ +| bus | addr type (memory or IO) | ++--------+----------------------------+ +| id | range ID (e.g., PCI BAR) | ++--------+----------------------------+ + +- MMIO request structure + +This structure describes an MMIO operation. It includes which guest +physical range the MMIO was within, the offset within that range, the +MMIO type (e.g., load or store), and its length and data. It also +includes a sequence number that can be used to reply to the MMIO, and +the CPU that issued the MMIO. + ++----------+------------------------+ +| Name | Description | ++==========+========================+ +| rid | range MMIO is within | ++----------+------------------------+ +| offset | offset within *rid* | ++----------+------------------------+ +| type | e.g., load or store | ++----------+------------------------+ +| len | MMIO length | ++----------+------------------------+ +| data | store data | ++----------+------------------------+ +| seq | sequence ID | ++----------+------------------------+ + +- MMIO request queues + +MMIO request queues are FIFO arrays of MMIO request structures. There +are two queues: pending queue is for MMIOs that haven't been read by the +emulation program, and the sent queue is for MMIOs that haven't been +acknowledged. The main use of the second queue is to validate MMIO +replies from the emulation program. + +- scoreboard + +Each CPU in the VM is emulated in QEMU by a separate thread, so multiple +MMIOs may be waiting to be consumed by an emulation program and multiple +threads may be waiting for MMIO replies. The scoreboard would contain a +wait queue and sequence number for the per-CPU threads, allowing them to +be individually woken when the MMIO reply is received from the emulation +program. It also tracks the number of posted MMIO stores to the device +that haven't been replied to, in order to satisfy the PCI constraint +that a load to a device will not complete until all previous stores to +that device have been completed. + +- device shadow memory + +Some MMIO loads do not have device side-effects. These MMIOs can be +completed without sending a MMIO request to the emulation program if the +emulation program shares a shadow image of the device's memory image +with the KVM driver. + +The emulation program will ask the KVM driver to allocate memory for the +shadow image, and will then use ``mmap()`` to directly access it. The +emulation program can control KVM access to the shadow image by sending +KVM an access map telling it which areas of the image have no +side-effects (and can be completed immediately), and which require a +MMIO request to the emulation program. The access map can also inform +the KVM drive which size accesses are allowed to the image. + +master descriptor +^^^^^^^^^^^^^^^^^ + +The master descriptor is used by QEMU to configure the new KVM device. +The descriptor would be returned by the KVM driver when QEMU issues a +*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. + +KVM\_DEV\_TYPE\_USER device ops + + +The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a +``kvm_register_device_ops()`` call when the KVM system in initialized by +``kvm_init()``. These device ops are called by the KVM driver when QEMU +executes certain ``ioctl()`` operations on its KVM file descriptor. They +include: + +- create + +This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* +``ioctl()`` on its per-VM file descriptor. It will allocate and +initialize a KVM user device specific data structure, and assign the +*kvm\_device* private field to it. + +- ioctl + +This routine is invoked when QEMU issues an ``ioctl()`` on the master +descriptor. The ``ioctl()`` commands supported are defined by the KVM +device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: + +*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will +be passed to the device emulation program. Only one slave can be created +by each master descriptor. The file operations performed by this +descriptor are described below. + +The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical +address range that the slave descriptor will receive MMIO notifications +for. The range is specified by a guest physical range structure +argument. For buses that assign addresses to devices dynamically, this +command can be executed while the guest is running, such as the case +when a guest changes a device's PCI BAR registers. + +*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to +register *kvm\_io\_device\_ops* callbacks to be invoked when the guest +performs a MMIO operation within the range. When a range is changed, +``kvm_io_bus_unregister_dev()`` is used to remove the previous +instantiation. + +*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies +how long KVM will wait for the emulation process to respond to a MMIO +indication. + +- destroy + +This routine is called when the VM instance is destroyed. It will need +to destroy the slave descriptor; and free any memory allocated by the +driver, as well as the *kvm\_device* structure itself. + +slave descriptor +^^^^^^^^^^^^^^^^ + +The slave descriptor will have its own file operations vector, which +responds to system calls on the descriptor performed by the device +emulation program. + +- read + +A read returns any pending MMIO requests from the KVM driver as MMIO +request structures. Multiple structures can be returned if there are +multiple MMIO operations pending. The MMIO requests are moved from the +pending queue to the sent queue, and if there are threads waiting for +space in the pending to add new MMIO operations, they will be woken +here. + +- write + +A write also consists of a set of MMIO requests. They are compared to +the MMIO requests in the sent queue. Matches are removed from the sent +queue, and any threads waiting for the reply are woken. If a store is +removed, then the number of posted stores in the per-CPU scoreboard is +decremented. When the number is zero, and a non side-effect load was +waiting for posted stores to complete, the load is continued. + +- ioctl + +There are several ioctl()s that can be performed on the slave +descriptor. + +A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to +allocate memory for the shadow image. This memory can later be +``mmap()``\ ed by the emulation process to share the emulation's view of +device memory with the KVM driver. + +A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the +shadow image. It will send the KVM driver a shadow control map, which +specifies which areas of the image can complete guest loads without +sending the load request to the emulation program. It will also specify +the size of load operations that are allowed. + +- poll + +An emulation program will use the ``poll()`` call with a *POLLIN* flag +to determine if there are MMIO requests waiting to be read. It will +return if the pending MMIO request queue is not empty. + +- mmap + +This call allows the emulation program to directly access the shadow +image allocated by the KVM driver. As device emulation updates device +memory, changes with no side-effects will be reflected in the shadow, +and the KVM driver can satisfy guest loads from the shadow image without +needing to wait for the emulation program. + +kvm\_io\_device ops +^^^^^^^^^^^^^^^^^^^ + +Each KVM per-CPU thread can handle MMIO operation on behalf of the guest +VM. KVM will use the MMIO's guest physical address to search for a +matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM +driver instead of exiting back to QEMU. If a match is found, the +corresponding callback will be invoked. + +- read + +This callback is invoked when the guest performs a load to the device. +Loads with side-effects must be handled synchronously, with the KVM +driver putting the QEMU thread to sleep waiting for the emulation +process reply before re-starting the guest. Loads that do not have +side-effects may be optimized by satisfying them from the shadow image, +if there are no outstanding stores to the device by this CPU. PCI memory +ordering demands that a load cannot complete before all older stores to +the same device have been completed. + +- write + +Stores can be handled asynchronously unless the pending MMIO request +queue is full. In this case, the QEMU thread must sleep waiting for +space in the queue. Stores will increment the number of posted stores in +the per-CPU scoreboard, in order to implement the PCI ordering +constraint above. + +interrupt acceleration +^^^^^^^^^^^^^^^^^^^^^^ + +This performance optimization would work much like a vhost user +application does, where the QEMU process sets up *eventfds* that cause +the device's corresponding interrupt to be triggered by the KVM driver. +These irq file descriptors are sent to the emulation process at +initialization, and are used when the emulation code raises a device +interrupt. + +intx acceleration +''''''''''''''''' + +Traditional PCI pin interrupts are level based, so, in addition to an +irq file descriptor, a re-sampling file descriptor needs to be sent to +the emulation program. This second file descriptor allows multiple +devices sharing an irq to be notified when the interrupt has been +acknowledged by the guest, so they can re-trigger the interrupt if their +device has not de-asserted its interrupt. + +intx irq descriptor + + +The irq descriptors are created by the proxy object +``using event_notifier_init()`` to create the irq and re-sampling +*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. +The interrupt route can be found with +``pci_device_route_intx_to_irq()``. + +intx routing changes + + +Intx routing can be changed when the guest programs the APIC the device +pin is connected to. The proxy object in QEMU will use +``pci_device_set_intx_routing_notifier()`` to be informed of any guest +changes to the route. This handler will broadly follow the VFIO +interrupt logic to change the route: de-assigning the existing irq +descriptor from its route, then assigning it the new route. (see +``vfio_intx_update()``) + +MSI/X acceleration +'''''''''''''''''' + +MSI/X interrupts are sent as DMA transactions to the host. The interrupt +data contains a vector that is programmed by the guest, A device may have +multiple MSI interrupts associated with it, so multiple irq descriptors +may need to be sent to the emulation program. + +MSI/X irq descriptor + + +This case will also follow the VFIO example. For each MSI/X interrupt, +an *eventfd* is created, a virtual interrupt is allocated by +``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to +the eventfd with ``kvm_irqchip_add_irqfd_notifier()``. + +MSI/X config space changes + + +The guest may dynamically update several MSI-related tables in the +device's PCI config space. These include per-MSI interrupt enables and +vector data. Additionally, MSIX tables exist in device memory space, not +config space. Much like the BAR case above, the proxy object must look +at guest config space programming to keep the MSI interrupt state +consistent between QEMU and the emulation program. + +-------------- + +Disaggregated CPU emulation +--------------------------- + +After IO services have been disaggregated, a second phase would be to +separate a process to handle CPU instruction emulation from the main +QEMU control function. There are no object separation points for this +code, so the first task would be to create one. + +Host access controls +-------------------- + +Separating QEMU relies on the host OS's access restriction mechanisms to +enforce that the differing processes can only access the objects they +are entitled to. There are a couple types of mechanisms usually provided +by general purpose OSs. + +Discretionary access control +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Discretionary access control allows each user to control who can access +their files. In Linux, this type of control is usually too coarse for +QEMU separation, since it only provides three separate access controls: +one for the same user ID, the second for users IDs with the same group +ID, and the third for all other user IDs. Each device instance would +need a separate user ID to provide access control, which is likely to be +unwieldy for dynamically created VMs. + +Mandatory access control +~~~~~~~~~~~~~~~~~~~~~~~~ + +Mandatory access control allows the OS to add an additional set of +controls on top of discretionary access for the OS to control. It also +adds other attributes to processes and files such as types, roles, and +categories, and can establish rules for how processes and files can +interact. + +Type enforcement +^^^^^^^^^^^^^^^^ + +Type enforcement assigns a *type* attribute to processes and files, and +allows rules to be written on what operations a process with a given +type can perform on a file with a given type. QEMU separation could take +advantage of type enforcement by running the emulation processes with +different types, both from the main QEMU process, and from the emulation +processes of different classes of devices. + +For example, guest disk images and disk emulation processes could have +types separate from the main QEMU process and non-disk emulation +processes, and the type rules could prevent processes other than disk +emulation ones from accessing guest disk images. Similarly, network +emulation processes can have a type separate from the main QEMU process +and non-network emulation process, and only that type can access the +host tun/tap device used to provide guest networking. + +Category enforcement +^^^^^^^^^^^^^^^^^^^^ + +Category enforcement assigns a set of numbers within a given range to +the process or file. The process is granted access to the file if the +process's set is a superset of the file's set. This enforcement can be +used to separate multiple instances of devices in the same class. + +For example, if there are multiple disk devices provides to a guest, +each device emulation process could be provisioned with a separate +category. The different device emulation processes would not be able to +access each other's backing disk images. + +Alternatively, categories could be used in lieu of the type enforcement +scheme described above. In this scenario, different categories would be +used to prevent device emulation processes in different classes from +accessing resources assigned to other classes. diff --git a/docs/devel/multi-thread-tcg.rst b/docs/devel/multi-thread-tcg.rst new file mode 100644 index 000000000..c9541a7b2 --- /dev/null +++ b/docs/devel/multi-thread-tcg.rst @@ -0,0 +1,373 @@ +.. + Copyright (c) 2015-2020 Linaro Ltd. + + This work is licensed under the terms of the GNU GPL, version 2 or + later. See the COPYING file in the top-level directory. + +================== +Multi-threaded TCG +================== + +This document outlines the design for multi-threaded TCG (a.k.a MTTCG) +system-mode emulation. user-mode emulation has always mirrored the +thread structure of the translated executable although some of the +changes done for MTTCG system emulation have improved the stability of +linux-user emulation. + +The original system-mode TCG implementation was single threaded and +dealt with multiple CPUs with simple round-robin scheduling. This +simplified a lot of things but became increasingly limited as systems +being emulated gained additional cores and per-core performance gains +for host systems started to level off. + +vCPU Scheduling +=============== + +We introduce a new running mode where each vCPU will run on its own +user-space thread. This is enabled by default for all FE/BE +combinations where the host memory model is able to accommodate the +guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the +guest has had the required work done to support this safely +(TARGET_SUPPORTS_MTTCG). + +System emulation will fall back to the original round robin approach +if: + +* forced by --accel tcg,thread=single +* enabling --icount mode +* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST) + +In the general case of running translated code there should be no +inter-vCPU dependencies and all vCPUs should be able to run at full +speed. Synchronisation will only be required while accessing internal +shared data structures or when the emulated architecture requires a +coherent representation of the emulated machine state. + +Shared Data Structures +====================== + +Main Run Loop +------------- + +Even when there is no code being generated there are a number of +structures associated with the hot-path through the main run-loop. +These are associated with looking up the next translation block to +execute. These include: + + tb_jmp_cache (per-vCPU, cache of recent jumps) + tb_ctx.htable (global hash table, phys address->tb lookup) + +As TB linking only occurs when blocks are in the same page this code +is critical to performance as looking up the next TB to execute is the +most common reason to exit the generated code. + +DESIGN REQUIREMENT: Make access to lookup structures safe with +multiple reader/writer threads. Minimise any lock contention to do it. + +The hot-path avoids using locks where possible. The tb_jmp_cache is +updated with atomic accesses to ensure consistent results. The fall +back QHT based hash table is also designed for lockless lookups. Locks +are only taken when code generation is required or TranslationBlocks +have their block-to-block jumps patched. + +Global TCG State +---------------- + +User-mode emulation +~~~~~~~~~~~~~~~~~~~ + +We need to protect the entire code generation cycle including any post +generation patching of the translated code. This also implies a shared +translation buffer which contains code running on all cores. Any +execution path that comes to the main run loop will need to hold a +mutex for code generation. This also includes times when we need flush +code or entries from any shared lookups/caches. Structures held on a +per-vCPU basis won't need locking unless other vCPUs will need to +modify them. + +DESIGN REQUIREMENT: Add locking around all code generation and TB +patching. + +(Current solution) + +Code generation is serialised with mmap_lock(). + +!User-mode emulation +~~~~~~~~~~~~~~~~~~~~ + +Each vCPU has its own TCG context and associated TCG region, thereby +requiring no locking during translation. + +Translation Blocks +------------------ + +Currently the whole system shares a single code generation buffer +which when full will force a flush of all translations and start from +scratch again. Some operations also force a full flush of translations +including: + + - debugging operations (breakpoint insertion/removal) + - some CPU helper functions + - linux-user spawning its first thread + +This is done with the async_safe_run_on_cpu() mechanism to ensure all +vCPUs are quiescent when changes are being made to shared global +structures. + +More granular translation invalidation events are typically due +to a change of the state of a physical page: + + - code modification (self modify code, patching code) + - page changes (new page mapping in linux-user mode) + +While setting the invalid flag in a TranslationBlock will stop it +being used when looked up in the hot-path there are a number of other +book-keeping structures that need to be safely cleared. + +Any TranslationBlocks which have been patched to jump directly to the +now invalid blocks need the jump patches reversing so they will return +to the C code. + +There are a number of look-up caches that need to be properly updated +including the: + + - jump lookup cache + - the physical-to-tb lookup hash table + - the global page table + +The global page table (l1_map) which provides a multi-level look-up +for PageDesc structures which contain pointers to the start of a +linked list of all Translation Blocks in that page (see page_next). + +Both the jump patching and the page cache involve linked lists that +the invalidated TranslationBlock needs to be removed from. + +DESIGN REQUIREMENT: Safely handle invalidation of TBs + - safely patch/revert direct jumps + - remove central PageDesc lookup entries + - ensure lookup caches/hashes are safely updated + +(Current solution) + +The direct jump themselves are updated atomically by the TCG +tb_set_jmp_target() code. Modification to the linked lists that allow +searching for linked pages are done under the protection of tb->jmp_lock, +where tb is the destination block of a jump. Each origin block keeps a +pointer to its destinations so that the appropriate lock can be acquired before +iterating over a jump list. + +The global page table is a lockless radix tree; cmpxchg is used +to atomically insert new elements. + +The lookup caches are updated atomically and the lookup hash uses QHT +which is designed for concurrent safe lookup. + +Parallel code generation is supported. QHT is used at insertion time +as the synchronization point across threads, thereby ensuring that we only +keep track of a single TranslationBlock for each guest code block. + +Memory maps and TLBs +-------------------- + +The memory handling code is fairly critical to the speed of memory +access in the emulated system. The SoftMMU code is designed so the +hot-path can be handled entirely within translated code. This is +handled with a per-vCPU TLB structure which once populated will allow +a series of accesses to the page to occur without exiting the +translated code. It is possible to set flags in the TLB address which +will ensure the slow-path is taken for each access. This can be done +to support: + + - Memory regions (dividing up access to PIO, MMIO and RAM) + - Dirty page tracking (for code gen, SMC detection, migration and display) + - Virtual TLB (for translating guest address->real address) + +When the TLB tables are updated by a vCPU thread other than their own +we need to ensure it is done in a safe way so no inconsistent state is +seen by the vCPU thread. + +Some operations require updating a number of vCPUs TLBs at the same +time in a synchronised manner. + +DESIGN REQUIREMENTS: + + - TLB Flush All/Page + - can be across-vCPUs + - cross vCPU TLB flush may need other vCPU brought to halt + - change may need to be visible to the calling vCPU immediately + - TLB Flag Update + - usually cross-vCPU + - want change to be visible as soon as possible + - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) + - This is a per-vCPU table - by definition can't race + - updated by its own thread when the slow-path is forced + +(Current solution) + +We have updated cputlb.c to defer operations when a cross-vCPU +operation with async_run_on_cpu() which ensures each vCPU sees a +coherent state when it next runs its work (in a few instructions +time). + +A new set up operations (tlb_flush_*_all_cpus) take an additional flag +which when set will force synchronisation by setting the source vCPUs +work as "safe work" and exiting the cpu run loop. This ensure by the +time execution restarts all flush operations have completed. + +TLB flag updates are all done atomically and are also protected by the +corresponding page lock. + +(Known limitation) + +Not really a limitation but the wait mechanism is overly strict for +some architectures which only need flushes completed by a barrier +instruction. This could be a future optimisation. + +Emulated hardware state +----------------------- + +Currently thanks to KVM work any access to IO memory is automatically +protected by the global iothread mutex, also known as the BQL (Big +QEMU Lock). Any IO region that doesn't use global mutex is expected to +do its own locking. + +However IO memory isn't the only way emulated hardware state can be +modified. Some architectures have model specific registers that +trigger hardware emulation features. Generally any translation helper +that needs to update more than a single vCPUs of state should take the +BQL. + +As the BQL, or global iothread mutex is shared across the system we +push the use of the lock as far down into the TCG code as possible to +minimise contention. + +(Current solution) + +MMIO access automatically serialises hardware emulation by way of the +BQL. Currently Arm targets serialise all ARM_CP_IO register accesses +and also defer the reset/startup of vCPUs to the vCPU context by way +of async_run_on_cpu(). + +Updates to interrupt state are also protected by the BQL as they can +often be cross vCPU. + +Memory Consistency +================== + +Between emulated guests and host systems there are a range of memory +consistency models. Even emulating weakly ordered systems on strongly +ordered hosts needs to ensure things like store-after-load re-ordering +can be prevented when the guest wants to. + +Memory Barriers +--------------- + +Barriers (sometimes known as fences) provide a mechanism for software +to enforce a particular ordering of memory operations from the point +of view of external observers (e.g. another processor core). They can +apply to any memory operations as well as just loads or stores. + +The Linux kernel has an excellent `write-up +<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_ +on the various forms of memory barrier and the guarantees they can +provide. + +Barriers are often wrapped around synchronisation primitives to +provide explicit memory ordering semantics. However they can be used +by themselves to provide safe lockless access by ensuring for example +a change to a signal flag will only be visible once the changes to +payload are. + +DESIGN REQUIREMENT: Add a new tcg_memory_barrier op + +This would enforce a strong load/store ordering so all loads/stores +complete at the memory barrier. On single-core non-SMP strongly +ordered backends this could become a NOP. + +Aside from explicit standalone memory barrier instructions there are +also implicit memory ordering semantics which comes with each guest +memory access instruction. For example all x86 load/stores come with +fairly strong guarantees of sequential consistency whereas Arm has +special variants of load/store instructions that imply acquire/release +semantics. + +In the case of a strongly ordered guest architecture being emulated on +a weakly ordered host the scope for a heavy performance impact is +quite high. + +DESIGN REQUIREMENTS: Be efficient with use of memory barriers + - host systems with stronger implied guarantees can skip some barriers + - merge consecutive barriers to the strongest one + +(Current solution) + +The system currently has a tcg_gen_mb() which will add memory barrier +operations if code generation is being done in a parallel context. The +tcg_optimize() function attempts to merge barriers up to their +strongest form before any load/store operations. The solution was +originally developed and tested for linux-user based systems. All +backends have been converted to emit fences when required. So far the +following front-ends have been updated to emit fences when required: + + - target-i386 + - target-arm + - target-aarch64 + - target-alpha + - target-mips + +Memory Control and Maintenance +------------------------------ + +This includes a class of instructions for controlling system cache +behaviour. While QEMU doesn't model cache behaviour these instructions +are often seen when code modification has taken place to ensure the +changes take effect. + +Synchronisation Primitives +-------------------------- + +There are two broad types of synchronisation primitives found in +modern ISAs: atomic instructions and exclusive regions. + +The first type offer a simple atomic instruction which will guarantee +some sort of test and conditional store will be truly atomic w.r.t. +other cores sharing access to the memory. The classic example is the +x86 cmpxchg instruction. + +The second type offer a pair of load/store instructions which offer a +guarantee that a region of memory has not been touched between the +load and store instructions. An example of this is Arm's ldrex/strex +pair where the strex instruction will return a flag indicating a +successful store only if no other CPU has accessed the memory region +since the ldrex. + +Traditionally TCG has generated a series of operations that work +because they are within the context of a single translation block so +will have completed before another CPU is scheduled. However with +the ability to have multiple threads running to emulate multiple CPUs +we will need to explicitly expose these semantics. + +DESIGN REQUIREMENTS: + - Support classic atomic instructions + - Support load/store exclusive (or load link/store conditional) pairs + - Generic enough infrastructure to support all guest architectures +CURRENT OPEN QUESTIONS: + - How problematic is the ABA problem in general? + +(Current solution) + +The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which +can be used directly or combined to emulate other instructions like +Arm's ldrex/strex instructions. While they are susceptible to the ABA +problem so far common guests have not implemented patterns where +this may be a problem - typically presenting a locking ABI which +assumes cmpxchg like semantics. + +The code also includes a fall-back for cases where multi-threaded TCG +ops can't work (e.g. guest atomic width > host atomic width). In this +case an EXCP_ATOMIC exit occurs and the instruction is emulated with +an exclusive lock which ensures all emulation is serialised. + +While the atomic helpers look good enough for now there may be a need +to look at solutions that can more closely model the guest +architectures semantics. diff --git a/docs/devel/multiple-iothreads.txt b/docs/devel/multiple-iothreads.txt new file mode 100644 index 000000000..aeb997bed --- /dev/null +++ b/docs/devel/multiple-iothreads.txt @@ -0,0 +1,138 @@ +Copyright (c) 2014-2017 Red Hat Inc. + +This work is licensed under the terms of the GNU GPL, version 2 or later. See +the COPYING file in the top-level directory. + + +This document explains the IOThread feature and how to write code that runs +outside the QEMU global mutex. + +The main loop and IOThreads +--------------------------- +QEMU is an event-driven program that can do several things at once using an +event loop. The VNC server and the QMP monitor are both processed from the +same event loop, which monitors their file descriptors until they become +readable and then invokes a callback. + +The default event loop is called the main loop (see main-loop.c). It is +possible to create additional event loop threads using -object +iothread,id=my-iothread. + +Side note: The main loop and IOThread are both event loops but their code is +not shared completely. Sometimes it is useful to remember that although they +are conceptually similar they are currently not interchangeable. + +Why IOThreads are useful +------------------------ +IOThreads allow the user to control the placement of work. The main loop is a +scalability bottleneck on hosts with many CPUs. Work can be spread across +several IOThreads instead of just one main loop. When set up correctly this +can improve I/O latency and reduce jitter seen by the guest. + +The main loop is also deeply associated with the QEMU global mutex, which is a +scalability bottleneck in itself. vCPU threads and the main loop use the QEMU +global mutex to serialize execution of QEMU code. This mutex is necessary +because a lot of QEMU's code historically was not thread-safe. + +The fact that all I/O processing is done in a single main loop and that the +QEMU global mutex is contended by all vCPU threads and the main loop explain +why it is desirable to place work into IOThreads. + +The experimental virtio-blk data-plane implementation has been benchmarked and +shows these effects: +ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf + +How to program for IOThreads +---------------------------- +The main difference between legacy code and new code that can run in an +IOThread is dealing explicitly with the event loop object, AioContext +(see include/block/aio.h). Code that only works in the main loop +implicitly uses the main loop's AioContext. Code that supports running +in IOThreads must be aware of its AioContext. + +AioContext supports the following services: + * File descriptor monitoring (read/write/error on POSIX hosts) + * Event notifiers (inter-thread signalling) + * Timers + * Bottom Halves (BH) deferred callbacks + +There are several old APIs that use the main loop AioContext: + * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor + * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier + * LEGACY timer_new_ms() - create a timer + * LEGACY qemu_bh_new() - create a BH + * LEGACY qemu_aio_wait() - run an event loop iteration + +Since they implicitly work on the main loop they cannot be used in code that +runs in an IOThread. They might cause a crash or deadlock if called from an +IOThread since the QEMU global mutex is not held. + +Instead, use the AioContext functions directly (see include/block/aio.h): + * aio_set_fd_handler() - monitor a file descriptor + * aio_set_event_notifier() - monitor an event notifier + * aio_timer_new() - create a timer + * aio_bh_new() - create a BH + * aio_poll() - run an event loop iteration + +The AioContext can be obtained from the IOThread using +iothread_get_aio_context() or for the main loop using qemu_get_aio_context(). +Code that takes an AioContext argument works both in IOThreads or the main +loop, depending on which AioContext instance the caller passes in. + +How to synchronize with an IOThread +----------------------------------- +AioContext is not thread-safe so some rules must be followed when using file +descriptors, event notifiers, timers, or BHs across threads: + +1. AioContext functions can always be called safely. They handle their +own locking internally. + +2. Other threads wishing to access the AioContext must use +aio_context_acquire()/aio_context_release() for mutual exclusion. Once the +context is acquired no other thread can access it or run event loop iterations +in this AioContext. + +Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls. +Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro +used in the block layer and can lead to hangs. + +There is currently no lock ordering rule if a thread needs to acquire multiple +AioContexts simultaneously. Therefore, it is only safe for code holding the +QEMU global mutex to acquire other AioContexts. + +Side note: the best way to schedule a function call across threads is to call +aio_bh_schedule_oneshot(). No acquire/release or locking is needed. + +AioContext and the block layer +------------------------------ +The AioContext originates from the QEMU block layer, even though nowadays +AioContext is a generic event loop that can be used by any QEMU subsystem. + +The block layer has support for AioContext integrated. Each BlockDriverState +is associated with an AioContext using bdrv_try_set_aio_context() and +bdrv_get_aio_context(). This allows block layer code to process I/O inside the +right AioContext. Other subsystems may wish to follow a similar approach. + +Block layer code must therefore expect to run in an IOThread and avoid using +old APIs that implicitly use the main loop. See the "How to program for +IOThreads" above for information on how to do that. + +If main loop code such as a QMP function wishes to access a BlockDriverState +it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure +that callbacks in the IOThread do not run in parallel. + +Code running in the monitor typically needs to ensure that past +requests from the guest are completed. When a block device is running +in an IOThread, the IOThread can also process requests from the guest +(via ioeventfd). To achieve both objects, wrap the code between +bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained +section". The functions must be called between aio_context_acquire() +and aio_context_release(). You can freely release and re-acquire the +AioContext within a drained section. + +Long-running jobs (usually in the form of coroutines) are best scheduled in +the BlockDriverState's AioContext to avoid the need to acquire/release around +each bdrv_*() call. The functions bdrv_add/remove_aio_context_notifier, +or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends, +can be used to get a notification whenever bdrv_try_set_aio_context() moves a +BlockDriverState to a different AioContext. diff --git a/docs/devel/qapi-code-gen.rst b/docs/devel/qapi-code-gen.rst new file mode 100644 index 000000000..a3b547308 --- /dev/null +++ b/docs/devel/qapi-code-gen.rst @@ -0,0 +1,1932 @@ +================================== +How to use the QAPI code generator +================================== + +.. + Copyright IBM Corp. 2011 + Copyright (C) 2012-2016 Red Hat, Inc. + + This work is licensed under the terms of the GNU GPL, version 2 or + later. See the COPYING file in the top-level directory. + + +Introduction +============ + +QAPI is a native C API within QEMU which provides management-level +functionality to internal and external users. For external +users/processes, this interface is made available by a JSON-based wire +format for the QEMU Monitor Protocol (QMP) for controlling qemu, as +well as the QEMU Guest Agent (QGA) for communicating with the guest. +The remainder of this document uses "Client JSON Protocol" when +referring to the wire contents of a QMP or QGA connection. + +To map between Client JSON Protocol interfaces and the native C API, +we generate C code from a QAPI schema. This document describes the +QAPI schema language, and how it gets mapped to the Client JSON +Protocol and to C. It additionally provides guidance on maintaining +Client JSON Protocol compatibility. + + +The QAPI schema language +======================== + +The QAPI schema defines the Client JSON Protocol's commands and +events, as well as types used by them. Forward references are +allowed. + +It is permissible for the schema to contain additional types not used +by any commands or events, for the side effect of generated C code +used internally. + +There are several kinds of types: simple types (a number of built-in +types, such as ``int`` and ``str``; as well as enumerations), arrays, +complex types (structs and two flavors of unions), and alternate types +(a choice between other types). + + +Schema syntax +------------- + +Syntax is loosely based on `JSON <http://www.ietf.org/rfc/rfc8259.txt>`_. +Differences: + +* Comments: start with a hash character (``#``) that is not part of a + string, and extend to the end of the line. + +* Strings are enclosed in ``'single quotes'``, not ``"double quotes"``. + +* Strings are restricted to printable ASCII, and escape sequences to + just ``\\``. + +* Numbers and ``null`` are not supported. + +A second layer of syntax defines the sequences of JSON texts that are +a correctly structured QAPI schema. We provide a grammar for this +syntax in an EBNF-like notation: + +* Production rules look like ``non-terminal = expression`` +* Concatenation: expression ``A B`` matches expression ``A``, then ``B`` +* Alternation: expression ``A | B`` matches expression ``A`` or ``B`` +* Repetition: expression ``A...`` matches zero or more occurrences of + expression ``A`` +* Repetition: expression ``A, ...`` matches zero or more occurrences of + expression ``A`` separated by ``,`` +* Grouping: expression ``( A )`` matches expression ``A`` +* JSON's structural characters are terminals: ``{ } [ ] : ,`` +* JSON's literal names are terminals: ``false true`` +* String literals enclosed in ``'single quotes'`` are terminal, and match + this JSON string, with a leading ``*`` stripped off +* When JSON object member's name starts with ``*``, the member is + optional. +* The symbol ``STRING`` is a terminal, and matches any JSON string +* The symbol ``BOOL`` is a terminal, and matches JSON ``false`` or ``true`` +* ALL-CAPS words other than ``STRING`` are non-terminals + +The order of members within JSON objects does not matter unless +explicitly noted. + +A QAPI schema consists of a series of top-level expressions:: + + SCHEMA = TOP-LEVEL-EXPR... + +The top-level expressions are all JSON objects. Code and +documentation is generated in schema definition order. Code order +should not matter. + +A top-level expressions is either a directive or a definition:: + + TOP-LEVEL-EXPR = DIRECTIVE | DEFINITION + +There are two kinds of directives and six kinds of definitions:: + + DIRECTIVE = INCLUDE | PRAGMA + DEFINITION = ENUM | STRUCT | UNION | ALTERNATE | COMMAND | EVENT + +These are discussed in detail below. + + +Built-in Types +-------------- + +The following types are predefined, and map to C as follows: + + ============= ============== ============================================ + Schema C JSON + ============= ============== ============================================ + ``str`` ``char *`` any JSON string, UTF-8 + ``number`` ``double`` any JSON number + ``int`` ``int64_t`` a JSON number without fractional part + that fits into the C integer type + ``int8`` ``int8_t`` likewise + ``int16`` ``int16_t`` likewise + ``int32`` ``int32_t`` likewise + ``int64`` ``int64_t`` likewise + ``uint8`` ``uint8_t`` likewise + ``uint16`` ``uint16_t`` likewise + ``uint32`` ``uint32_t`` likewise + ``uint64`` ``uint64_t`` likewise + ``size`` ``uint64_t`` like ``uint64_t``, except + ``StringInputVisitor`` accepts size suffixes + ``bool`` ``bool`` JSON ``true`` or ``false`` + ``null`` ``QNull *`` JSON ``null`` + ``any`` ``QObject *`` any JSON value + ``QType`` ``QType`` JSON string matching enum ``QType`` values + ============= ============== ============================================ + + +Include directives +------------------ + +Syntax:: + + INCLUDE = { 'include': STRING } + +The QAPI schema definitions can be modularized using the 'include' directive:: + + { 'include': 'path/to/file.json' } + +The directive is evaluated recursively, and include paths are relative +to the file using the directive. Multiple includes of the same file +are idempotent. + +As a matter of style, it is a good idea to have all files be +self-contained, but at the moment, nothing prevents an included file +from making a forward reference to a type that is only introduced by +an outer file. The parser may be made stricter in the future to +prevent incomplete include files. + +.. _pragma: + +Pragma directives +----------------- + +Syntax:: + + PRAGMA = { 'pragma': { + '*doc-required': BOOL, + '*command-name-exceptions': [ STRING, ... ], + '*command-returns-exceptions': [ STRING, ... ], + '*member-name-exceptions': [ STRING, ... ] } } + +The pragma directive lets you control optional generator behavior. + +Pragma's scope is currently the complete schema. Setting the same +pragma to different values in parts of the schema doesn't work. + +Pragma 'doc-required' takes a boolean value. If true, documentation +is required. Default is false. + +Pragma 'command-name-exceptions' takes a list of commands whose names +may contain ``"_"`` instead of ``"-"``. Default is none. + +Pragma 'command-returns-exceptions' takes a list of commands that may +violate the rules on permitted return types. Default is none. + +Pragma 'member-name-exceptions' takes a list of types whose member +names may contain uppercase letters, and ``"_"`` instead of ``"-"``. +Default is none. + +.. _ENUM-VALUE: + +Enumeration types +----------------- + +Syntax:: + + ENUM = { 'enum': STRING, + 'data': [ ENUM-VALUE, ... ], + '*prefix': STRING, + '*if': COND, + '*features': FEATURES } + ENUM-VALUE = STRING + | { 'name': STRING, + '*if': COND, + '*features': FEATURES } + +Member 'enum' names the enum type. + +Each member of the 'data' array defines a value of the enumeration +type. The form STRING is shorthand for :code:`{ 'name': STRING }`. The +'name' values must be be distinct. + +Example:: + + { 'enum': 'MyEnum', 'data': [ 'value1', 'value2', 'value3' ] } + +Nothing prevents an empty enumeration, although it is probably not +useful. + +On the wire, an enumeration type's value is represented by its +(string) name. In C, it's represented by an enumeration constant. +These are of the form PREFIX_NAME, where PREFIX is derived from the +enumeration type's name, and NAME from the value's name. For the +example above, the generator maps 'MyEnum' to MY_ENUM and 'value1' to +VALUE1, resulting in the enumeration constant MY_ENUM_VALUE1. The +optional 'prefix' member overrides PREFIX. + +The generated C enumeration constants have values 0, 1, ..., N-1 (in +QAPI schema order), where N is the number of values. There is an +additional enumeration constant PREFIX__MAX with value N. + +Do not use string or an integer type when an enumeration type can do +the job satisfactorily. + +The optional 'if' member specifies a conditional. See `Configuring the +schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +.. _TYPE-REF: + +Type references and array types +------------------------------- + +Syntax:: + + TYPE-REF = STRING | ARRAY-TYPE + ARRAY-TYPE = [ STRING ] + +A string denotes the type named by the string. + +A one-element array containing a string denotes an array of the type +named by the string. Example: ``['int']`` denotes an array of ``int``. + + +Struct types +------------ + +Syntax:: + + STRUCT = { 'struct': STRING, + 'data': MEMBERS, + '*base': STRING, + '*if': COND, + '*features': FEATURES } + MEMBERS = { MEMBER, ... } + MEMBER = STRING : TYPE-REF + | STRING : { 'type': TYPE-REF, + '*if': COND, + '*features': FEATURES } + +Member 'struct' names the struct type. + +Each MEMBER of the 'data' object defines a member of the struct type. + +.. _MEMBERS: + +The MEMBER's STRING name consists of an optional ``*`` prefix and the +struct member name. If ``*`` is present, the member is optional. + +The MEMBER's value defines its properties, in particular its type. +The form TYPE-REF_ is shorthand for :code:`{ 'type': TYPE-REF }`. + +Example:: + + { 'struct': 'MyType', + 'data': { 'member1': 'str', 'member2': ['int'], '*member3': 'str' } } + +A struct type corresponds to a struct in C, and an object in JSON. +The C struct's members are generated in QAPI schema order. + +The optional 'base' member names a struct type whose members are to be +included in this type. They go first in the C struct. + +Example:: + + { 'struct': 'BlockdevOptionsGenericFormat', + 'data': { 'file': 'str' } } + { 'struct': 'BlockdevOptionsGenericCOWFormat', + 'base': 'BlockdevOptionsGenericFormat', + 'data': { '*backing': 'str' } } + +An example BlockdevOptionsGenericCOWFormat object on the wire could use +both members like this:: + + { "file": "/some/place/my-image", + "backing": "/some/place/my-backing-file" } + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +Union types +----------- + +Syntax:: + + UNION = { 'union': STRING, + 'base': ( MEMBERS | STRING ), + 'discriminator': STRING, + 'data': BRANCHES, + '*if': COND, + '*features': FEATURES } + BRANCHES = { BRANCH, ... } + BRANCH = STRING : TYPE-REF + | STRING : { 'type': TYPE-REF, '*if': COND } + +Member 'union' names the union type. + +The 'base' member defines the common members. If it is a MEMBERS_ +object, it defines common members just like a struct type's 'data' +member defines struct type members. If it is a STRING, it names a +struct type whose members are the common members. + +Member 'discriminator' must name a non-optional enum-typed member of +the base struct. That member's value selects a branch by its name. +If no such branch exists, an empty branch is assumed. + +Each BRANCH of the 'data' object defines a branch of the union. A +union must have at least one branch. + +The BRANCH's STRING name is the branch name. It must be a value of +the discriminator enum type. + +The BRANCH's value defines the branch's properties, in particular its +type. The type must a struct type. The form TYPE-REF_ is shorthand +for :code:`{ 'type': TYPE-REF }`. + +In the Client JSON Protocol, a union is represented by an object with +the common members (from the base type) and the selected branch's +members. The two sets of member names must be disjoint. + +Example:: + + { 'enum': 'BlockdevDriver', 'data': [ 'file', 'qcow2' ] } + { 'union': 'BlockdevOptions', + 'base': { 'driver': 'BlockdevDriver', '*read-only': 'bool' }, + 'discriminator': 'driver', + 'data': { 'file': 'BlockdevOptionsFile', + 'qcow2': 'BlockdevOptionsQcow2' } } + +Resulting in these JSON objects:: + + { "driver": "file", "read-only": true, + "filename": "/some/place/my-image" } + { "driver": "qcow2", "read-only": false, + "backing": "/some/place/my-image", "lazy-refcounts": true } + +The order of branches need not match the order of the enum values. +The branches need not cover all possible enum values. In the +resulting generated C data types, a union is represented as a struct +with the base members in QAPI schema order, and then a union of +structures for each branch of the struct. + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +Alternate types +--------------- + +Syntax:: + + ALTERNATE = { 'alternate': STRING, + 'data': ALTERNATIVES, + '*if': COND, + '*features': FEATURES } + ALTERNATIVES = { ALTERNATIVE, ... } + ALTERNATIVE = STRING : STRING + | STRING : { 'type': STRING, '*if': COND } + +Member 'alternate' names the alternate type. + +Each ALTERNATIVE of the 'data' object defines a branch of the +alternate. An alternate must have at least one branch. + +The ALTERNATIVE's STRING name is the branch name. + +The ALTERNATIVE's value defines the branch's properties, in particular +its type. The form STRING is shorthand for :code:`{ 'type': STRING }`. + +Example:: + + { 'alternate': 'BlockdevRef', + 'data': { 'definition': 'BlockdevOptions', + 'reference': 'str' } } + +An alternate type is like a union type, except there is no +discriminator on the wire. Instead, the branch to use is inferred +from the value. An alternate can only express a choice between types +represented differently on the wire. + +If a branch is typed as the 'bool' built-in, the alternate accepts +true and false; if it is typed as any of the various numeric +built-ins, it accepts a JSON number; if it is typed as a 'str' +built-in or named enum type, it accepts a JSON string; if it is typed +as the 'null' built-in, it accepts JSON null; and if it is typed as a +complex type (struct or union), it accepts a JSON object. + +The example alternate declaration above allows using both of the +following example objects:: + + { "file": "my_existing_block_device_id" } + { "file": { "driver": "file", + "read-only": false, + "filename": "/tmp/mydisk.qcow2" } } + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +Commands +-------- + +Syntax:: + + COMMAND = { 'command': STRING, + ( + '*data': ( MEMBERS | STRING ), + | + 'data': STRING, + 'boxed': true, + ) + '*returns': TYPE-REF, + '*success-response': false, + '*gen': false, + '*allow-oob': true, + '*allow-preconfig': true, + '*coroutine': true, + '*if': COND, + '*features': FEATURES } + +Member 'command' names the command. + +Member 'data' defines the arguments. It defaults to an empty MEMBERS_ +object. + +If 'data' is a MEMBERS_ object, then MEMBERS defines arguments just +like a struct type's 'data' defines struct type members. + +If 'data' is a STRING, then STRING names a complex type whose members +are the arguments. A union type requires ``'boxed': true``. + +Member 'returns' defines the command's return type. It defaults to an +empty struct type. It must normally be a complex type or an array of +a complex type. To return anything else, the command must be listed +in pragma 'commands-returns-exceptions'. If you do this, extending +the command to return additional information will be harder. Use of +the pragma for new commands is strongly discouraged. + +A command's error responses are not specified in the QAPI schema. +Error conditions should be documented in comments. + +In the Client JSON Protocol, the value of the "execute" or "exec-oob" +member is the command name. The value of the "arguments" member then +has to conform to the arguments, and the value of the success +response's "return" member will conform to the return type. + +Some example commands:: + + { 'command': 'my-first-command', + 'data': { 'arg1': 'str', '*arg2': 'str' } } + { 'struct': 'MyType', 'data': { '*value': 'str' } } + { 'command': 'my-second-command', + 'returns': [ 'MyType' ] } + +which would validate this Client JSON Protocol transaction:: + + => { "execute": "my-first-command", + "arguments": { "arg1": "hello" } } + <= { "return": { } } + => { "execute": "my-second-command" } + <= { "return": [ { "value": "one" }, { } ] } + +The generator emits a prototype for the C function implementing the +command. The function itself needs to be written by hand. See +section `Code generated for commands`_ for examples. + +The function returns the return type. When member 'boxed' is absent, +it takes the command arguments as arguments one by one, in QAPI schema +order. Else it takes them wrapped in the C struct generated for the +complex argument type. It takes an additional ``Error **`` argument in +either case. + +The generator also emits a marshalling function that extracts +arguments for the user's function out of an input QDict, calls the +user's function, and if it succeeded, builds an output QObject from +its return value. This is for use by the QMP monitor core. + +In rare cases, QAPI cannot express a type-safe representation of a +corresponding Client JSON Protocol command. You then have to suppress +generation of a marshalling function by including a member 'gen' with +boolean value false, and instead write your own function. For +example:: + + { 'command': 'netdev_add', + 'data': {'type': 'str', 'id': 'str'}, + 'gen': false } + +Please try to avoid adding new commands that rely on this, and instead +use type-safe unions. + +Normally, the QAPI schema is used to describe synchronous exchanges, +where a response is expected. But in some cases, the action of a +command is expected to change state in a way that a successful +response is not possible (although the command will still return an +error object on failure). When a successful reply is not possible, +the command definition includes the optional member 'success-response' +with boolean value false. So far, only QGA makes use of this member. + +Member 'allow-oob' declares whether the command supports out-of-band +(OOB) execution. It defaults to false. For example:: + + { 'command': 'migrate_recover', + 'data': { 'uri': 'str' }, 'allow-oob': true } + +See qmp-spec.txt for out-of-band execution syntax and semantics. + +Commands supporting out-of-band execution can still be executed +in-band. + +When a command is executed in-band, its handler runs in the main +thread with the BQL held. + +When a command is executed out-of-band, its handler runs in a +dedicated monitor I/O thread with the BQL *not* held. + +An OOB-capable command handler must satisfy the following conditions: + +- It terminates quickly. +- It does not invoke system calls that may block. +- It does not access guest RAM that may block when userfaultfd is + enabled for postcopy live migration. +- It takes only "fast" locks, i.e. all critical sections protected by + any lock it takes also satisfy the conditions for OOB command + handler code. + +The restrictions on locking limit access to shared state. Such access +requires synchronization, but OOB commands can't take the BQL or any +other "slow" lock. + +When in doubt, do not implement OOB execution support. + +Member 'allow-preconfig' declares whether the command is available +before the machine is built. It defaults to false. For example:: + + { 'enum': 'QMPCapability', + 'data': [ 'oob' ] } + { 'command': 'qmp_capabilities', + 'data': { '*enable': [ 'QMPCapability' ] }, + 'allow-preconfig': true } + +QMP is available before the machine is built only when QEMU was +started with --preconfig. + +Member 'coroutine' tells the QMP dispatcher whether the command handler +is safe to be run in a coroutine. It defaults to false. If it is true, +the command handler is called from coroutine context and may yield while +waiting for an external event (such as I/O completion) in order to avoid +blocking the guest and other background operations. + +Coroutine safety can be hard to prove, similar to thread safety. Common +pitfalls are: + +- The global mutex isn't held across ``qemu_coroutine_yield()``, so + operations that used to assume that they execute atomically may have + to be more careful to protect against changes in the global state. + +- Nested event loops (``AIO_WAIT_WHILE()`` etc.) are problematic in + coroutine context and can easily lead to deadlocks. They should be + replaced by yielding and reentering the coroutine when the condition + becomes false. + +Since the command handler may assume coroutine context, any callers +other than the QMP dispatcher must also call it in coroutine context. +In particular, HMP commands calling such a QMP command handler must be +marked ``.coroutine = true`` in hmp-commands.hx. + +It is an error to specify both ``'coroutine': true`` and ``'allow-oob': true`` +for a command. We don't currently have a use case for both together and +without a use case, it's not entirely clear what the semantics should +be. + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +Events +------ + +Syntax:: + + EVENT = { 'event': STRING, + ( + '*data': ( MEMBERS | STRING ), + | + 'data': STRING, + 'boxed': true, + ) + '*if': COND, + '*features': FEATURES } + +Member 'event' names the event. This is the event name used in the +Client JSON Protocol. + +Member 'data' defines the event-specific data. It defaults to an +empty MEMBERS object. + +If 'data' is a MEMBERS object, then MEMBERS defines event-specific +data just like a struct type's 'data' defines struct type members. + +If 'data' is a STRING, then STRING names a complex type whose members +are the event-specific data. A union type requires ``'boxed': true``. + +An example event is:: + + { 'event': 'EVENT_C', + 'data': { '*a': 'int', 'b': 'str' } } + +Resulting in this JSON object:: + + { "event": "EVENT_C", + "data": { "b": "test string" }, + "timestamp": { "seconds": 1267020223, "microseconds": 435656 } } + +The generator emits a function to send the event. When member 'boxed' +is absent, it takes event-specific data one by one, in QAPI schema +order. Else it takes them wrapped in the C struct generated for the +complex type. See section `Code generated for events`_ for examples. + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +.. _FEATURE: + +Features +-------- + +Syntax:: + + FEATURES = [ FEATURE, ... ] + FEATURE = STRING + | { 'name': STRING, '*if': COND } + +Sometimes, the behaviour of QEMU changes compatibly, but without a +change in the QMP syntax (usually by allowing values or operations +that previously resulted in an error). QMP clients may still need to +know whether the extension is available. + +For this purpose, a list of features can be specified for a command or +struct type. Each list member can either be ``{ 'name': STRING, '*if': +COND }``, or STRING, which is shorthand for ``{ 'name': STRING }``. + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +Example:: + + { 'struct': 'TestType', + 'data': { 'number': 'int' }, + 'features': [ 'allow-negative-numbers' ] } + +The feature strings are exposed to clients in introspection, as +explained in section `Client JSON Protocol introspection`_. + +Intended use is to have each feature string signal that this build of +QEMU shows a certain behaviour. + + +Special features +~~~~~~~~~~~~~~~~ + +Feature "deprecated" marks a command, event, enum value, or struct +member as deprecated. It is not supported elsewhere so far. +Interfaces so marked may be withdrawn in future releases in accordance +with QEMU's deprecation policy. + +Feature "unstable" marks a command, event, enum value, or struct +member as unstable. It is not supported elsewhere so far. Interfaces +so marked may be withdrawn or changed incompatibly in future releases. + + +Naming rules and reserved names +------------------------------- + +All names must begin with a letter, and contain only ASCII letters, +digits, hyphen, and underscore. There are two exceptions: enum values +may start with a digit, and names that are downstream extensions (see +section `Downstream extensions`_) start with underscore. + +Names beginning with ``q_`` are reserved for the generator, which uses +them for munging QMP names that resemble C keywords or other +problematic strings. For example, a member named ``default`` in qapi +becomes ``q_default`` in the generated C code. + +Types, commands, and events share a common namespace. Therefore, +generally speaking, type definitions should always use CamelCase for +user-defined type names, while built-in types are lowercase. + +Type names ending with ``Kind`` or ``List`` are reserved for the +generator, which uses them for implicit union enums and array types, +respectively. + +Command names, and member names within a type, should be all lower +case with words separated by a hyphen. However, some existing older +commands and complex types use underscore; when extending them, +consistency is preferred over blindly avoiding underscore. + +Event names should be ALL_CAPS with words separated by underscore. + +Member name ``u`` and names starting with ``has-`` or ``has_`` are reserved +for the generator, which uses them for unions and for tracking +optional members. + +Names beginning with ``x-`` used to signify "experimental". This +convention has been replaced by special feature "unstable". + +Pragmas ``command-name-exceptions`` and ``member-name-exceptions`` let +you violate naming rules. Use for new code is strongly discouraged. See +`Pragma directives`_ for details. + + +Downstream extensions +--------------------- + +QAPI schema names that are externally visible, say in the Client JSON +Protocol, need to be managed with care. Names starting with a +downstream prefix of the form __RFQDN_ are reserved for the downstream +who controls the valid, reverse fully qualified domain name RFQDN. +RFQDN may only contain ASCII letters, digits, hyphen and period. + +Example: Red Hat, Inc. controls redhat.com, and may therefore add a +downstream command ``__com.redhat_drive-mirror``. + + +Configuring the schema +---------------------- + +Syntax:: + + COND = STRING + | { 'all: [ COND, ... ] } + | { 'any: [ COND, ... ] } + | { 'not': COND } + +All definitions take an optional 'if' member. Its value must be a +string, or an object with a single member 'all', 'any' or 'not'. + +The C code generated for the definition will then be guarded by an #if +preprocessing directive with an operand generated from that condition: + + * STRING will generate defined(STRING) + * { 'all': [COND, ...] } will generate (COND && ...) + * { 'any': [COND, ...] } will generate (COND || ...) + * { 'not': COND } will generate !COND + +Example: a conditional struct :: + + { 'struct': 'IfStruct', 'data': { 'foo': 'int' }, + 'if': { 'all': [ 'CONFIG_FOO', 'HAVE_BAR' ] } } + +gets its generated code guarded like this:: + + #if defined(CONFIG_FOO) && defined(HAVE_BAR) + ... generated code ... + #endif /* defined(HAVE_BAR) && defined(CONFIG_FOO) */ + +Individual members of complex types, commands arguments, and +event-specific data can also be made conditional. This requires the +longhand form of MEMBER. + +Example: a struct type with unconditional member 'foo' and conditional +member 'bar' :: + + { 'struct': 'IfStruct', + 'data': { 'foo': 'int', + 'bar': { 'type': 'int', 'if': 'IFCOND'} } } + +A union's discriminator may not be conditional. + +Likewise, individual enumeration values be conditional. This requires +the longhand form of ENUM-VALUE_. + +Example: an enum type with unconditional value 'foo' and conditional +value 'bar' :: + + { 'enum': 'IfEnum', + 'data': [ 'foo', + { 'name' : 'bar', 'if': 'IFCOND' } ] } + +Likewise, features can be conditional. This requires the longhand +form of FEATURE_. + +Example: a struct with conditional feature 'allow-negative-numbers' :: + + { 'struct': 'TestType', + 'data': { 'number': 'int' }, + 'features': [ { 'name': 'allow-negative-numbers', + 'if': 'IFCOND' } ] } + +Please note that you are responsible to ensure that the C code will +compile with an arbitrary combination of conditions, since the +generator is unable to check it at this point. + +The conditions apply to introspection as well, i.e. introspection +shows a conditional entity only when the condition is satisfied in +this particular build. + + +Documentation comments +---------------------- + +A multi-line comment that starts and ends with a ``##`` line is a +documentation comment. + +If the documentation comment starts like :: + + ## + # @SYMBOL: + +it documents the definition of SYMBOL, else it's free-form +documentation. + +See below for more on `Definition documentation`_. + +Free-form documentation may be used to provide additional text and +structuring content. + + +Headings and subheadings +~~~~~~~~~~~~~~~~~~~~~~~~ + +A free-form documentation comment containing a line which starts with +some ``=`` symbols and then a space defines a section heading:: + + ## + # = This is a top level heading + # + # This is a free-form comment which will go under the + # top level heading. + ## + + ## + # == This is a second level heading + ## + +A heading line must be the first line of the documentation +comment block. + +Section headings must always be correctly nested, so you can only +define a third-level heading inside a second-level heading, and so on. + + +Documentation markup +~~~~~~~~~~~~~~~~~~~~ + +Documentation comments can use most rST markup. In particular, +a ``::`` literal block can be used for examples:: + + # :: + # + # Text of the example, may span + # multiple lines + +``*`` starts an itemized list:: + + # * First item, may span + # multiple lines + # * Second item + +You can also use ``-`` instead of ``*``. + +A decimal number followed by ``.`` starts a numbered list:: + + # 1. First item, may span + # multiple lines + # 2. Second item + +The actual number doesn't matter. + +Lists of either kind must be preceded and followed by a blank line. +If a list item's text spans multiple lines, then the second and +subsequent lines must be correctly indented to line up with the +first character of the first line. + +The usual ****strong****, *\*emphasized\** and ````literal```` markup +should be used. If you need a single literal ``*``, you will need to +backslash-escape it. As an extension beyond the usual rST syntax, you +can also use ``@foo`` to reference a name in the schema; this is rendered +the same way as ````foo````. + +Example:: + + ## + # Some text foo with **bold** and *emphasis* + # 1. with a list + # 2. like that + # + # And some code: + # + # :: + # + # $ echo foo + # -> do this + # <- get that + ## + + +Definition documentation +~~~~~~~~~~~~~~~~~~~~~~~~ + +Definition documentation, if present, must immediately precede the +definition it documents. + +When documentation is required (see pragma_ 'doc-required'), every +definition must have documentation. + +Definition documentation starts with a line naming the definition, +followed by an optional overview, a description of each argument (for +commands and events), member (for structs and unions), branch (for +alternates), or value (for enums), a description of each feature (if +any), and finally optional tagged sections. + +The description of an argument or feature 'name' starts with +'\@name:'. The description text can start on the line following the +'\@name:', in which case it must not be indented at all. It can also +start on the same line as the '\@name:'. In this case if it spans +multiple lines then second and subsequent lines must be indented to +line up with the first character of the first line of the +description:: + + # @argone: + # This is a two line description + # in the first style. + # + # @argtwo: This is a two line description + # in the second style. + +The number of spaces between the ':' and the text is not significant. + +.. admonition:: FIXME + + The parser accepts these things in almost any order. + +.. admonition:: FIXME + + union branches should be described, too. + +Extensions added after the definition was first released carry a +'(since x.y.z)' comment. + +The feature descriptions must be preceded by a line "Features:", like +this:: + + # Features: + # @feature: Description text + +A tagged section starts with one of the following words: +"Note:"/"Notes:", "Since:", "Example"/"Examples", "Returns:", "TODO:". +The section ends with the start of a new section. + +The text of a section can start on a new line, in +which case it must not be indented at all. It can also start +on the same line as the 'Note:', 'Returns:', etc tag. In this +case if it spans multiple lines then second and subsequent +lines must be indented to match the first, in the same way as +multiline argument descriptions. + +A 'Since: x.y.z' tagged section lists the release that introduced the +definition. + +An 'Example' or 'Examples' section is automatically rendered +entirely as literal fixed-width text. In other sections, +the text is formatted, and rST markup can be used. + +For example:: + + ## + # @BlockStats: + # + # Statistics of a virtual block device or a block backing device. + # + # @device: If the stats are for a virtual block device, the name + # corresponding to the virtual block device. + # + # @node-name: The node name of the device. (since 2.3) + # + # ... more members ... + # + # Since: 0.14.0 + ## + { 'struct': 'BlockStats', + 'data': {'*device': 'str', '*node-name': 'str', + ... more members ... } } + + ## + # @query-blockstats: + # + # Query the @BlockStats for all virtual block devices. + # + # @query-nodes: If true, the command will query all the + # block nodes ... explain, explain ... (since 2.3) + # + # Returns: A list of @BlockStats for each virtual block devices. + # + # Since: 0.14.0 + # + # Example: + # + # -> { "execute": "query-blockstats" } + # <- { + # ... lots of output ... + # } + # + ## + { 'command': 'query-blockstats', + 'data': { '*query-nodes': 'bool' }, + 'returns': ['BlockStats'] } + + +Client JSON Protocol introspection +================================== + +Clients of a Client JSON Protocol commonly need to figure out what +exactly the server (QEMU) supports. + +For this purpose, QMP provides introspection via command +query-qmp-schema. QGA currently doesn't support introspection. + +While Client JSON Protocol wire compatibility should be maintained +between qemu versions, we cannot make the same guarantees for +introspection stability. For example, one version of qemu may provide +a non-variant optional member of a struct, and a later version rework +the member to instead be non-optional and associated with a variant. +Likewise, one version of qemu may list a member with open-ended type +'str', and a later version could convert it to a finite set of strings +via an enum type; or a member may be converted from a specific type to +an alternate that represents a choice between the original type and +something else. + +query-qmp-schema returns a JSON array of SchemaInfo objects. These +objects together describe the wire ABI, as defined in the QAPI schema. +There is no specified order to the SchemaInfo objects returned; a +client must search for a particular name throughout the entire array +to learn more about that name, but is at least guaranteed that there +will be no collisions between type, command, and event names. + +However, the SchemaInfo can't reflect all the rules and restrictions +that apply to QMP. It's interface introspection (figuring out what's +there), not interface specification. The specification is in the QAPI +schema. To understand how QMP is to be used, you need to study the +QAPI schema. + +Like any other command, query-qmp-schema is itself defined in the QAPI +schema, along with the SchemaInfo type. This text attempts to give an +overview how things work. For details you need to consult the QAPI +schema. + +SchemaInfo objects have common members "name", "meta-type", +"features", and additional variant members depending on the value of +meta-type. + +Each SchemaInfo object describes a wire ABI entity of a certain +meta-type: a command, event or one of several kinds of type. + +SchemaInfo for commands and events have the same name as in the QAPI +schema. + +Command and event names are part of the wire ABI, but type names are +not. Therefore, the SchemaInfo for types have auto-generated +meaningless names. For readability, the examples in this section use +meaningful type names instead. + +Optional member "features" exposes the entity's feature strings as a +JSON array of strings. + +To examine a type, start with a command or event using it, then follow +references by name. + +QAPI schema definitions not reachable that way are omitted. + +The SchemaInfo for a command has meta-type "command", and variant +members "arg-type", "ret-type" and "allow-oob". On the wire, the +"arguments" member of a client's "execute" command must conform to the +object type named by "arg-type". The "return" member that the server +passes in a success response conforms to the type named by "ret-type". +When "allow-oob" is true, it means the command supports out-of-band +execution. It defaults to false. + +If the command takes no arguments, "arg-type" names an object type +without members. Likewise, if the command returns nothing, "ret-type" +names an object type without members. + +Example: the SchemaInfo for command query-qmp-schema :: + + { "name": "query-qmp-schema", "meta-type": "command", + "arg-type": "q_empty", "ret-type": "SchemaInfoList" } + + Type "q_empty" is an automatic object type without members, and type + "SchemaInfoList" is the array of SchemaInfo type. + +The SchemaInfo for an event has meta-type "event", and variant member +"arg-type". On the wire, a "data" member that the server passes in an +event conforms to the object type named by "arg-type". + +If the event carries no additional information, "arg-type" names an +object type without members. The event may not have a data member on +the wire then. + +Each command or event defined with 'data' as MEMBERS object in the +QAPI schema implicitly defines an object type. + +Example: the SchemaInfo for EVENT_C from section Events_ :: + + { "name": "EVENT_C", "meta-type": "event", + "arg-type": "q_obj-EVENT_C-arg" } + + Type "q_obj-EVENT_C-arg" is an implicitly defined object type with + the two members from the event's definition. + +The SchemaInfo for struct and union types has meta-type "object". + +The SchemaInfo for a struct type has variant member "members". + +The SchemaInfo for a union type additionally has variant members "tag" +and "variants". + +"members" is a JSON array describing the object's common members, if +any. Each element is a JSON object with members "name" (the member's +name), "type" (the name of its type), "features" (a JSON array of +feature strings), and "default". The latter two are optional. The +member is optional if "default" is present. Currently, "default" can +only have value null. Other values are reserved for future +extensions. The "members" array is in no particular order; clients +must search the entire object when learning whether a particular +member is supported. + +Example: the SchemaInfo for MyType from section `Struct types`_ :: + + { "name": "MyType", "meta-type": "object", + "members": [ + { "name": "member1", "type": "str" }, + { "name": "member2", "type": "int" }, + { "name": "member3", "type": "str", "default": null } ] } + +"features" exposes the command's feature strings as a JSON array of +strings. + +Example: the SchemaInfo for TestType from section Features_:: + + { "name": "TestType", "meta-type": "object", + "members": [ + { "name": "number", "type": "int" } ], + "features": ["allow-negative-numbers"] } + +"tag" is the name of the common member serving as type tag. +"variants" is a JSON array describing the object's variant members. +Each element is a JSON object with members "case" (the value of type +tag this element applies to) and "type" (the name of an object type +that provides the variant members for this type tag value). The +"variants" array is in no particular order, and is not guaranteed to +list cases in the same order as the corresponding "tag" enum type. + +Example: the SchemaInfo for union BlockdevOptions from section +`Union types`_ :: + + { "name": "BlockdevOptions", "meta-type": "object", + "members": [ + { "name": "driver", "type": "BlockdevDriver" }, + { "name": "read-only", "type": "bool", "default": null } ], + "tag": "driver", + "variants": [ + { "case": "file", "type": "BlockdevOptionsFile" }, + { "case": "qcow2", "type": "BlockdevOptionsQcow2" } ] } + +Note that base types are "flattened": its members are included in the +"members" array. + +The SchemaInfo for an alternate type has meta-type "alternate", and +variant member "members". "members" is a JSON array. Each element is +a JSON object with member "type", which names a type. Values of the +alternate type conform to exactly one of its member types. There is +no guarantee on the order in which "members" will be listed. + +Example: the SchemaInfo for BlockdevRef from section `Alternate types`_ :: + + { "name": "BlockdevRef", "meta-type": "alternate", + "members": [ + { "type": "BlockdevOptions" }, + { "type": "str" } ] } + +The SchemaInfo for an array type has meta-type "array", and variant +member "element-type", which names the array's element type. Array +types are implicitly defined. For convenience, the array's name may +resemble the element type; however, clients should examine member +"element-type" instead of making assumptions based on parsing member +"name". + +Example: the SchemaInfo for ['str'] :: + + { "name": "[str]", "meta-type": "array", + "element-type": "str" } + +The SchemaInfo for an enumeration type has meta-type "enum" and +variant member "members". + +"members" is a JSON array describing the enumeration values. Each +element is a JSON object with member "name" (the member's name), and +optionally "features" (a JSON array of feature strings). The +"members" array is in no particular order; clients must search the +entire array when learning whether a particular value is supported. + +Example: the SchemaInfo for MyEnum from section `Enumeration types`_ :: + + { "name": "MyEnum", "meta-type": "enum", + "members": [ + { "name": "value1" }, + { "name": "value2" }, + { "name": "value3" } + ] } + +The SchemaInfo for a built-in type has the same name as the type in +the QAPI schema (see section `Built-in Types`_), with one exception +detailed below. It has variant member "json-type" that shows how +values of this type are encoded on the wire. + +Example: the SchemaInfo for str :: + + { "name": "str", "meta-type": "builtin", "json-type": "string" } + +The QAPI schema supports a number of integer types that only differ in +how they map to C. They are identical as far as SchemaInfo is +concerned. Therefore, they get all mapped to a single type "int" in +SchemaInfo. + +As explained above, type names are not part of the wire ABI. Not even +the names of built-in types. Clients should examine member +"json-type" instead of hard-coding names of built-in types. + + +Compatibility considerations +============================ + +Maintaining backward compatibility at the Client JSON Protocol level +while evolving the schema requires some care. This section is about +syntactic compatibility, which is necessary, but not sufficient, for +actual compatibility. + +Clients send commands with argument data, and receive command +responses with return data and events with event data. + +Adding opt-in functionality to the send direction is backwards +compatible: adding commands, optional arguments, enumeration values, +union and alternate branches; turning an argument type into an +alternate of that type; making mandatory arguments optional. Clients +oblivious of the new functionality continue to work. + +Incompatible changes include removing commands, command arguments, +enumeration values, union and alternate branches, adding mandatory +command arguments, and making optional arguments mandatory. + +The specified behavior of an absent optional argument should remain +the same. With proper documentation, this policy still allows some +flexibility; for example, when an optional 'buffer-size' argument is +specified to default to a sensible buffer size, the actual default +value can still be changed. The specified default behavior is not the +exact size of the buffer, only that the default size is sensible. + +Adding functionality to the receive direction is generally backwards +compatible: adding events, adding return and event data members. +Clients are expected to ignore the ones they don't know. + +Removing "unreachable" stuff like events that can't be triggered +anymore, optional return or event data members that can't be sent +anymore, and return or event data member (enumeration) values that +can't be sent anymore makes no difference to clients, except for +introspection. The latter can conceivably confuse clients, so tread +carefully. + +Incompatible changes include removing return and event data members. + +Any change to a command definition's 'data' or one of the types used +there (recursively) needs to consider send direction compatibility. + +Any change to a command definition's 'return', an event definition's +'data', or one of the types used there (recursively) needs to consider +receive direction compatibility. + +Any change to types used in both contexts need to consider both. + +Enumeration type values and complex and alternate type members may be +reordered freely. For enumerations and alternate types, this doesn't +affect the wire encoding. For complex types, this might make the +implementation emit JSON object members in a different order, which +the Client JSON Protocol permits. + +Since type names are not visible in the Client JSON Protocol, types +may be freely renamed. Even certain refactorings are invisible, such +as splitting members from one type into a common base type. + + +Code generation +=============== + +The QAPI code generator qapi-gen.py generates code and documentation +from the schema. Together with the core QAPI libraries, this code +provides everything required to take JSON commands read in by a Client +JSON Protocol server, unmarshal the arguments into the underlying C +types, call into the corresponding C function, map the response back +to a Client JSON Protocol response to be returned to the user, and +introspect the commands. + +As an example, we'll use the following schema, which describes a +single complex user-defined type, along with command which takes a +list of that type as a parameter, and returns a single element of that +type. The user is responsible for writing the implementation of +qmp_my_command(); everything else is produced by the generator. :: + + $ cat example-schema.json + { 'struct': 'UserDefOne', + 'data': { 'integer': 'int', '*string': 'str' } } + + { 'command': 'my-command', + 'data': { 'arg1': ['UserDefOne'] }, + 'returns': 'UserDefOne' } + + { 'event': 'MY_EVENT' } + +We run qapi-gen.py like this:: + + $ python scripts/qapi-gen.py --output-dir="qapi-generated" \ + --prefix="example-" example-schema.json + +For a more thorough look at generated code, the testsuite includes +tests/qapi-schema/qapi-schema-tests.json that covers more examples of +what the generator will accept, and compiles the resulting C code as +part of 'make check-unit'. + + +Code generated for QAPI types +----------------------------- + +The following files are created: + + ``$(prefix)qapi-types.h`` + C types corresponding to types defined in the schema + + ``$(prefix)qapi-types.c`` + Cleanup functions for the above C types + +The $(prefix) is an optional parameter used as a namespace to keep the +generated code from one schema/code-generation separated from others so code +can be generated/used from multiple schemas without clobbering previously +created code. + +Example:: + + $ cat qapi-generated/example-qapi-types.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_TYPES_H + #define EXAMPLE_QAPI_TYPES_H + + #include "qapi/qapi-builtin-types.h" + + typedef struct UserDefOne UserDefOne; + + typedef struct UserDefOneList UserDefOneList; + + typedef struct q_obj_my_command_arg q_obj_my_command_arg; + + struct UserDefOne { + int64_t integer; + bool has_string; + char *string; + }; + + void qapi_free_UserDefOne(UserDefOne *obj); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(UserDefOne, qapi_free_UserDefOne) + + struct UserDefOneList { + UserDefOneList *next; + UserDefOne *value; + }; + + void qapi_free_UserDefOneList(UserDefOneList *obj); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(UserDefOneList, qapi_free_UserDefOneList) + + struct q_obj_my_command_arg { + UserDefOneList *arg1; + }; + + #endif /* EXAMPLE_QAPI_TYPES_H */ + $ cat qapi-generated/example-qapi-types.c + [Uninteresting stuff omitted...] + + void qapi_free_UserDefOne(UserDefOne *obj) + { + Visitor *v; + + if (!obj) { + return; + } + + v = qapi_dealloc_visitor_new(); + visit_type_UserDefOne(v, NULL, &obj, NULL); + visit_free(v); + } + + void qapi_free_UserDefOneList(UserDefOneList *obj) + { + Visitor *v; + + if (!obj) { + return; + } + + v = qapi_dealloc_visitor_new(); + visit_type_UserDefOneList(v, NULL, &obj, NULL); + visit_free(v); + } + + [Uninteresting stuff omitted...] + +For a modular QAPI schema (see section `Include directives`_), code for +each sub-module SUBDIR/SUBMODULE.json is actually generated into :: + + SUBDIR/$(prefix)qapi-types-SUBMODULE.h + SUBDIR/$(prefix)qapi-types-SUBMODULE.c + +If qapi-gen.py is run with option --builtins, additional files are +created: + + ``qapi-builtin-types.h`` + C types corresponding to built-in types + + ``qapi-builtin-types.c`` + Cleanup functions for the above C types + + +Code generated for visiting QAPI types +-------------------------------------- + +These are the visitor functions used to walk through and convert +between a native QAPI C data structure and some other format (such as +QObject); the generated functions are named visit_type_FOO() and +visit_type_FOO_members(). + +The following files are generated: + + ``$(prefix)qapi-visit.c`` + Visitor function for a particular C type, used to automagically + convert QObjects into the corresponding C type and vice-versa, as + well as for deallocating memory for an existing C type + + ``$(prefix)qapi-visit.h`` + Declarations for previously mentioned visitor functions + +Example:: + + $ cat qapi-generated/example-qapi-visit.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_VISIT_H + #define EXAMPLE_QAPI_VISIT_H + + #include "qapi/qapi-builtin-visit.h" + #include "example-qapi-types.h" + + + bool visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp); + + bool visit_type_UserDefOne(Visitor *v, const char *name, + UserDefOne **obj, Error **errp); + + bool visit_type_UserDefOneList(Visitor *v, const char *name, + UserDefOneList **obj, Error **errp); + + bool visit_type_q_obj_my_command_arg_members(Visitor *v, q_obj_my_command_arg *obj, Error **errp); + + #endif /* EXAMPLE_QAPI_VISIT_H */ + $ cat qapi-generated/example-qapi-visit.c + [Uninteresting stuff omitted...] + + bool visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp) + { + if (!visit_type_int(v, "integer", &obj->integer, errp)) { + return false; + } + if (visit_optional(v, "string", &obj->has_string)) { + if (!visit_type_str(v, "string", &obj->string, errp)) { + return false; + } + } + return true; + } + + bool visit_type_UserDefOne(Visitor *v, const char *name, + UserDefOne **obj, Error **errp) + { + bool ok = false; + + if (!visit_start_struct(v, name, (void **)obj, sizeof(UserDefOne), errp)) { + return false; + } + if (!*obj) { + /* incomplete */ + assert(visit_is_dealloc(v)); + ok = true; + goto out_obj; + } + if (!visit_type_UserDefOne_members(v, *obj, errp)) { + goto out_obj; + } + ok = visit_check_struct(v, errp); + out_obj: + visit_end_struct(v, (void **)obj); + if (!ok && visit_is_input(v)) { + qapi_free_UserDefOne(*obj); + *obj = NULL; + } + return ok; + } + + bool visit_type_UserDefOneList(Visitor *v, const char *name, + UserDefOneList **obj, Error **errp) + { + bool ok = false; + UserDefOneList *tail; + size_t size = sizeof(**obj); + + if (!visit_start_list(v, name, (GenericList **)obj, size, errp)) { + return false; + } + + for (tail = *obj; tail; + tail = (UserDefOneList *)visit_next_list(v, (GenericList *)tail, size)) { + if (!visit_type_UserDefOne(v, NULL, &tail->value, errp)) { + goto out_obj; + } + } + + ok = visit_check_list(v, errp); + out_obj: + visit_end_list(v, (void **)obj); + if (!ok && visit_is_input(v)) { + qapi_free_UserDefOneList(*obj); + *obj = NULL; + } + return ok; + } + + bool visit_type_q_obj_my_command_arg_members(Visitor *v, q_obj_my_command_arg *obj, Error **errp) + { + if (!visit_type_UserDefOneList(v, "arg1", &obj->arg1, errp)) { + return false; + } + return true; + } + + [Uninteresting stuff omitted...] + +For a modular QAPI schema (see section `Include directives`_), code for +each sub-module SUBDIR/SUBMODULE.json is actually generated into :: + + SUBDIR/$(prefix)qapi-visit-SUBMODULE.h + SUBDIR/$(prefix)qapi-visit-SUBMODULE.c + +If qapi-gen.py is run with option --builtins, additional files are +created: + + ``qapi-builtin-visit.h`` + Visitor functions for built-in types + + ``qapi-builtin-visit.c`` + Declarations for these visitor functions + + +Code generated for commands +--------------------------- + +These are the marshaling/dispatch functions for the commands defined +in the schema. The generated code provides qmp_marshal_COMMAND(), and +declares qmp_COMMAND() that the user must implement. + +The following files are generated: + + ``$(prefix)qapi-commands.c`` + Command marshal/dispatch functions for each QMP command defined in + the schema + + ``$(prefix)qapi-commands.h`` + Function prototypes for the QMP commands specified in the schema + + ``$(prefix)qapi-init-commands.h`` + Command initialization prototype + + ``$(prefix)qapi-init-commands.c`` + Command initialization code + +Example:: + + $ cat qapi-generated/example-qapi-commands.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_COMMANDS_H + #define EXAMPLE_QAPI_COMMANDS_H + + #include "example-qapi-types.h" + + UserDefOne *qmp_my_command(UserDefOneList *arg1, Error **errp); + void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp); + + #endif /* EXAMPLE_QAPI_COMMANDS_H */ + $ cat qapi-generated/example-qapi-commands.c + [Uninteresting stuff omitted...] + + + static void qmp_marshal_output_UserDefOne(UserDefOne *ret_in, + QObject **ret_out, Error **errp) + { + Visitor *v; + + v = qobject_output_visitor_new_qmp(ret_out); + if (visit_type_UserDefOne(v, "unused", &ret_in, errp)) { + visit_complete(v, ret_out); + } + visit_free(v); + v = qapi_dealloc_visitor_new(); + visit_type_UserDefOne(v, "unused", &ret_in, NULL); + visit_free(v); + } + + void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp) + { + Error *err = NULL; + bool ok = false; + Visitor *v; + UserDefOne *retval; + q_obj_my_command_arg arg = {0}; + + v = qobject_input_visitor_new_qmp(QOBJECT(args)); + if (!visit_start_struct(v, NULL, NULL, 0, errp)) { + goto out; + } + if (visit_type_q_obj_my_command_arg_members(v, &arg, errp)) { + ok = visit_check_struct(v, errp); + } + visit_end_struct(v, NULL); + if (!ok) { + goto out; + } + + retval = qmp_my_command(arg.arg1, &err); + error_propagate(errp, err); + if (err) { + goto out; + } + + qmp_marshal_output_UserDefOne(retval, ret, errp); + + out: + visit_free(v); + v = qapi_dealloc_visitor_new(); + visit_start_struct(v, NULL, NULL, 0, NULL); + visit_type_q_obj_my_command_arg_members(v, &arg, NULL); + visit_end_struct(v, NULL); + visit_free(v); + } + + [Uninteresting stuff omitted...] + $ cat qapi-generated/example-qapi-init-commands.h + [Uninteresting stuff omitted...] + #ifndef EXAMPLE_QAPI_INIT_COMMANDS_H + #define EXAMPLE_QAPI_INIT_COMMANDS_H + + #include "qapi/qmp/dispatch.h" + + void example_qmp_init_marshal(QmpCommandList *cmds); + + #endif /* EXAMPLE_QAPI_INIT_COMMANDS_H */ + $ cat qapi-generated/example-qapi-init-commands.c + [Uninteresting stuff omitted...] + void example_qmp_init_marshal(QmpCommandList *cmds) + { + QTAILQ_INIT(cmds); + + qmp_register_command(cmds, "my-command", + qmp_marshal_my_command, QCO_NO_OPTIONS); + } + [Uninteresting stuff omitted...] + +For a modular QAPI schema (see section `Include directives`_), code for +each sub-module SUBDIR/SUBMODULE.json is actually generated into:: + + SUBDIR/$(prefix)qapi-commands-SUBMODULE.h + SUBDIR/$(prefix)qapi-commands-SUBMODULE.c + + +Code generated for events +------------------------- + +This is the code related to events defined in the schema, providing +qapi_event_send_EVENT(). + +The following files are created: + + ``$(prefix)qapi-events.h`` + Function prototypes for each event type + + ``$(prefix)qapi-events.c`` + Implementation of functions to send an event + + ``$(prefix)qapi-emit-events.h`` + Enumeration of all event names, and common event code declarations + + ``$(prefix)qapi-emit-events.c`` + Common event code definitions + +Example:: + + $ cat qapi-generated/example-qapi-events.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_EVENTS_H + #define EXAMPLE_QAPI_EVENTS_H + + #include "qapi/util.h" + #include "example-qapi-types.h" + + void qapi_event_send_my_event(void); + + #endif /* EXAMPLE_QAPI_EVENTS_H */ + $ cat qapi-generated/example-qapi-events.c + [Uninteresting stuff omitted...] + + void qapi_event_send_my_event(void) + { + QDict *qmp; + + qmp = qmp_event_build_dict("MY_EVENT"); + + example_qapi_event_emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp); + + qobject_unref(qmp); + } + + [Uninteresting stuff omitted...] + $ cat qapi-generated/example-qapi-emit-events.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_EMIT_EVENTS_H + #define EXAMPLE_QAPI_EMIT_EVENTS_H + + #include "qapi/util.h" + + typedef enum example_QAPIEvent { + EXAMPLE_QAPI_EVENT_MY_EVENT, + EXAMPLE_QAPI_EVENT__MAX, + } example_QAPIEvent; + + #define example_QAPIEvent_str(val) \ + qapi_enum_lookup(&example_QAPIEvent_lookup, (val)) + + extern const QEnumLookup example_QAPIEvent_lookup; + + void example_qapi_event_emit(example_QAPIEvent event, QDict *qdict); + + #endif /* EXAMPLE_QAPI_EMIT_EVENTS_H */ + $ cat qapi-generated/example-qapi-emit-events.c + [Uninteresting stuff omitted...] + + const QEnumLookup example_QAPIEvent_lookup = { + .array = (const char *const[]) { + [EXAMPLE_QAPI_EVENT_MY_EVENT] = "MY_EVENT", + }, + .size = EXAMPLE_QAPI_EVENT__MAX + }; + + [Uninteresting stuff omitted...] + +For a modular QAPI schema (see section `Include directives`_), code for +each sub-module SUBDIR/SUBMODULE.json is actually generated into :: + + SUBDIR/$(prefix)qapi-events-SUBMODULE.h + SUBDIR/$(prefix)qapi-events-SUBMODULE.c + + +Code generated for introspection +-------------------------------- + +The following files are created: + + ``$(prefix)qapi-introspect.c`` + Defines a string holding a JSON description of the schema + + ``$(prefix)qapi-introspect.h`` + Declares the above string + +Example:: + + $ cat qapi-generated/example-qapi-introspect.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_INTROSPECT_H + #define EXAMPLE_QAPI_INTROSPECT_H + + #include "qapi/qmp/qlit.h" + + extern const QLitObject example_qmp_schema_qlit; + + #endif /* EXAMPLE_QAPI_INTROSPECT_H */ + $ cat qapi-generated/example-qapi-introspect.c + [Uninteresting stuff omitted...] + + const QLitObject example_qmp_schema_qlit = QLIT_QLIST(((QLitObject[]) { + QLIT_QDICT(((QLitDictEntry[]) { + { "arg-type", QLIT_QSTR("0"), }, + { "meta-type", QLIT_QSTR("command"), }, + { "name", QLIT_QSTR("my-command"), }, + { "ret-type", QLIT_QSTR("1"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "arg-type", QLIT_QSTR("2"), }, + { "meta-type", QLIT_QSTR("event"), }, + { "name", QLIT_QSTR("MY_EVENT"), }, + {} + })), + /* "0" = q_obj_my-command-arg */ + QLIT_QDICT(((QLitDictEntry[]) { + { "members", QLIT_QLIST(((QLitObject[]) { + QLIT_QDICT(((QLitDictEntry[]) { + { "name", QLIT_QSTR("arg1"), }, + { "type", QLIT_QSTR("[1]"), }, + {} + })), + {} + })), }, + { "meta-type", QLIT_QSTR("object"), }, + { "name", QLIT_QSTR("0"), }, + {} + })), + /* "1" = UserDefOne */ + QLIT_QDICT(((QLitDictEntry[]) { + { "members", QLIT_QLIST(((QLitObject[]) { + QLIT_QDICT(((QLitDictEntry[]) { + { "name", QLIT_QSTR("integer"), }, + { "type", QLIT_QSTR("int"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "default", QLIT_QNULL, }, + { "name", QLIT_QSTR("string"), }, + { "type", QLIT_QSTR("str"), }, + {} + })), + {} + })), }, + { "meta-type", QLIT_QSTR("object"), }, + { "name", QLIT_QSTR("1"), }, + {} + })), + /* "2" = q_empty */ + QLIT_QDICT(((QLitDictEntry[]) { + { "members", QLIT_QLIST(((QLitObject[]) { + {} + })), }, + { "meta-type", QLIT_QSTR("object"), }, + { "name", QLIT_QSTR("2"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "element-type", QLIT_QSTR("1"), }, + { "meta-type", QLIT_QSTR("array"), }, + { "name", QLIT_QSTR("[1]"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "json-type", QLIT_QSTR("int"), }, + { "meta-type", QLIT_QSTR("builtin"), }, + { "name", QLIT_QSTR("int"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "json-type", QLIT_QSTR("string"), }, + { "meta-type", QLIT_QSTR("builtin"), }, + { "name", QLIT_QSTR("str"), }, + {} + })), + {} + })); + + [Uninteresting stuff omitted...] diff --git a/docs/devel/qgraph.rst b/docs/devel/qgraph.rst new file mode 100644 index 000000000..43342d9d6 --- /dev/null +++ b/docs/devel/qgraph.rst @@ -0,0 +1,628 @@ +.. _qgraph: + +Qtest Driver Framework +====================== + +In order to test a specific driver, plain libqos tests need to +take care of booting QEMU with the right machine and devices. +This makes each test "hardcoded" for a specific configuration, reducing +the possible coverage that it can reach. + +For example, the sdhci device is supported on both x86_64 and ARM boards, +therefore a generic sdhci test should test all machines and drivers that +support that device. +Using only libqos APIs, the test has to manually take care of +covering all the setups, and build the correct command line. + +This also introduces backward compatibility issues: if a device/driver command +line name is changed, all tests that use that will not work +properly anymore and need to be adjusted. + +The aim of qgraph is to create a graph of drivers, machines and tests such that +a test aimed to a certain driver does not have to care of +booting the right QEMU machine, pick the right device, build the command line +and so on. Instead, it only defines what type of device it is testing +(interface in qgraph terms) and the framework takes care of +covering all supported types of devices and machine architectures. + +Following the above example, an interface would be ``sdhci``, +so the sdhci-test should only care of linking its qgraph node with +that interface. In this way, if the command line of a sdhci driver +is changed, only the respective qgraph driver node has to be adjusted. + +QGraph concepts +--------------- + +The graph is composed by nodes that represent machines, drivers, tests +and edges that define the relationships between them (``CONSUMES``, ``PRODUCES``, and +``CONTAINS``). + +Nodes +~~~~~ + +A node can be of four types: + +- **QNODE_MACHINE**: for example ``arm/raspi2b`` +- **QNODE_DRIVER**: for example ``generic-sdhci`` +- **QNODE_INTERFACE**: for example ``sdhci`` (interface for all ``-sdhci`` + drivers). + An interface is not explicitly created, it will be automatically + instantiated when a node consumes or produces it. + An interface is simply a struct that abstracts the various drivers + for the same type of device, and offers an API to the nodes that + use it ("consume" relation in qgraph terms) that is implemented/backed up by the drivers that implement it ("produce" relation in qgraph terms). +- **QNODE_TEST**: for example ``sdhci-test``. A test consumes an interface + and tests the functions provided by it. + +Notes for the nodes: + +- QNODE_MACHINE: each machine struct must have a ``QGuestAllocator`` and + implement ``get_driver()`` to return the allocator mapped to the interface + "memory". The function can also return ``NULL`` if the allocator + is not set. +- QNODE_DRIVER: driver names must be unique, and machines and nodes + planned to be "consumed" by other nodes must match QEMU + drivers name, otherwise they won't be discovered + +Edges +~~~~~ + +An edge relation between two nodes (drivers or machines) ``X`` and ``Y`` can be: + +- ``X CONSUMES Y``: ``Y`` can be plugged into ``X`` +- ``X PRODUCES Y``: ``X`` provides the interface ``Y`` +- ``X CONTAINS Y``: ``Y`` is part of ``X`` component + +Execution steps +~~~~~~~~~~~~~~~ + +The basic framework steps are the following: + +- All nodes and edges are created in their respective + machine/driver/test files +- The framework starts QEMU and asks for a list of available devices + and machines (note that only machines and "consumed" nodes are mapped + 1:1 with QEMU devices) +- The framework walks the graph starting from the available machines and + performs a Depth First Search for tests +- Once a test is found, the path is walked again and all drivers are + allocated accordingly and the final interface is passed to the test +- The test is executed +- Unused objects are cleaned and the path discovery is continued + +Depending on the QEMU binary used, only some drivers/machines will be +available and only test that are reached by them will be executed. + +Command line +~~~~~~~~~~~~ + +Command line is built by using node names and optional arguments +passed by the user when building the edges. + +There are three types of command line arguments: + +- ``in node`` : created from the node name. For example, machines will + have ``-M <machine>`` to its command line, while devices + ``-device <device>``. It is automatically done by the framework. +- ``after node`` : added as additional argument to the node name. + This argument is added optionally when creating edges, + by setting the parameter ``after_cmd_line`` and + ``extra_edge_opts`` in ``QOSGraphEdgeOptions``. + The framework automatically adds + a comma before ``extra_edge_opts``, + because it is going to add attributes + after the destination node pointed by + the edge containing these options, and automatically + adds a space before ``after_cmd_line``, because it + adds an additional device, not an attribute. +- ``before node`` : added as additional argument to the node name. + This argument is added optionally when creating edges, + by setting the parameter ``before_cmd_line`` in + ``QOSGraphEdgeOptions``. This attribute + is going to add attributes before the destination node + pointed by the edge containing these options. It is + helpful to commands that are not node-representable, + such as ``-fdsev`` or ``-netdev``. + +While adding command line in edges is always used, not all nodes names are +used in every path walk: this is because the contained or produced ones +are already added by QEMU, so only nodes that "consumes" will be used to +build the command line. Also, nodes that will have ``{ "abstract" : true }`` +as QMP attribute will loose their command line, since they are not proper +devices to be added in QEMU. + +Example:: + + QOSGraphEdgeOptions opts = { + .before_cmd_line = "-drive id=drv0,if=none,file=null-co://," + "file.read-zeroes=on,format=raw", + .after_cmd_line = "-device scsi-hd,bus=vs0.0,drive=drv0", + + opts.extra_device_opts = "id=vs0"; + }; + + qos_node_create_driver("virtio-scsi-device", + virtio_scsi_device_create); + qos_node_consumes("virtio-scsi-device", "virtio-bus", &opts); + +Will produce the following command line: +``-drive id=drv0,if=none,file=null-co://, -device virtio-scsi-device,id=vs0 -device scsi-hd,bus=vs0.0,drive=drv0`` + +Troubleshooting unavailable tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If there is no path from an available machine to a test then that test will be +unavailable and won't execute. This can happen if a test or driver did not set +up its qgraph node correctly. It can also happen if the necessary machine type +or device is missing from the QEMU binary because it was compiled out or +otherwise. + +It is possible to troubleshoot unavailable tests by running:: + + $ QTEST_QEMU_BINARY=build/qemu-system-x86_64 build/tests/qtest/qos-test --verbose + # ALL QGRAPH EDGES: { + # src='virtio-net' + # |-> dest='virtio-net-tests/vhost-user/multiqueue' type=2 (node=0x559142109e30) + # |-> dest='virtio-net-tests/vhost-user/migrate' type=2 (node=0x559142109d00) + # src='virtio-net-pci' + # |-> dest='virtio-net' type=1 (node=0x55914210d740) + # src='pci-bus' + # |-> dest='virtio-net-pci' type=2 (node=0x55914210d880) + # src='pci-bus-pc' + # |-> dest='pci-bus' type=1 (node=0x559142103f40) + # src='i440FX-pcihost' + # |-> dest='pci-bus-pc' type=0 (node=0x55914210ac70) + # src='x86_64/pc' + # |-> dest='i440FX-pcihost' type=0 (node=0x5591421117f0) + # src='' + # |-> dest='x86_64/pc' type=0 (node=0x559142111600) + # |-> dest='arm/raspi2b' type=0 (node=0x559142110740) + ... + # } + # ALL QGRAPH NODES: { + # name='virtio-net-tests/announce-self' type=3 cmd_line='(null)' [available] + # name='arm/raspi2b' type=0 cmd_line='-M raspi2b ' [UNAVAILABLE] + ... + # } + +The ``virtio-net-tests/announce-self`` test is listed as "available" in the +"ALL QGRAPH NODES" output. This means the test will execute. We can follow the +qgraph path in the "ALL QGRAPH EDGES" output as follows: '' -> 'x86_64/pc' -> +'i440FX-pcihost' -> 'pci-bus-pc' -> 'pci-bus' -> 'virtio-net-pci' -> +'virtio-net'. The root of the qgraph is '' and the depth first search begins +there. + +The ``arm/raspi2b`` machine node is listed as "UNAVAILABLE". Although it is +reachable from the root via '' -> 'arm/raspi2b' the node is unavailable because +the QEMU binary did not list it when queried by the framework. This is expected +because we used the ``qemu-system-x86_64`` binary which does not support ARM +machine types. + +If a test is unexpectedly listed as "UNAVAILABLE", first check that the "ALL +QGRAPH EDGES" output reports edge connectivity from the root ('') to the test. +If there is no connectivity then the qgraph nodes were not set up correctly and +the driver or test code is incorrect. If there is connectivity, check the +availability of each node in the path in the "ALL QGRAPH NODES" output. The +first unavailable node in the path is the reason why the test is unavailable. +Typically this is because the QEMU binary lacks support for the necessary +machine type or device. + +Creating a new driver and its interface +--------------------------------------- + +Here we continue the ``sdhci`` use case, with the following scenario: + +- ``sdhci-test`` aims to test the ``read[q,w], writeq`` functions + offered by the ``sdhci`` drivers. +- The current ``sdhci`` device is supported by both ``x86_64/pc`` and ``ARM`` + (in this example we focus on the ``arm-raspi2b``) machines. +- QEMU offers 2 types of drivers: ``QSDHCI_MemoryMapped`` for ``ARM`` and + ``QSDHCI_PCI`` for ``x86_64/pc``. Both implement the + ``read[q,w], writeq`` functions. + +In order to implement such scenario in qgraph, the test developer needs to: + +- Create the ``x86_64/pc`` machine node. This machine uses the + ``pci-bus`` architecture so it ``contains`` a PCI driver, + ``pci-bus-pc``. The actual path is + + ``x86_64/pc --contains--> 1440FX-pcihost --contains--> + pci-bus-pc --produces--> pci-bus``. + + For the sake of this example, + we do not focus on the PCI interface implementation. +- Create the ``sdhci-pci`` driver node, representing ``QSDHCI_PCI``. + The driver uses the PCI bus (and its API), + so it must ``consume`` the ``pci-bus`` generic interface (which abstracts + all the pci drivers available) + + ``sdhci-pci --consumes--> pci-bus`` +- Create an ``arm/raspi2b`` machine node. This machine ``contains`` + a ``generic-sdhci`` memory mapped ``sdhci`` driver node, representing + ``QSDHCI_MemoryMapped``. + + ``arm/raspi2b --contains--> generic-sdhci`` +- Create the ``sdhci`` interface node. This interface offers the + functions that are shared by all ``sdhci`` devices. + The interface is produced by ``sdhci-pci`` and ``generic-sdhci``, + the available architecture-specific drivers. + + ``sdhci-pci --produces--> sdhci`` + + ``generic-sdhci --produces--> sdhci`` +- Create the ``sdhci-test`` test node. The test ``consumes`` the + ``sdhci`` interface, using its API. It doesn't need to look at + the supported machines or drivers. + + ``sdhci-test --consumes--> sdhci`` + +``arm-raspi2b`` machine, simplified from +``tests/qtest/libqos/arm-raspi2-machine.c``:: + + #include "qgraph.h" + + struct QRaspi2Machine { + QOSGraphObject obj; + QGuestAllocator alloc; + QSDHCI_MemoryMapped sdhci; + }; + + static void *raspi2_get_driver(void *object, const char *interface) + { + QRaspi2Machine *machine = object; + if (!g_strcmp0(interface, "memory")) { + return &machine->alloc; + } + + fprintf(stderr, "%s not present in arm/raspi2b\n", interface); + g_assert_not_reached(); + } + + static QOSGraphObject *raspi2_get_device(void *obj, + const char *device) + { + QRaspi2Machine *machine = obj; + if (!g_strcmp0(device, "generic-sdhci")) { + return &machine->sdhci.obj; + } + + fprintf(stderr, "%s not present in arm/raspi2b\n", device); + g_assert_not_reached(); + } + + static void *qos_create_machine_arm_raspi2(QTestState *qts) + { + QRaspi2Machine *machine = g_new0(QRaspi2Machine, 1); + + alloc_init(&machine->alloc, ...); + + /* Get node(s) contained inside (CONTAINS) */ + machine->obj.get_device = raspi2_get_device; + + /* Get node(s) produced (PRODUCES) */ + machine->obj.get_driver = raspi2_get_driver; + + /* free the object */ + machine->obj.destructor = raspi2_destructor; + qos_init_sdhci_mm(&machine->sdhci, ...); + return &machine->obj; + } + + static void raspi2_register_nodes(void) + { + /* arm/raspi2b --contains--> generic-sdhci */ + qos_node_create_machine("arm/raspi2b", + qos_create_machine_arm_raspi2); + qos_node_contains("arm/raspi2b", "generic-sdhci", NULL); + } + + libqos_init(raspi2_register_nodes); + +``x86_64/pc`` machine, simplified from +``tests/qtest/libqos/x86_64_pc-machine.c``:: + + #include "qgraph.h" + + struct i440FX_pcihost { + QOSGraphObject obj; + QPCIBusPC pci; + }; + + struct QX86PCMachine { + QOSGraphObject obj; + QGuestAllocator alloc; + i440FX_pcihost bridge; + }; + + /* i440FX_pcihost */ + + static QOSGraphObject *i440FX_host_get_device(void *obj, + const char *device) + { + i440FX_pcihost *host = obj; + if (!g_strcmp0(device, "pci-bus-pc")) { + return &host->pci.obj; + } + fprintf(stderr, "%s not present in i440FX-pcihost\n", device); + g_assert_not_reached(); + } + + /* x86_64/pc machine */ + + static void *pc_get_driver(void *object, const char *interface) + { + QX86PCMachine *machine = object; + if (!g_strcmp0(interface, "memory")) { + return &machine->alloc; + } + + fprintf(stderr, "%s not present in x86_64/pc\n", interface); + g_assert_not_reached(); + } + + static QOSGraphObject *pc_get_device(void *obj, const char *device) + { + QX86PCMachine *machine = obj; + if (!g_strcmp0(device, "i440FX-pcihost")) { + return &machine->bridge.obj; + } + + fprintf(stderr, "%s not present in x86_64/pc\n", device); + g_assert_not_reached(); + } + + static void *qos_create_machine_pc(QTestState *qts) + { + QX86PCMachine *machine = g_new0(QX86PCMachine, 1); + + /* Get node(s) contained inside (CONTAINS) */ + machine->obj.get_device = pc_get_device; + + /* Get node(s) produced (PRODUCES) */ + machine->obj.get_driver = pc_get_driver; + + /* free the object */ + machine->obj.destructor = pc_destructor; + pc_alloc_init(&machine->alloc, qts, ALLOC_NO_FLAGS); + + /* Get node(s) contained inside (CONTAINS) */ + machine->bridge.obj.get_device = i440FX_host_get_device; + + return &machine->obj; + } + + static void pc_machine_register_nodes(void) + { + /* x86_64/pc --contains--> 1440FX-pcihost --contains--> + * pci-bus-pc [--produces--> pci-bus (in pci.h)] */ + qos_node_create_machine("x86_64/pc", qos_create_machine_pc); + qos_node_contains("x86_64/pc", "i440FX-pcihost", NULL); + + /* contained drivers don't need a constructor, + * they will be init by the parent */ + qos_node_create_driver("i440FX-pcihost", NULL); + qos_node_contains("i440FX-pcihost", "pci-bus-pc", NULL); + } + + libqos_init(pc_machine_register_nodes); + +``sdhci`` taken from ``tests/qtest/libqos/sdhci.c``:: + + /* Interface node, offers the sdhci API */ + struct QSDHCI { + uint16_t (*readw)(QSDHCI *s, uint32_t reg); + uint64_t (*readq)(QSDHCI *s, uint32_t reg); + void (*writeq)(QSDHCI *s, uint32_t reg, uint64_t val); + /* other fields */ + }; + + /* Memory Mapped implementation of QSDHCI */ + struct QSDHCI_MemoryMapped { + QOSGraphObject obj; + QSDHCI sdhci; + /* other driver-specific fields */ + }; + + /* PCI implementation of QSDHCI */ + struct QSDHCI_PCI { + QOSGraphObject obj; + QSDHCI sdhci; + /* other driver-specific fields */ + }; + + /* Memory mapped implementation of QSDHCI */ + + static void *sdhci_mm_get_driver(void *obj, const char *interface) + { + QSDHCI_MemoryMapped *smm = obj; + if (!g_strcmp0(interface, "sdhci")) { + return &smm->sdhci; + } + fprintf(stderr, "%s not present in generic-sdhci\n", interface); + g_assert_not_reached(); + } + + void qos_init_sdhci_mm(QSDHCI_MemoryMapped *sdhci, QTestState *qts, + uint32_t addr, QSDHCIProperties *common) + { + /* Get node contained inside (CONTAINS) */ + sdhci->obj.get_driver = sdhci_mm_get_driver; + + /* SDHCI interface API */ + sdhci->sdhci.readw = sdhci_mm_readw; + sdhci->sdhci.readq = sdhci_mm_readq; + sdhci->sdhci.writeq = sdhci_mm_writeq; + sdhci->qts = qts; + } + + /* PCI implementation of QSDHCI */ + + static void *sdhci_pci_get_driver(void *object, + const char *interface) + { + QSDHCI_PCI *spci = object; + if (!g_strcmp0(interface, "sdhci")) { + return &spci->sdhci; + } + + fprintf(stderr, "%s not present in sdhci-pci\n", interface); + g_assert_not_reached(); + } + + static void *sdhci_pci_create(void *pci_bus, + QGuestAllocator *alloc, + void *addr) + { + QSDHCI_PCI *spci = g_new0(QSDHCI_PCI, 1); + QPCIBus *bus = pci_bus; + uint64_t barsize; + + qpci_device_init(&spci->dev, bus, addr); + + /* SDHCI interface API */ + spci->sdhci.readw = sdhci_pci_readw; + spci->sdhci.readq = sdhci_pci_readq; + spci->sdhci.writeq = sdhci_pci_writeq; + + /* Get node(s) produced (PRODUCES) */ + spci->obj.get_driver = sdhci_pci_get_driver; + + spci->obj.start_hw = sdhci_pci_start_hw; + spci->obj.destructor = sdhci_destructor; + return &spci->obj; + } + + static void qsdhci_register_nodes(void) + { + QOSGraphEdgeOptions opts = { + .extra_device_opts = "addr=04.0", + }; + + /* generic-sdhci */ + /* generic-sdhci --produces--> sdhci */ + qos_node_create_driver("generic-sdhci", NULL); + qos_node_produces("generic-sdhci", "sdhci"); + + /* sdhci-pci */ + /* sdhci-pci --produces--> sdhci + * sdhci-pci --consumes--> pci-bus */ + qos_node_create_driver("sdhci-pci", sdhci_pci_create); + qos_node_produces("sdhci-pci", "sdhci"); + qos_node_consumes("sdhci-pci", "pci-bus", &opts); + } + + libqos_init(qsdhci_register_nodes); + +In the above example, all possible types of relations are created:: + + x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc + | + sdhci-pci --consumes--> pci-bus <--produces--+ + | + +--produces--+ + | + v + sdhci + ^ + | + +--produces-- + + | + arm/raspi2b --contains--> generic-sdhci + +or inverting the consumes edge in consumed_by:: + + x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc + | + sdhci-pci <--consumed by-- pci-bus <--produces--+ + | + +--produces--+ + | + v + sdhci + ^ + | + +--produces-- + + | + arm/raspi2b --contains--> generic-sdhci + +Adding a new test +----------------- + +Given the above setup, adding a new test is very simple. +``sdhci-test``, taken from ``tests/qtest/sdhci-test.c``:: + + static void check_capab_sdma(QSDHCI *s, bool supported) + { + uint64_t capab, capab_sdma; + + capab = s->readq(s, SDHC_CAPAB); + capab_sdma = FIELD_EX64(capab, SDHC_CAPAB, SDMA); + g_assert_cmpuint(capab_sdma, ==, supported); + } + + static void test_registers(void *obj, void *data, + QGuestAllocator *alloc) + { + QSDHCI *s = obj; + + /* example test */ + check_capab_sdma(s, s->props.capab.sdma); + } + + static void register_sdhci_test(void) + { + /* sdhci-test --consumes--> sdhci */ + qos_add_test("registers", "sdhci", test_registers, NULL); + } + + libqos_init(register_sdhci_test); + +Here a new test is created, consuming ``sdhci`` interface node +and creating a valid path from both machines to a test. +Final graph will be like this:: + + x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc + | + sdhci-pci --consumes--> pci-bus <--produces--+ + | + +--produces--+ + | + v + sdhci <--consumes-- sdhci-test + ^ + | + +--produces-- + + | + arm/raspi2b --contains--> generic-sdhci + +or inverting the consumes edge in consumed_by:: + + x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc + | + sdhci-pci <--consumed by-- pci-bus <--produces--+ + | + +--produces--+ + | + v + sdhci --consumed by--> sdhci-test + ^ + | + +--produces-- + + | + arm/raspi2b --contains--> generic-sdhci + +Assuming there the binary is +``QTEST_QEMU_BINARY=./qemu-system-x86_64`` +a valid test path will be: +``/x86_64/pc/1440FX-pcihost/pci-bus-pc/pci-bus/sdhci-pc/sdhci/sdhci-test`` + +and for the binary ``QTEST_QEMU_BINARY=./qemu-system-arm``: + +``/arm/raspi2b/generic-sdhci/sdhci/sdhci-test`` + +Additional examples are also in ``test-qgraph.c`` + +Qgraph API reference +-------------------- + +.. kernel-doc:: tests/qtest/libqos/qgraph.h diff --git a/docs/devel/qom.rst b/docs/devel/qom.rst new file mode 100644 index 000000000..e5fe3597c --- /dev/null +++ b/docs/devel/qom.rst @@ -0,0 +1,389 @@ +=========================== +The QEMU Object Model (QOM) +=========================== + +.. highlight:: c + +The QEMU Object Model provides a framework for registering user creatable +types and instantiating objects from those types. QOM provides the following +features: + +- System for dynamically registering types +- Support for single-inheritance of types +- Multiple inheritance of stateless interfaces + +.. code-block:: c + :caption: Creating a minimal type + + #include "qdev.h" + + #define TYPE_MY_DEVICE "my-device" + + // No new virtual functions: we can reuse the typedef for the + // superclass. + typedef DeviceClass MyDeviceClass; + typedef struct MyDevice + { + DeviceState parent; + + int reg0, reg1, reg2; + } MyDevice; + + static const TypeInfo my_device_info = { + .name = TYPE_MY_DEVICE, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDevice), + }; + + static void my_device_register_types(void) + { + type_register_static(&my_device_info); + } + + type_init(my_device_register_types) + +In the above example, we create a simple type that is described by #TypeInfo. +#TypeInfo describes information about the type including what it inherits +from, the instance and class size, and constructor/destructor hooks. + +Alternatively several static types could be registered using helper macro +DEFINE_TYPES() + +.. code-block:: c + + static const TypeInfo device_types_info[] = { + { + .name = TYPE_MY_DEVICE_A, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDeviceA), + }, + { + .name = TYPE_MY_DEVICE_B, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDeviceB), + }, + }; + + DEFINE_TYPES(device_types_info) + +Every type has an #ObjectClass associated with it. #ObjectClass derivatives +are instantiated dynamically but there is only ever one instance for any +given type. The #ObjectClass typically holds a table of function pointers +for the virtual methods implemented by this type. + +Using object_new(), a new #Object derivative will be instantiated. You can +cast an #Object to a subclass (or base-class) type using +object_dynamic_cast(). You typically want to define macro wrappers around +OBJECT_CHECK() and OBJECT_CLASS_CHECK() to make it easier to convert to a +specific type: + +.. code-block:: c + :caption: Typecasting macros + + #define MY_DEVICE_GET_CLASS(obj) \ + OBJECT_GET_CLASS(MyDeviceClass, obj, TYPE_MY_DEVICE) + #define MY_DEVICE_CLASS(klass) \ + OBJECT_CLASS_CHECK(MyDeviceClass, klass, TYPE_MY_DEVICE) + #define MY_DEVICE(obj) \ + OBJECT_CHECK(MyDevice, obj, TYPE_MY_DEVICE) + +In case the ObjectClass implementation can be built as module a +module_obj() line must be added to make sure qemu loads the module +when the object is needed. + +.. code-block:: c + + module_obj(TYPE_MY_DEVICE); + +Class Initialization +==================== + +Before an object is initialized, the class for the object must be +initialized. There is only one class object for all instance objects +that is created lazily. + +Classes are initialized by first initializing any parent classes (if +necessary). After the parent class object has initialized, it will be +copied into the current class object and any additional storage in the +class object is zero filled. + +The effect of this is that classes automatically inherit any virtual +function pointers that the parent class has already initialized. All +other fields will be zero filled. + +Once all of the parent classes have been initialized, #TypeInfo::class_init +is called to let the class being instantiated provide default initialize for +its virtual functions. Here is how the above example might be modified +to introduce an overridden virtual function: + +.. code-block:: c + :caption: Overriding a virtual function + + #include "qdev.h" + + void my_device_class_init(ObjectClass *klass, void *class_data) + { + DeviceClass *dc = DEVICE_CLASS(klass); + dc->reset = my_device_reset; + } + + static const TypeInfo my_device_info = { + .name = TYPE_MY_DEVICE, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDevice), + .class_init = my_device_class_init, + }; + +Introducing new virtual methods requires a class to define its own +struct and to add a .class_size member to the #TypeInfo. Each method +will also have a wrapper function to call it easily: + +.. code-block:: c + :caption: Defining an abstract class + + #include "qdev.h" + + typedef struct MyDeviceClass + { + DeviceClass parent; + + void (*frobnicate) (MyDevice *obj); + } MyDeviceClass; + + static const TypeInfo my_device_info = { + .name = TYPE_MY_DEVICE, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDevice), + .abstract = true, // or set a default in my_device_class_init + .class_size = sizeof(MyDeviceClass), + }; + + void my_device_frobnicate(MyDevice *obj) + { + MyDeviceClass *klass = MY_DEVICE_GET_CLASS(obj); + + klass->frobnicate(obj); + } + +Interfaces +========== + +Interfaces allow a limited form of multiple inheritance. Instances are +similar to normal types except for the fact that are only defined by +their classes and never carry any state. As a consequence, a pointer to +an interface instance should always be of incomplete type in order to be +sure it cannot be dereferenced. That is, you should define the +'typedef struct SomethingIf SomethingIf' so that you can pass around +``SomethingIf *si`` arguments, but not define a ``struct SomethingIf { ... }``. +The only things you can validly do with a ``SomethingIf *`` are to pass it as +an argument to a method on its corresponding SomethingIfClass, or to +dynamically cast it to an object that implements the interface. + +Methods +======= + +A *method* is a function within the namespace scope of +a class. It usually operates on the object instance by passing it as a +strongly-typed first argument. +If it does not operate on an object instance, it is dubbed +*class method*. + +Methods cannot be overloaded. That is, the #ObjectClass and method name +uniquely identity the function to be called; the signature does not vary +except for trailing varargs. + +Methods are always *virtual*. Overriding a method in +#TypeInfo.class_init of a subclass leads to any user of the class obtained +via OBJECT_GET_CLASS() accessing the overridden function. +The original function is not automatically invoked. It is the responsibility +of the overriding class to determine whether and when to invoke the method +being overridden. + +To invoke the method being overridden, the preferred solution is to store +the original value in the overriding class before overriding the method. +This corresponds to ``{super,base}.method(...)`` in Java and C# +respectively; this frees the overriding class from hardcoding its parent +class, which someone might choose to change at some point. + +.. code-block:: c + :caption: Overriding a virtual method + + typedef struct MyState MyState; + + typedef void (*MyDoSomething)(MyState *obj); + + typedef struct MyClass { + ObjectClass parent_class; + + MyDoSomething do_something; + } MyClass; + + static void my_do_something(MyState *obj) + { + // do something + } + + static void my_class_init(ObjectClass *oc, void *data) + { + MyClass *mc = MY_CLASS(oc); + + mc->do_something = my_do_something; + } + + static const TypeInfo my_type_info = { + .name = TYPE_MY, + .parent = TYPE_OBJECT, + .instance_size = sizeof(MyState), + .class_size = sizeof(MyClass), + .class_init = my_class_init, + }; + + typedef struct DerivedClass { + MyClass parent_class; + + MyDoSomething parent_do_something; + } DerivedClass; + + static void derived_do_something(MyState *obj) + { + DerivedClass *dc = DERIVED_GET_CLASS(obj); + + // do something here + dc->parent_do_something(obj); + // do something else here + } + + static void derived_class_init(ObjectClass *oc, void *data) + { + MyClass *mc = MY_CLASS(oc); + DerivedClass *dc = DERIVED_CLASS(oc); + + dc->parent_do_something = mc->do_something; + mc->do_something = derived_do_something; + } + + static const TypeInfo derived_type_info = { + .name = TYPE_DERIVED, + .parent = TYPE_MY, + .class_size = sizeof(DerivedClass), + .class_init = derived_class_init, + }; + +Alternatively, object_class_by_name() can be used to obtain the class and +its non-overridden methods for a specific type. This would correspond to +``MyClass::method(...)`` in C++. + +The first example of such a QOM method was #CPUClass.reset, +another example is #DeviceClass.realize. + +Standard type declaration and definition macros +=============================================== + +A lot of the code outlined above follows a standard pattern and naming +convention. To reduce the amount of boilerplate code that needs to be +written for a new type there are two sets of macros to generate the +common parts in a standard format. + +A type is declared using the OBJECT_DECLARE macro family. In types +which do not require any virtual functions in the class, the +OBJECT_DECLARE_SIMPLE_TYPE macro is suitable, and is commonly placed +in the header file: + +.. code-block:: c + :caption: Declaring a simple type + + OBJECT_DECLARE_SIMPLE_TYPE(MyDevice, my_device, + MY_DEVICE, DEVICE) + +This is equivalent to the following: + +.. code-block:: c + :caption: Expansion from declaring a simple type + + typedef struct MyDevice MyDevice; + typedef struct MyDeviceClass MyDeviceClass; + + G_DEFINE_AUTOPTR_CLEANUP_FUNC(MyDeviceClass, object_unref) + + #define MY_DEVICE_GET_CLASS(void *obj) \ + OBJECT_GET_CLASS(MyDeviceClass, obj, TYPE_MY_DEVICE) + #define MY_DEVICE_CLASS(void *klass) \ + OBJECT_CLASS_CHECK(MyDeviceClass, klass, TYPE_MY_DEVICE) + #define MY_DEVICE(void *obj) + OBJECT_CHECK(MyDevice, obj, TYPE_MY_DEVICE) + + struct MyDeviceClass { + DeviceClass parent_class; + }; + +The 'struct MyDevice' needs to be declared separately. +If the type requires virtual functions to be declared in the class +struct, then the alternative OBJECT_DECLARE_TYPE() macro can be +used. This does the same as OBJECT_DECLARE_SIMPLE_TYPE(), but without +the 'struct MyDeviceClass' definition. + +To implement the type, the OBJECT_DEFINE macro family is available. +In the simple case the OBJECT_DEFINE_TYPE macro is suitable: + +.. code-block:: c + :caption: Defining a simple type + + OBJECT_DEFINE_TYPE(MyDevice, my_device, MY_DEVICE, DEVICE) + +This is equivalent to the following: + +.. code-block:: c + :caption: Expansion from defining a simple type + + static void my_device_finalize(Object *obj); + static void my_device_class_init(ObjectClass *oc, void *data); + static void my_device_init(Object *obj); + + static const TypeInfo my_device_info = { + .parent = TYPE_DEVICE, + .name = TYPE_MY_DEVICE, + .instance_size = sizeof(MyDevice), + .instance_init = my_device_init, + .instance_finalize = my_device_finalize, + .class_size = sizeof(MyDeviceClass), + .class_init = my_device_class_init, + }; + + static void + my_device_register_types(void) + { + type_register_static(&my_device_info); + } + type_init(my_device_register_types); + +This is sufficient to get the type registered with the type +system, and the three standard methods now need to be implemented +along with any other logic required for the type. + +If the type needs to implement one or more interfaces, then the +OBJECT_DEFINE_TYPE_WITH_INTERFACES() macro can be used instead. +This accepts an array of interface type names. + +.. code-block:: c + :caption: Defining a simple type implementing interfaces + + OBJECT_DEFINE_TYPE_WITH_INTERFACES(MyDevice, my_device, + MY_DEVICE, DEVICE, + { TYPE_USER_CREATABLE }, + { NULL }) + +If the type is not intended to be instantiated, then then +the OBJECT_DEFINE_ABSTRACT_TYPE() macro can be used instead: + +.. code-block:: c + :caption: Defining a simple abstract type + + OBJECT_DEFINE_ABSTRACT_TYPE(MyDevice, my_device, + MY_DEVICE, DEVICE) + + + +API Reference +------------- + +.. kernel-doc:: include/qom/object.h diff --git a/docs/devel/qtest.rst b/docs/devel/qtest.rst new file mode 100644 index 000000000..c3dceb6c8 --- /dev/null +++ b/docs/devel/qtest.rst @@ -0,0 +1,92 @@ +======================================== +QTest Device Emulation Testing Framework +======================================== + +.. toctree:: + :hidden: + + qgraph + +QTest is a device emulation testing framework. It can be very useful to test +device models; it could also control certain aspects of QEMU (such as virtual +clock stepping), with a special purpose "qtest" protocol. Refer to +:ref:`qtest-protocol` for more details of the protocol. + +QTest cases can be executed with + +.. code:: + + make check-qtest + +The QTest library is implemented by ``tests/qtest/libqtest.c`` and the API is +defined in ``tests/qtest/libqtest.h``. + +Consider adding a new QTest case when you are introducing a new virtual +hardware, or extending one if you are adding functionalities to an existing +virtual device. + +On top of libqtest, a higher level library, ``libqos``, was created to +encapsulate common tasks of device drivers, such as memory management and +communicating with system buses or devices. Many virtual device tests use +libqos instead of directly calling into libqtest. +Libqos also offers the Qgraph API to increase each test coverage and +automate QEMU command line arguments and devices setup. +Refer to :ref:`qgraph` for Qgraph explanation and API. + +Steps to add a new QTest case are: + +1. Create a new source file for the test. (More than one file can be added as + necessary.) For example, ``tests/qtest/foo-test.c``. + +2. Write the test code with the glib and libqtest/libqos API. See also existing + tests and the library headers for reference. + +3. Register the new test in ``tests/qtest/meson.build``. Add the test + executable name to an appropriate ``qtests_*`` variable. There is + one variable per architecture, plus ``qtests_generic`` for tests + that can be run for all architectures. For example:: + + qtests_generic = [ + ... + 'foo-test', + ... + ] + +4. If the test has more than one source file or needs to be linked with any + dependency other than ``qemuutil`` and ``qos``, list them in the ``qtests`` + dictionary. For example a test that needs to use the ``QIO`` library + will have an entry like:: + + { + ... + 'foo-test': [io], + ... + } + +Debugging a QTest failure is slightly harder than the unit test because the +tests look up QEMU program names in the environment variables, such as +``QTEST_QEMU_BINARY`` and ``QTEST_QEMU_IMG``, and also because it is not easy +to attach gdb to the QEMU process spawned from the test. But manual invoking +and using gdb on the test is still simple to do: find out the actual command +from the output of + +.. code:: + + make check-qtest V=1 + +which you can run manually. + + +.. _qtest-protocol: + +QTest Protocol +-------------- + +.. kernel-doc:: softmmu/qtest.c + :doc: QTest Protocol + + +libqtest API reference +---------------------- + +.. kernel-doc:: tests/qtest/libqos/libqtest.h diff --git a/docs/devel/rcu.txt b/docs/devel/rcu.txt new file mode 100644 index 000000000..2e6cc607a --- /dev/null +++ b/docs/devel/rcu.txt @@ -0,0 +1,406 @@ +Using RCU (Read-Copy-Update) for synchronization +================================================ + +Read-copy update (RCU) is a synchronization mechanism that is used to +protect read-mostly data structures. RCU is very efficient and scalable +on the read side (it is wait-free), and thus can make the read paths +extremely fast. + +RCU supports concurrency between a single writer and multiple readers, +thus it is not used alone. Typically, the write-side will use a lock to +serialize multiple updates, but other approaches are possible (e.g., +restricting updates to a single task). In QEMU, when a lock is used, +this will often be the "iothread mutex", also known as the "big QEMU +lock" (BQL). Also, restricting updates to a single task is done in +QEMU using the "bottom half" API. + +RCU is fundamentally a "wait-to-finish" mechanism. The read side marks +sections of code with "critical sections", and the update side will wait +for the execution of all *currently running* critical sections before +proceeding, or before asynchronously executing a callback. + +The key point here is that only the currently running critical sections +are waited for; critical sections that are started _after_ the beginning +of the wait do not extend the wait, despite running concurrently with +the updater. This is the reason why RCU is more scalable than, +for example, reader-writer locks. It is so much more scalable that +the system will have a single instance of the RCU mechanism; a single +mechanism can be used for an arbitrary number of "things", without +having to worry about things such as contention or deadlocks. + +How is this possible? The basic idea is to split updates in two phases, +"removal" and "reclamation". During removal, we ensure that subsequent +readers will not be able to get a reference to the old data. After +removal has completed, a critical section will not be able to access +the old data. Therefore, critical sections that begin after removal +do not matter; as soon as all previous critical sections have finished, +there cannot be any readers who hold references to the data structure, +and these can now be safely reclaimed (e.g., freed or unref'ed). + +Here is a picture: + + thread 1 thread 2 thread 3 + ------------------- ------------------------ ------------------- + enter RCU crit.sec. + | finish removal phase + | begin wait + | | enter RCU crit.sec. + exit RCU crit.sec | | + complete wait | + begin reclamation phase | + exit RCU crit.sec. + + +Note how thread 3 is still executing its critical section when thread 2 +starts reclaiming data. This is possible, because the old version of the +data structure was not accessible at the time thread 3 began executing +that critical section. + + +RCU API +======= + +The core RCU API is small: + + void rcu_read_lock(void); + + Used by a reader to inform the reclaimer that the reader is + entering an RCU read-side critical section. + + void rcu_read_unlock(void); + + Used by a reader to inform the reclaimer that the reader is + exiting an RCU read-side critical section. Note that RCU + read-side critical sections may be nested and/or overlapping. + + void synchronize_rcu(void); + + Blocks until all pre-existing RCU read-side critical sections + on all threads have completed. This marks the end of the removal + phase and the beginning of reclamation phase. + + Note that it would be valid for another update to come while + synchronize_rcu is running. Because of this, it is better that + the updater releases any locks it may hold before calling + synchronize_rcu. If this is not possible (for example, because + the updater is protected by the BQL), you can use call_rcu. + + void call_rcu1(struct rcu_head * head, + void (*func)(struct rcu_head *head)); + + This function invokes func(head) after all pre-existing RCU + read-side critical sections on all threads have completed. This + marks the end of the removal phase, with func taking care + asynchronously of the reclamation phase. + + The foo struct needs to have an rcu_head structure added, + perhaps as follows: + + struct foo { + struct rcu_head rcu; + int a; + char b; + long c; + }; + + so that the reclaimer function can fetch the struct foo address + and free it: + + call_rcu1(&foo.rcu, foo_reclaim); + + void foo_reclaim(struct rcu_head *rp) + { + struct foo *fp = container_of(rp, struct foo, rcu); + g_free(fp); + } + + For the common case where the rcu_head member is the first of the + struct, you can use the following macro. + + void call_rcu(T *p, + void (*func)(T *p), + field-name); + void g_free_rcu(T *p, + field-name); + + call_rcu1 is typically used through these macro, in the common case + where the "struct rcu_head" is the first field in the struct. If + the callback function is g_free, in particular, g_free_rcu can be + used. In the above case, one could have written simply: + + g_free_rcu(&foo, rcu); + + typeof(*p) qatomic_rcu_read(p); + + qatomic_rcu_read() is similar to qatomic_load_acquire(), but it makes + some assumptions on the code that calls it. This allows a more + optimized implementation. + + qatomic_rcu_read assumes that whenever a single RCU critical + section reads multiple shared data, these reads are either + data-dependent or need no ordering. This is almost always the + case when using RCU, because read-side critical sections typically + navigate one or more pointers (the pointers that are changed on + every update) until reaching a data structure of interest, + and then read from there. + + RCU read-side critical sections must use qatomic_rcu_read() to + read data, unless concurrent writes are prevented by another + synchronization mechanism. + + Furthermore, RCU read-side critical sections should traverse the + data structure in a single direction, opposite to the direction + in which the updater initializes it. + + void qatomic_rcu_set(p, typeof(*p) v); + + qatomic_rcu_set() is similar to qatomic_store_release(), though it also + makes assumptions on the code that calls it in order to allow a more + optimized implementation. + + In particular, qatomic_rcu_set() suffices for synchronization + with readers, if the updater never mutates a field within a + data item that is already accessible to readers. This is the + case when initializing a new copy of the RCU-protected data + structure; just ensure that initialization of *p is carried out + before qatomic_rcu_set() makes the data item visible to readers. + If this rule is observed, writes will happen in the opposite + order as reads in the RCU read-side critical sections (or if + there is just one update), and there will be no need for other + synchronization mechanism to coordinate the accesses. + +The following APIs must be used before RCU is used in a thread: + + void rcu_register_thread(void); + + Mark a thread as taking part in the RCU mechanism. Such a thread + will have to report quiescent points regularly, either manually + or through the QemuCond/QemuSemaphore/QemuEvent APIs. + + void rcu_unregister_thread(void); + + Mark a thread as not taking part anymore in the RCU mechanism. + It is not a problem if such a thread reports quiescent points, + either manually or by using the QemuCond/QemuSemaphore/QemuEvent + APIs. + +Note that these APIs are relatively heavyweight, and should _not_ be +nested. + +Convenience macros +================== + +Two macros are provided that automatically release the read lock at the +end of the scope. + + RCU_READ_LOCK_GUARD() + + Takes the lock and will release it at the end of the block it's + used in. + + WITH_RCU_READ_LOCK_GUARD() { code } + + Is used at the head of a block to protect the code within the block. + +Note that 'goto'ing out of the guarded block will also drop the lock. + +DIFFERENCES WITH LINUX +====================== + +- Waiting on a mutex is possible, though discouraged, within an RCU critical + section. This is because spinlocks are rarely (if ever) used in userspace + programming; not allowing this would prevent upgrading an RCU read-side + critical section to become an updater. + +- qatomic_rcu_read and qatomic_rcu_set replace rcu_dereference and + rcu_assign_pointer. They take a _pointer_ to the variable being accessed. + +- call_rcu is a macro that has an extra argument (the name of the first + field in the struct, which must be a struct rcu_head), and expects the + type of the callback's argument to be the type of the first argument. + call_rcu1 is the same as Linux's call_rcu. + + +RCU PATTERNS +============ + +Many patterns using read-writer locks translate directly to RCU, with +the advantages of higher scalability and deadlock immunity. + +In general, RCU can be used whenever it is possible to create a new +"version" of a data structure every time the updater runs. This may +sound like a very strict restriction, however: + +- the updater does not mean "everything that writes to a data structure", + but rather "everything that involves a reclamation step". See the + array example below + +- in some cases, creating a new version of a data structure may actually + be very cheap. For example, modifying the "next" pointer of a singly + linked list is effectively creating a new version of the list. + +Here are some frequently-used RCU idioms that are worth noting. + + +RCU list processing +------------------- + +TBD (not yet used in QEMU) + + +RCU reference counting +---------------------- + +Because grace periods are not allowed to complete while there is an RCU +read-side critical section in progress, the RCU read-side primitives +may be used as a restricted reference-counting mechanism. For example, +consider the following code fragment: + + rcu_read_lock(); + p = qatomic_rcu_read(&foo); + /* do something with p. */ + rcu_read_unlock(); + +The RCU read-side critical section ensures that the value of "p" remains +valid until after the rcu_read_unlock(). In some sense, it is acquiring +a reference to p that is later released when the critical section ends. +The write side looks simply like this (with appropriate locking): + + qemu_mutex_lock(&foo_mutex); + old = foo; + qatomic_rcu_set(&foo, new); + qemu_mutex_unlock(&foo_mutex); + synchronize_rcu(); + free(old); + +If the processing cannot be done purely within the critical section, it +is possible to combine this idiom with a "real" reference count: + + rcu_read_lock(); + p = qatomic_rcu_read(&foo); + foo_ref(p); + rcu_read_unlock(); + /* do something with p. */ + foo_unref(p); + +The write side can be like this: + + qemu_mutex_lock(&foo_mutex); + old = foo; + qatomic_rcu_set(&foo, new); + qemu_mutex_unlock(&foo_mutex); + synchronize_rcu(); + foo_unref(old); + +or with call_rcu: + + qemu_mutex_lock(&foo_mutex); + old = foo; + qatomic_rcu_set(&foo, new); + qemu_mutex_unlock(&foo_mutex); + call_rcu(foo_unref, old, rcu); + +In both cases, the write side only performs removal. Reclamation +happens when the last reference to a "foo" object is dropped. +Using synchronize_rcu() is undesirably expensive, because the +last reference may be dropped on the read side. Hence you can +use call_rcu() instead: + + foo_unref(struct foo *p) { + if (qatomic_fetch_dec(&p->refcount) == 1) { + call_rcu(foo_destroy, p, rcu); + } + } + + +Note that the same idioms would be possible with reader/writer +locks: + + read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); + p = foo; p = foo; + /* do something with p. */ foo = new; + read_unlock(&foo_rwlock); free(p); + write_mutex_unlock(&foo_rwlock); + free(p); + + ------------------------------------------------------------------ + + read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); + p = foo; old = foo; + foo_ref(p); foo = new; + read_unlock(&foo_rwlock); foo_unref(old); + /* do something with p. */ write_mutex_unlock(&foo_rwlock); + read_lock(&foo_rwlock); + foo_unref(p); + read_unlock(&foo_rwlock); + +foo_unref could use a mechanism such as bottom halves to move deallocation +out of the write-side critical section. + + +RCU resizable arrays +-------------------- + +Resizable arrays can be used with RCU. The expensive RCU synchronization +(or call_rcu) only needs to take place when the array is resized. +The two items to take care of are: + +- ensuring that the old version of the array is available between removal + and reclamation; + +- avoiding mismatches in the read side between the array data and the + array size. + +The first problem is avoided simply by not using realloc. Instead, +each resize will allocate a new array and copy the old data into it. +The second problem would arise if the size and the data pointers were +two members of a larger struct: + + struct mystuff { + ... + int data_size; + int data_alloc; + T *data; + ... + }; + +Instead, we store the size of the array with the array itself: + + struct arr { + int size; + int alloc; + T data[]; + }; + struct arr *global_array; + + read side: + rcu_read_lock(); + struct arr *array = qatomic_rcu_read(&global_array); + x = i < array->size ? array->data[i] : -1; + rcu_read_unlock(); + return x; + + write side (running under a lock): + if (global_array->size == global_array->alloc) { + /* Creating a new version. */ + new_array = g_malloc(sizeof(struct arr) + + global_array->alloc * 2 * sizeof(T)); + new_array->size = global_array->size; + new_array->alloc = global_array->alloc * 2; + memcpy(new_array->data, global_array->data, + global_array->alloc * sizeof(T)); + + /* Removal phase. */ + old_array = global_array; + qatomic_rcu_set(&global_array, new_array); + synchronize_rcu(); + + /* Reclamation phase. */ + free(old_array); + } + + +SOURCES +======= + +* Documentation/RCU/ from the Linux kernel diff --git a/docs/devel/replay.txt b/docs/devel/replay.txt new file mode 100644 index 000000000..e641c35ad --- /dev/null +++ b/docs/devel/replay.txt @@ -0,0 +1,46 @@ +Record/replay mechanism, that could be enabled through icount mode, expects +the virtual devices to satisfy the following requirements. + +The main idea behind this document is that everything that affects +the guest state during execution in icount mode should be deterministic. + +Timers +====== + +All virtual devices should use virtual clock for timers that change the guest +state. Virtual clock is deterministic, therefore such timers are deterministic +too. + +Virtual devices can also use realtime clock for the events that do not change +the guest state directly. When the clock ticking should depend on VM execution +speed, use virtual clock with EXTERNAL attribute. It is not deterministic, +but its speed depends on the guest execution. This clock is used by +the virtual devices (e.g., slirp routing device) that lie outside the +replayed guest. + +Bottom halves +============= + +Bottom half callbacks, that affect the guest state, should be invoked through +replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions. +Their invocations are saved in record mode and synchronized with the existing +log in replay mode. + +Saving/restoring the VM state +============================= + +All fields in the device state structure (including virtual timers) +should be restored by loadvm to the same values they had before savevm. + +Avoid accessing other devices' state, because the order of saving/restoring +is not defined. It means that you should not call functions like +'update_irq' in post_load callback. Save everything explicitly to avoid +the dependencies that may make restoring the VM state non-deterministic. + +Stopping the VM +=============== + +Stopping the guest should not interfere with its state (with the exception +of the network connections, that could be broken by the remote timeouts). +VM can be stopped at any moment of replay by the user. Restarting the VM +after that stop should not break the replay by the unneeded guest state change. diff --git a/docs/devel/reset.rst b/docs/devel/reset.rst new file mode 100644 index 000000000..abea1102d --- /dev/null +++ b/docs/devel/reset.rst @@ -0,0 +1,289 @@ + +======================================= +Reset in QEMU: the Resettable interface +======================================= + +The reset of qemu objects is handled using the resettable interface declared +in ``include/hw/resettable.h``. + +This interface allows objects to be grouped (on a tree basis); so that the +whole group can be reset consistently. Each individual member object does not +have to care about others; in particular, problems of order (which object is +reset first) are addressed. + +As of now DeviceClass and BusClass implement this interface. + + +Triggering reset +---------------- + +This section documents the APIs which "users" of a resettable object should use +to control it. All resettable control functions must be called while holding +the iothread lock. + +You can apply a reset to an object using ``resettable_assert_reset()``. You need +to call ``resettable_release_reset()`` to release the object from reset. To +instantly reset an object, without keeping it in reset state, just call +``resettable_reset()``. These functions take two parameters: a pointer to the +object to reset and a reset type. + +Several types of reset will be supported. For now only cold reset is defined; +others may be added later. The Resettable interface handles reset types with an +enum: + +``RESET_TYPE_COLD`` + Cold reset is supported by every resettable object. In QEMU, it means we reset + to the initial state corresponding to the start of QEMU; this might differ + from what is a real hardware cold reset. It differs from other resets (like + warm or bus resets) which may keep certain parts untouched. + +Calling ``resettable_reset()`` is equivalent to calling +``resettable_assert_reset()`` then ``resettable_release_reset()``. It is +possible to interleave multiple calls to these three functions. There may +be several reset sources/controllers of a given object. The interface handles +everything and the different reset controllers do not need to know anything +about each others. The object will leave reset state only when each other +controllers end their reset operation. This point is handled internally by +maintaining a count of in-progress resets; it is crucial to call +``resettable_release_reset()`` one time and only one time per +``resettable_assert_reset()`` call. + +For now migration of a device or bus in reset is not supported. Care must be +taken not to delay ``resettable_release_reset()`` after its +``resettable_assert_reset()`` counterpart. + +Note that, since resettable is an interface, the API takes a simple Object as +parameter. Still, it is a programming error to call a resettable function on a +non-resettable object and it will trigger a run time assert error. Since most +calls to resettable interface are done through base class functions, such an +error is not likely to happen. + +For Devices and Buses, the following helper functions exist: + +- ``device_cold_reset()`` +- ``bus_cold_reset()`` + +These are simple wrappers around resettable_reset() function; they only cast the +Device or Bus into an Object and pass the cold reset type. When possible +prefer to use these functions instead of ``resettable_reset()``. + +Device and bus functions co-exist because there can be semantic differences +between resetting a bus and resetting the controller bridge which owns it. +For example, consider a SCSI controller. Resetting the controller puts all +its registers back to what reset state was as well as reset everything on the +SCSI bus, whereas resetting just the SCSI bus only resets everything that's on +it but not the controller. + + +Multi-phase mechanism +--------------------- + +This section documents the internals of the resettable interface. + +The resettable interface uses a multi-phase system to relieve objects and +machines from reset ordering problems. To address this, the reset operation +of an object is split into three well defined phases. + +When resetting several objects (for example the whole machine at simulation +startup), all first phases of all objects are executed, then all second phases +and then all third phases. + +The three phases are: + +1. The **enter** phase is executed when the object enters reset. It resets only + local state of the object; it must not do anything that has a side-effect + on other objects, such as raising or lowering a qemu_irq line or reading or + writing guest memory. + +2. The **hold** phase is executed for entry into reset, once every object in the + group which is being reset has had its *enter* phase executed. At this point + devices can do actions that affect other objects. + +3. The **exit** phase is executed when the object leaves the reset state. + Actions affecting other objects are permitted. + +As said in previous section, the interface maintains a count of reset. This +count is used to ensure phases are executed only when required. *enter* and +*hold* phases are executed only when asserting reset for the first time +(if an object is already in reset state when calling +``resettable_assert_reset()`` or ``resettable_reset()``, they are not +executed). +The *exit* phase is executed only when the last reset operation ends. Therefore +the object does not need to care how many of reset controllers it has and how +many of them have started a reset. + + +Handling reset in a resettable object +------------------------------------- + +This section documents the APIs that an implementation of a resettable object +must provide and what functions it has access to. It is intended for people +who want to implement or convert a class which has the resettable interface; +for example when specializing an existing device or bus. + +Methods to implement +.................... + +Three methods should be defined or left empty. Each method corresponds to a +phase of the reset; they are name ``phases.enter()``, ``phases.hold()`` and +``phases.exit()``. They all take the object as parameter. The *enter* method +also take the reset type as second parameter. + +When extending an existing class, these methods may need to be extended too. +The ``resettable_class_set_parent_phases()`` class function may be used to +backup parent class methods. + +Here follows an example to implement reset for a Device which sets an IO while +in reset. + +:: + + static void mydev_reset_enter(Object *obj, ResetType type) + { + MyDevClass *myclass = MYDEV_GET_CLASS(obj); + MyDevState *mydev = MYDEV(obj); + /* call parent class enter phase */ + if (myclass->parent_phases.enter) { + myclass->parent_phases.enter(obj, type); + } + /* initialize local state only */ + mydev->var = 0; + } + + static void mydev_reset_hold(Object *obj) + { + MyDevClass *myclass = MYDEV_GET_CLASS(obj); + MyDevState *mydev = MYDEV(obj); + /* call parent class hold phase */ + if (myclass->parent_phases.hold) { + myclass->parent_phases.hold(obj); + } + /* set an IO */ + qemu_set_irq(mydev->irq, 1); + } + + static void mydev_reset_exit(Object *obj) + { + MyDevClass *myclass = MYDEV_GET_CLASS(obj); + MyDevState *mydev = MYDEV(obj); + /* call parent class exit phase */ + if (myclass->parent_phases.exit) { + myclass->parent_phases.exit(obj); + } + /* clear an IO */ + qemu_set_irq(mydev->irq, 0); + } + + typedef struct MyDevClass { + MyParentClass parent_class; + /* to store eventual parent reset methods */ + ResettablePhases parent_phases; + } MyDevClass; + + static void mydev_class_init(ObjectClass *class, void *data) + { + MyDevClass *myclass = MYDEV_CLASS(class); + ResettableClass *rc = RESETTABLE_CLASS(class); + resettable_class_set_parent_reset_phases(rc, + mydev_reset_enter, + mydev_reset_hold, + mydev_reset_exit, + &myclass->parent_phases); + } + +In the above example, we override all three phases. It is possible to override +only some of them by passing NULL instead of a function pointer to +``resettable_class_set_parent_reset_phases()``. For example, the following will +only override the *enter* phase and leave *hold* and *exit* untouched:: + + resettable_class_set_parent_reset_phases(rc, mydev_reset_enter, + NULL, NULL, + &myclass->parent_phases); + +This is equivalent to providing a trivial implementation of the hold and exit +phases which does nothing but call the parent class's implementation of the +phase. + +Polling the reset state +....................... + +Resettable interface provides the ``resettable_is_in_reset()`` function. +This function returns true if the object parameter is currently under reset. + +An object is under reset from the beginning of the *init* phase to the end of +the *exit* phase. During all three phases, the function will return that the +object is in reset. + +This function may be used if the object behavior has to be adapted +while in reset state. For example if a device has an irq input, +it will probably need to ignore it while in reset; then it can for +example check the reset state at the beginning of the irq callback. + +Note that until migration of the reset state is supported, an object +should not be left in reset. So apart from being currently executing +one of the reset phases, the only cases when this function will return +true is if an external interaction (like changing an io) is made during +*hold* or *exit* phase of another object in the same reset group. + +Helpers ``device_is_in_reset()`` and ``bus_is_in_reset()`` are also provided +for devices and buses and should be preferred. + + +Base class handling of reset +---------------------------- + +This section documents parts of the reset mechanism that you only need to know +about if you are extending it to work with a new base class other than +DeviceClass or BusClass, or maintaining the existing code in those classes. Most +people can ignore it. + +Methods to implement +.................... + +There are two other methods that need to exist in a class implementing the +interface: ``get_state()`` and ``child_foreach()``. + +``get_state()`` is simple. *resettable* is an interface and, as a consequence, +does not have any class state structure. But in order to factorize the code, we +need one. This method must return a pointer to ``ResettableState`` structure. +The structure must be allocated by the base class; preferably it should be +located inside the object instance structure. + +``child_foreach()`` is more complex. It should execute the given callback on +every reset child of the given resettable object. All children must be +resettable too. Additional parameters (a reset type and an opaque pointer) must +be passed to the callback too. + +In ``DeviceClass`` and ``BusClass`` the ``ResettableState`` is located +``DeviceState`` and ``BusState`` structure. ``child_foreach()`` is implemented +to follow the bus hierarchy; for a bus, it calls the function on every child +device; for a device, it calls the function on every bus child. When we reset +the main system bus, we reset the whole machine bus tree. + +Changing a resettable parent +............................ + +One thing which should be taken care of by the base class is handling reset +hierarchy changes. + +The reset hierarchy is supposed to be static and built during machine creation. +But there are actually some exceptions. To cope with this, the resettable API +provides ``resettable_change_parent()``. This function allows to set, update or +remove the parent of a resettable object after machine creation is done. As +parameters, it takes the object being moved, the old parent if any and the new +parent if any. + +This function can be used at any time when not in a reset operation. During +a reset operation it must be used only in *hold* phase. Using it in *enter* or +*exit* phase is an error. +Also it should not be used during machine creation, although it is harmless to +do so: the function is a no-op as long as old and new parent are NULL or not +in reset. + +There is currently 2 cases where this function is used: + +1. *device hotplug*; it means a new device is introduced on a live bus. + +2. *hot bus change*; it means an existing live device is added, moved or + removed in the bus hierarchy. At the moment, it occurs only in the raspi + machines for changing the sdbus used by sd card. diff --git a/docs/devel/s390-dasd-ipl.rst b/docs/devel/s390-dasd-ipl.rst new file mode 100644 index 000000000..2529eb5f5 --- /dev/null +++ b/docs/devel/s390-dasd-ipl.rst @@ -0,0 +1,138 @@ +Booting from real channel-attached devices on s390x +=================================================== + +s390 hardware IPL +----------------- + +The s390 hardware IPL process consists of the following steps. + +1. A READ IPL ccw is constructed in memory location ``0x0``. + This ccw, by definition, reads the IPL1 record which is located on the disk + at cylinder 0 track 0 record 1. Note that the chain flag is on in this ccw + so when it is complete another ccw will be fetched and executed from memory + location ``0x08``. + +2. Execute the Read IPL ccw at ``0x00``, thereby reading IPL1 data into ``0x00``. + IPL1 data is 24 bytes in length and consists of the following pieces of + information: ``[psw][read ccw][tic ccw]``. When the machine executes the Read + IPL ccw it read the 24-bytes of IPL1 to be read into memory starting at + location ``0x0``. Then the ccw program at ``0x08`` which consists of a read + ccw and a tic ccw is automatically executed because of the chain flag from + the original READ IPL ccw. The read ccw will read the IPL2 data into memory + and the TIC (Transfer In Channel) will transfer control to the channel + program contained in the IPL2 data. The TIC channel command is the + equivalent of a branch/jump/goto instruction for channel programs. + + NOTE: The ccws in IPL1 are defined by the architecture to be format 0. + +3. Execute IPL2. + The TIC ccw instruction at the end of the IPL1 channel program will begin + the execution of the IPL2 channel program. IPL2 is stage-2 of the boot + process and will contain a larger channel program than IPL1. The point of + IPL2 is to find and load either the operating system or a small program that + loads the operating system from disk. At the end of this step all or some of + the real operating system is loaded into memory and we are ready to hand + control over to the guest operating system. At this point the guest + operating system is entirely responsible for loading any more data it might + need to function. + + NOTE: The IPL2 channel program might read data into memory + location ``0x0`` thereby overwriting the IPL1 psw and channel program. This is ok + as long as the data placed in location ``0x0`` contains a psw whose instruction + address points to the guest operating system code to execute at the end of + the IPL/boot process. + + NOTE: The ccws in IPL2 are defined by the architecture to be format 0. + +4. Start executing the guest operating system. + The psw that was loaded into memory location ``0x0`` as part of the ipl process + should contain the needed flags for the operating system we have loaded. The + psw's instruction address will point to the location in memory where we want + to start executing the operating system. This psw is loaded (via LPSW + instruction) causing control to be passed to the operating system code. + +In a non-virtualized environment this process, handled entirely by the hardware, +is kicked off by the user initiating a "Load" procedure from the hardware +management console. This "Load" procedure crafts a special "Read IPL" ccw in +memory location 0x0 that reads IPL1. It then executes this ccw thereby kicking +off the reading of IPL1 data. Since the channel program from IPL1 will be +written immediately after the special "Read IPL" ccw, the IPL1 channel program +will be executed immediately (the special read ccw has the chaining bit turned +on). The TIC at the end of the IPL1 channel program will cause the IPL2 channel +program to be executed automatically. After this sequence completes the "Load" +procedure then loads the psw from ``0x0``. + +How this all pertains to QEMU (and the kernel) +---------------------------------------------- + +In theory we should merely have to do the following to IPL/boot a guest +operating system from a DASD device: + +1. Place a "Read IPL" ccw into memory location ``0x0`` with chaining bit on. +2. Execute channel program at ``0x0``. +3. LPSW ``0x0``. + +However, our emulation of the machine's channel program logic within the kernel +is missing one key feature that is required for this process to work: +non-prefetch of ccw data. + +When we start a channel program we pass the channel subsystem parameters via an +ORB (Operation Request Block). One of those parameters is a prefetch bit. If the +bit is on then the vfio-ccw kernel driver is allowed to read the entire channel +program from guest memory before it starts executing it. This means that any +channel commands that read additional channel commands will not work as expected +because the newly read commands will only exist in guest memory and NOT within +the kernel's channel subsystem memory. The kernel vfio-ccw driver currently +requires this bit to be on for all channel programs. This is a problem because +the IPL process consists of transferring control from the "Read IPL" ccw +immediately to the IPL1 channel program that was read by "Read IPL". + +Not being able to turn off prefetch will also prevent the TIC at the end of the +IPL1 channel program from transferring control to the IPL2 channel program. + +Lastly, in some cases (the zipl bootloader for example) the IPL2 program also +transfers control to another channel program segment immediately after reading +it from the disk. So we need to be able to handle this case. + +What QEMU does +-------------- + +Since we are forced to live with prefetch we cannot use the very simple IPL +procedure we defined in the preceding section. So we compensate by doing the +following. + +1. Place "Read IPL" ccw into memory location ``0x0``, but turn off chaining bit. +2. Execute "Read IPL" at ``0x0``. + + So now IPL1's psw is at ``0x0`` and IPL1's channel program is at ``0x08``. + +3. Write a custom channel program that will seek to the IPL2 record and then + execute the READ and TIC ccws from IPL1. Normally the seek is not required + because after reading the IPL1 record the disk is automatically positioned + to read the very next record which will be IPL2. But since we are not reading + both IPL1 and IPL2 as part of the same channel program we must manually set + the position. + +4. Grab the target address of the TIC instruction from the IPL1 channel program. + This address is where the IPL2 channel program starts. + + Now IPL2 is loaded into memory somewhere, and we know the address. + +5. Execute the IPL2 channel program at the address obtained in step #4. + + Because this channel program can be dynamic, we must use a special algorithm + that detects a READ immediately followed by a TIC and breaks the ccw chain + by turning off the chain bit in the READ ccw. When control is returned from + the kernel/hardware to the QEMU bios code we immediately issue another start + subchannel to execute the remaining TIC instruction. This causes the entire + channel program (starting from the TIC) and all needed data to be refetched + thereby stepping around the limitation that would otherwise prevent this + channel program from executing properly. + + Now the operating system code is loaded somewhere in guest memory and the psw + in memory location ``0x0`` will point to entry code for the guest operating + system. + +6. LPSW ``0x0`` + + LPSW transfers control to the guest operating system and we're done. diff --git a/docs/devel/secure-coding-practices.rst b/docs/devel/secure-coding-practices.rst new file mode 100644 index 000000000..0454cc527 --- /dev/null +++ b/docs/devel/secure-coding-practices.rst @@ -0,0 +1,115 @@ +======================= +Secure Coding Practices +======================= +This document covers topics that both developers and security researchers must +be aware of so that they can develop safe code and audit existing code +properly. + +Reporting Security Bugs +----------------------- +For details on how to report security bugs or ask questions about potential +security bugs, see the `Security Process wiki page +<https://wiki.qemu.org/SecurityProcess>`_. + +General Secure C Coding Practices +--------------------------------- +Most CVEs (security bugs) reported against QEMU are not specific to +virtualization or emulation. They are simply C programming bugs. Therefore +it's critical to be aware of common classes of security bugs. + +There is a wide selection of resources available covering secure C coding. For +example, the `CERT C Coding Standard +<https://wiki.sei.cmu.edu/confluence/display/c/SEI+CERT+C+Coding+Standard>`_ +covers the most important classes of security bugs. + +Instead of describing them in detail here, only the names of the most important +classes of security bugs are mentioned: + +* Buffer overflows +* Use-after-free and double-free +* Integer overflows +* Format string vulnerabilities + +Some of these classes of bugs can be detected by analyzers. Static analysis is +performed regularly by Coverity and the most obvious of these bugs are even +reported by compilers. Dynamic analysis is possible with valgrind, tsan, and +asan. + +Input Validation +---------------- +Inputs from the guest or external sources (e.g. network, files) cannot be +trusted and may be invalid. Inputs must be checked before using them in a way +that could crash the program, expose host memory to the guest, or otherwise be +exploitable by an attacker. + +The most sensitive attack surface is device emulation. All hardware register +accesses and data read from guest memory must be validated. A typical example +is a device that contains multiple units that are selectable by the guest via +an index register:: + + typedef struct { + ProcessingUnit unit[2]; + ... + } MyDeviceState; + + static void mydev_writel(void *opaque, uint32_t addr, uint32_t val) + { + MyDeviceState *mydev = opaque; + ProcessingUnit *unit; + + switch (addr) { + case MYDEV_SELECT_UNIT: + unit = &mydev->unit[val]; <-- this input wasn't validated! + ... + } + } + +If ``val`` is not in range [0, 1] then an out-of-bounds memory access will take +place when ``unit`` is dereferenced. The code must check that ``val`` is 0 or +1 and handle the case where it is invalid. + +Unexpected Device Accesses +-------------------------- +The guest may access device registers in unusual orders or at unexpected +moments. Device emulation code must not assume that the guest follows the +typical "theory of operation" presented in driver writer manuals. The guest +may make nonsense accesses to device registers such as starting operations +before the device has been fully initialized. + +A related issue is that device emulation code must be prepared for unexpected +device register accesses while asynchronous operations are in progress. A +well-behaved guest might wait for a completion interrupt before accessing +certain device registers. Device emulation code must handle the case where the +guest overwrites registers or submits further requests before an ongoing +request completes. Unexpected accesses must not cause memory corruption or +leaks in QEMU. + +Invalid device register accesses can be reported with +``qemu_log_mask(LOG_GUEST_ERROR, ...)``. The ``-d guest_errors`` command-line +option enables these log messages. + +Live Migration +-------------- +Device state can be saved to disk image files and shared with other users. +Live migration code must validate inputs when loading device state so an +attacker cannot gain control by crafting invalid device states. Device state +is therefore considered untrusted even though it is typically generated by QEMU +itself. + +Guest Memory Access Races +------------------------- +Guests with multiple vCPUs may modify guest RAM while device emulation code is +running. Device emulation code must copy in descriptors and other guest RAM +structures and only process the local copy. This prevents +time-of-check-to-time-of-use (TOCTOU) race conditions that could cause QEMU to +crash when a vCPU thread modifies guest RAM while device emulation is +processing it. + +Use of null-co block drivers +---------------------------- + +The ``null-co`` block driver is designed for performance: its read accesses are +not initialized by default. In case this driver has to be used for security +research, it must be used with the ``read-zeroes=on`` option which fills read +buffers with zeroes. Security issues reported with the default +(``read-zeroes=off``) will be discarded. diff --git a/docs/devel/stable-process.rst b/docs/devel/stable-process.rst new file mode 100644 index 000000000..c21fb8664 --- /dev/null +++ b/docs/devel/stable-process.rst @@ -0,0 +1,73 @@ +.. _stable-process: + +QEMU and the stable process +=========================== + +QEMU stable releases +-------------------- + +QEMU stable releases are based upon the last released QEMU version +and marked by an additional version number, e.g. 2.10.1. Occasionally, +a four-number version is released, if a single urgent fix needs to go +on top. + +Usually, stable releases are only provided for the last major QEMU +release. For example, when QEMU 2.11.0 is released, 2.11.x or 2.11.x.y +stable releases are produced only until QEMU 2.12.0 is released, at +which point the stable process moves to producing 2.12.x/2.12.x.y releases. + +What should go into a stable release? +------------------------------------- + +Generally, the following patches are considered stable material: + +* Patches that fix severe issues, like fixes for CVEs + +* Patches that fix regressions + +If you think the patch would be important for users of the current release +(or for a distribution picking fixes), it is usually a good candidate +for stable. + + +How to get a patch into QEMU stable +----------------------------------- + +There are various ways to get a patch into stable: + +* Preferred: Make sure that the stable maintainers are on copy when you send + the patch by adding + + .. code:: + + Cc: qemu-stable@nongnu.org + + to the patch description. By default, this will send a copy of the patch + to ``qemu-stable@nongnu.org`` if you use git send-email, which is where + patches that are stable candidates are tracked by the maintainers. + +* You can also reply to a patch and put ``qemu-stable@nongnu.org`` on copy + directly in your mail client if you think a previously submitted patch + should be considered for a stable release. + +* If a maintainer judges the patch appropriate for stable later on (or you + notify them), they will add the same line to the patch, meaning that + the stable maintainers will be on copy on the maintainer's pull request. + +* If you judge an already merged patch suitable for stable, send a mail + (preferably as a reply to the most recent patch submission) to + ``qemu-stable@nongnu.org`` along with ``qemu-devel@nongnu.org`` and + appropriate other people (like the patch author or the relevant maintainer) + on copy. + +Stable release process +---------------------- + +When the stable maintainers prepare a new stable release, they will prepare +a git branch with a release candidate and send the patches out to +``qemu-devel@nongnu.org`` for review. If any of your patches are included, +please verify that they look fine, especially if the maintainer had to tweak +the patch as part of back-porting things across branches. You may also +nominate other patches that you think are suitable for inclusion. After +review is complete (may involve more release candidates), a new stable release +is made available. diff --git a/docs/devel/style.rst b/docs/devel/style.rst new file mode 100644 index 000000000..9c5c0fffd --- /dev/null +++ b/docs/devel/style.rst @@ -0,0 +1,703 @@ +.. _coding-style: + +================= +QEMU Coding Style +================= + +.. contents:: Table of Contents + +Please use the script checkpatch.pl in the scripts directory to check +patches before submitting. + +Formatting and style +******************** + +Whitespace +========== + +Of course, the most important aspect in any coding style is whitespace. +Crusty old coders who have trouble spotting the glasses on their noses +can tell the difference between a tab and eight spaces from a distance +of approximately fifteen parsecs. Many a flamewar has been fought and +lost on this issue. + +QEMU indents are four spaces. Tabs are never used, except in Makefiles +where they have been irreversibly coded into the syntax. +Spaces of course are superior to tabs because: + +* You have just one way to specify whitespace, not two. Ambiguity breeds + mistakes. +* The confusion surrounding 'use tabs to indent, spaces to justify' is gone. +* Tab indents push your code to the right, making your screen seriously + unbalanced. +* Tabs will be rendered incorrectly on editors who are misconfigured not + to use tab stops of eight positions. +* Tabs are rendered badly in patches, causing off-by-one errors in almost + every line. +* It is the QEMU coding style. + +Do not leave whitespace dangling off the ends of lines. + +Multiline Indent +---------------- + +There are several places where indent is necessary: + +* if/else +* while/for +* function definition & call + +When breaking up a long line to fit within line width, we need a proper indent +for the following lines. + +In case of if/else, while/for, align the secondary lines just after the +opening parenthesis of the first. + +For example: + +.. code-block:: c + + if (a == 1 && + b == 2) { + + while (a == 1 && + b == 2) { + +In case of function, there are several variants: + +* 4 spaces indent from the beginning +* align the secondary lines just after the opening parenthesis of the first + +For example: + +.. code-block:: c + + do_something(x, y, + z); + + do_something(x, y, + z); + + do_something(x, do_another(y, + z)); + +Line width +========== + +Lines should be 80 characters; try not to make them longer. + +Sometimes it is hard to do, especially when dealing with QEMU subsystems +that use long function or symbol names. If wrapping the line at 80 columns +is obviously less readable and more awkward, prefer not to wrap it; better +to have an 85 character line than one which is awkwardly wrapped. + +Even in that case, try not to make lines much longer than 80 characters. +(The checkpatch script will warn at 100 characters, but this is intended +as a guard against obviously-overlength lines, not a target.) + +Rationale: + +* Some people like to tile their 24" screens with a 6x4 matrix of 80x24 + xterms and use vi in all of them. The best way to punish them is to + let them keep doing it. +* Code and especially patches is much more readable if limited to a sane + line length. Eighty is traditional. +* The four-space indentation makes the most common excuse ("But look + at all that white space on the left!") moot. +* It is the QEMU coding style. + +Naming +====== + +Variables are lower_case_with_underscores; easy to type and read. Structured +type names are in CamelCase; harder to type but standing out. Enum type +names and function type names should also be in CamelCase. Scalar type +names are lower_case_with_underscores_ending_with_a_t, like the POSIX +uint64_t and family. Note that this last convention contradicts POSIX +and is therefore likely to be changed. + +Variable Naming Conventions +--------------------------- + +A number of short naming conventions exist for variables that use +common QEMU types. For example, the architecture independent CPUState +is often held as a ``cs`` pointer variable, whereas the concrete +CPUArchState is usually held in a pointer called ``env``. + +Likewise, in device emulation code the common DeviceState is usually +called ``dev``. + +Function Naming Conventions +--------------------------- + +Wrapped version of standard library or GLib functions use a ``qemu_`` +prefix to alert readers that they are seeing a wrapped version, for +example ``qemu_strtol`` or ``qemu_mutex_lock``. Other utility functions +that are widely called from across the codebase should not have any +prefix, for example ``pstrcpy`` or bit manipulation functions such as +``find_first_bit``. + +The ``qemu_`` prefix is also used for functions that modify global +emulator state, for example ``qemu_add_vm_change_state_handler``. +However, if there is an obvious subsystem-specific prefix it should be +used instead. + +Public functions from a file or subsystem (declared in headers) tend +to have a consistent prefix to show where they came from. For example, +``tlb_`` for functions from ``cputlb.c`` or ``cpu_`` for functions +from cpus.c. + +If there are two versions of a function to be called with or without a +lock held, the function that expects the lock to be already held +usually uses the suffix ``_locked``. + + +Block structure +=============== + +Every indented statement is braced; even if the block contains just one +statement. The opening brace is on the line that contains the control +flow statement that introduces the new block; the closing brace is on the +same line as the else keyword, or on a line by itself if there is no else +keyword. Example: + +.. code-block:: c + + if (a == 5) { + printf("a was 5.\n"); + } else if (a == 6) { + printf("a was 6.\n"); + } else { + printf("a was something else entirely.\n"); + } + +Note that 'else if' is considered a single statement; otherwise a long if/ +else if/else if/.../else sequence would need an indent for every else +statement. + +An exception is the opening brace for a function; for reasons of tradition +and clarity it comes on a line by itself: + +.. code-block:: c + + void a_function(void) + { + do_something(); + } + +Rationale: a consistent (except for functions...) bracing style reduces +ambiguity and avoids needless churn when lines are added or removed. +Furthermore, it is the QEMU coding style. + +Declarations +============ + +Mixed declarations (interleaving statements and declarations within +blocks) are generally not allowed; declarations should be at the beginning +of blocks. + +Every now and then, an exception is made for declarations inside a +#ifdef or #ifndef block: if the code looks nicer, such declarations can +be placed at the top of the block even if there are statements above. +On the other hand, however, it's often best to move that #ifdef/#ifndef +block to a separate function altogether. + +Conditional statements +====================== + +When comparing a variable for (in)equality with a constant, list the +constant on the right, as in: + +.. code-block:: c + + if (a == 1) { + /* Reads like: "If a equals 1" */ + do_something(); + } + +Rationale: Yoda conditions (as in 'if (1 == a)') are awkward to read. +Besides, good compilers already warn users when '==' is mis-typed as '=', +even when the constant is on the right. + +Comment style +============= + +We use traditional C-style /``*`` ``*``/ comments and avoid // comments. + +Rationale: The // form is valid in C99, so this is purely a matter of +consistency of style. The checkpatch script will warn you about this. + +Multiline comment blocks should have a row of stars on the left, +and the initial /``*`` and terminating ``*``/ both on their own lines: + +.. code-block:: c + + /* + * like + * this + */ + +This is the same format required by the Linux kernel coding style. + +(Some of the existing comments in the codebase use the GNU Coding +Standards form which does not have stars on the left, or other +variations; avoid these when writing new comments, but don't worry +about converting to the preferred form unless you're editing that +comment anyway.) + +Rationale: Consistency, and ease of visually picking out a multiline +comment from the surrounding code. + +Language usage +************** + +Preprocessor +============ + +Variadic macros +--------------- + +For variadic macros, stick with this C99-like syntax: + +.. code-block:: c + + #define DPRINTF(fmt, ...) \ + do { printf("IRQ: " fmt, ## __VA_ARGS__); } while (0) + +Include directives +------------------ + +Order include directives as follows: + +.. code-block:: c + + #include "qemu/osdep.h" /* Always first... */ + #include <...> /* then system headers... */ + #include "..." /* and finally QEMU headers. */ + +The "qemu/osdep.h" header contains preprocessor macros that affect the behavior +of core system headers like <stdint.h>. It must be the first include so that +core system headers included by external libraries get the preprocessor macros +that QEMU depends on. + +Do not include "qemu/osdep.h" from header files since the .c file will have +already included it. + +C types +======= + +It should be common sense to use the right type, but we have collected +a few useful guidelines here. + +Scalars +------- + +If you're using "int" or "long", odds are good that there's a better type. +If a variable is counting something, it should be declared with an +unsigned type. + +If it's host memory-size related, size_t should be a good choice (use +ssize_t only if required). Guest RAM memory offsets must use ram_addr_t, +but only for RAM, it may not cover whole guest address space. + +If it's file-size related, use off_t. +If it's file-offset related (i.e., signed), use off_t. +If it's just counting small numbers use "unsigned int"; +(on all but oddball embedded systems, you can assume that that +type is at least four bytes wide). + +In the event that you require a specific width, use a standard type +like int32_t, uint32_t, uint64_t, etc. The specific types are +mandatory for VMState fields. + +Don't use Linux kernel internal types like u32, __u32 or __le32. + +Use hwaddr for guest physical addresses except pcibus_t +for PCI addresses. In addition, ram_addr_t is a QEMU internal address +space that maps guest RAM physical addresses into an intermediate +address space that can map to host virtual address spaces. Generally +speaking, the size of guest memory can always fit into ram_addr_t but +it would not be correct to store an actual guest physical address in a +ram_addr_t. + +For CPU virtual addresses there are several possible types. +vaddr is the best type to use to hold a CPU virtual address in +target-independent code. It is guaranteed to be large enough to hold a +virtual address for any target, and it does not change size from target +to target. It is always unsigned. +target_ulong is a type the size of a virtual address on the CPU; this means +it may be 32 or 64 bits depending on which target is being built. It should +therefore be used only in target-specific code, and in some +performance-critical built-per-target core code such as the TLB code. +There is also a signed version, target_long. +abi_ulong is for the ``*``-user targets, and represents a type the size of +'void ``*``' in that target's ABI. (This may not be the same as the size of a +full CPU virtual address in the case of target ABIs which use 32 bit pointers +on 64 bit CPUs, like sparc32plus.) Definitions of structures that must match +the target's ABI must use this type for anything that on the target is defined +to be an 'unsigned long' or a pointer type. +There is also a signed version, abi_long. + +Of course, take all of the above with a grain of salt. If you're about +to use some system interface that requires a type like size_t, pid_t or +off_t, use matching types for any corresponding variables. + +Also, if you try to use e.g., "unsigned int" as a type, and that +conflicts with the signedness of a related variable, sometimes +it's best just to use the *wrong* type, if "pulling the thread" +and fixing all related variables would be too invasive. + +Finally, while using descriptive types is important, be careful not to +go overboard. If whatever you're doing causes warnings, or requires +casts, then reconsider or ask for help. + +Pointers +-------- + +Ensure that all of your pointers are "const-correct". +Unless a pointer is used to modify the pointed-to storage, +give it the "const" attribute. That way, the reader knows +up-front that this is a read-only pointer. Perhaps more +importantly, if we're diligent about this, when you see a non-const +pointer, you're guaranteed that it is used to modify the storage +it points to, or it is aliased to another pointer that is. + +Typedefs +-------- + +Typedefs are used to eliminate the redundant 'struct' keyword, since type +names have a different style than other identifiers ("CamelCase" versus +"snake_case"). Each named struct type should have a CamelCase name and a +corresponding typedef. + +Since certain C compilers choke on duplicated typedefs, you should avoid +them and declare a typedef only in one header file. For common types, +you can use "include/qemu/typedefs.h" for example. However, as a matter +of convenience it is also perfectly fine to use forward struct +definitions instead of typedefs in headers and function prototypes; this +avoids problems with duplicated typedefs and reduces the need to include +headers from other headers. + +Reserved namespaces in C and POSIX +---------------------------------- + +Underscore capital, double underscore, and underscore 't' suffixes should be +avoided. + +Low level memory management +=========================== + +Use of the ``malloc/free/realloc/calloc/valloc/memalign/posix_memalign`` +APIs is not allowed in the QEMU codebase. Instead of these routines, +use the GLib memory allocation routines +``g_malloc/g_malloc0/g_new/g_new0/g_realloc/g_free`` +or QEMU's ``qemu_memalign/qemu_blockalign/qemu_vfree`` APIs. + +Please note that ``g_malloc`` will exit on allocation failure, so +there is no need to test for failure (as you would have to with +``malloc``). Generally using ``g_malloc`` on start-up is fine as the +result of a failure to allocate memory is going to be a fatal exit +anyway. There may be some start-up cases where failing is unreasonable +(for example speculatively loading a large debug symbol table). + +Care should be taken to avoid introducing places where the guest could +trigger an exit by causing a large allocation. For small allocations, +of the order of 4k, a failure to allocate is likely indicative of an +overloaded host and allowing ``g_malloc`` to ``exit`` is a reasonable +approach. However for larger allocations where we could realistically +fall-back to a smaller one if need be we should use functions like +``g_try_new`` and check the result. For example this is valid approach +for a time/space trade-off like ``tlb_mmu_resize_locked`` in the +SoftMMU TLB code. + +If the lifetime of the allocation is within the function and there are +multiple exist paths you can also improve the readability of the code +by using ``g_autofree`` and related annotations. See :ref:`autofree-ref` +for more details. + +Calling ``g_malloc`` with a zero size is valid and will return NULL. + +Prefer ``g_new(T, n)`` instead of ``g_malloc(sizeof(T) * n)`` for the following +reasons: + +* It catches multiplication overflowing size_t; +* It returns T ``*`` instead of void ``*``, letting compiler catch more type errors. + +Declarations like + +.. code-block:: c + + T *v = g_malloc(sizeof(*v)) + +are acceptable, though. + +Memory allocated by ``qemu_memalign`` or ``qemu_blockalign`` must be freed with +``qemu_vfree``, since breaking this will cause problems on Win32. + +String manipulation +=================== + +Do not use the strncpy function. As mentioned in the man page, it does *not* +guarantee a NULL-terminated buffer, which makes it extremely dangerous to use. +It also zeros trailing destination bytes out to the specified length. Instead, +use this similar function when possible, but note its different signature: + +.. code-block:: c + + void pstrcpy(char *dest, int dest_buf_size, const char *src) + +Don't use strcat because it can't check for buffer overflows, but: + +.. code-block:: c + + char *pstrcat(char *buf, int buf_size, const char *s) + +The same limitation exists with sprintf and vsprintf, so use snprintf and +vsnprintf. + +QEMU provides other useful string functions: + +.. code-block:: c + + int strstart(const char *str, const char *val, const char **ptr) + int stristart(const char *str, const char *val, const char **ptr) + int qemu_strnlen(const char *s, int max_len) + +There are also replacement character processing macros for isxyz and toxyz, +so instead of e.g. isalnum you should use qemu_isalnum. + +Because of the memory management rules, you must use g_strdup/g_strndup +instead of plain strdup/strndup. + +Printf-style functions +====================== + +Whenever you add a new printf-style function, i.e., one with a format +string argument and following "..." in its prototype, be sure to use +gcc's printf attribute directive in the prototype. + +This makes it so gcc's -Wformat and -Wformat-security options can do +their jobs and cross-check format strings with the number and types +of arguments. + +C standard, implementation defined and undefined behaviors +========================================================== + +C code in QEMU should be written to the C99 language specification. A copy +of the final version of the C99 standard with corrigenda TC1, TC2, and TC3 +included, formatted as a draft, can be downloaded from: + + `<http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf>`_ + +The C language specification defines regions of undefined behavior and +implementation defined behavior (to give compiler authors enough leeway to +produce better code). In general, code in QEMU should follow the language +specification and avoid both undefined and implementation defined +constructs. ("It works fine on the gcc I tested it with" is not a valid +argument...) However there are a few areas where we allow ourselves to +assume certain behaviors because in practice all the platforms we care about +behave in the same way and writing strictly conformant code would be +painful. These are: + +* you may assume that integers are 2s complement representation +* you may assume that right shift of a signed integer duplicates + the sign bit (ie it is an arithmetic shift, not a logical shift) + +In addition, QEMU assumes that the compiler does not use the latitude +given in C99 and C11 to treat aspects of signed '<<' as undefined, as +documented in the GNU Compiler Collection manual starting at version 4.0. + +.. _autofree-ref: + +Automatic memory deallocation +============================= + +QEMU has a mandatory dependency either the GCC or CLang compiler. As +such it has the freedom to make use of a C language extension for +automatically running a cleanup function when a stack variable goes +out of scope. This can be used to simplify function cleanup paths, +often allowing many goto jumps to be eliminated, through automatic +free'ing of memory. + +The GLib2 library provides a number of functions/macros for enabling +automatic cleanup: + + `<https://developer.gnome.org/glib/stable/glib-Miscellaneous-Macros.html>`_ + +Most notably: + +* g_autofree - will invoke g_free() on the variable going out of scope + +* g_autoptr - for structs / objects, will invoke the cleanup func created + by a previous use of G_DEFINE_AUTOPTR_CLEANUP_FUNC. This is + supported for most GLib data types and GObjects + +For example, instead of + +.. code-block:: c + + int somefunc(void) { + int ret = -1; + char *foo = g_strdup_printf("foo%", "wibble"); + GList *bar = ..... + + if (eek) { + goto cleanup; + } + + ret = 0; + + cleanup: + g_free(foo); + g_list_free(bar); + return ret; + } + +Using g_autofree/g_autoptr enables the code to be written as: + +.. code-block:: c + + int somefunc(void) { + g_autofree char *foo = g_strdup_printf("foo%", "wibble"); + g_autoptr (GList) bar = ..... + + if (eek) { + return -1; + } + + return 0; + } + +While this generally results in simpler, less leak-prone code, there +are still some caveats to beware of + +* Variables declared with g_auto* MUST always be initialized, + otherwise the cleanup function will use uninitialized stack memory + +* If a variable declared with g_auto* holds a value which must + live beyond the life of the function, that value must be saved + and the original variable NULL'd out. This can be simpler using + g_steal_pointer + + +.. code-block:: c + + char *somefunc(void) { + g_autofree char *foo = g_strdup_printf("foo%", "wibble"); + g_autoptr (GList) bar = ..... + + if (eek) { + return NULL; + } + + return g_steal_pointer(&foo); + } + + +QEMU Specific Idioms +******************** + +Error handling and reporting +============================ + +Reporting errors to the human user +---------------------------------- + +Do not use printf(), fprintf() or monitor_printf(). Instead, use +error_report() or error_vreport() from error-report.h. This ensures the +error is reported in the right place (current monitor or stderr), and in +a uniform format. + +Use error_printf() & friends to print additional information. + +error_report() prints the current location. In certain common cases +like command line parsing, the current location is tracked +automatically. To manipulate it manually, use the loc_``*``() from +error-report.h. + +Propagating errors +------------------ + +An error can't always be reported to the user right where it's detected, +but often needs to be propagated up the call chain to a place that can +handle it. This can be done in various ways. + +The most flexible one is Error objects. See error.h for usage +information. + +Use the simplest suitable method to communicate success / failure to +callers. Stick to common methods: non-negative on success / -1 on +error, non-negative / -errno, non-null / null, or Error objects. + +Example: when a function returns a non-null pointer on success, and it +can fail only in one way (as far as the caller is concerned), returning +null on failure is just fine, and certainly simpler and a lot easier on +the eyes than propagating an Error object through an Error ``*````*`` parameter. + +Example: when a function's callers need to report details on failure +only the function really knows, use Error ``*````*``, and set suitable errors. + +Do not report an error to the user when you're also returning an error +for somebody else to handle. Leave the reporting to the place that +consumes the error returned. + +Handling errors +--------------- + +Calling exit() is fine when handling configuration errors during +startup. It's problematic during normal operation. In particular, +monitor commands should never exit(). + +Do not call exit() or abort() to handle an error that can be triggered +by the guest (e.g., some unimplemented corner case in guest code +translation or device emulation). Guests should not be able to +terminate QEMU. + +Note that &error_fatal is just another way to exit(1), and &error_abort +is just another way to abort(). + + +trace-events style +================== + +0x prefix +--------- + +In trace-events files, use a '0x' prefix to specify hex numbers, as in: + +.. code-block:: c + + some_trace(unsigned x, uint64_t y) "x 0x%x y 0x" PRIx64 + +An exception is made for groups of numbers that are hexadecimal by +convention and separated by the symbols '.', '/', ':', or ' ' (such as +PCI bus id): + +.. code-block:: c + + another_trace(int cssid, int ssid, int dev_num) "bus id: %x.%x.%04x" + +However, you can use '0x' for such groups if you want. Anyway, be sure that +it is obvious that numbers are in hex, ex.: + +.. code-block:: c + + data_dump(uint8_t c1, uint8_t c2, uint8_t c3) "bytes (in hex): %02x %02x %02x" + +Rationale: hex numbers are hard to read in logs when there is no 0x prefix, +especially when (occasionally) the representation doesn't contain any letters +and especially in one line with other decimal numbers. Number groups are allowed +to not use '0x' because for some things notations like %x.%x.%x are used not +only in QEMU. Also dumping raw data bytes with '0x' is less readable. + +'#' printf flag +--------------- + +Do not use printf flag '#', like '%#x'. + +Rationale: there are two ways to add a '0x' prefix to printed number: '0x%...' +and '%#...'. For consistency the only one way should be used. Arguments for +'0x%' are: + +* it is more popular +* '%#' omits the 0x for the value 0 which makes output inconsistent diff --git a/docs/devel/submitting-a-patch.rst b/docs/devel/submitting-a-patch.rst new file mode 100644 index 000000000..e51259eb9 --- /dev/null +++ b/docs/devel/submitting-a-patch.rst @@ -0,0 +1,562 @@ +.. _submitting-a-patch: + +Submitting a Patch +================== + +QEMU welcomes contributions of code (either fixing bugs or adding new +functionality). However, we get a lot of patches, and so we have some +guidelines about submitting patches. If you follow these, you'll help +make our task of code review easier and your patch is likely to be +committed faster. + +This page seems very long, so if you are only trying to post a quick +one-shot fix, the bare minimum we ask is that: + +- You **must** provide a Signed-off-by: line (this is a hard + requirement because it's how you say "I'm legally okay to contribute + this and happy for it to go into QEMU", modeled after the `Linux kernel + <http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__ + policy.) ``git commit -s`` or ``git format-patch -s`` will add one. +- All contributions to QEMU must be **sent as patches** to the + qemu-devel `mailing list <MailingLists>`__. Patch contributions + should not be posted on the bug tracker, posted on forums, or + externally hosted and linked to. (We have other mailing lists too, + but all patches must go to qemu-devel, possibly with a Cc: to another + list.) ``git send-email`` (`step-by-step setup + guide <https://git-send-email.io/>`__ and `hints and + tips <https://elixir.bootlin.com/linux/latest/source/Documentation/process/email-clients.rst>`__) + works best for delivering the patch without mangling it, but + attachments can be used as a last resort on a first-time submission. +- You must read replies to your message, and be willing to act on them. + Note, however, that maintainers are often willing to manually fix up + first-time contributions, since there is a learning curve involved in + making an ideal patch submission. + +You do not have to subscribe to post (list policy is to reply-to-all to +preserve CCs and keep non-subscribers in the loop on the threads they +start), although you may find it easier as a subscriber to pick up good +ideas from other posts. If you do subscribe, be prepared for a high +volume of email, often over one thousand messages in a week. The list is +moderated; first-time posts from an email address (whether or not you +subscribed) may be subject to some delay while waiting for a moderator +to whitelist your address. + +The larger your contribution is, or if you plan on becoming a long-term +contributor, then the more important the rest of this page becomes. +Reading the table of contents below should already give you an idea of +the basic requirements. Use the table of contents as a reference, and +read the parts that you have doubts about. + +.. contents:: Table of Contents + +.. _writing_your_patches: + +Writing your Patches +-------------------- + +.. _use_the_qemu_coding_style: + +Use the QEMU coding style +~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can run run *scripts/checkpatch.pl <patchfile>* before submitting to +check that you are in compliance with our coding standards. Be aware +that ``checkpatch.pl`` is not infallible, though, especially where C +preprocessor macros are involved; use some common sense too. See also: + +- :ref:`coding-style` +- `Automate a checkpatch run on + commit <https://blog.vmsplice.net/2011/03/how-to-automatically-run-checkpatchpl.html>`__ + +.. _base_patches_against_current_git_master: + +Base patches against current git master +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There's no point submitting a patch which is based on a released version +of QEMU because development will have moved on from then and it probably +won't even apply to master. We only apply selected bugfixes to release +branches and then only as backports once the code has gone into master. + +It is also okay to base patches on top of other on-going work that is +not yet part of the git master branch. To aid continuous integration +tools, such as `patchew <http://patchew.org/QEMU/>`__, you should `add a +tag <https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg01288.html>`__ +line ``Based-on: $MESSAGE_ID`` to your cover letter to make the series +dependency obvious. + +.. _split_up_long_patches: + +Split up long patches +~~~~~~~~~~~~~~~~~~~~~ + +Split up longer patches into a patch series of logical code changes. +Each change should compile and execute successfully. For instance, don't +add a file to the makefile in patch one and then add the file itself in +patch two. (This rule is here so that people can later use tools like +`git bisect <http://git-scm.com/docs/git-bisect>`__ without hitting +points in the commit history where QEMU doesn't work for reasons +unrelated to the bug they're chasing.) Put documentation first, not +last, so that someone reading the series can do a clean-room evaluation +of the documentation, then validate that the code matched the +documentation. A commit message that mentions "Also, ..." is often a +good candidate for splitting into multiple patches. For more thoughts on +properly splitting patches and writing good commit messages, see `this +advice from +OpenStack <https://wiki.openstack.org/wiki/GitCommitMessages>`__. + +.. _make_code_motion_patches_easy_to_review: + +Make code motion patches easy to review +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a series requires large blocks of code motion, there are tricks for +making the refactoring easier to review. Split up the series so that +semantic changes (or even function renames) are done in a separate patch +from the raw code motion. Use a one-time setup of ``git config +diff.renames true;`` ``git config diff.algorithm patience`` (refer to +`git-config <http://git-scm.com/docs/git-config>`__). The 'diff.renames' +property ensures file rename patches will be given in a more compact +representation that focuses only on the differences across the file +rename, instead of showing the entire old file as a deletion and the new +file as an insertion. Meanwhile, the 'diff.algorithm' property ensures +that extracting a non-contiguous subset of one file into a new file, but +where all extracted parts occur in the same order both before and after +the patch, will reduce churn in trying to treat unrelated ``}`` lines in +the original file as separating hunks of changes. + +Ideally, a code motion patch can be reviewed by doing:: + + git format-patch --stdout -1 > patch; + diff -u <(sed -n 's/^-//p' patch) <(sed -n 's/^\+//p' patch) + +to focus on the few changes that weren't wholesale code motion. + +.. _dont_include_irrelevant_changes: + +Don't include irrelevant changes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In particular, don't include formatting, coding style or whitespace +changes to bits of code that would otherwise not be touched by the +patch. (It's OK to fix coding style issues in the immediate area (few +lines) of the lines you're changing.) If you think a section of code +really does need a reindent or other large-scale style fix, submit this +as a separate patch which makes no semantic changes; don't put it in the +same patch as your bug fix. + +For smaller patches in less frequently changed areas of QEMU, consider +using the :ref:`trivial-patches` process. + +.. _write_a_meaningful_commit_message: + +Write a meaningful commit message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Commit messages should be meaningful and should stand on their own as a +historical record of why the changes you applied were necessary or +useful. + +QEMU follows the usual standard for git commit messages: the first line +(which becomes the email subject line) is "subsystem: single line +summary of change". Whether the "single line summary of change" starts +with a capital is a matter of taste, but we prefer that the summary does +not end in a dot. Look at ``git shortlog -30`` for an idea of sample +subject lines. Then there is a blank line and a more detailed +description of the patch, another blank and your Signed-off-by: line. +Please do not use lines that are longer than 76 characters in your +commit message (so that the text still shows up nicely with "git show" +in a 80-columns terminal window). + +The body of the commit message is a good place to document why your +change is important. Don't include comments like "This is a suggestion +for fixing this bug" (they can go below the ``---`` line in the email so +they don't go into the final commit message). Make sure the body of the +commit message can be read in isolation even if the reader's mailer +displays the subject line some distance apart (that is, a body that +starts with "... so that" as a continuation of the subject line is +harder to follow). + +If your patch fixes a commit that is already in the repository, please +add an additional line with "Fixes: <at-least-12-digits-of-SHA-commit-id> +("Fixed commit subject")" below the patch description / before your +"Signed-off-by:" line in the commit message. + +If your patch fixes a bug in the gitlab bug tracker, please add a line +with "Resolves: <URL-of-the-bug>" to the commit message, too. Gitlab can +close bugs automatically once commits with the "Resolved:" keyword get +merged into the master branch of the project. And if your patch addresses +a bug in another public bug tracker, you can also use a line with +"Buglink: <URL-of-the-bug>" for reference here, too. + +Example:: + + Fixes: 14055ce53c2d ("s390x/tcg: avoid overflows in time2tod/tod2time") + Resolves: https://gitlab.com/qemu-project/qemu/-/issues/42 + Buglink: https://bugs.launchpad.net/qemu/+bug/1804323`` + +Some other tags that are used in commit messages include "Message-Id:" +"Tested-by:", "Acked-by:", "Reported-by:", "Suggested-by:". See ``git +log`` for these keywords for example usage. + +.. _test_your_patches: + +Test your patches +~~~~~~~~~~~~~~~~~ + +Although QEMU has `continuous integration +services <Testing#Continuous_Integration>`__ that attempt to test +patches submitted to the list, it still saves everyone time if you have +already tested that your patch compiles and works. Because QEMU is such +a large project, it's okay to use configure arguments to limit what is +built for faster turnaround during your development time; but it is +still wise to also check that your patches work with a full build before +submitting a series, especially if your changes might have an unintended +effect on other areas of the code you don't normally experiment with. +See `Testing <Testing>`__ for more details on what tests are available. +Also, it is a wise idea to include a testsuite addition as part of your +patches - either to ensure that future changes won't regress your new +feature, or to add a test which exposes the bug that the rest of your +series fixes. Keeping separate commits for the test and the fix allows +reviewers to rebase the test to occur first to prove it catches the +problem, then again to place it last in the series so that bisection +doesn't land on a known-broken state. + +.. _submitting_your_patches: + +Submitting your Patches +----------------------- + +.. _if_you_cannot_send_patch_emails: + +If you cannot send patch emails +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In rare cases it may not be possible to send properly formatted patch +emails. You can use `sourcehut <https://sourcehut.org/>`__ to send your +patches to the QEMU mailing list by following these steps: + +#. Register or sign in to your account +#. Add your SSH public key in `meta \| + keys <https://meta.sr.ht/keys>`__. +#. Publish your git branch using **git push git@git.sr.ht:~USERNAME/qemu + HEAD** +#. Send your patches to the QEMU mailing list using the web-based + ``git-send-email`` UI at https://git.sr.ht/~USERNAME/qemu/send-email + +`This video +<https://spacepub.space/videos/watch/ad258d23-0ac6-488c-83fc-2bacf578de3a>`__ +shows the web-based ``git-send-email`` workflow. Documentation is +available `here +<https://man.sr.ht/git.sr.ht/#sending-patches-upstream>`__. + +.. _cc_the_relevant_maintainer: + +CC the relevant maintainer +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Send patches both to the mailing list and CC the maintainer(s) of the +files you are modifying. look in the MAINTAINERS file to find out who +that is. Also try using scripts/get_maintainer.pl from the repository +for learning the most common committers for the files you touched. + +Example:: + + ~/src/qemu/scripts/get_maintainer.pl -f hw/ide/core.c + +In fact, you can automate this, via a one-time setup of ``git config +sendemail.cccmd 'scripts/get_maintainer.pl --nogit-fallback'`` (Refer to +`git-config <http://git-scm.com/docs/git-config>`__.) + +.. _do_not_send_as_an_attachment: + +Do not send as an attachment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Send patches inline so they are easy to reply to with review comments. +Do not put patches in attachments. + +.. _use_git_format_patch: + +Use ``git format-patch`` +~~~~~~~~~~~~~~~~~~~~~~~~ + +Use the right diff format. +`git format-patch <http://git-scm.com/docs/git-format-patch>`__ will +produce patch emails in the right format (check the documentation to +find out how to drive it). You can then edit the cover letter before +using ``git send-email`` to mail the files to the mailing list. (We +recommend `git send-email <http://git-scm.com/docs/git-send-email>`__ +because mail clients often mangle patches by wrapping long lines or +messing up whitespace. Some distributions do not include send-email in a +default install of git; you may need to download additional packages, +such as 'git-email' on Fedora-based systems.) Patch series need a cover +letter, with shallow threading (all patches in the series are +in-reply-to the cover letter, but not to each other); single unrelated +patches do not need a cover letter (but if you do send a cover letter, +use ``--numbered`` so the cover and the patch have distinct subject lines). +Patches are easier to find if they start a new top-level thread, rather +than being buried in-reply-to another existing thread. + +.. _avoid_posting_large_binary_blob: + +Avoid posting large binary blob +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you added binaries to the repository, consider producing the patch +emails using ``git format-patch --no-binary`` and include a link to a +git repository to fetch the original commit. + +.. _patch_emails_must_include_a_signed_off_by_line: + +Patch emails must include a ``Signed-off-by:`` line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For more information see `SubmittingPatches 1.12 +<http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__. +This is vital or we will not be able to apply your patch! Please use +your real name to sign a patch (not an alias or acronym). + +If you wrote the patch, make sure your "From:" and "Signed-off-by:" +lines use the same spelling. It's okay if you subscribe or contribute to +the list via more than one address, but using multiple addresses in one +commit just confuses things. If someone else wrote the patch, git will +include a "From:" line in the body of the email (different from your +envelope From:) that will give credit to the correct author; but again, +that author's Signed-off-by: line is mandatory, with the same spelling. + +.. _include_a_meaningful_cover_letter: + +Include a meaningful cover letter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is a requirement for any series with multiple patches (as it aids +continuous integration), but optional for an isolated patch. The cover +letter explains the overall goal of such a series, and also provides a +convenient 0/N email for others to reply to the series as a whole. A +one-time setup of ``git config format.coverletter auto`` (refer to +`git-config <http://git-scm.com/docs/git-config>`__) will generate the +cover letter as needed. + +When reviewers don't know your goal at the start of their review, they +may object to early changes that don't make sense until the end of the +series, because they do not have enough context yet at that point of +their review. A series where the goal is unclear also risks a higher +number of review-fix cycles because the reviewers haven't bought into +the idea yet. If the cover letter can explain these points to the +reviewer, the process will be smoother patches will get merged faster. +Make sure your cover letter includes a diffstat of changes made over the +entire series; potential reviewers know what files they are interested +in, and they need an easy way determine if your series touches them. + +.. _use_the_rfc_tag_if_needed: + +Use the RFC tag if needed +~~~~~~~~~~~~~~~~~~~~~~~~~ + +For example, "[PATCH RFC v2]". ``git format-patch --subject-prefix=RFC`` +can help. + +"RFC" means "Request For Comments" and is a statement that you don't +intend for your patchset to be applied to master, but would like some +review on it anyway. Reasons for doing this include: + +- the patch depends on some pending kernel changes which haven't yet + been accepted, so the QEMU patch series is blocked until that + dependency has been dealt with, but is worth reviewing anyway +- the patch set is not finished yet (perhaps it doesn't cover all use + cases or work with all targets) but you want early review of a major + API change or design structure before continuing + +In general, since it's asking other people to do review work on a +patchset that the submitter themselves is saying shouldn't be applied, +it's best to: + +- use it sparingly +- in the cover letter, be clear about why a patch is an RFC, what areas + of the patchset you're looking for review on, and why reviewers + should care + +.. _consider_whether_your_patch_is_applicable_for_stable: + +Consider whether your patch is applicable for stable +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If your patch fixes a severe issue or a regression, it may be applicable +for stable. In that case, consider adding ``Cc: qemu-stable@nongnu.org`` +to your patch to notify the stable maintainers. + +For more details on how QEMU's stable process works, refer to the +:ref:`stable-process` page. + +.. _participating_in_code_review: + +Participating in Code Review +---------------------------- + +All patches submitted to the QEMU project go through a code review +process before they are accepted. Some areas of code that are well +maintained may review patches quickly, lesser-loved areas of code may +have a longer delay. + +.. _stay_around_to_fix_problems_raised_in_code_review: + +Stay around to fix problems raised in code review +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Not many patches get into QEMU straight away -- it is quite common that +developers will identify bugs, or suggest a cleaner approach, or even +just point out code style issues or commit message typos. You'll need to +respond to these, and then send a second version of your patches with +the issues fixed. This takes a little time and effort on your part, but +if you don't do it then your changes will never get into QEMU. It's also +just polite -- it is quite disheartening for a developer to spend time +reviewing your code and suggesting improvements, only to find that +you're not going to do anything further and it was all wasted effort. + +When replying to comments on your patches **reply to all and not just +the sender** -- keeping discussion on the mailing list means everybody +can follow it. + +.. _pay_attention_to_review_comments: + +Pay attention to review comments +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Someone took their time to review your work, and it pays to respect that +effort; repeatedly submitting a series without addressing all comments +from the previous round tends to alienate reviewers and stall your +patch. Reviewers aren't always perfect, so it is okay if you want to +argue that your code was correct in the first place instead of blindly +doing everything the reviewer asked. On the other hand, if someone +pointed out a potential issue during review, then even if your code +turns out to be correct, it's probably a sign that you should improve +your commit message and/or comments in the code explaining why the code +is correct. + +If you fix issues that are raised during review **resend the entire +patch series** not just the one patch that was changed. This allows +maintainers to easily apply the fixed series without having to manually +identify which patches are relevant. Send the new version as a complete +fresh email or series of emails -- don't try to make it a followup to +version 1. (This helps automatic patch email handling tools distinguish +between v1 and v2 emails.) + +.. _when_resending_patches_add_a_version_tag: + +When resending patches add a version tag +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +All patches beyond the first version should include a version tag -- for +example, "[PATCH v2]". This means people can easily identify whether +they're looking at the most recent version. (The first version of a +patch need not say "v1", just [PATCH] is sufficient.) For patch series, +the version applies to the whole series -- even if you only change one +patch, you resend the entire series and mark it as "v2". Don't try to +track versions of different patches in the series separately. `git +format-patch <http://git-scm.com/docs/git-format-patch>`__ and `git +send-email <http://git-scm.com/docs/git-send-email>`__ both understand +the ``-v2`` option to make this easier. Send each new revision as a new +top-level thread, rather than burying it in-reply-to an earlier +revision, as many reviewers are not looking inside deep threads for new +patches. + +.. _include_version_history_in_patchset_revisions: + +Include version history in patchset revisions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For later versions of patches, include a summary of changes from +previous versions, but not in the commit message itself. In an email +formatted as a git patch, the commit message is the part above the ``---`` +line, and this will go into the git changelog when the patch is +committed. This part should be a self-contained description of what this +version of the patch does, written to make sense to anybody who comes +back to look at this commit in git in six months' time. The part below +the ``---`` line and above the patch proper (git format-patch puts the +diffstat here) is a good place to put remarks for people reading the +patch email, and this is where the "changes since previous version" +summary belongs. The `git-publish +<https://github.com/stefanha/git-publish>`__ script can help with +tracking a good summary across versions. Also, the `git-backport-diff +<https://github.com/codyprime/git-scripts>`__ script can help focus +reviewers on what changed between revisions. + +.. _tips_and_tricks: + +Tips and Tricks +--------------- + +.. _proper_use_of_reviewed_by_tags_can_aid_review: + +Proper use of Reviewed-by: tags can aid review +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When reviewing a large series, a reviewer can reply to some of the +patches with a Reviewed-by tag, stating that they are happy with that +patch in isolation (sometimes conditional on minor cleanup, like fixing +whitespace, that doesn't affect code content). You should then update +those commit messages by hand to include the Reviewed-by tag, so that in +the next revision, reviewers can spot which patches were already clean +from the previous round. Conversely, if you significantly modify a patch +that was previously reviewed, remove the reviewed-by tag out of the +commit message, as well as listing the changes from the previous +version, to make it easier to focus a reviewer's attention to your +changes. + +.. _if_your_patch_seems_to_have_been_ignored: + +If your patch seems to have been ignored +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If your patchset has received no replies you should "ping" it after a +week or two, by sending an email as a reply-to-all to the patch mail, +including the word "ping" and ideally also a link to the page for the +patch on `patchew <https://patchew.org/QEMU/>`__ or +`lore.kernel.org <https://lore.kernel.org/qemu-devel/>`__. It's worth +double-checking for reasons why your patch might have been ignored +(forgot to CC the maintainer? annoyed people by failing to respond to +review comments on an earlier version?), but often for less-maintained +areas of QEMU patches do just slip through the cracks. If your ping is +also ignored, ping again after another week or so. As the submitter, you +are the person with the most motivation to get your patch applied, so +you have to be persistent. + +.. _is_my_patch_in: + +Is my patch in? +~~~~~~~~~~~~~~~ + +QEMU has some Continuous Integration machines that try to catch patch +submission problems as soon as possible. `patchew +<http://patchew.org/QEMU/>`__ includes a web interface for tracking the +status of various threads that have been posted to the list, and may +send you an automated mail if it detected a problem with your patch. + +Once your patch has had enough review on list, the maintainer for that +area of code will send notification to the list that they are including +your patch in a particular staging branch. Periodically, the maintainer +then takes care of :ref:`submitting-a-pull-request` +for aggregating topic branches into mainline QEMU. Generally, you do not +need to send a pull request unless you have contributed enough patches +to become a maintainer over a particular section of code. Maintainers +may further modify your commit, by resolving simple merge conflicts or +fixing minor typos pointed out during review, but will always add a +Signed-off-by line in addition to yours, indicating that it went through +their tree. Occasionally, the maintainer's pull request may hit more +difficult merge conflicts, where you may be requested to help rebase and +resolve the problems. It may take a couple of weeks between when your +patch first had a positive review to when it finally lands in qemu.git; +release cycle freezes may extend that time even longer. + +.. _return_the_favor: + +Return the favor +~~~~~~~~~~~~~~~~ + +Peer review only works if everyone chips in a bit of review time. If +everyone submitted more patches than they reviewed, we would have a +patch backlog. A good goal is to try to review at least as many patches +from others as what you submit. Don't worry if you don't know the code +base as well as a maintainer; it's perfectly fine to admit when your +review is weak because you are unfamiliar with the code. diff --git a/docs/devel/submitting-a-pull-request.rst b/docs/devel/submitting-a-pull-request.rst new file mode 100644 index 000000000..c9d1e8afd --- /dev/null +++ b/docs/devel/submitting-a-pull-request.rst @@ -0,0 +1,77 @@ +.. _submitting-a-pull-request: + +Submitting a Pull Request +========================= + +QEMU welcomes contributions of code, but we generally expect these to be +sent as simple patch emails to the mailing list (see our page on +:ref:`submitting-a-patch` +for more details). Generally only existing submaintainers of a tree +will need to submit pull requests, although occasionally for a large +patch series we might ask a submitter to send a pull request. This page +documents our recommendations on pull requests for those people. + +A good rule of thumb is not to send a pull request unless somebody asks +you to. + +**Resend the patches with the pull request** as emails which are +threaded as follow-ups to the pull request itself. The simplest way to +do this is to use ``git format-patch --cover-letter`` to create the +emails, and then edit the cover letter to include the pull request +details that ``git request-pull`` outputs. + +**Use PULL as the subject line tag** in both the cover letter and the +retransmitted patch mails (for example, by using +``--subject-prefix=PULL`` in your ``git format-patch`` command). This +helps people to filter in or out the resulting emails (especially useful +if they are only CC'd on one email out of the set). + +**Each patch must have your own Signed-off-by: line** as well as that of +the original author if the patch was not written by you. This is because +with a pull request you're now indicating that the patch has passed via +you rather than directly from the original author. + +**Don't forget to add Reviewed-by: and Acked-by: lines**. When other +people have reviewed the patches you're putting in the pull request, +make sure you've copied their signoffs across. (If you use the `patches +tool <https://github.com/stefanha/patches>`__ to add patches from email +directly to your git repo it will include the tags automatically; if +you're updating patches manually or in some other way you'll need to +edit the commit messages by hand.) + +**Don't send pull requests for code that hasn't passed review**. A pull +request says these patches are ready to go into QEMU now, so they must +have passed the standard code review processes. In particular if you've +corrected issues in one round of code review, you need to send your +fixed patch series as normal to the list; you can't put it in a pull +request until it's gone through. (Extremely trivial fixes may be OK to +just fix in passing, but if in doubt err on the side of not.) + +**Test before sending**. This is an obvious thing to say, but make sure +everything builds (including that it compiles at each step of the patch +series) and that "make check" passes before sending out the pull +request. As a submaintainer you're one of QEMU's lines of defense +against bad code, so double check the details. + +**All pull requests must be signed**. If your key is not already signed +by members of the QEMU community, you should make arrangements to attend +a `KeySigningParty <https://wiki.qemu.org/KeySigningParty>`__ (for +example at KVM Forum) or make alternative arrangements to have your key +signed by an attendee. Key signing requires meeting another community +member \*in person\* so please make appropriate arrangements. By +"signed" here we mean that the pullreq email should quote a tag which is +a GPG-signed tag (as created with 'gpg tag -s ...'). + +**Pull requests not for master should say "not for master" and have +"PULL SUBSYSTEM whatever" in the subject tag**. If your pull request is +targeting a stable branch or some submaintainer tree, please include the +string "not for master" in the cover letter email, and make sure the +subject tag is "PULL SUBSYSTEM s390/block/whatever" rather than just +"PULL". This allows it to be automatically filtered out of the set of +pull requests that should be applied to master. + +You might be interested in the `make-pullreq +<https://git.linaro.org/people/peter.maydell/misc-scripts.git/tree/make-pullreq>`__ +script which automates some of this process for you and includes a few +sanity checks. Note that you must edit it to configure it suitably for +your local situation! diff --git a/docs/devel/tcg-icount.rst b/docs/devel/tcg-icount.rst new file mode 100644 index 000000000..50c8e8dab --- /dev/null +++ b/docs/devel/tcg-icount.rst @@ -0,0 +1,94 @@ +.. + Copyright (c) 2020, Linaro Limited + Written by Alex Bennée + + +======================== +TCG Instruction Counting +======================== + +TCG has long supported a feature known as icount which allows for +instruction counting during execution. This should not be confused +with cycle accurate emulation - QEMU does not attempt to emulate how +long an instruction would take on real hardware. That is a job for +other more detailed (and slower) tools that simulate the rest of a +micro-architecture. + +This feature is only available for system emulation and is +incompatible with multi-threaded TCG. It can be used to better align +execution time with wall-clock time so a "slow" device doesn't run too +fast on modern hardware. It can also provides for a degree of +deterministic execution and is an essential part of the record/replay +support in QEMU. + +Core Concepts +============= + +At its heart icount is simply a count of executed instructions which +is stored in the TimersState of QEMU's timer sub-system. The number of +executed instructions can then be used to calculate QEMU_CLOCK_VIRTUAL +which represents the amount of elapsed time in the system since +execution started. Depending on the icount mode this may either be a +fixed number of ns per instruction or adjusted as execution continues +to keep wall clock time and virtual time in sync. + +To be able to calculate the number of executed instructions the +translator starts by allocating a budget of instructions to be +executed. The budget of instructions is limited by how long it will be +until the next timer will expire. We store this budget as part of a +vCPU icount_decr field which shared with the machinery for handling +cpu_exit(). The whole field is checked at the start of every +translated block and will cause a return to the outer loop to deal +with whatever caused the exit. + +In the case of icount, before the flag is checked we subtract the +number of instructions the translation block would execute. If this +would cause the instruction budget to go negative we exit the main +loop and regenerate a new translation block with exactly the right +number of instructions to take the budget to 0 meaning whatever timer +was due to expire will expire exactly when we exit the main run loop. + +Dealing with MMIO +----------------- + +While we can adjust the instruction budget for known events like timer +expiry we cannot do the same for MMIO. Every load/store we execute +might potentially trigger an I/O event, at which point we will need an +up to date and accurate reading of the icount number. + +To deal with this case, when an I/O access is made we: + + - restore un-executed instructions to the icount budget + - re-compile a single [1]_ instruction block for the current PC + - exit the cpu loop and execute the re-compiled block + +The new block is created with the CF_LAST_IO compile flag which +ensures the final instruction translation starts with a call to +gen_io_start() so we don't enter a perpetual loop constantly +recompiling a single instruction block. For translators using the +common translator_loop this is done automatically. + +.. [1] sometimes two instructions if dealing with delay slots + +Other I/O operations +-------------------- + +MMIO isn't the only type of operation for which we might need a +correct and accurate clock. IO port instructions and accesses to +system registers are the common examples here. These instructions have +to be handled by the individual translators which have the knowledge +of which operations are I/O operations. + +When the translator is handling an instruction of this kind: + +* it must call gen_io_start() if icount is enabled, at some + point before the generation of the code which actually does + the I/O, using a code fragment similar to: + +.. code:: c + + if (tb_cflags(s->base.tb) & CF_USE_ICOUNT) { + gen_io_start(); + } + +* it must end the TB immediately after this instruction diff --git a/docs/devel/tcg-plugins.rst b/docs/devel/tcg-plugins.rst new file mode 100644 index 000000000..f93ef4fe5 --- /dev/null +++ b/docs/devel/tcg-plugins.rst @@ -0,0 +1,438 @@ +.. + Copyright (C) 2017, Emilio G. Cota <cota@braap.org> + Copyright (c) 2019, Linaro Limited + Written by Emilio Cota and Alex Bennée + +QEMU TCG Plugins +================ + +QEMU TCG plugins provide a way for users to run experiments taking +advantage of the total system control emulation can have over a guest. +It provides a mechanism for plugins to subscribe to events during +translation and execution and optionally callback into the plugin +during these events. TCG plugins are unable to change the system state +only monitor it passively. However they can do this down to an +individual instruction granularity including potentially subscribing +to all load and store operations. + +Usage +----- + +Any QEMU binary with TCG support has plugins enabled by default. +Earlier releases needed to be explicitly enabled with:: + + configure --enable-plugins + +Once built a program can be run with multiple plugins loaded each with +their own arguments:: + + $QEMU $OTHER_QEMU_ARGS \ + -plugin tests/plugin/libhowvec.so,inline=on,count=hint \ + -plugin tests/plugin/libhotblocks.so + +Arguments are plugin specific and can be used to modify their +behaviour. In this case the howvec plugin is being asked to use inline +ops to count and break down the hint instructions by type. + +Writing plugins +--------------- + +API versioning +~~~~~~~~~~~~~~ + +This is a new feature for QEMU and it does allow people to develop +out-of-tree plugins that can be dynamically linked into a running QEMU +process. However the project reserves the right to change or break the +API should it need to do so. The best way to avoid this is to submit +your plugin upstream so they can be updated if/when the API changes. + +All plugins need to declare a symbol which exports the plugin API +version they were built against. This can be done simply by:: + + QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION; + +The core code will refuse to load a plugin that doesn't export a +``qemu_plugin_version`` symbol or if plugin version is outside of QEMU's +supported range of API versions. + +Additionally the ``qemu_info_t`` structure which is passed to the +``qemu_plugin_install`` method of a plugin will detail the minimum and +current API versions supported by QEMU. The API version will be +incremented if new APIs are added. The minimum API version will be +incremented if existing APIs are changed or removed. + +Lifetime of the query handle +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each callback provides an opaque anonymous information handle which +can usually be further queried to find out information about a +translation, instruction or operation. The handles themselves are only +valid during the lifetime of the callback so it is important that any +information that is needed is extracted during the callback and saved +by the plugin. + +Plugin life cycle +~~~~~~~~~~~~~~~~~ + +First the plugin is loaded and the public qemu_plugin_install function +is called. The plugin will then register callbacks for various plugin +events. Generally plugins will register a handler for the *atexit* +if they want to dump a summary of collected information once the +program/system has finished running. + +When a registered event occurs the plugin callback is invoked. The +callbacks may provide additional information. In the case of a +translation event the plugin has an option to enumerate the +instructions in a block of instructions and optionally register +callbacks to some or all instructions when they are executed. + +There is also a facility to add an inline event where code to +increment a counter can be directly inlined with the translation. +Currently only a simple increment is supported. This is not atomic so +can miss counts. If you want absolute precision you should use a +callback which can then ensure atomicity itself. + +Finally when QEMU exits all the registered *atexit* callbacks are +invoked. + +Exposure of QEMU internals +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The plugin architecture actively avoids leaking implementation details +about how QEMU's translation works to the plugins. While there are +conceptions such as translation time and translation blocks the +details are opaque to plugins. The plugin is able to query select +details of instructions and system configuration only through the +exported *qemu_plugin* functions. + +API +~~~ + +.. kernel-doc:: include/qemu/qemu-plugin.h + +Internals +--------- + +Locking +~~~~~~~ + +We have to ensure we cannot deadlock, particularly under MTTCG. For +this we acquire a lock when called from plugin code. We also keep the +list of callbacks under RCU so that we do not have to hold the lock +when calling the callbacks. This is also for performance, since some +callbacks (e.g. memory access callbacks) might be called very +frequently. + + * A consequence of this is that we keep our own list of CPUs, so that + we do not have to worry about locking order wrt cpu_list_lock. + * Use a recursive lock, since we can get registration calls from + callbacks. + +As a result registering/unregistering callbacks is "slow", since it +takes a lock. But this is very infrequent; we want performance when +calling (or not calling) callbacks, not when registering them. Using +RCU is great for this. + +We support the uninstallation of a plugin at any time (e.g. from +plugin callbacks). This allows plugins to remove themselves if they no +longer want to instrument the code. This operation is asynchronous +which means callbacks may still occur after the uninstall operation is +requested. The plugin isn't completely uninstalled until the safe work +has executed while all vCPUs are quiescent. + +Example Plugins +--------------- + +There are a number of plugins included with QEMU and you are +encouraged to contribute your own plugins plugins upstream. There is a +``contrib/plugins`` directory where they can go. + +- tests/plugins + +These are some basic plugins that are used to test and exercise the +API during the ``make check-tcg`` target. + +- contrib/plugins/hotblocks.c + +The hotblocks plugin allows you to examine the where hot paths of +execution are in your program. Once the program has finished you will +get a sorted list of blocks reporting the starting PC, translation +count, number of instructions and execution count. This will work best +with linux-user execution as system emulation tends to generate +re-translations as blocks from different programs get swapped in and +out of system memory. + +If your program is single-threaded you can use the ``inline`` option for +slightly faster (but not thread safe) counters. + +Example:: + + ./aarch64-linux-user/qemu-aarch64 \ + -plugin contrib/plugins/libhotblocks.so -d plugin \ + ./tests/tcg/aarch64-linux-user/sha1 + SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6 + collected 903 entries in the hash table + pc, tcount, icount, ecount + 0x0000000041ed10, 1, 5, 66087 + 0x000000004002b0, 1, 4, 66087 + ... + +- contrib/plugins/hotpages.c + +Similar to hotblocks but this time tracks memory accesses:: + + ./aarch64-linux-user/qemu-aarch64 \ + -plugin contrib/plugins/libhotpages.so -d plugin \ + ./tests/tcg/aarch64-linux-user/sha1 + SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6 + Addr, RCPUs, Reads, WCPUs, Writes + 0x000055007fe000, 0x0001, 31747952, 0x0001, 8835161 + 0x000055007ff000, 0x0001, 29001054, 0x0001, 8780625 + 0x00005500800000, 0x0001, 687465, 0x0001, 335857 + 0x0000000048b000, 0x0001, 130594, 0x0001, 355 + 0x0000000048a000, 0x0001, 1826, 0x0001, 11 + +The hotpages plugin can be configured using the following arguments: + + * sortby=reads|writes|address + + Log the data sorted by either the number of reads, the number of writes, or + memory address. (Default: entries are sorted by the sum of reads and writes) + + * io=on + + Track IO addresses. Only relevant to full system emulation. (Default: off) + + * pagesize=N + + The page size used. (Default: N = 4096) + +- contrib/plugins/howvec.c + +This is an instruction classifier so can be used to count different +types of instructions. It has a number of options to refine which get +counted. You can give a value to the ``count`` argument for a class of +instructions to break it down fully, so for example to see all the system +registers accesses:: + + ./aarch64-softmmu/qemu-system-aarch64 $(QEMU_ARGS) \ + -append "root=/dev/sda2 systemd.unit=benchmark.service" \ + -smp 4 -plugin ./contrib/plugins/libhowvec.so,count=sreg -d plugin + +which will lead to a sorted list after the class breakdown:: + + Instruction Classes: + Class: UDEF not counted + Class: SVE (68 hits) + Class: PCrel addr (47789483 hits) + Class: Add/Sub (imm) (192817388 hits) + Class: Logical (imm) (93852565 hits) + Class: Move Wide (imm) (76398116 hits) + Class: Bitfield (44706084 hits) + Class: Extract (5499257 hits) + Class: Cond Branch (imm) (147202932 hits) + Class: Exception Gen (193581 hits) + Class: NOP not counted + Class: Hints (6652291 hits) + Class: Barriers (8001661 hits) + Class: PSTATE (1801695 hits) + Class: System Insn (6385349 hits) + Class: System Reg counted individually + Class: Branch (reg) (69497127 hits) + Class: Branch (imm) (84393665 hits) + Class: Cmp & Branch (110929659 hits) + Class: Tst & Branch (44681442 hits) + Class: AdvSimd ldstmult (736 hits) + Class: ldst excl (9098783 hits) + Class: Load Reg (lit) (87189424 hits) + Class: ldst noalloc pair (3264433 hits) + Class: ldst pair (412526434 hits) + Class: ldst reg (imm) (314734576 hits) + Class: Loads & Stores (2117774 hits) + Class: Data Proc Reg (223519077 hits) + Class: Scalar FP (31657954 hits) + Individual Instructions: + Instr: mrs x0, sp_el0 (2682661 hits) (op=0xd5384100/ System Reg) + Instr: mrs x1, tpidr_el2 (1789339 hits) (op=0xd53cd041/ System Reg) + Instr: mrs x2, tpidr_el2 (1513494 hits) (op=0xd53cd042/ System Reg) + Instr: mrs x0, tpidr_el2 (1490823 hits) (op=0xd53cd040/ System Reg) + Instr: mrs x1, sp_el0 (933793 hits) (op=0xd5384101/ System Reg) + Instr: mrs x2, sp_el0 (699516 hits) (op=0xd5384102/ System Reg) + Instr: mrs x4, tpidr_el2 (528437 hits) (op=0xd53cd044/ System Reg) + Instr: mrs x30, ttbr1_el1 (480776 hits) (op=0xd538203e/ System Reg) + Instr: msr ttbr1_el1, x30 (480713 hits) (op=0xd518203e/ System Reg) + Instr: msr vbar_el1, x30 (480671 hits) (op=0xd518c01e/ System Reg) + ... + +To find the argument shorthand for the class you need to examine the +source code of the plugin at the moment, specifically the ``*opt`` +argument in the InsnClassExecCount tables. + +- contrib/plugins/lockstep.c + +This is a debugging tool for developers who want to find out when and +where execution diverges after a subtle change to TCG code generation. +It is not an exact science and results are likely to be mixed once +asynchronous events are introduced. While the use of -icount can +introduce determinism to the execution flow it doesn't always follow +the translation sequence will be exactly the same. Typically this is +caused by a timer firing to service the GUI causing a block to end +early. However in some cases it has proved to be useful in pointing +people at roughly where execution diverges. The only argument you need +for the plugin is a path for the socket the two instances will +communicate over:: + + + ./sparc-softmmu/qemu-system-sparc -monitor none -parallel none \ + -net none -M SS-20 -m 256 -kernel day11/zImage.elf \ + -plugin ./contrib/plugins/liblockstep.so,sockpath=lockstep-sparc.sock \ + -d plugin,nochain + +which will eventually report:: + + qemu-system-sparc: warning: nic lance.0 has no peer + @ 0x000000ffd06678 vs 0x000000ffd001e0 (2/1 since last) + @ 0x000000ffd07d9c vs 0x000000ffd06678 (3/1 since last) + Δ insn_count @ 0x000000ffd07d9c (809900609) vs 0x000000ffd06678 (809900612) + previously @ 0x000000ffd06678/10 (809900609 insns) + previously @ 0x000000ffd001e0/4 (809900599 insns) + previously @ 0x000000ffd080ac/2 (809900595 insns) + previously @ 0x000000ffd08098/5 (809900593 insns) + previously @ 0x000000ffd080c0/1 (809900588 insns) + +- contrib/plugins/hwprofile.c + +The hwprofile tool can only be used with system emulation and allows +the user to see what hardware is accessed how often. It has a number of options: + + * track=read or track=write + + By default the plugin tracks both reads and writes. You can use one + of these options to limit the tracking to just one class of accesses. + + * source + + Will include a detailed break down of what the guest PC that made the + access was. Not compatible with the pattern option. Example output:: + + cirrus-low-memory @ 0xfffffd00000a0000 + pc:fffffc0000005cdc, 1, 256 + pc:fffffc0000005ce8, 1, 256 + pc:fffffc0000005cec, 1, 256 + + * pattern + + Instead break down the accesses based on the offset into the HW + region. This can be useful for seeing the most used registers of a + device. Example output:: + + pci0-conf @ 0xfffffd01fe000000 + off:00000004, 1, 1 + off:00000010, 1, 3 + off:00000014, 1, 3 + off:00000018, 1, 2 + off:0000001c, 1, 2 + off:00000020, 1, 2 + ... + +- contrib/plugins/execlog.c + +The execlog tool traces executed instructions with memory access. It can be used +for debugging and security analysis purposes. +Please be aware that this will generate a lot of output. + +The plugin takes no argument:: + + qemu-system-arm $(QEMU_ARGS) \ + -plugin ./contrib/plugins/libexeclog.so -d plugin + +which will output an execution trace following this structure:: + + # vCPU, vAddr, opcode, disassembly[, load/store, memory addr, device]... + 0, 0xa12, 0xf8012400, "movs r4, #0" + 0, 0xa14, 0xf87f42b4, "cmp r4, r6" + 0, 0xa16, 0xd206, "bhs #0xa26" + 0, 0xa18, 0xfff94803, "ldr r0, [pc, #0xc]", load, 0x00010a28, RAM + 0, 0xa1a, 0xf989f000, "bl #0xd30" + 0, 0xd30, 0xfff9b510, "push {r4, lr}", store, 0x20003ee0, RAM, store, 0x20003ee4, RAM + 0, 0xd32, 0xf9893014, "adds r0, #0x14" + 0, 0xd34, 0xf9c8f000, "bl #0x10c8" + 0, 0x10c8, 0xfff96c43, "ldr r3, [r0, #0x44]", load, 0x200000e4, RAM + +- contrib/plugins/cache.c + +Cache modelling plugin that measures the performance of a given L1 cache +configuration, and optionally a unified L2 per-core cache when a given working +set is run:: + + qemu-x86_64 -plugin ./contrib/plugins/libcache.so \ + -d plugin -D cache.log ./tests/tcg/x86_64-linux-user/float_convs + +will report the following:: + + core #, data accesses, data misses, dmiss rate, insn accesses, insn misses, imiss rate + 0 996695 508 0.0510% 2642799 18617 0.7044% + + address, data misses, instruction + 0x424f1e (_int_malloc), 109, movq %rax, 8(%rcx) + 0x41f395 (_IO_default_xsputn), 49, movb %dl, (%rdi, %rax) + 0x42584d (ptmalloc_init.part.0), 33, movaps %xmm0, (%rax) + 0x454d48 (__tunables_init), 20, cmpb $0, (%r8) + ... + + address, fetch misses, instruction + 0x4160a0 (__vfprintf_internal), 744, movl $1, %ebx + 0x41f0a0 (_IO_setb), 744, endbr64 + 0x415882 (__vfprintf_internal), 744, movq %r12, %rdi + 0x4268a0 (__malloc), 696, andq $0xfffffffffffffff0, %rax + ... + +The plugin has a number of arguments, all of them are optional: + + * limit=N + + Print top N icache and dcache thrashing instructions along with their + address, number of misses, and its disassembly. (default: 32) + + * icachesize=N + * iblksize=B + * iassoc=A + + Instruction cache configuration arguments. They specify the cache size, block + size, and associativity of the instruction cache, respectively. + (default: N = 16384, B = 64, A = 8) + + * dcachesize=N + * dblksize=B + * dassoc=A + + Data cache configuration arguments. They specify the cache size, block size, + and associativity of the data cache, respectively. + (default: N = 16384, B = 64, A = 8) + + * evict=POLICY + + Sets the eviction policy to POLICY. Available policies are: :code:`lru`, + :code:`fifo`, and :code:`rand`. The plugin will use the specified policy for + both instruction and data caches. (default: POLICY = :code:`lru`) + + * cores=N + + Sets the number of cores for which we maintain separate icache and dcache. + (default: for linux-user, N = 1, for full system emulation: N = cores + available to guest) + + * l2=on + + Simulates a unified L2 cache (stores blocks for both instructions and data) + using the default L2 configuration (cache size = 2MB, associativity = 16-way, + block size = 64B). + + * l2cachesize=N + * l2blksize=B + * l2assoc=A + + L2 cache configuration arguments. They specify the cache size, block size, and + associativity of the L2 cache, respectively. Setting any of the L2 + configuration arguments implies ``l2=on``. + (default: N = 2097152 (2MB), B = 64, A = 16) diff --git a/docs/devel/tcg.rst b/docs/devel/tcg.rst new file mode 100644 index 000000000..a65fb7b1c --- /dev/null +++ b/docs/devel/tcg.rst @@ -0,0 +1,190 @@ +==================== +Translator Internals +==================== + +QEMU is a dynamic translator. When it first encounters a piece of code, +it converts it to the host instruction set. Usually dynamic translators +are very complicated and highly CPU dependent. QEMU uses some tricks +which make it relatively easily portable and simple while achieving good +performances. + +QEMU's dynamic translation backend is called TCG, for "Tiny Code +Generator". For more information, please take a look at ``tcg/README``. + +The following sections outline some notable features and implementation +details of QEMU's dynamic translator. + +CPU state optimisations +----------------------- + +The target CPUs have many internal states which change the way they +evaluate instructions. In order to achieve a good speed, the +translation phase considers that some state information of the virtual +CPU cannot change in it. The state is recorded in the Translation +Block (TB). If the state changes (e.g. privilege level), a new TB will +be generated and the previous TB won't be used anymore until the state +matches the state recorded in the previous TB. The same idea can be applied +to other aspects of the CPU state. For example, on x86, if the SS, +DS and ES segments have a zero base, then the translator does not even +generate an addition for the segment base. + +Direct block chaining +--------------------- + +After each translated basic block is executed, QEMU uses the simulated +Program Counter (PC) and other CPU state information (such as the CS +segment base value) to find the next basic block. + +In its simplest, less optimized form, this is done by exiting from the +current TB, going through the TB epilogue, and then back to the +main loop. That’s where QEMU looks for the next TB to execute, +translating it from the guest architecture if it isn’t already available +in memory. Then QEMU proceeds to execute this next TB, starting at the +prologue and then moving on to the translated instructions. + +Exiting from the TB this way will cause the ``cpu_exec_interrupt()`` +callback to be re-evaluated before executing additional instructions. +It is mandatory to exit this way after any CPU state changes that may +unmask interrupts. + +In order to accelerate the cases where the TB for the new +simulated PC is already available, QEMU has mechanisms that allow +multiple TBs to be chained directly, without having to go back to the +main loop as described above. These mechanisms are: + +``lookup_and_goto_ptr`` +^^^^^^^^^^^^^^^^^^^^^^^ + +Calling ``tcg_gen_lookup_and_goto_ptr()`` will emit a call to +``helper_lookup_tb_ptr``. This helper will look for an existing TB that +matches the current CPU state. If the destination TB is available its +code address is returned, otherwise the address of the JIT epilogue is +returned. The call to the helper is always followed by the tcg ``goto_ptr`` +opcode, which branches to the returned address. In this way, we either +branch to the next TB or return to the main loop. + +``goto_tb + exit_tb`` +^^^^^^^^^^^^^^^^^^^^^ + +The translation code usually implements branching by performing the +following steps: + +1. Call ``tcg_gen_goto_tb()`` passing a jump slot index (either 0 or 1) + as a parameter. + +2. Emit TCG instructions to update the CPU state with any information + that has been assumed constant and is required by the main loop to + correctly locate and execute the next TB. For most guests, this is + just the PC of the branch destination, but others may store additional + data. The information updated in this step must be inferable from both + ``cpu_get_tb_cpu_state()`` and ``cpu_restore_state()``. + +3. Call ``tcg_gen_exit_tb()`` passing the address of the current TB and + the jump slot index again. + +Step 1, ``tcg_gen_goto_tb()``, will emit a ``goto_tb`` TCG +instruction that later on gets translated to a jump to an address +associated with the specified jump slot. Initially, this is the address +of step 2's instructions, which update the CPU state information. Step 3, +``tcg_gen_exit_tb()``, exits from the current TB returning a tagged +pointer composed of the last executed TB’s address and the jump slot +index. + +The first time this whole sequence is executed, step 1 simply jumps +to step 2. Then the CPU state information gets updated and we exit from +the current TB. As a result, the behavior is very similar to the less +optimized form described earlier in this section. + +Next, the main loop looks for the next TB to execute using the +current CPU state information (creating the TB if it wasn’t already +available) and, before starting to execute the new TB’s instructions, +patches the previously executed TB by associating one of its jump +slots (the one specified in the call to ``tcg_gen_exit_tb()``) with the +address of the new TB. + +The next time this previous TB is executed and we get to that same +``goto_tb`` step, it will already be patched (assuming the destination TB +is still in memory) and will jump directly to the first instruction of +the destination TB, without going back to the main loop. + +For the ``goto_tb + exit_tb`` mechanism to be used, the following +conditions need to be satisfied: + +* The change in CPU state must be constant, e.g., a direct branch and + not an indirect branch. + +* The direct branch cannot cross a page boundary. Memory mappings + may change, causing the code at the destination address to change. + +Note that, on step 3 (``tcg_gen_exit_tb()``), in addition to the +jump slot index, the address of the TB just executed is also returned. +This address corresponds to the TB that will be patched; it may be +different than the one that was directly executed from the main loop +if the latter had already been chained to other TBs. + +Self-modifying code and translated code invalidation +---------------------------------------------------- + +Self-modifying code is a special challenge in x86 emulation because no +instruction cache invalidation is signaled by the application when code +is modified. + +User-mode emulation marks a host page as write-protected (if it is +not already read-only) every time translated code is generated for a +basic block. Then, if a write access is done to the page, Linux raises +a SEGV signal. QEMU then invalidates all the translated code in the page +and enables write accesses to the page. For system emulation, write +protection is achieved through the software MMU. + +Correct translated code invalidation is done efficiently by maintaining +a linked list of every translated block contained in a given page. Other +linked lists are also maintained to undo direct block chaining. + +On RISC targets, correctly written software uses memory barriers and +cache flushes, so some of the protection above would not be +necessary. However, QEMU still requires that the generated code always +matches the target instructions in memory in order to handle +exceptions correctly. + +Exception support +----------------- + +longjmp() is used when an exception such as division by zero is +encountered. + +The host SIGSEGV and SIGBUS signal handlers are used to get invalid +memory accesses. QEMU keeps a map from host program counter to +target program counter, and looks up where the exception happened +based on the host program counter at the exception point. + +On some targets, some bits of the virtual CPU's state are not flushed to the +memory until the end of the translation block. This is done for internal +emulation state that is rarely accessed directly by the program and/or changes +very often throughout the execution of a translation block---this includes +condition codes on x86, delay slots on SPARC, conditional execution on +Arm, and so on. This state is stored for each target instruction, and +looked up on exceptions. + +MMU emulation +------------- + +For system emulation QEMU uses a software MMU. In that mode, the MMU +virtual to physical address translation is done at every memory +access. + +QEMU uses an address translation cache (TLB) to speed up the translation. +In order to avoid flushing the translated code each time the MMU +mappings change, all caches in QEMU are physically indexed. This +means that each basic block is indexed with its physical address. + +In order to avoid invalidating the basic block chain when MMU mappings +change, chaining is only performed when the destination of the jump +shares a page with the basic block that is performing the jump. + +The MMU can also distinguish RAM and ROM memory areas from MMIO memory +areas. Access is faster for RAM and ROM because the translation cache also +hosts the offset between guest address and host memory. Accessing MMIO +memory areas instead calls out to C code for device emulation. +Finally, the MMU helps tracking dirty pages and pages pointed to by +translation blocks. + diff --git a/docs/devel/testing.rst b/docs/devel/testing.rst new file mode 100644 index 000000000..755343c7d --- /dev/null +++ b/docs/devel/testing.rst @@ -0,0 +1,1309 @@ +Testing in QEMU +=============== + +This document describes the testing infrastructure in QEMU. + +Testing with "make check" +------------------------- + +The "make check" testing family includes most of the C based tests in QEMU. For +a quick help, run ``make check-help`` from the source tree. + +The usual way to run these tests is: + +.. code:: + + make check + +which includes QAPI schema tests, unit tests, QTests and some iotests. +Different sub-types of "make check" tests will be explained below. + +Before running tests, it is best to build QEMU programs first. Some tests +expect the executables to exist and will fail with obscure messages if they +cannot find them. + +Unit tests +~~~~~~~~~~ + +Unit tests, which can be invoked with ``make check-unit``, are simple C tests +that typically link to individual QEMU object files and exercise them by +calling exported functions. + +If you are writing new code in QEMU, consider adding a unit test, especially +for utility modules that are relatively stateless or have few dependencies. To +add a new unit test: + +1. Create a new source file. For example, ``tests/unit/foo-test.c``. + +2. Write the test. Normally you would include the header file which exports + the module API, then verify the interface behaves as expected from your + test. The test code should be organized with the glib testing framework. + Copying and modifying an existing test is usually a good idea. + +3. Add the test to ``tests/unit/meson.build``. The unit tests are listed in a + dictionary called ``tests``. The values are any additional sources and + dependencies to be linked with the test. For a simple test whose source + is in ``tests/unit/foo-test.c``, it is enough to add an entry like:: + + { + ... + 'foo-test': [], + ... + } + +Since unit tests don't require environment variables, the simplest way to debug +a unit test failure is often directly invoking it or even running it under +``gdb``. However there can still be differences in behavior between ``make`` +invocations and your manual run, due to ``$MALLOC_PERTURB_`` environment +variable (which affects memory reclamation and catches invalid pointers better) +and gtester options. If necessary, you can run + +.. code:: + + make check-unit V=1 + +and copy the actual command line which executes the unit test, then run +it from the command line. + +QTest +~~~~~ + +QTest is a device emulation testing framework. It can be very useful to test +device models; it could also control certain aspects of QEMU (such as virtual +clock stepping), with a special purpose "qtest" protocol. Refer to +:doc:`qtest` for more details. + +QTest cases can be executed with + +.. code:: + + make check-qtest + +QAPI schema tests +~~~~~~~~~~~~~~~~~ + +The QAPI schema tests validate the QAPI parser used by QMP, by feeding +predefined input to the parser and comparing the result with the reference +output. + +The input/output data is managed under the ``tests/qapi-schema`` directory. +Each test case includes four files that have a common base name: + + * ``${casename}.json`` - the file contains the JSON input for feeding the + parser + * ``${casename}.out`` - the file contains the expected stdout from the parser + * ``${casename}.err`` - the file contains the expected stderr from the parser + * ``${casename}.exit`` - the expected error code + +Consider adding a new QAPI schema test when you are making a change on the QAPI +parser (either fixing a bug or extending/modifying the syntax). To do this: + +1. Add four files for the new case as explained above. For example: + + ``$EDITOR tests/qapi-schema/foo.{json,out,err,exit}``. + +2. Add the new test in ``tests/Makefile.include``. For example: + + ``qapi-schema += foo.json`` + +check-block +~~~~~~~~~~~ + +``make check-block`` runs a subset of the block layer iotests (the tests that +are in the "auto" group). +See the "QEMU iotests" section below for more information. + +QEMU iotests +------------ + +QEMU iotests, under the directory ``tests/qemu-iotests``, is the testing +framework widely used to test block layer related features. It is higher level +than "make check" tests and 99% of the code is written in bash or Python +scripts. The testing success criteria is golden output comparison, and the +test files are named with numbers. + +To run iotests, make sure QEMU is built successfully, then switch to the +``tests/qemu-iotests`` directory under the build directory, and run ``./check`` +with desired arguments from there. + +By default, "raw" format and "file" protocol is used; all tests will be +executed, except the unsupported ones. You can override the format and protocol +with arguments: + +.. code:: + + # test with qcow2 format + ./check -qcow2 + # or test a different protocol + ./check -nbd + +It's also possible to list test numbers explicitly: + +.. code:: + + # run selected cases with qcow2 format + ./check -qcow2 001 030 153 + +Cache mode can be selected with the "-c" option, which may help reveal bugs +that are specific to certain cache mode. + +More options are supported by the ``./check`` script, run ``./check -h`` for +help. + +Writing a new test case +~~~~~~~~~~~~~~~~~~~~~~~ + +Consider writing a tests case when you are making any changes to the block +layer. An iotest case is usually the choice for that. There are already many +test cases, so it is possible that extending one of them may achieve the goal +and save the boilerplate to create one. (Unfortunately, there isn't a 100% +reliable way to find a related one out of hundreds of tests. One approach is +using ``git grep``.) + +Usually an iotest case consists of two files. One is an executable that +produces output to stdout and stderr, the other is the expected reference +output. They are given the same number in file names. E.g. Test script ``055`` +and reference output ``055.out``. + +In rare cases, when outputs differ between cache mode ``none`` and others, a +``.out.nocache`` file is added. In other cases, when outputs differ between +image formats, more than one ``.out`` files are created ending with the +respective format names, e.g. ``178.out.qcow2`` and ``178.out.raw``. + +There isn't a hard rule about how to write a test script, but a new test is +usually a (copy and) modification of an existing case. There are a few +commonly used ways to create a test: + +* A Bash script. It will make use of several environmental variables related + to the testing procedure, and could source a group of ``common.*`` libraries + for some common helper routines. + +* A Python unittest script. Import ``iotests`` and create a subclass of + ``iotests.QMPTestCase``, then call ``iotests.main`` method. The downside of + this approach is that the output is too scarce, and the script is considered + harder to debug. + +* A simple Python script without using unittest module. This could also import + ``iotests`` for launching QEMU and utilities etc, but it doesn't inherit + from ``iotests.QMPTestCase`` therefore doesn't use the Python unittest + execution. This is a combination of 1 and 2. + +Pick the language per your preference since both Bash and Python have +comparable library support for invoking and interacting with QEMU programs. If +you opt for Python, it is strongly recommended to write Python 3 compatible +code. + +Both Python and Bash frameworks in iotests provide helpers to manage test +images. They can be used to create and clean up images under the test +directory. If no I/O or any protocol specific feature is needed, it is often +more convenient to use the pseudo block driver, ``null-co://``, as the test +image, which doesn't require image creation or cleaning up. Avoid system-wide +devices or files whenever possible, such as ``/dev/null`` or ``/dev/zero``. +Otherwise, image locking implications have to be considered. For example, +another application on the host may have locked the file, possibly leading to a +test failure. If using such devices are explicitly desired, consider adding +``locking=off`` option to disable image locking. + +Debugging a test case +~~~~~~~~~~~~~~~~~~~~~ + +The following options to the ``check`` script can be useful when debugging +a failing test: + +* ``-gdb`` wraps every QEMU invocation in a ``gdbserver``, which waits for a + connection from a gdb client. The options given to ``gdbserver`` (e.g. the + address on which to listen for connections) are taken from the ``$GDB_OPTIONS`` + environment variable. By default (if ``$GDB_OPTIONS`` is empty), it listens on + ``localhost:12345``. + It is possible to connect to it for example with + ``gdb -iex "target remote $addr"``, where ``$addr`` is the address + ``gdbserver`` listens on. + If the ``-gdb`` option is not used, ``$GDB_OPTIONS`` is ignored, + regardless of whether it is set or not. + +* ``-valgrind`` attaches a valgrind instance to QEMU. If it detects + warnings, it will print and save the log in + ``$TEST_DIR/<valgrind_pid>.valgrind``. + The final command line will be ``valgrind --log-file=$TEST_DIR/ + <valgrind_pid>.valgrind --error-exitcode=99 $QEMU ...`` + +* ``-d`` (debug) just increases the logging verbosity, showing + for example the QMP commands and answers. + +* ``-p`` (print) redirects QEMU’s stdout and stderr to the test output, + instead of saving it into a log file in + ``$TEST_DIR/qemu-machine-<random_string>``. + +Test case groups +~~~~~~~~~~~~~~~~ + +"Tests may belong to one or more test groups, which are defined in the form +of a comment in the test source file. By convention, test groups are listed +in the second line of the test file, after the "#!/..." line, like this: + +.. code:: + + #!/usr/bin/env python3 + # group: auto quick + # + ... + +Another way of defining groups is creating the tests/qemu-iotests/group.local +file. This should be used only for downstream (this file should never appear +in upstream). This file may be used for defining some downstream test groups +or for temporarily disabling tests, like this: + +.. code:: + + # groups for some company downstream process + # + # ci - tests to run on build + # down - our downstream tests, not for upstream + # + # Format of each line is: + # TEST_NAME TEST_GROUP [TEST_GROUP ]... + + 013 ci + 210 disabled + 215 disabled + our-ugly-workaround-test down ci + +Note that the following group names have a special meaning: + +- quick: Tests in this group should finish within a few seconds. + +- auto: Tests in this group are used during "make check" and should be + runnable in any case. That means they should run with every QEMU binary + (also non-x86), with every QEMU configuration (i.e. must not fail if + an optional feature is not compiled in - but reporting a "skip" is ok), + work at least with the qcow2 file format, work with all kind of host + filesystems and users (e.g. "nobody" or "root") and must not take too + much memory and disk space (since CI pipelines tend to fail otherwise). + +- disabled: Tests in this group are disabled and ignored by check. + +.. _container-ref: + +Container based tests +--------------------- + +Introduction +~~~~~~~~~~~~ + +The container testing framework in QEMU utilizes public images to +build and test QEMU in predefined and widely accessible Linux +environments. This makes it possible to expand the test coverage +across distros, toolchain flavors and library versions. The support +was originally written for Docker although we also support Podman as +an alternative container runtime. Although the many of the target +names and scripts are prefixed with "docker" the system will +automatically run on whichever is configured. + +The container images are also used to augment the generation of tests +for testing TCG. See :ref:`checktcg-ref` for more details. + +Docker Prerequisites +~~~~~~~~~~~~~~~~~~~~ + +Install "docker" with the system package manager and start the Docker service +on your development machine, then make sure you have the privilege to run +Docker commands. Typically it means setting up passwordless ``sudo docker`` +command or login as root. For example: + +.. code:: + + $ sudo yum install docker + $ # or `apt-get install docker` for Ubuntu, etc. + $ sudo systemctl start docker + $ sudo docker ps + +The last command should print an empty table, to verify the system is ready. + +An alternative method to set up permissions is by adding the current user to +"docker" group and making the docker daemon socket file (by default +``/var/run/docker.sock``) accessible to the group: + +.. code:: + + $ sudo groupadd docker + $ sudo usermod $USER -a -G docker + $ sudo chown :docker /var/run/docker.sock + +Note that any one of above configurations makes it possible for the user to +exploit the whole host with Docker bind mounting or other privileged +operations. So only do it on development machines. + +Podman Prerequisites +~~~~~~~~~~~~~~~~~~~~ + +Install "podman" with the system package manager. + +.. code:: + + $ sudo dnf install podman + $ podman ps + +The last command should print an empty table, to verify the system is ready. + +Quickstart +~~~~~~~~~~ + +From source tree, type ``make docker-help`` to see the help. Testing +can be started without configuring or building QEMU (``configure`` and +``make`` are done in the container, with parameters defined by the +make target): + +.. code:: + + make docker-test-build@centos8 + +This will create a container instance using the ``centos8`` image (the image +is downloaded and initialized automatically), in which the ``test-build`` job +is executed. + +Registry +~~~~~~~~ + +The QEMU project has a container registry hosted by GitLab at +``registry.gitlab.com/qemu-project/qemu`` which will automatically be +used to pull in pre-built layers. This avoids unnecessary strain on +the distro archives created by multiple developers running the same +container build steps over and over again. This can be overridden +locally by using the ``NOCACHE`` build option: + +.. code:: + + make docker-image-debian10 NOCACHE=1 + +Images +~~~~~~ + +Along with many other images, the ``centos8`` image is defined in a Dockerfile +in ``tests/docker/dockerfiles/``, called ``centos8.docker``. ``make docker-help`` +command will list all the available images. + +To add a new image, simply create a new ``.docker`` file under the +``tests/docker/dockerfiles/`` directory. + +A ``.pre`` script can be added beside the ``.docker`` file, which will be +executed before building the image under the build context directory. This is +mainly used to do necessary host side setup. One such setup is ``binfmt_misc``, +for example, to make qemu-user powered cross build containers work. + +Tests +~~~~~ + +Different tests are added to cover various configurations to build and test +QEMU. Docker tests are the executables under ``tests/docker`` named +``test-*``. They are typically shell scripts and are built on top of a shell +library, ``tests/docker/common.rc``, which provides helpers to find the QEMU +source and build it. + +The full list of tests is printed in the ``make docker-help`` help. + +Debugging a Docker test failure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When CI tasks, maintainers or yourself report a Docker test failure, follow the +below steps to debug it: + +1. Locally reproduce the failure with the reported command line. E.g. run + ``make docker-test-mingw@fedora J=8``. +2. Add "V=1" to the command line, try again, to see the verbose output. +3. Further add "DEBUG=1" to the command line. This will pause in a shell prompt + in the container right before testing starts. You could either manually + build QEMU and run tests from there, or press Ctrl-D to let the Docker + testing continue. +4. If you press Ctrl-D, the same building and testing procedure will begin, and + will hopefully run into the error again. After that, you will be dropped to + the prompt for debug. + +Options +~~~~~~~ + +Various options can be used to affect how Docker tests are done. The full +list is in the ``make docker`` help text. The frequently used ones are: + +* ``V=1``: the same as in top level ``make``. It will be propagated to the + container and enable verbose output. +* ``J=$N``: the number of parallel tasks in make commands in the container, + similar to the ``-j $N`` option in top level ``make``. (The ``-j`` option in + top level ``make`` will not be propagated into the container.) +* ``DEBUG=1``: enables debug. See the previous "Debugging a Docker test + failure" section. + +Thread Sanitizer +---------------- + +Thread Sanitizer (TSan) is a tool which can detect data races. QEMU supports +building and testing with this tool. + +For more information on TSan: + +https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual + +Thread Sanitizer in Docker +~~~~~~~~~~~~~~~~~~~~~~~~~~ +TSan is currently supported in the ubuntu2004 docker. + +The test-tsan test will build using TSan and then run make check. + +.. code:: + + make docker-test-tsan@ubuntu2004 + +TSan warnings under docker are placed in files located at build/tsan/. + +We recommend using DEBUG=1 to allow launching the test from inside the docker, +and to allow review of the warnings generated by TSan. + +Building and Testing with TSan +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is possible to build and test with TSan, with a few additional steps. +These steps are normally done automatically in the docker. + +There is a one time patch needed in clang-9 or clang-10 at this time: + +.. code:: + + sed -i 's/^const/static const/g' \ + /usr/lib/llvm-10/lib/clang/10.0.0/include/sanitizer/tsan_interface.h + +To configure the build for TSan: + +.. code:: + + ../configure --enable-tsan --cc=clang-10 --cxx=clang++-10 \ + --disable-werror --extra-cflags="-O0" + +The runtime behavior of TSAN is controlled by the TSAN_OPTIONS environment +variable. + +More information on the TSAN_OPTIONS can be found here: + +https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags + +For example: + +.. code:: + + export TSAN_OPTIONS=suppressions=<path to qemu>/tests/tsan/suppressions.tsan \ + detect_deadlocks=false history_size=7 exitcode=0 \ + log_path=<build path>/tsan/tsan_warning + +The above exitcode=0 has TSan continue without error if any warnings are found. +This allows for running the test and then checking the warnings afterwards. +If you want TSan to stop and exit with error on warnings, use exitcode=66. + +TSan Suppressions +~~~~~~~~~~~~~~~~~ +Keep in mind that for any data race warning, although there might be a data race +detected by TSan, there might be no actual bug here. TSan provides several +different mechanisms for suppressing warnings. In general it is recommended +to fix the code if possible to eliminate the data race rather than suppress +the warning. + +A few important files for suppressing warnings are: + +tests/tsan/suppressions.tsan - Has TSan warnings we wish to suppress at runtime. +The comment on each suppression will typically indicate why we are +suppressing it. More information on the file format can be found here: + +https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions + +tests/tsan/blacklist.tsan - Has TSan warnings we wish to disable +at compile time for test or debug. +Add flags to configure to enable: + +"--extra-cflags=-fsanitize-blacklist=<src path>/tests/tsan/blacklist.tsan" + +More information on the file format can be found here under "Blacklist Format": + +https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags + +TSan Annotations +~~~~~~~~~~~~~~~~ +include/qemu/tsan.h defines annotations. See this file for more descriptions +of the annotations themselves. Annotations can be used to suppress +TSan warnings or give TSan more information so that it can detect proper +relationships between accesses of data. + +Annotation examples can be found here: + +https://github.com/llvm/llvm-project/tree/master/compiler-rt/test/tsan/ + +Good files to start with are: annotate_happens_before.cpp and ignore_race.cpp + +The full set of annotations can be found here: + +https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp + +VM testing +---------- + +This test suite contains scripts that bootstrap various guest images that have +necessary packages to build QEMU. The basic usage is documented in ``Makefile`` +help which is displayed with ``make vm-help``. + +Quickstart +~~~~~~~~~~ + +Run ``make vm-help`` to list available make targets. Invoke a specific make +command to run build test in an image. For example, ``make vm-build-freebsd`` +will build the source tree in the FreeBSD image. The command can be executed +from either the source tree or the build dir; if the former, ``./configure`` is +not needed. The command will then generate the test image in ``./tests/vm/`` +under the working directory. + +Note: images created by the scripts accept a well-known RSA key pair for SSH +access, so they SHOULD NOT be exposed to external interfaces if you are +concerned about attackers taking control of the guest and potentially +exploiting a QEMU security bug to compromise the host. + +QEMU binaries +~~~~~~~~~~~~~ + +By default, ``qemu-system-x86_64`` is searched in $PATH to run the guest. If +there isn't one, or if it is older than 2.10, the test won't work. In this case, +provide the QEMU binary in env var: ``QEMU=/path/to/qemu-2.10+``. + +Likewise the path to ``qemu-img`` can be set in QEMU_IMG environment variable. + +Make jobs +~~~~~~~~~ + +The ``-j$X`` option in the make command line is not propagated into the VM, +specify ``J=$X`` to control the make jobs in the guest. + +Debugging +~~~~~~~~~ + +Add ``DEBUG=1`` and/or ``V=1`` to the make command to allow interactive +debugging and verbose output. If this is not enough, see the next section. +``V=1`` will be propagated down into the make jobs in the guest. + +Manual invocation +~~~~~~~~~~~~~~~~~ + +Each guest script is an executable script with the same command line options. +For example to work with the netbsd guest, use ``$QEMU_SRC/tests/vm/netbsd``: + +.. code:: + + $ cd $QEMU_SRC/tests/vm + + # To bootstrap the image + $ ./netbsd --build-image --image /var/tmp/netbsd.img + <...> + + # To run an arbitrary command in guest (the output will not be echoed unless + # --debug is added) + $ ./netbsd --debug --image /var/tmp/netbsd.img uname -a + + # To build QEMU in guest + $ ./netbsd --debug --image /var/tmp/netbsd.img --build-qemu $QEMU_SRC + + # To get to an interactive shell + $ ./netbsd --interactive --image /var/tmp/netbsd.img sh + +Adding new guests +~~~~~~~~~~~~~~~~~ + +Please look at existing guest scripts for how to add new guests. + +Most importantly, create a subclass of BaseVM and implement ``build_image()`` +method and define ``BUILD_SCRIPT``, then finally call ``basevm.main()`` from +the script's ``main()``. + +* Usually in ``build_image()``, a template image is downloaded from a + predefined URL. ``BaseVM._download_with_cache()`` takes care of the cache and + the checksum, so consider using it. + +* Once the image is downloaded, users, SSH server and QEMU build deps should + be set up: + + - Root password set to ``BaseVM.ROOT_PASS`` + - User ``BaseVM.GUEST_USER`` is created, and password set to + ``BaseVM.GUEST_PASS`` + - SSH service is enabled and started on boot, + ``$QEMU_SRC/tests/keys/id_rsa.pub`` is added to ssh's ``authorized_keys`` + file of both root and the normal user + - DHCP client service is enabled and started on boot, so that it can + automatically configure the virtio-net-pci NIC and communicate with QEMU + user net (10.0.2.2) + - Necessary packages are installed to untar the source tarball and build + QEMU + +* Write a proper ``BUILD_SCRIPT`` template, which should be a shell script that + untars a raw virtio-blk block device, which is the tarball data blob of the + QEMU source tree, then configure/build it. Running "make check" is also + recommended. + +Image fuzzer testing +-------------------- + +An image fuzzer was added to exercise format drivers. Currently only qcow2 is +supported. To start the fuzzer, run + +.. code:: + + tests/image-fuzzer/runner.py -c '[["qemu-img", "info", "$test_img"]]' /tmp/test qcow2 + +Alternatively, some command different from ``qemu-img info`` can be tested, by +changing the ``-c`` option. + +Integration tests using the Avocado Framework +--------------------------------------------- + +The ``tests/avocado`` directory hosts integration tests. They're usually +higher level tests, and may interact with external resources and with +various guest operating systems. + +These tests are written using the Avocado Testing Framework (which must +be installed separately) in conjunction with a the ``avocado_qemu.Test`` +class, implemented at ``tests/avocado/avocado_qemu``. + +Tests based on ``avocado_qemu.Test`` can easily: + + * Customize the command line arguments given to the convenience + ``self.vm`` attribute (a QEMUMachine instance) + + * Interact with the QEMU monitor, send QMP commands and check + their results + + * Interact with the guest OS, using the convenience console device + (which may be useful to assert the effectiveness and correctness of + command line arguments or QMP commands) + + * Interact with external data files that accompany the test itself + (see ``self.get_data()``) + + * Download (and cache) remote data files, such as firmware and kernel + images + + * Have access to a library of guest OS images (by means of the + ``avocado.utils.vmimage`` library) + + * Make use of various other test related utilities available at the + test class itself and at the utility library: + + - http://avocado-framework.readthedocs.io/en/latest/api/test/avocado.html#avocado.Test + - http://avocado-framework.readthedocs.io/en/latest/api/utils/avocado.utils.html + +Running tests +~~~~~~~~~~~~~ + +You can run the avocado tests simply by executing: + +.. code:: + + make check-avocado + +This involves the automatic creation of Python virtual environment +within the build tree (at ``tests/venv``) which will have all the +right dependencies, and will save tests results also within the +build tree (at ``tests/results``). + +Note: the build environment must be using a Python 3 stack, and have +the ``venv`` and ``pip`` packages installed. If necessary, make sure +``configure`` is called with ``--python=`` and that those modules are +available. On Debian and Ubuntu based systems, depending on the +specific version, they may be on packages named ``python3-venv`` and +``python3-pip``. + +It is also possible to run tests based on tags using the +``make check-avocado`` command and the ``AVOCADO_TAGS`` environment +variable: + +.. code:: + + make check-avocado AVOCADO_TAGS=quick + +Note that tags separated with commas have an AND behavior, while tags +separated by spaces have an OR behavior. For more information on Avocado +tags, see: + + https://avocado-framework.readthedocs.io/en/latest/guides/user/chapters/tags.html + +To run a single test file, a couple of them, or a test within a file +using the ``make check-avocado`` command, set the ``AVOCADO_TESTS`` +environment variable with the test files or test names. To run all +tests from a single file, use: + + .. code:: + + make check-avocado AVOCADO_TESTS=$FILEPATH + +The same is valid to run tests from multiple test files: + + .. code:: + + make check-avocado AVOCADO_TESTS='$FILEPATH1 $FILEPATH2' + +To run a single test within a file, use: + + .. code:: + + make check-avocado AVOCADO_TESTS=$FILEPATH:$TESTCLASS.$TESTNAME + +The same is valid to run single tests from multiple test files: + + .. code:: + + make check-avocado AVOCADO_TESTS='$FILEPATH1:$TESTCLASS1.$TESTNAME1 $FILEPATH2:$TESTCLASS2.$TESTNAME2' + +The scripts installed inside the virtual environment may be used +without an "activation". For instance, the Avocado test runner +may be invoked by running: + + .. code:: + + tests/venv/bin/avocado run $OPTION1 $OPTION2 tests/avocado/ + +Note that if ``make check-avocado`` was not executed before, it is +possible to create the Python virtual environment with the dependencies +needed running: + + .. code:: + + make check-venv + +It is also possible to run tests from a single file or a single test within +a test file. To run tests from a single file within the build tree, use: + + .. code:: + + tests/venv/bin/avocado run tests/avocado/$TESTFILE + +To run a single test within a test file, use: + + .. code:: + + tests/venv/bin/avocado run tests/avocado/$TESTFILE:$TESTCLASS.$TESTNAME + +Valid test names are visible in the output from any previous execution +of Avocado or ``make check-avocado``, and can also be queried using: + + .. code:: + + tests/venv/bin/avocado list tests/avocado + +Manual Installation +~~~~~~~~~~~~~~~~~~~ + +To manually install Avocado and its dependencies, run: + +.. code:: + + pip install --user avocado-framework + +Alternatively, follow the instructions on this link: + + https://avocado-framework.readthedocs.io/en/latest/guides/user/chapters/installing.html + +Overview +~~~~~~~~ + +The ``tests/avocado/avocado_qemu`` directory provides the +``avocado_qemu`` Python module, containing the ``avocado_qemu.Test`` +class. Here's a simple usage example: + +.. code:: + + from avocado_qemu import QemuSystemTest + + + class Version(QemuSystemTest): + """ + :avocado: tags=quick + """ + def test_qmp_human_info_version(self): + self.vm.launch() + res = self.vm.command('human-monitor-command', + command_line='info version') + self.assertRegexpMatches(res, r'^(\d+\.\d+\.\d)') + +To execute your test, run: + +.. code:: + + avocado run version.py + +Tests may be classified according to a convention by using docstring +directives such as ``:avocado: tags=TAG1,TAG2``. To run all tests +in the current directory, tagged as "quick", run: + +.. code:: + + avocado run -t quick . + +The ``avocado_qemu.Test`` base test class +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``avocado_qemu.Test`` class has a number of characteristics that +are worth being mentioned right away. + +First of all, it attempts to give each test a ready to use QEMUMachine +instance, available at ``self.vm``. Because many tests will tweak the +QEMU command line, launching the QEMUMachine (by using ``self.vm.launch()``) +is left to the test writer. + +The base test class has also support for tests with more than one +QEMUMachine. The way to get machines is through the ``self.get_vm()`` +method which will return a QEMUMachine instance. The ``self.get_vm()`` +method accepts arguments that will be passed to the QEMUMachine creation +and also an optional ``name`` attribute so you can identify a specific +machine and get it more than once through the tests methods. A simple +and hypothetical example follows: + +.. code:: + + from avocado_qemu import QemuSystemTest + + + class MultipleMachines(QemuSystemTest): + def test_multiple_machines(self): + first_machine = self.get_vm() + second_machine = self.get_vm() + self.get_vm(name='third_machine').launch() + + first_machine.launch() + second_machine.launch() + + first_res = first_machine.command( + 'human-monitor-command', + command_line='info version') + + second_res = second_machine.command( + 'human-monitor-command', + command_line='info version') + + third_res = self.get_vm(name='third_machine').command( + 'human-monitor-command', + command_line='info version') + + self.assertEquals(first_res, second_res, third_res) + +At test "tear down", ``avocado_qemu.Test`` handles all the QEMUMachines +shutdown. + +The ``avocado_qemu.LinuxTest`` base test class +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``avocado_qemu.LinuxTest`` is further specialization of the +``avocado_qemu.Test`` class, so it contains all the characteristics of +the later plus some extra features. + +First of all, this base class is intended for tests that need to +interact with a fully booted and operational Linux guest. At this +time, it uses a Fedora 31 guest image. The most basic example looks +like this: + +.. code:: + + from avocado_qemu import LinuxTest + + + class SomeTest(LinuxTest): + + def test(self): + self.launch_and_wait() + self.ssh_command('some_command_to_be_run_in_the_guest') + +Please refer to tests that use ``avocado_qemu.LinuxTest`` under +``tests/avocado`` for more examples. + +QEMUMachine +~~~~~~~~~~~ + +The QEMUMachine API is already widely used in the Python iotests, +device-crash-test and other Python scripts. It's a wrapper around the +execution of a QEMU binary, giving its users: + + * the ability to set command line arguments to be given to the QEMU + binary + + * a ready to use QMP connection and interface, which can be used to + send commands and inspect its results, as well as asynchronous + events + + * convenience methods to set commonly used command line arguments in + a more succinct and intuitive way + +QEMU binary selection +^^^^^^^^^^^^^^^^^^^^^ + +The QEMU binary used for the ``self.vm`` QEMUMachine instance will +primarily depend on the value of the ``qemu_bin`` parameter. If it's +not explicitly set, its default value will be the result of a dynamic +probe in the same source tree. A suitable binary will be one that +targets the architecture matching host machine. + +Based on this description, test writers will usually rely on one of +the following approaches: + +1) Set ``qemu_bin``, and use the given binary + +2) Do not set ``qemu_bin``, and use a QEMU binary named like + "qemu-system-${arch}", either in the current + working directory, or in the current source tree. + +The resulting ``qemu_bin`` value will be preserved in the +``avocado_qemu.Test`` as an attribute with the same name. + +Attribute reference +~~~~~~~~~~~~~~~~~~~ + +Test +^^^^ + +Besides the attributes and methods that are part of the base +``avocado.Test`` class, the following attributes are available on any +``avocado_qemu.Test`` instance. + +vm +'' + +A QEMUMachine instance, initially configured according to the given +``qemu_bin`` parameter. + +arch +'''' + +The architecture can be used on different levels of the stack, e.g. by +the framework or by the test itself. At the framework level, it will +currently influence the selection of a QEMU binary (when one is not +explicitly given). + +Tests are also free to use this attribute value, for their own needs. +A test may, for instance, use the same value when selecting the +architecture of a kernel or disk image to boot a VM with. + +The ``arch`` attribute will be set to the test parameter of the same +name. If one is not given explicitly, it will either be set to +``None``, or, if the test is tagged with one (and only one) +``:avocado: tags=arch:VALUE`` tag, it will be set to ``VALUE``. + +cpu +''' + +The cpu model that will be set to all QEMUMachine instances created +by the test. + +The ``cpu`` attribute will be set to the test parameter of the same +name. If one is not given explicitly, it will either be set to +``None ``, or, if the test is tagged with one (and only one) +``:avocado: tags=cpu:VALUE`` tag, it will be set to ``VALUE``. + +machine +''''''' + +The machine type that will be set to all QEMUMachine instances created +by the test. + +The ``machine`` attribute will be set to the test parameter of the same +name. If one is not given explicitly, it will either be set to +``None``, or, if the test is tagged with one (and only one) +``:avocado: tags=machine:VALUE`` tag, it will be set to ``VALUE``. + +qemu_bin +'''''''' + +The preserved value of the ``qemu_bin`` parameter or the result of the +dynamic probe for a QEMU binary in the current working directory or +source tree. + +LinuxTest +^^^^^^^^^ + +Besides the attributes present on the ``avocado_qemu.Test`` base +class, the ``avocado_qemu.LinuxTest`` adds the following attributes: + +distro +'''''' + +The name of the Linux distribution used as the guest image for the +test. The name should match the **Provider** column on the list +of images supported by the avocado.utils.vmimage library: + +https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images + +distro_version +'''''''''''''' + +The version of the Linux distribution as the guest image for the +test. The name should match the **Version** column on the list +of images supported by the avocado.utils.vmimage library: + +https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images + +distro_checksum +''''''''''''''' + +The sha256 hash of the guest image file used for the test. + +If this value is not set in the code or by a test parameter (with the +same name), no validation on the integrity of the image will be +performed. + +Parameter reference +~~~~~~~~~~~~~~~~~~~ + +To understand how Avocado parameters are accessed by tests, and how +they can be passed to tests, please refer to:: + + https://avocado-framework.readthedocs.io/en/latest/guides/writer/chapters/writing.html#accessing-test-parameters + +Parameter values can be easily seen in the log files, and will look +like the following: + +.. code:: + + PARAMS (key=qemu_bin, path=*, default=./qemu-system-x86_64) => './qemu-system-x86_64 + +Test +^^^^ + +arch +'''' + +The architecture that will influence the selection of a QEMU binary +(when one is not explicitly given). + +Tests are also free to use this parameter value, for their own needs. +A test may, for instance, use the same value when selecting the +architecture of a kernel or disk image to boot a VM with. + +This parameter has a direct relation with the ``arch`` attribute. If +not given, it will default to None. + +cpu +''' + +The cpu model that will be set to all QEMUMachine instances created +by the test. + +machine +''''''' + +The machine type that will be set to all QEMUMachine instances created +by the test. + +qemu_bin +'''''''' + +The exact QEMU binary to be used on QEMUMachine. + +LinuxTest +^^^^^^^^^ + +Besides the parameters present on the ``avocado_qemu.Test`` base +class, the ``avocado_qemu.LinuxTest`` adds the following parameters: + +distro +'''''' + +The name of the Linux distribution used as the guest image for the +test. The name should match the **Provider** column on the list +of images supported by the avocado.utils.vmimage library: + +https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images + +distro_version +'''''''''''''' + +The version of the Linux distribution as the guest image for the +test. The name should match the **Version** column on the list +of images supported by the avocado.utils.vmimage library: + +https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images + +distro_checksum +''''''''''''''' + +The sha256 hash of the guest image file used for the test. + +If this value is not set in the code or by this parameter no +validation on the integrity of the image will be performed. + +Skipping tests +~~~~~~~~~~~~~~ + +The Avocado framework provides Python decorators which allow for easily skip +tests running under certain conditions. For example, on the lack of a binary +on the test system or when the running environment is a CI system. For further +information about those decorators, please refer to:: + + https://avocado-framework.readthedocs.io/en/latest/guides/writer/chapters/writing.html#skipping-tests + +While the conditions for skipping tests are often specifics of each one, there +are recurring scenarios identified by the QEMU developers and the use of +environment variables became a kind of standard way to enable/disable tests. + +Here is a list of the most used variables: + +AVOCADO_ALLOW_LARGE_STORAGE +^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Tests which are going to fetch or produce assets considered *large* are not +going to run unless that ``AVOCADO_ALLOW_LARGE_STORAGE=1`` is exported on +the environment. + +The definition of *large* is a bit arbitrary here, but it usually means an +asset which occupies at least 1GB of size on disk when uncompressed. + +AVOCADO_ALLOW_UNTRUSTED_CODE +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +There are tests which will boot a kernel image or firmware that can be +considered not safe to run on the developer's workstation, thus they are +skipped by default. The definition of *not safe* is also arbitrary but +usually it means a blob which either its source or build process aren't +public available. + +You should export ``AVOCADO_ALLOW_UNTRUSTED_CODE=1`` on the environment in +order to allow tests which make use of those kind of assets. + +AVOCADO_TIMEOUT_EXPECTED +^^^^^^^^^^^^^^^^^^^^^^^^ +The Avocado framework has a timeout mechanism which interrupts tests to avoid the +test suite of getting stuck. The timeout value can be set via test parameter or +property defined in the test class, for further details:: + + https://avocado-framework.readthedocs.io/en/latest/guides/writer/chapters/writing.html#setting-a-test-timeout + +Even though the timeout can be set by the test developer, there are some tests +that may not have a well-defined limit of time to finish under certain +conditions. For example, tests that take longer to execute when QEMU is +compiled with debug flags. Therefore, the ``AVOCADO_TIMEOUT_EXPECTED`` variable +has been used to determine whether those tests should run or not. + +GITLAB_CI +^^^^^^^^^ +A number of tests are flagged to not run on the GitLab CI. Usually because +they proved to the flaky or there are constraints on the CI environment which +would make them fail. If you encounter a similar situation then use that +variable as shown on the code snippet below to skip the test: + +.. code:: + + @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab') + def test(self): + do_something() + +Uninstalling Avocado +~~~~~~~~~~~~~~~~~~~~ + +If you've followed the manual installation instructions above, you can +easily uninstall Avocado. Start by listing the packages you have +installed:: + + pip list --user + +And remove any package you want with:: + + pip uninstall <package_name> + +If you've used ``make check-avocado``, the Python virtual environment where +Avocado is installed will be cleaned up as part of ``make check-clean``. + +.. _checktcg-ref: + +Testing with "make check-tcg" +----------------------------- + +The check-tcg tests are intended for simple smoke tests of both +linux-user and softmmu TCG functionality. However to build test +programs for guest targets you need to have cross compilers available. +If your distribution supports cross compilers you can do something as +simple as:: + + apt install gcc-aarch64-linux-gnu + +The configure script will automatically pick up their presence. +Sometimes compilers have slightly odd names so the availability of +them can be prompted by passing in the appropriate configure option +for the architecture in question, for example:: + + $(configure) --cross-cc-aarch64=aarch64-cc + +There is also a ``--cross-cc-flags-ARCH`` flag in case additional +compiler flags are needed to build for a given target. + +If you have the ability to run containers as the user the build system +will automatically use them where no system compiler is available. For +architectures where we also support building QEMU we will generally +use the same container to build tests. However there are a number of +additional containers defined that have a minimal cross-build +environment that is only suitable for building test cases. Sometimes +we may use a bleeding edge distribution for compiler features needed +for test cases that aren't yet in the LTS distros we support for QEMU +itself. + +See :ref:`container-ref` for more details. + +Running subset of tests +~~~~~~~~~~~~~~~~~~~~~~~ + +You can build the tests for one architecture:: + + make build-tcg-tests-$TARGET + +And run with:: + + make run-tcg-tests-$TARGET + +Adding ``V=1`` to the invocation will show the details of how to +invoke QEMU for the test which is useful for debugging tests. + +TCG test dependencies +~~~~~~~~~~~~~~~~~~~~~ + +The TCG tests are deliberately very light on dependencies and are +either totally bare with minimal gcc lib support (for softmmu tests) +or just glibc (for linux-user tests). This is because getting a cross +compiler to work with additional libraries can be challenging. + +Other TCG Tests +--------------- + +There are a number of out-of-tree test suites that are used for more +extensive testing of processor features. + +KVM Unit Tests +~~~~~~~~~~~~~~ + +The KVM unit tests are designed to run as a Guest OS under KVM but +there is no reason why they can't exercise the TCG as well. It +provides a minimal OS kernel with hooks for enabling the MMU as well +as reporting test results via a special device:: + + https://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git + +Linux Test Project +~~~~~~~~~~~~~~~~~~ + +The LTP is focused on exercising the syscall interface of a Linux +kernel. It checks that syscalls behave as documented and strives to +exercise as many corner cases as possible. It is a useful test suite +to run to exercise QEMU's linux-user code:: + + https://linux-test-project.github.io/ + +GCC gcov support +---------------- + +``gcov`` is a GCC tool to analyze the testing coverage by +instrumenting the tested code. To use it, configure QEMU with +``--enable-gcov`` option and build. Then run the tests as usual. + +If you want to gather coverage information on a single test the ``make +clean-gcda`` target can be used to delete any existing coverage +information before running a single test. + +You can generate a HTML coverage report by executing ``make +coverage-html`` which will create +``meson-logs/coveragereport/index.html``. + +Further analysis can be conducted by running the ``gcov`` command +directly on the various .gcda output files. Please read the ``gcov`` +documentation for more information. diff --git a/docs/devel/tracing.rst b/docs/devel/tracing.rst new file mode 100644 index 000000000..ba8395489 --- /dev/null +++ b/docs/devel/tracing.rst @@ -0,0 +1,498 @@ +======= +Tracing +======= + +Introduction +============ + +This document describes the tracing infrastructure in QEMU and how to use it +for debugging, profiling, and observing execution. + +Quickstart +========== + +Enable tracing of ``memory_region_ops_read`` and ``memory_region_ops_write`` +events:: + + $ qemu --trace "memory_region_ops_*" ... + ... + 719585@1608130130.441188:memory_region_ops_read cpu 0 mr 0x562fdfbb3820 addr 0x3cc value 0x67 size 1 + 719585@1608130130.441190:memory_region_ops_write cpu 0 mr 0x562fdfbd2f00 addr 0x3d4 value 0x70e size 2 + +This output comes from the "log" trace backend that is enabled by default when +``./configure --enable-trace-backends=BACKENDS`` was not explicitly specified. + +Multiple patterns can be specified by repeating the ``--trace`` option:: + + $ qemu --trace "kvm_*" --trace "virtio_*" ... + +When patterns are used frequently it is more convenient to store them in a +file to avoid long command-line options:: + + $ echo "memory_region_ops_*" >/tmp/events + $ echo "kvm_*" >>/tmp/events + $ qemu --trace events=/tmp/events ... + +Trace events +============ + +Sub-directory setup +------------------- + +Each directory in the source tree can declare a set of trace events in a local +"trace-events" file. All directories which contain "trace-events" files must be +listed in the "trace_events_subdirs" variable in the top level meson.build +file. During build, the "trace-events" file in each listed subdirectory will be +processed by the "tracetool" script to generate code for the trace events. + +The individual "trace-events" files are merged into a "trace-events-all" file, +which is also installed into "/usr/share/qemu" with the name "trace-events". +This merged file is to be used by the "simpletrace.py" script to later analyse +traces in the simpletrace data format. + +The following files are automatically generated in <builddir>/trace/ during the +build: + + - trace-<subdir>.c - the trace event state declarations + - trace-<subdir>.h - the trace event enums and probe functions + - trace-dtrace-<subdir>.h - DTrace event probe specification + - trace-dtrace-<subdir>.dtrace - DTrace event probe helper declaration + - trace-dtrace-<subdir>.o - binary DTrace provider (generated by dtrace) + - trace-ust-<subdir>.h - UST event probe helper declarations + +Here <subdir> is the sub-directory path with '/' replaced by '_'. For example, +"accel/kvm" becomes "accel_kvm" and the final filename for "trace-<subdir>.c" +becomes "trace-accel_kvm.c". + +Source files in the source tree do not directly include generated files in +"<builddir>/trace/". Instead they #include the local "trace.h" file, without +any sub-directory path prefix. eg io/channel-buffer.c would do:: + + #include "trace.h" + +The "io/trace.h" file must be created manually with an #include of the +corresponding "trace/trace-<subdir>.h" file that will be generated in the +builddir:: + + $ echo '#include "trace/trace-io.h"' >io/trace.h + +While it is possible to include a trace.h file from outside a source file's own +sub-directory, this is discouraged in general. It is strongly preferred that +all events be declared directly in the sub-directory that uses them. The only +exception is where there are some shared trace events defined in the top level +directory trace-events file. The top level directory generates trace files +with a filename prefix of "trace/trace-root" instead of just "trace". This is +to avoid ambiguity between a trace.h in the current directory, vs the top level +directory. + +Using trace events +------------------ + +Trace events are invoked directly from source code like this:: + + #include "trace.h" /* needed for trace event prototype */ + + void *qemu_vmalloc(size_t size) + { + void *ptr; + size_t align = QEMU_VMALLOC_ALIGN; + + if (size < align) { + align = getpagesize(); + } + ptr = qemu_memalign(align, size); + trace_qemu_vmalloc(size, ptr); + return ptr; + } + +Declaring trace events +---------------------- + +The "tracetool" script produces the trace.h header file which is included by +every source file that uses trace events. Since many source files include +trace.h, it uses a minimum of types and other header files included to keep the +namespace clean and compile times and dependencies down. + +Trace events should use types as follows: + + * Use stdint.h types for fixed-size types. Most offsets and guest memory + addresses are best represented with uint32_t or uint64_t. Use fixed-size + types over primitive types whose size may change depending on the host + (32-bit versus 64-bit) so trace events don't truncate values or break + the build. + + * Use void * for pointers to structs or for arrays. The trace.h header + cannot include all user-defined struct declarations and it is therefore + necessary to use void * for pointers to structs. + + * For everything else, use primitive scalar types (char, int, long) with the + appropriate signedness. + + * Avoid floating point types (float and double) because SystemTap does not + support them. In most cases it is possible to round to an integer type + instead. This may require scaling the value first by multiplying it by 1000 + or the like when digits after the decimal point need to be preserved. + +Format strings should reflect the types defined in the trace event. Take +special care to use PRId64 and PRIu64 for int64_t and uint64_t types, +respectively. This ensures portability between 32- and 64-bit platforms. +Format strings must not end with a newline character. It is the responsibility +of backends to adapt line ending for proper logging. + +Each event declaration will start with the event name, then its arguments, +finally a format string for pretty-printing. For example:: + + qemu_vmalloc(size_t size, void *ptr) "size %zu ptr %p" + qemu_vfree(void *ptr) "ptr %p" + + +Hints for adding new trace events +--------------------------------- + +1. Trace state changes in the code. Interesting points in the code usually + involve a state change like starting, stopping, allocating, freeing. State + changes are good trace events because they can be used to understand the + execution of the system. + +2. Trace guest operations. Guest I/O accesses like reading device registers + are good trace events because they can be used to understand guest + interactions. + +3. Use correlator fields so the context of an individual line of trace output + can be understood. For example, trace the pointer returned by malloc and + used as an argument to free. This way mallocs and frees can be matched up. + Trace events with no context are not very useful. + +4. Name trace events after their function. If there are multiple trace events + in one function, append a unique distinguisher at the end of the name. + +Generic interface and monitor commands +====================================== + +You can programmatically query and control the state of trace events through a +backend-agnostic interface provided by the header "trace/control.h". + +Note that some of the backends do not provide an implementation for some parts +of this interface, in which case QEMU will just print a warning (please refer to +header "trace/control.h" to see which routines are backend-dependent). + +The state of events can also be queried and modified through monitor commands: + +* ``info trace-events`` + View available trace events and their state. State 1 means enabled, state 0 + means disabled. + +* ``trace-event NAME on|off`` + Enable/disable a given trace event or a group of events (using wildcards). + +The "--trace events=<file>" command line argument can be used to enable the +events listed in <file> from the very beginning of the program. This file must +contain one event name per line. + +If a line in the "--trace events=<file>" file begins with a '-', the trace event +will be disabled instead of enabled. This is useful when a wildcard was used +to enable an entire family of events but one noisy event needs to be disabled. + +Wildcard matching is supported in both the monitor command "trace-event" and the +events list file. That means you can enable/disable the events having a common +prefix in a batch. For example, virtio-blk trace events could be enabled using +the following monitor command:: + + trace-event virtio_blk_* on + +Trace backends +============== + +The "tracetool" script automates tedious trace event code generation and also +keeps the trace event declarations independent of the trace backend. The trace +events are not tightly coupled to a specific trace backend, such as LTTng or +SystemTap. Support for trace backends can be added by extending the "tracetool" +script. + +The trace backends are chosen at configure time:: + + ./configure --enable-trace-backends=simple,dtrace + +For a list of supported trace backends, try ./configure --help or see below. +If multiple backends are enabled, the trace is sent to them all. + +If no backends are explicitly selected, configure will default to the +"log" backend. + +The following subsections describe the supported trace backends. + +Nop +--- + +The "nop" backend generates empty trace event functions so that the compiler +can optimize out trace events completely. This imposes no performance +penalty. + +Note that regardless of the selected trace backend, events with the "disable" +property will be generated with the "nop" backend. + +Log +--- + +The "log" backend sends trace events directly to standard error. This +effectively turns trace events into debug printfs. + +This is the simplest backend and can be used together with existing code that +uses DPRINTF(). + +The -msg timestamp=on|off command-line option controls whether or not to print +the tid/timestamp prefix for each trace event. + +Simpletrace +----------- + +The "simple" backend writes binary trace logs to a file from a thread, making +it lower overhead than the "log" backend. A Python API is available for writing +offline trace file analysis scripts. It may not be as powerful as +platform-specific or third-party trace backends but it is portable and has no +special library dependencies. + +Monitor commands +~~~~~~~~~~~~~~~~ + +* ``trace-file on|off|flush|set <path>`` + Enable/disable/flush the trace file or set the trace file name. + +Analyzing trace files +~~~~~~~~~~~~~~~~~~~~~ + +The "simple" backend produces binary trace files that can be formatted with the +simpletrace.py script. The script takes the "trace-events-all" file and the +binary trace:: + + ./scripts/simpletrace.py trace-events-all trace-12345 + +You must ensure that the same "trace-events-all" file was used to build QEMU, +otherwise trace event declarations may have changed and output will not be +consistent. + +Ftrace +------ + +The "ftrace" backend writes trace data to ftrace marker. This effectively +sends trace events to ftrace ring buffer, and you can compare qemu trace +data and kernel(especially kvm.ko when using KVM) trace data. + +if you use KVM, enable kvm events in ftrace:: + + # echo 1 > /sys/kernel/debug/tracing/events/kvm/enable + +After running qemu by root user, you can get the trace:: + + # cat /sys/kernel/debug/tracing/trace + +Restriction: "ftrace" backend is restricted to Linux only. + +Syslog +------ + +The "syslog" backend sends trace events using the POSIX syslog API. The log +is opened specifying the LOG_DAEMON facility and LOG_PID option (so events +are tagged with the pid of the particular QEMU process that generated +them). All events are logged at LOG_INFO level. + +NOTE: syslog may squash duplicate consecutive trace events and apply rate + limiting. + +Restriction: "syslog" backend is restricted to POSIX compliant OS. + +LTTng Userspace Tracer +---------------------- + +The "ust" backend uses the LTTng Userspace Tracer library. There are no +monitor commands built into QEMU, instead UST utilities should be used to list, +enable/disable, and dump traces. + +Package lttng-tools is required for userspace tracing. You must ensure that the +current user belongs to the "tracing" group, or manually launch the +lttng-sessiond daemon for the current user prior to running any instance of +QEMU. + +While running an instrumented QEMU, LTTng should be able to list all available +events:: + + lttng list -u + +Create tracing session:: + + lttng create mysession + +Enable events:: + + lttng enable-event qemu:g_malloc -u + +Where the events can either be a comma-separated list of events, or "-a" to +enable all tracepoint events. Start and stop tracing as needed:: + + lttng start + lttng stop + +View the trace:: + + lttng view + +Destroy tracing session:: + + lttng destroy + +Babeltrace can be used at any later time to view the trace:: + + babeltrace $HOME/lttng-traces/mysession-<date>-<time> + +SystemTap +--------- + +The "dtrace" backend uses DTrace sdt probes but has only been tested with +SystemTap. When SystemTap support is detected a .stp file with wrapper probes +is generated to make use in scripts more convenient. This step can also be +performed manually after a build in order to change the binary name in the .stp +probes:: + + scripts/tracetool.py --backends=dtrace --format=stap \ + --binary path/to/qemu-binary \ + --target-type system \ + --target-name x86_64 \ + --group=all \ + trace-events-all \ + qemu.stp + +To facilitate simple usage of systemtap where there merely needs to be printf +logging of certain probes, a helper script "qemu-trace-stap" is provided. +Consult its manual page for guidance on its usage. + +Trace event properties +====================== + +Each event in the "trace-events-all" file can be prefixed with a space-separated +list of zero or more of the following event properties. + +"disable" +--------- + +If a specific trace event is going to be invoked a huge number of times, this +might have a noticeable performance impact even when the event is +programmatically disabled. + +In this case you should declare such event with the "disable" property. This +will effectively disable the event at compile time (by using the "nop" backend), +thus having no performance impact at all on regular builds (i.e., unless you +edit the "trace-events-all" file). + +In addition, there might be cases where relatively complex computations must be +performed to generate values that are only used as arguments for a trace +function. In these cases you can use 'trace_event_get_state_backends()' to +guard such computations, so they are skipped if the event has been either +compile-time disabled or run-time disabled. If the event is compile-time +disabled, this check will have no performance impact. + +:: + + #include "trace.h" /* needed for trace event prototype */ + + void *qemu_vmalloc(size_t size) + { + void *ptr; + size_t align = QEMU_VMALLOC_ALIGN; + + if (size < align) { + align = getpagesize(); + } + ptr = qemu_memalign(align, size); + if (trace_event_get_state_backends(TRACE_QEMU_VMALLOC)) { + void *complex; + /* some complex computations to produce the 'complex' value */ + trace_qemu_vmalloc(size, ptr, complex); + } + return ptr; + } + +"tcg" +----- + +Guest code generated by TCG can be traced by defining an event with the "tcg" +event property. Internally, this property generates two events: +"<eventname>_trans" to trace the event at translation time, and +"<eventname>_exec" to trace the event at execution time. + +Instead of using these two events, you should instead use the function +"trace_<eventname>_tcg" during translation (TCG code generation). This function +will automatically call "trace_<eventname>_trans", and will generate the +necessary TCG code to call "trace_<eventname>_exec" during guest code execution. + +Events with the "tcg" property can be declared in the "trace-events" file with a +mix of native and TCG types, and "trace_<eventname>_tcg" will gracefully forward +them to the "<eventname>_trans" and "<eventname>_exec" events. Since TCG values +are not known at translation time, these are ignored by the "<eventname>_trans" +event. Because of this, the entry in the "trace-events" file needs two printing +formats (separated by a comma):: + + tcg foo(uint8_t a1, TCGv_i32 a2) "a1=%d", "a1=%d a2=%d" + +For example:: + + #include "trace-tcg.h" + + void some_disassembly_func (...) + { + uint8_t a1 = ...; + TCGv_i32 a2 = ...; + trace_foo_tcg(a1, a2); + } + +This will immediately call:: + + void trace_foo_trans(uint8_t a1); + +and will generate the TCG code to call:: + + void trace_foo(uint8_t a1, uint32_t a2); + +"vcpu" +------ + +Identifies events that trace vCPU-specific information. It implicitly adds a +"CPUState*" argument, and extends the tracing print format to show the vCPU +information. If used together with the "tcg" property, it adds a second +"TCGv_env" argument that must point to the per-target global TCG register that +points to the vCPU when guest code is executed (usually the "cpu_env" variable). + +The "tcg" and "vcpu" properties are currently only honored in the root +./trace-events file. + +The following example events:: + + foo(uint32_t a) "a=%x" + vcpu bar(uint32_t a) "a=%x" + tcg vcpu baz(uint32_t a) "a=%x", "a=%x" + +Can be used as:: + + #include "trace-tcg.h" + + CPUArchState *env; + TCGv_ptr cpu_env; + + void some_disassembly_func(...) + { + /* trace emitted at this point */ + trace_foo(0xd1); + /* trace emitted at this point */ + trace_bar(env_cpu(env), 0xd2); + /* trace emitted at this point (env) and when guest code is executed (cpu_env) */ + trace_baz_tcg(env_cpu(env), cpu_env, 0xd3); + } + +If the translating vCPU has address 0xc1 and code is later executed by vCPU +0xc2, this would be an example output:: + + // at guest code translation + foo a=0xd1 + bar cpu=0xc1 a=0xd2 + baz_trans cpu=0xc1 a=0xd3 + // at guest code execution + baz_exec cpu=0xc2 a=0xd3 diff --git a/docs/devel/trivial-patches.rst b/docs/devel/trivial-patches.rst new file mode 100644 index 000000000..9380c730f --- /dev/null +++ b/docs/devel/trivial-patches.rst @@ -0,0 +1,52 @@ +.. _trivial-patches: + +Trivial Patches +=============== + +Overview +-------- + +Trivial patches that change just a few lines of code sometimes languish +on the mailing list even though they require only a small amount of +review. This is often the case for patches that do not fall under an +actively maintained subsystem and therefore fall through the cracks. + +The trivial patches team take on the task of reviewing and building pull +requests for patches that: + +- Do not fall under an actively maintained subsystem. +- Are single patches or short series (max 2-4 patches). +- Only touch a few lines of code. + +**You should hint that your patch is a candidate by CCing +qemu-trivial@nongnu.org.** + +Repositories +------------ + +Since the trivial patch team rotates maintainership there is only one +active repository at a time: + +- git://github.com/vivier/qemu.git trivial-patches - `browse <https://github.com/vivier/qemu/tree/trivial-patches>`__ + +Workflow +-------- + +The trivial patches team rotates the duty of collecting trivial patches +amongst its members. A team member's job is to: + +1. Identify trivial patches on the development mailing list. +2. Review trivial patches, merge them into a git tree, and reply to state + that the patch is queued. +3. Send pull requests to the development mailing list once a week. + +A single team member can be on duty as long as they like. The suggested +time is 1 week before handing off to the next member. + +Team +---- + +If you would like to join the trivial patches team, contact Laurent +Vivier. The current team includes: + +- `Laurent Vivier <mailto:laurent@vivier.eu>`__ diff --git a/docs/devel/ui.rst b/docs/devel/ui.rst new file mode 100644 index 000000000..17fb667de --- /dev/null +++ b/docs/devel/ui.rst @@ -0,0 +1,8 @@ +================= +QEMU UI subsystem +================= + +QEMU Clipboard +-------------- + +.. kernel-doc:: include/ui/clipboard.h diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst new file mode 100644 index 000000000..9ff6163c8 --- /dev/null +++ b/docs/devel/vfio-migration.rst @@ -0,0 +1,150 @@ +===================== +VFIO device Migration +===================== + +Migration of virtual machine involves saving the state for each device that +the guest is running on source host and restoring this saved state on the +destination host. This document details how saving and restoring of VFIO +devices is done in QEMU. + +Migration of VFIO devices consists of two phases: the optional pre-copy phase, +and the stop-and-copy phase. The pre-copy phase is iterative and allows to +accommodate VFIO devices that have a large amount of data that needs to be +transferred. The iterative pre-copy phase of migration allows for the guest to +continue whilst the VFIO device state is transferred to the destination, this +helps to reduce the total downtime of the VM. VFIO devices can choose to skip +the pre-copy phase of migration by returning pending_bytes as zero during the +pre-copy phase. + +A detailed description of the UAPI for VFIO device migration can be found in +the comment for the ``vfio_device_migration_info`` structure in the header +file linux-headers/linux/vfio.h. + +VFIO implements the device hooks for the iterative approach as follows: + +* A ``save_setup`` function that sets up the migration region and sets _SAVING + flag in the VFIO device state. + +* A ``load_setup`` function that sets up the migration region on the + destination and sets _RESUMING flag in the VFIO device state. + +* A ``save_live_pending`` function that reads pending_bytes from the vendor + driver, which indicates the amount of data that the vendor driver has yet to + save for the VFIO device. + +* A ``save_live_iterate`` function that reads the VFIO device's data from the + vendor driver through the migration region during iterative phase. + +* A ``save_state`` function to save the device config space if it is present. + +* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the + VFIO device state and iteratively copies the remaining data for the VFIO + device until the vendor driver indicates that no data remains (pending bytes + is zero). + +* A ``load_state`` function that loads the config section and the data + sections that are generated by the save functions above + +* ``cleanup`` functions for both save and load that perform any migration + related cleanup, including unmapping the migration region + + +The VFIO migration code uses a VM state change handler to change the VFIO +device state when the VM state changes from running to not-running, and +vice versa. + +Similarly, a migration state change handler is used to trigger a transition of +the VFIO device state when certain changes of the migration state occur. For +example, the VFIO device state is transitioned back to _RUNNING in case a +migration failed or was canceled. + +System memory dirty pages tracking +---------------------------------- + +A ``log_global_start`` and ``log_global_stop`` memory listener callback informs +the VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync`` +memory listener callback marks those system memory pages as dirty which are +used for DMA by the VFIO device. The dirty pages bitmap is queried per +container. All pages pinned by the vendor driver through external APIs have to +be marked as dirty during migration. When there are CPU writes, CPU dirty page +tracking can identify dirtied pages, but any page pinned by the vendor driver +can also be written by the device. There is currently no device or IOMMU +support for dirty page tracking in hardware. + +By default, dirty pages are tracked when the device is in pre-copy as well as +stop-and-copy phase. So, a page pinned by the vendor driver will be copied to +the destination in both phases. Copying dirty pages in pre-copy phase helps +QEMU to predict if it can achieve its downtime tolerances. If QEMU during +pre-copy phase keeps finding dirty pages continuously, then it understands +that even in stop-and-copy phase, it is likely to find dirty pages and can +predict the downtime accordingly. + +QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking`` +which disables querying the dirty bitmap during pre-copy phase. If it is set to +off, all dirty pages will be copied to the destination in stop-and-copy phase +only. + +System memory dirty pages tracking when vIOMMU is enabled +--------------------------------------------------------- + +With vIOMMU, an IO virtual address range can get unmapped while in pre-copy +phase of migration. In that case, the unmap ioctl returns any dirty pages in +that range and QEMU reports corresponding guest physical pages dirty. During +stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped +pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those +mapped ranges. + +Flow of state changes during Live migration +=========================================== + +Below is the flow of state change during live migration. +The values in the brackets represent the VM state, the migration state, and +the VFIO device state, respectively. + +Live migration save path +------------------------ + +:: + + QEMU normal running state + (RUNNING, _NONE, _RUNNING) + | + migrate_init spawns migration_thread + Migration thread then calls each device's .save_setup() + (RUNNING, _SETUP, _RUNNING|_SAVING) + | + (RUNNING, _ACTIVE, _RUNNING|_SAVING) + If device is active, get pending_bytes by .save_live_pending() + If total pending_bytes >= threshold_size, call .save_live_iterate() + Data of VFIO device for pre-copy phase is copied + Iterate till total pending bytes converge and are less than threshold + | + On migration completion, vCPU stops and calls .save_live_complete_precopy for + each active device. The VFIO device is then transitioned into _SAVING state + (FINISH_MIGRATE, _DEVICE, _SAVING) + | + For the VFIO device, iterate in .save_live_complete_precopy until + pending data is 0 + (FINISH_MIGRATE, _DEVICE, _STOPPED) + | + (FINISH_MIGRATE, _COMPLETED, _STOPPED) + Migraton thread schedules cleanup bottom half and exits + +Live migration resume path +-------------------------- + +:: + + Incoming migration calls .load_setup for each device + (RESTORE_VM, _ACTIVE, _STOPPED) + | + For each device, .load_state is called for that device section data + (RESTORE_VM, _ACTIVE, _RESUMING) + | + At the end, .load_cleanup is called for each device and vCPUs are started + (RUNNING, _NONE, _RUNNING) + +Postcopy +======== + +Postcopy migration is currently not supported for VFIO devices. diff --git a/docs/devel/virtio-migration.txt b/docs/devel/virtio-migration.txt new file mode 100644 index 000000000..98a6b0ffb --- /dev/null +++ b/docs/devel/virtio-migration.txt @@ -0,0 +1,108 @@ +Virtio devices and migration +============================ + +Copyright 2015 IBM Corp. + +This work is licensed under the terms of the GNU GPL, version 2 or later. See +the COPYING file in the top-level directory. + +Saving and restoring the state of virtio devices is a bit of a twisty maze, +for several reasons: +- state is distributed between several parts: + - virtio core, for common fields like features, number of queues, ... + - virtio transport (pci, ccw, ...), for the different proxy devices and + transport specific state (msix vectors, indicators, ...) + - virtio device (net, blk, ...), for the different device types and their + state (mac address, request queue, ...) +- most fields are saved via the stream interface; subsequently, subsections + have been added to make cross-version migration possible + +This file attempts to document the current procedure and point out some +caveats. + + +Save state procedure +==================== + +virtio core virtio transport virtio device +----------- ---------------- ------------- + + save() function registered + via VMState wrapper on + device class +virtio_save() <---------- + ------> save_config() + - save proxy device + - save transport-specific + device fields +- save common device + fields +- save common virtqueue + fields + ------> save_queue() + - save transport-specific + virtqueue fields + ------> save_device() + - save device-specific + fields +- save subsections + - device endianness, + if changed from + default endianness + - 64 bit features, if + any high feature bit + is set + - virtio-1 virtqueue + fields, if VERSION_1 + is set + + +Load state procedure +==================== + +virtio core virtio transport virtio device +----------- ---------------- ------------- + + load() function registered + via VMState wrapper on + device class +virtio_load() <---------- + ------> load_config() + - load proxy device + - load transport-specific + device fields +- load common device + fields +- load common virtqueue + fields + ------> load_queue() + - load transport-specific + virtqueue fields +- notify guest + ------> load_device() + - load device-specific + fields +- load subsections + - device endianness + - 64 bit features + - virtio-1 virtqueue + fields +- sanitize endianness +- sanitize features +- virtqueue index sanity + check + - feature-dependent setup + + +Implications of this setup +========================== + +Devices need to be careful in their state processing during load: The +load_device() procedure is invoked by the core before subsections have +been loaded. Any code that depends on information transmitted in subsections +therefore has to be invoked in the device's load() function _after_ +virtio_load() returned (like e.g. code depending on features). + +Any extension of the state being migrated should be done in subsections +added to the core for compatibility reasons. If transport or device specific +state is added, core needs to invoke a callback from the new subsection. diff --git a/docs/devel/writing-monitor-commands.rst b/docs/devel/writing-monitor-commands.rst new file mode 100644 index 000000000..1693822f8 --- /dev/null +++ b/docs/devel/writing-monitor-commands.rst @@ -0,0 +1,751 @@ +How to write monitor commands +============================= + +This document is a step-by-step guide on how to write new QMP commands using +the QAPI framework and HMP commands. + +This document doesn't discuss QMP protocol level details, nor does it dive +into the QAPI framework implementation. + +For an in-depth introduction to the QAPI framework, please refer to +docs/devel/qapi-code-gen.txt. For documentation about the QMP protocol, +start with docs/interop/qmp-intro.txt. + +New commands may be implemented in QMP only. New HMP commands should be +implemented on top of QMP. The typical HMP command wraps around an +equivalent QMP command, but HMP convenience commands built from QMP +building blocks are also fine. The long term goal is to make all +existing HMP commands conform to this, to fully isolate HMP from the +internals of QEMU. Refer to the `Writing a debugging aid returning +unstructured text`_ section for further guidance on commands that +would have traditionally been HMP only. + +Overview +-------- + +Generally speaking, the following steps should be taken in order to write a +new QMP command. + +1. Define the command and any types it needs in the appropriate QAPI + schema module. + +2. Write the QMP command itself, which is a regular C function. Preferably, + the command should be exported by some QEMU subsystem. But it can also be + added to the monitor/qmp-cmds.c file + +3. At this point the command can be tested under the QMP protocol + +4. Write the HMP command equivalent. This is not required and should only be + done if it does make sense to have the functionality in HMP. The HMP command + is implemented in terms of the QMP command + +The following sections will demonstrate each of the steps above. We will start +very simple and get more complex as we progress. + + +Testing +------- + +For all the examples in the next sections, the test setup is the same and is +shown here. + +First, QEMU should be started like this:: + + # qemu-system-TARGET [...] \ + -chardev socket,id=qmp,port=4444,host=localhost,server=on \ + -mon chardev=qmp,mode=control,pretty=on + +Then, in a different terminal:: + + $ telnet localhost 4444 + Trying 127.0.0.1... + Connected to localhost. + Escape character is '^]'. + { + "QMP": { + "version": { + "qemu": { + "micro": 50, + "minor": 15, + "major": 0 + }, + "package": "" + }, + "capabilities": [ + ] + } + } + +The above output is the QMP server saying you're connected. The server is +actually in capabilities negotiation mode. To enter in command mode type:: + + { "execute": "qmp_capabilities" } + +Then the server should respond:: + + { + "return": { + } + } + +Which is QMP's way of saying "the latest command executed OK and didn't return +any data". Now you're ready to enter the QMP example commands as explained in +the following sections. + + +Writing a simple command: hello-world +------------------------------------- + +That's the most simple QMP command that can be written. Usually, this kind of +command carries some meaningful action in QEMU but here it will just print +"Hello, world" to the standard output. + +Our command will be called "hello-world". It takes no arguments, nor does it +return any data. + +The first step is defining the command in the appropriate QAPI schema +module. We pick module qapi/misc.json, and add the following line at +the bottom:: + + { 'command': 'hello-world' } + +The "command" keyword defines a new QMP command. It's an JSON object. All +schema entries are JSON objects. The line above will instruct the QAPI to +generate any prototypes and the necessary code to marshal and unmarshal +protocol data. + +The next step is to write the "hello-world" implementation. As explained +earlier, it's preferable for commands to live in QEMU subsystems. But +"hello-world" doesn't pertain to any, so we put its implementation in +monitor/qmp-cmds.c:: + + void qmp_hello_world(Error **errp) + { + printf("Hello, world!\n"); + } + +There are a few things to be noticed: + +1. QMP command implementation functions must be prefixed with "qmp\_" +2. qmp_hello_world() returns void, this is in accordance with the fact that the + command doesn't return any data +3. It takes an "Error \*\*" argument. This is required. Later we will see how to + return errors and take additional arguments. The Error argument should not + be touched if the command doesn't return errors +4. We won't add the function's prototype. That's automatically done by the QAPI +5. Printing to the terminal is discouraged for QMP commands, we do it here + because it's the easiest way to demonstrate a QMP command + +You're done. Now build qemu, run it as suggested in the "Testing" section, +and then type the following QMP command:: + + { "execute": "hello-world" } + +Then check the terminal running qemu and look for the "Hello, world" string. If +you don't see it then something went wrong. + + +Arguments +~~~~~~~~~ + +Let's add an argument called "message" to our "hello-world" command. The new +argument will contain the string to be printed to stdout. It's an optional +argument, if it's not present we print our default "Hello, World" string. + +The first change we have to do is to modify the command specification in the +schema file to the following:: + + { 'command': 'hello-world', 'data': { '*message': 'str' } } + +Notice the new 'data' member in the schema. It's an JSON object whose each +element is an argument to the command in question. Also notice the asterisk, +it's used to mark the argument optional (that means that you shouldn't use it +for mandatory arguments). Finally, 'str' is the argument's type, which +stands for "string". The QAPI also supports integers, booleans, enumerations +and user defined types. + +Now, let's update our C implementation in monitor/qmp-cmds.c:: + + void qmp_hello_world(bool has_message, const char *message, Error **errp) + { + if (has_message) { + printf("%s\n", message); + } else { + printf("Hello, world\n"); + } + } + +There are two important details to be noticed: + +1. All optional arguments are accompanied by a 'has\_' boolean, which is set + if the optional argument is present or false otherwise +2. The C implementation signature must follow the schema's argument ordering, + which is defined by the "data" member + +Time to test our new version of the "hello-world" command. Build qemu, run it as +described in the "Testing" section and then send two commands:: + + { "execute": "hello-world" } + { + "return": { + } + } + + { "execute": "hello-world", "arguments": { "message": "We love qemu" } } + { + "return": { + } + } + +You should see "Hello, world" and "We love qemu" in the terminal running qemu, +if you don't see these strings, then something went wrong. + + +Errors +~~~~~~ + +QMP commands should use the error interface exported by the error.h header +file. Basically, most errors are set by calling the error_setg() function. + +Let's say we don't accept the string "message" to contain the word "love". If +it does contain it, we want the "hello-world" command to return an error:: + + void qmp_hello_world(bool has_message, const char *message, Error **errp) + { + if (has_message) { + if (strstr(message, "love")) { + error_setg(errp, "the word 'love' is not allowed"); + return; + } + printf("%s\n", message); + } else { + printf("Hello, world\n"); + } + } + +The first argument to the error_setg() function is the Error pointer +to pointer, which is passed to all QMP functions. The next argument is a human +description of the error, this is a free-form printf-like string. + +Let's test the example above. Build qemu, run it as defined in the "Testing" +section, and then issue the following command:: + + { "execute": "hello-world", "arguments": { "message": "all you need is love" } } + +The QMP server's response should be:: + + { + "error": { + "class": "GenericError", + "desc": "the word 'love' is not allowed" + } + } + +Note that error_setg() produces a "GenericError" class. In general, +all QMP errors should have that error class. There are two exceptions +to this rule: + + 1. To support a management application's need to recognize a specific + error for special handling + + 2. Backward compatibility + +If the failure you want to report falls into one of the two cases above, +use error_set() with a second argument of an ErrorClass value. + + +Command Documentation +~~~~~~~~~~~~~~~~~~~~~ + +There's only one step missing to make "hello-world"'s implementation complete, +and that's its documentation in the schema file. + +There are many examples of such documentation in the schema file already, but +here goes "hello-world"'s new entry for qapi/misc.json:: + + ## + # @hello-world: + # + # Print a client provided string to the standard output stream. + # + # @message: string to be printed + # + # Returns: Nothing on success. + # + # Notes: if @message is not provided, the "Hello, world" string will + # be printed instead + # + # Since: <next qemu stable release, eg. 1.0> + ## + { 'command': 'hello-world', 'data': { '*message': 'str' } } + +Please, note that the "Returns" clause is optional if a command doesn't return +any data nor any errors. + + +Implementing the HMP command +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Now that the QMP command is in place, we can also make it available in the human +monitor (HMP). + +With the introduction of the QAPI, HMP commands make QMP calls. Most of the +time HMP commands are simple wrappers. All HMP commands implementation exist in +the monitor/hmp-cmds.c file. + +Here's the implementation of the "hello-world" HMP command:: + + void hmp_hello_world(Monitor *mon, const QDict *qdict) + { + const char *message = qdict_get_try_str(qdict, "message"); + Error *err = NULL; + + qmp_hello_world(!!message, message, &err); + if (hmp_handle_error(mon, err)) { + return; + } + } + +Also, you have to add the function's prototype to the hmp.h file. + +There are three important points to be noticed: + +1. The "mon" and "qdict" arguments are mandatory for all HMP functions. The + former is the monitor object. The latter is how the monitor passes + arguments entered by the user to the command implementation +2. hmp_hello_world() performs error checking. In this example we just call + hmp_handle_error() which prints a message to the user, but we could do + more, like taking different actions depending on the error + qmp_hello_world() returns +3. The "err" variable must be initialized to NULL before performing the + QMP call + +There's one last step to actually make the command available to monitor users, +we should add it to the hmp-commands.hx file:: + + { + .name = "hello-world", + .args_type = "message:s?", + .params = "hello-world [message]", + .help = "Print message to the standard output", + .cmd = hmp_hello_world, + }, + +:: + + STEXI + @item hello_world @var{message} + @findex hello_world + Print message to the standard output + ETEXI + +To test this you have to open a user monitor and issue the "hello-world" +command. It might be instructive to check the command's documentation with +HMP's "help" command. + +Please, check the "-monitor" command-line option to know how to open a user +monitor. + + +Writing more complex commands +----------------------------- + +A QMP command is capable of returning any data the QAPI supports like integers, +strings, booleans, enumerations and user defined types. + +In this section we will focus on user defined types. Please, check the QAPI +documentation for information about the other types. + + +Modelling data in QAPI +~~~~~~~~~~~~~~~~~~~~~~ + +For a QMP command that to be considered stable and supported long term, +there is a requirement returned data should be explicitly modelled +using fine-grained QAPI types. As a general guide, a caller of the QMP +command should never need to parse individual returned data fields. If +a field appears to need parsing, then it should be split into separate +fields corresponding to each distinct data item. This should be the +common case for any new QMP command that is intended to be used by +machines, as opposed to exclusively human operators. + +Some QMP commands, however, are only intended as ad hoc debugging aids +for human operators. While they may return large amounts of formatted +data, it is not expected that machines will need to parse the result. +The overhead of defining a fine grained QAPI type for the data may not +be justified by the potential benefit. In such cases, it is permitted +to have a command return a simple string that contains formatted data, +however, it is mandatory for the command to use the 'x-' name prefix. +This indicates that the command is not guaranteed to be long term +stable / liable to change in future and is not following QAPI design +best practices. An example where this approach is taken is the QMP +command "x-query-registers". This returns a formatted dump of the +architecture specific CPU state. The way the data is formatted varies +across QEMU targets, is liable to change over time, and is only +intended to be consumed as an opaque string by machines. Refer to the +`Writing a debugging aid returning unstructured text`_ section for +an illustration. + +User Defined Types +~~~~~~~~~~~~~~~~~~ + +FIXME This example needs to be redone after commit 6d32717 + +For this example we will write the query-alarm-clock command, which returns +information about QEMU's timer alarm. For more information about it, please +check the "-clock" command-line option. + +We want to return two pieces of information. The first one is the alarm clock's +name. The second one is when the next alarm will fire. The former information is +returned as a string, the latter is an integer in nanoseconds (which is not +very useful in practice, as the timer has probably already fired when the +information reaches the client). + +The best way to return that data is to create a new QAPI type, as shown below:: + + ## + # @QemuAlarmClock + # + # QEMU alarm clock information. + # + # @clock-name: The alarm clock method's name. + # + # @next-deadline: The time (in nanoseconds) the next alarm will fire. + # + # Since: 1.0 + ## + { 'type': 'QemuAlarmClock', + 'data': { 'clock-name': 'str', '*next-deadline': 'int' } } + +The "type" keyword defines a new QAPI type. Its "data" member contains the +type's members. In this example our members are the "clock-name" and the +"next-deadline" one, which is optional. + +Now let's define the query-alarm-clock command:: + + ## + # @query-alarm-clock + # + # Return information about QEMU's alarm clock. + # + # Returns a @QemuAlarmClock instance describing the alarm clock method + # being currently used by QEMU (this is usually set by the '-clock' + # command-line option). + # + # Since: 1.0 + ## + { 'command': 'query-alarm-clock', 'returns': 'QemuAlarmClock' } + +Notice the "returns" keyword. As its name suggests, it's used to define the +data returned by a command. + +It's time to implement the qmp_query_alarm_clock() function, you can put it +in the qemu-timer.c file:: + + QemuAlarmClock *qmp_query_alarm_clock(Error **errp) + { + QemuAlarmClock *clock; + int64_t deadline; + + clock = g_malloc0(sizeof(*clock)); + + deadline = qemu_next_alarm_deadline(); + if (deadline > 0) { + clock->has_next_deadline = true; + clock->next_deadline = deadline; + } + clock->clock_name = g_strdup(alarm_timer->name); + + return clock; + } + +There are a number of things to be noticed: + +1. The QemuAlarmClock type is automatically generated by the QAPI framework, + its members correspond to the type's specification in the schema file +2. As specified in the schema file, the function returns a QemuAlarmClock + instance and takes no arguments (besides the "errp" one, which is mandatory + for all QMP functions) +3. The "clock" variable (which will point to our QAPI type instance) is + allocated by the regular g_malloc0() function. Note that we chose to + initialize the memory to zero. This is recommended for all QAPI types, as + it helps avoiding bad surprises (specially with booleans) +4. Remember that "next_deadline" is optional? All optional members have a + 'has_TYPE_NAME' member that should be properly set by the implementation, + as shown above +5. Even static strings, such as "alarm_timer->name", should be dynamically + allocated by the implementation. This is so because the QAPI also generates + a function to free its types and it cannot distinguish between dynamically + or statically allocated strings +6. You have to include "qapi/qapi-commands-misc.h" in qemu-timer.c + +Time to test the new command. Build qemu, run it as described in the "Testing" +section and try this:: + + { "execute": "query-alarm-clock" } + { + "return": { + "next-deadline": 2368219, + "clock-name": "dynticks" + } + } + + +The HMP command +~~~~~~~~~~~~~~~ + +Here's the HMP counterpart of the query-alarm-clock command:: + + void hmp_info_alarm_clock(Monitor *mon) + { + QemuAlarmClock *clock; + Error *err = NULL; + + clock = qmp_query_alarm_clock(&err); + if (hmp_handle_error(mon, err)) { + return; + } + + monitor_printf(mon, "Alarm clock method in use: '%s'\n", clock->clock_name); + if (clock->has_next_deadline) { + monitor_printf(mon, "Next alarm will fire in %" PRId64 " nanoseconds\n", + clock->next_deadline); + } + + qapi_free_QemuAlarmClock(clock); + } + +It's important to notice that hmp_info_alarm_clock() calls +qapi_free_QemuAlarmClock() to free the data returned by qmp_query_alarm_clock(). +For user defined types, the QAPI will generate a qapi_free_QAPI_TYPE_NAME() +function and that's what you have to use to free the types you define and +qapi_free_QAPI_TYPE_NAMEList() for list types (explained in the next section). +If the QMP call returns a string, then you should g_free() to free it. + +Also note that hmp_info_alarm_clock() performs error handling. That's not +strictly required if you're sure the QMP function doesn't return errors, but +it's good practice to always check for errors. + +Another important detail is that HMP's "info" commands don't go into the +hmp-commands.hx. Instead, they go into the info_cmds[] table, which is defined +in the monitor/misc.c file. The entry for the "info alarmclock" follows:: + + { + .name = "alarmclock", + .args_type = "", + .params = "", + .help = "show information about the alarm clock", + .cmd = hmp_info_alarm_clock, + }, + +To test this, run qemu and type "info alarmclock" in the user monitor. + + +Returning Lists +~~~~~~~~~~~~~~~ + +For this example, we're going to return all available methods for the timer +alarm, which is pretty much what the command-line option "-clock ?" does, +except that we're also going to inform which method is in use. + +This first step is to define a new type:: + + ## + # @TimerAlarmMethod + # + # Timer alarm method information. + # + # @method-name: The method's name. + # + # @current: true if this alarm method is currently in use, false otherwise + # + # Since: 1.0 + ## + { 'type': 'TimerAlarmMethod', + 'data': { 'method-name': 'str', 'current': 'bool' } } + +The command will be called "query-alarm-methods", here is its schema +specification:: + + ## + # @query-alarm-methods + # + # Returns information about available alarm methods. + # + # Returns: a list of @TimerAlarmMethod for each method + # + # Since: 1.0 + ## + { 'command': 'query-alarm-methods', 'returns': ['TimerAlarmMethod'] } + +Notice the syntax for returning lists "'returns': ['TimerAlarmMethod']", this +should be read as "returns a list of TimerAlarmMethod instances". + +The C implementation follows:: + + TimerAlarmMethodList *qmp_query_alarm_methods(Error **errp) + { + TimerAlarmMethodList *method_list = NULL; + const struct qemu_alarm_timer *p; + bool current = true; + + for (p = alarm_timers; p->name; p++) { + TimerAlarmMethod *value = g_malloc0(*value); + value->method_name = g_strdup(p->name); + value->current = current; + QAPI_LIST_PREPEND(method_list, value); + current = false; + } + + return method_list; + } + +The most important difference from the previous examples is the +TimerAlarmMethodList type, which is automatically generated by the QAPI from +the TimerAlarmMethod type. + +Each list node is represented by a TimerAlarmMethodList instance. We have to +allocate it, and that's done inside the for loop: the "info" pointer points to +an allocated node. We also have to allocate the node's contents, which is +stored in its "value" member. In our example, the "value" member is a pointer +to an TimerAlarmMethod instance. + +Notice that the "current" variable is used as "true" only in the first +iteration of the loop. That's because the alarm timer method in use is the +first element of the alarm_timers array. Also notice that QAPI lists are handled +by hand and we return the head of the list. + +Now Build qemu, run it as explained in the "Testing" section and try our new +command:: + + { "execute": "query-alarm-methods" } + { + "return": [ + { + "current": false, + "method-name": "unix" + }, + { + "current": true, + "method-name": "dynticks" + } + ] + } + +The HMP counterpart is a bit more complex than previous examples because it +has to traverse the list, it's shown below for reference:: + + void hmp_info_alarm_methods(Monitor *mon) + { + TimerAlarmMethodList *method_list, *method; + Error *err = NULL; + + method_list = qmp_query_alarm_methods(&err); + if (hmp_handle_error(mon, err)) { + return; + } + + for (method = method_list; method; method = method->next) { + monitor_printf(mon, "%c %s\n", method->value->current ? '*' : ' ', + method->value->method_name); + } + + qapi_free_TimerAlarmMethodList(method_list); + } + +Writing a debugging aid returning unstructured text +--------------------------------------------------- + +As discussed in section `Modelling data in QAPI`_, it is required that +commands expecting machine usage be using fine-grained QAPI data types. +The exception to this rule applies when the command is solely intended +as a debugging aid and allows for returning unstructured text. This is +commonly needed for query commands that report aspects of QEMU's +internal state that are useful to human operators. + +In this example we will consider a simplified variant of the HMP +command ``info roms``. Following the earlier rules, this command will +need to live under the ``x-`` name prefix, so its QMP implementation +will be called ``x-query-roms``. It will have no parameters and will +return a single text string:: + + { 'struct': 'HumanReadableText', + 'data': { 'human-readable-text': 'str' } } + + { 'command': 'x-query-roms', + 'returns': 'HumanReadableText' } + +The ``HumanReadableText`` struct is intended to be used for all +commands, under the ``x-`` name prefix that are returning unstructured +text targeted at humans. It should never be used for commands outside +the ``x-`` name prefix, as those should be using structured QAPI types. + +Implementing the QMP command +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The QMP implementation will typically involve creating a ``GString`` +object and printing formatted data into it:: + + HumanReadableText *qmp_x_query_roms(Error **errp) + { + g_autoptr(GString) buf = g_string_new(""); + Rom *rom; + + QTAILQ_FOREACH(rom, &roms, next) { + g_string_append_printf("%s size=0x%06zx name=\"%s\"\n", + memory_region_name(rom->mr), + rom->romsize, + rom->name); + } + + return human_readable_text_from_str(buf); + } + + +Implementing the HMP command +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Now that the QMP command is in place, we can also make it available in +the human monitor (HMP) as shown in previous examples. The HMP +implementations will all look fairly similar, as all they need do is +invoke the QMP command and then print the resulting text or error +message. Here's the implementation of the "info roms" HMP command:: + + void hmp_info_roms(Monitor *mon, const QDict *qdict) + { + Error err = NULL; + g_autoptr(HumanReadableText) info = qmp_x_query_roms(&err); + + if (hmp_handle_error(mon, err)) { + return; + } + monitor_printf(mon, "%s", info->human_readable_text); + } + +Also, you have to add the function's prototype to the hmp.h file. + +There's one last step to actually make the command available to +monitor users, we should add it to the hmp-commands-info.hx file:: + + { + .name = "roms", + .args_type = "", + .params = "", + .help = "show roms", + .cmd = hmp_info_roms, + }, + +The case of writing a HMP info handler that calls a no-parameter QMP query +command is quite common. To simplify the implementation there is a general +purpose HMP info handler for this scenario. All that is required to expose +a no-parameter QMP query command via HMP is to declare it using the +'.cmd_info_hrt' field to point to the QMP handler, and leave the '.cmd' +field NULL:: + + { + .name = "roms", + .args_type = "", + .params = "", + .help = "show roms", + .cmd_info_hrt = qmp_x_query_roms, + }, |