1 files changed, 507 insertions, 0 deletions
diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
new file mode 100644
index 000000000..52baa0736
--- /dev/null
+++ b/docs/devel/atomics.rst
@@ -0,0 +1,507 @@
+=========================
+Atomic operations in QEMU
+=========================
+
+CPUs perform independent memory operations effectively in random order.
+but this can be a problem for CPU-CPU interaction (including interactions
+between QEMU and the guest).  Multi-threaded programs use various tools
+to instruct the compiler and the CPU to restrict the order to something
+that is consistent with the expectations of the programmer.
+
+The most basic tool is locking.  Mutexes, condition variables and
+semaphores are used in QEMU, and should be the default approach to
+synchronization.  Anything else is considerably harder, but it's
+also justified more often than one would like;
+the most performance-critical parts of QEMU in particular require
+a very low level approach to concurrency, involving memory barriers
+and atomic operations.  The semantics of concurrent memory accesses are governed
+by the C11 memory model.
+
+QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to
+provide better portability and a less verbose syntax.  ``qemu/atomic.h``
+provides macros that fall in three camps:
+
+- compiler barriers: ``barrier()``;
+
+- weak atomic access and manual memory barriers: ``qatomic_read()``,
+  ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
+  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``;
+
+- sequentially consistent atomic access: everything else.
+
+In general, use of ``qemu/atomic.h`` should be wrapped with more easily
+used data structures (e.g. the lock-free singly-linked list operations
+``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
+primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``).  Bare use of
+atomic operations and memory barriers should be limited to inter-thread
+checking of flags and documented thoroughly.
+
+
+
+Compiler memory barrier
+=======================
+
+``barrier()`` prevents the compiler from moving the memory accesses on
+either side of it to the other side.  The compiler barrier has no direct
+effect on the CPU, which may then reorder things however it wishes.
+
+``barrier()`` is mostly used within ``qemu/atomic.h`` itself.  On some
+architectures, CPU guarantees are strong enough that blocking compiler
+optimizations already ensures the correct order of execution.  In this
+case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple
+compiler barriers.
+
+Still, ``barrier()`` can be useful when writing code that can be interrupted
+by signal handlers.
+
+
+Sequentially consistent atomic access
+=====================================
+
+Most of the operations in the ``qemu/atomic.h`` header ensure *sequential
+consistency*, where "the result of any execution is the same as if the
+operations of all the processors were executed in some sequential order,
+and the operations of each individual processor appear in this sequence
+in the order specified by its program".
+
+``qemu/atomic.h`` provides the following set of atomic read-modify-write
+operations::
+
+    void qatomic_inc(ptr)
+    void qatomic_dec(ptr)
+    void qatomic_add(ptr, val)
+    void qatomic_sub(ptr, val)
+    void qatomic_and(ptr, val)
+    void qatomic_or(ptr, val)
+
+    typeof(*ptr) qatomic_fetch_inc(ptr)
+    typeof(*ptr) qatomic_fetch_dec(ptr)
+    typeof(*ptr) qatomic_fetch_add(ptr, val)
+    typeof(*ptr) qatomic_fetch_sub(ptr, val)
+    typeof(*ptr) qatomic_fetch_and(ptr, val)
+    typeof(*ptr) qatomic_fetch_or(ptr, val)
+    typeof(*ptr) qatomic_fetch_xor(ptr, val)
+    typeof(*ptr) qatomic_fetch_inc_nonzero(ptr)
+    typeof(*ptr) qatomic_xchg(ptr, val)
+    typeof(*ptr) qatomic_cmpxchg(ptr, old, new)
+
+all of which return the old value of ``*ptr``.  These operations are
+polymorphic; they operate on any type that is as wide as a pointer or
+smaller.
+
+Similar operations return the new value of ``*ptr``::
+
+    typeof(*ptr) qatomic_inc_fetch(ptr)
+    typeof(*ptr) qatomic_dec_fetch(ptr)
+    typeof(*ptr) qatomic_add_fetch(ptr, val)
+    typeof(*ptr) qatomic_sub_fetch(ptr, val)
+    typeof(*ptr) qatomic_and_fetch(ptr, val)
+    typeof(*ptr) qatomic_or_fetch(ptr, val)
+    typeof(*ptr) qatomic_xor_fetch(ptr, val)
+
+``qemu/atomic.h`` also provides loads and stores that cannot be reordered
+with each other::
+
+    typeof(*ptr) qatomic_mb_read(ptr)
+    void         qatomic_mb_set(ptr, val)
+
+However these do not provide sequential consistency and, in particular,
+they do not participate in the total ordering enforced by
+sequentially-consistent operations.  For this reason they are deprecated.
+They should instead be replaced with any of the following (ordered from
+easiest to hardest):
+
+- accesses inside a mutex or spinlock
+
+- lightweight synchronization primitives such as ``QemuEvent``
+
+- RCU operations (``qatomic_rcu_read``, ``qatomic_rcu_set``) when publishing
+  or accessing a new version of a data structure
+
+- other atomic accesses: ``qatomic_read`` and ``qatomic_load_acquire`` for
+  loads, ``qatomic_set`` and ``qatomic_store_release`` for stores, ``smp_mb``
+  to forbid reordering subsequent loads before a store.
+
+
+Weak atomic access and manual memory barriers
+=============================================
+
+Compared to sequentially consistent atomic access, programming with
+weaker consistency models can be considerably more complicated.
+The only guarantees that you can rely upon in this case are:
+
+- atomic accesses will not cause data races (and hence undefined behavior);
+  ordinary accesses instead cause data races if they are concurrent with
+  other accesses of which at least one is a write.  In order to ensure this,
+  the compiler will not optimize accesses out of existence, create unsolicited
+  accesses, or perform other similar optimzations.
+
+- acquire operations will appear to happen, with respect to the other
+  components of the system, before all the LOAD or STORE operations
+  specified afterwards.
+
+- release operations will appear to happen, with respect to the other
+  components of the system, after all the LOAD or STORE operations
+  specified before.
+
+- release operations will *synchronize with* acquire operations;
+  see :ref:`acqrel` for a detailed explanation.
+
+When using this model, variables are accessed with:
+
+- ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from
+  optimizing accesses out of existence and creating unsolicited
+  accesses, but do not otherwise impose any ordering on loads and
+  stores: both the compiler and the processor are free to reorder
+  them.
+
+- ``qatomic_load_acquire()``, which guarantees the LOAD to appear to
+  happen, with respect to the other components of the system,
+  before all the LOAD or STORE operations specified afterwards.
+  Operations coming before ``qatomic_load_acquire()`` can still be
+  reordered after it.
+
+- ``qatomic_store_release()``, which guarantees the STORE to appear to
+  happen, with respect to the other components of the system,
+  after all the LOAD or STORE operations specified before.
+  Operations coming after ``qatomic_store_release()`` can still be
+  reordered before it.
+
+Restrictions to the ordering of accesses can also be specified
+using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
+``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``.
+
+Memory barriers control the order of references to shared memory.
+They come in six kinds:
+
+- ``smp_rmb()`` guarantees that all the LOAD operations specified before
+  the barrier will appear to happen before all the LOAD operations
+  specified after the barrier with respect to the other components of
+  the system.
+
+  In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not
+  required to have any effect on stores.
+
+- ``smp_wmb()`` guarantees that all the STORE operations specified before
+  the barrier will appear to happen before all the STORE operations
+  specified after the barrier with respect to the other components of
+  the system.
+
+  In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not
+  required to have any effect on loads.
+
+- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before
+  the barrier will appear to happen before all the LOAD or STORE operations
+  specified after the barrier with respect to the other components of
+  the system.
+
+- ``smp_mb_release()`` guarantees that all the STORE operations specified *after*
+  the barrier will appear to happen after all the LOAD or STORE operations
+  specified *before* the barrier with respect to the other components of
+  the system.
+
+- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified
+  before the barrier will appear to happen before all the LOAD and
+  STORE operations specified after the barrier with respect to the other
+  components of the system.
+
+  ``smp_mb()`` puts a partial ordering on both loads and stores.  It is
+  stronger than both a read and a write memory barrier; it implies both
+  ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs
+  coming before the barrier from overtaking LOADs coming after the
+  barrier and vice versa.
+
+- ``smp_read_barrier_depends()`` is a weaker kind of read barrier.  On
+  most processors, whenever two loads are performed such that the
+  second depends on the result of the first (e.g., the first load
+  retrieves the address to which the second load will be directed),
+  the processor will guarantee that the first LOAD will appear to happen
+  before the second with respect to the other components of the system.
+  However, this is not always true---for example, it was not true on
+  Alpha processors.  Whenever this kind of access happens to shared
+  memory (that is not protected by a lock), a read barrier is needed,
+  and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``.
+
+  Note that the first load really has to have a _data_ dependency and not
+  a control dependency.  If the address for the second load is dependent
+  on the first load, but the dependency is through a conditional rather
+  than actually loading the address itself, then it's a _control_
+  dependency and a full read barrier or better is required.
+
+
+Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are
+mostly used when a data structure has one thread that is always a writer
+and one thread that is always a reader:
+
+    +----------------------------------+----------------------------------+
+    | thread 1                         | thread 2                         |
+    +==================================+==================================+
+    | ::                               | ::                               |
+    |                                  |                                  |
+    |   qatomic_store_release(&a, x);  |   y = qatomic_load_acquire(&b);  |
+    |   qatomic_store_release(&b, y);  |   x = qatomic_load_acquire(&a);  |
+    +----------------------------------+----------------------------------+
+
+In this case, correctness is easy to check for using the "pairing"
+trick that is explained below.
+
+Sometimes, a thread is accessing many variables that are otherwise
+unrelated to each other (for example because, apart from the current
+thread, exactly one other thread will read or write each of these
+variables).  In this case, it is possible to "hoist" the barriers
+outside a loop.  For example:
+
+    +------------------------------------------+----------------------------------+
+    | before                                   | after                            |
+    +==========================================+==================================+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |   n = 0;                                 |   n = 0;                         |
+    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       |
+    |     n += qatomic_load_acquire(&a[i]);    |     n += qatomic_read(&a[i]);    |
+    |                                          |   smp_mb_acquire();              |
+    +------------------------------------------+----------------------------------+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |                                          |   smp_mb_release();              |
+    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       |
+    |     qatomic_store_release(&a[i], false); |     qatomic_set(&a[i], false);   |
+    +------------------------------------------+----------------------------------+
+
+Splitting a loop can also be useful to reduce the number of barriers:
+
+    +------------------------------------------+----------------------------------+
+    | before                                   | after                            |
+    +==========================================+==================================+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |   n = 0;                                 |     smp_mb_release();            |
+    |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)     |
+    |     qatomic_store_release(&a[i], false); |       qatomic_set(&a[i], false); |
+    |     smp_mb();                            |     smb_mb();                    |
+    |     n += qatomic_read(&b[i]);            |     n = 0;                       |
+    |   }                                      |     for (i = 0; i < 10; i++)     |
+    |                                          |       n += qatomic_read(&b[i]);  |
+    +------------------------------------------+----------------------------------+
+
+In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer
+as well) ``smp_wmb()``:
+
+    +------------------------------------------+----------------------------------+
+    | before                                   | after                            |
+    +==========================================+==================================+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |                                          |     smp_mb_release();            |
+    |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)     |
+    |     qatomic_store_release(&a[i], false); |       qatomic_set(&a[i], false); |
+    |     qatomic_store_release(&b[i], false); |     smb_wmb();                   |
+    |   }                                      |     for (i = 0; i < 10; i++)     |
+    |                                          |       qatomic_set(&b[i], false); |
+    +------------------------------------------+----------------------------------+
+
+
+.. _acqrel:
+
+Acquire/release pairing and the *synchronizes-with* relation
+------------------------------------------------------------
+
+Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have
+either *acquire* or *release* semantics [#rmw]_.  This has two effects:
+
+.. [#rmw] Read-modify-write operations can have both---acquire applies to the
+          read part, and release to the write.
+
+- within a thread, they are ordered either before subsequent operations
+  (for acquire) or after previous operations (for release).
+
+- if a release operation in one thread *synchronizes with* an acquire operation
+  in another thread, the ordering constraints propagates from the first to the
+  second thread.  That is, everything before the release operation in the
+  first thread is guaranteed to *happen before* everything after the
+  acquire operation in the second thread.
+
+The concept of acquire and release semantics is not exclusive to atomic
+operations; almost all higher-level synchronization primitives also have
+acquire or release semantics.  For example:
+
+- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has
+  release semantics and synchronizes with a ``pthread_mutex_lock`` for the
+  same mutex.
+
+- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics;
+  ``pthread_cond_wait`` has both release semantics (synchronizing with
+  ``pthread_mutex_lock``) and acquire semantics (synchronizing with
+  ``pthread_mutex_unlock`` and signaling of the condition variable).
+
+- ``pthread_create`` has release semantics and synchronizes with the start
+  of the new thread; ``pthread_join`` has acquire semantics and synchronizes
+  with the exiting of the thread.
+
+- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has
+  acquire semantics.
+
+For example, in the following example there are no atomic accesses, but still
+thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit``
+(release) and ``pthread_join`` (acquire):
+
+      +----------------------+-------------------------------+
+      | thread 1             | thread 2                      |
+      +======================+===============================+
+      | ::                   | ::                            |
+      |                      |                               |
+      |   *a = 1;            |                               |
+      |   pthread_exit(a);   |   pthread_join(thread1, &a);  |
+      |                      |   x = *a;                     |
+      +----------------------+-------------------------------+
+
+Synchronization between threads basically descends from this pairing of
+a release operation and an acquire operation.  Therefore, atomic operations
+other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be
+paired with another operation of the opposite kind: an acquire operation
+will pair with a release operation and vice versa.  This rule of thumb is
+extremely useful; in the case of QEMU, however, note that the other
+operation may actually be in a driver that runs in the guest!
+
+``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``,
+``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count
+as acquire operations.  ``smp_wmb()``, ``smp_mb_release()``,
+``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release
+operations.  ``smp_mb()`` counts as both acquire and release, therefore
+it can pair with any other atomic operation.  Here is an example:
+
+      +----------------------+------------------------------+
+      | thread 1             | thread 2                     |
+      +======================+==============================+
+      | ::                   | ::                           |
+      |                      |                              |
+      |   qatomic_set(&a, 1);|                              |
+      |   smp_wmb();         |                              |
+      |   qatomic_set(&b, 2);|   x = qatomic_read(&b);      |
+      |                      |   smp_rmb();                 |
+      |                      |   y = qatomic_read(&a);      |
+      +----------------------+------------------------------+
+
+Note that a load-store pair only counts if the two operations access the
+same variable: that is, a store-release on a variable ``x`` *synchronizes
+with* a load-acquire on a variable ``x``, while a release barrier
+synchronizes with any acquire operation.  The following example shows
+correct synchronization:
+
+      +--------------------------------+--------------------------------+
+      | thread 1                       | thread 2                       |
+      +================================+================================+
+      | ::                             | ::                             |
+      |                                |                                |
+      |   qatomic_set(&a, 1);          |                                |
+      |   qatomic_store_release(&b, 2);|   x = qatomic_load_acquire(&b);|
+      |                                |   y = qatomic_read(&a);        |
+      +--------------------------------+--------------------------------+
+
+Acquire and release semantics of higher-level primitives can also be
+relied upon for the purpose of establishing the *synchronizes with*
+relation.
+
+Note that the "writing" thread is accessing the variables in the
+opposite order as the "reading" thread.  This is expected: stores
+before a release operation will normally match the loads after
+the acquire operation, and vice versa.  In fact, this happened already
+in the ``pthread_exit``/``pthread_join`` example above.
+
+Finally, this more complex example has more than two accesses and data
+dependency barriers.  It also does not use atomic accesses whenever there
+cannot be a data race:
+
+      +----------------------+------------------------------+
+      | thread 1             | thread 2                     |
+      +======================+==============================+
+      | ::                   | ::                           |
+      |                      |                              |
+      |   b[2] = 1;          |                              |
+      |   smp_wmb();         |                              |
+      |   x->i = 2;          |                              |
+      |   smp_wmb();         |                              |
+      |   qatomic_set(&a, x);|  x = qatomic_read(&a);       |
+      |                      |  smp_read_barrier_depends(); |
+      |                      |  y = x->i;                   |
+      |                      |  smp_read_barrier_depends(); |
+      |                      |  z = b[y];                   |
+      +----------------------+------------------------------+
+
+Comparison with Linux kernel primitives
+=======================================
+
+Here is a list of differences between Linux kernel atomic operations
+and memory barriers, and the equivalents in QEMU:
+
+- atomic operations in Linux are always on a 32-bit int type and
+  use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic
+  and use normal C types.
+
+- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee
+  at all. Linux 4.1 updated them to implement volatile
+  semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``).
+
+  QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed
+  semantics if the compiler supports it, and volatile semantics otherwise.
+  Both semantics prevent the compiler from doing certain transformations;
+  the difference is that atomic accesses are guaranteed to be atomic,
+  while volatile accesses aren't. Thus, in the volatile case we just cross
+  our fingers hoping that the compiler will generate atomic accesses,
+  since we assume the variables passed are machine-word sized and
+  properly aligned.
+
+  No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either
+  Linux or QEMU.
+
+- atomic read-modify-write operations in Linux are of three kinds:
+
+         ===================== =========================================
+         ``atomic_OP``         returns void
+         ``atomic_OP_return``  returns new value of the variable
+         ``atomic_fetch_OP``   returns the old value of the variable
+         ``atomic_cmpxchg``    returns the old value of the variable
+         ===================== =========================================
+
+  In QEMU, the second kind is named ``atomic_OP_fetch``.
+
+- different atomic read-modify-write operations in Linux imply
+  a different set of memory barriers; in QEMU, all of them enforce
+  sequential consistency.
+
+- in QEMU, ``qatomic_read()`` and ``qatomic_set()`` do not participate in
+  the total ordering enforced by sequentially-consistent operations.
+  This is because QEMU uses the C11 memory model.  The following example
+  is correct in Linux but not in QEMU:
+
+      +----------------------------------+--------------------------------+
+      | Linux (correct)                  | QEMU (incorrect)               |
+      +==================================+================================+
+      | ::                               | ::                             |
+      |                                  |                                |
+      |   a = atomic_fetch_add(&x, 2);   |   a = qatomic_fetch_add(&x, 2);|
+      |   b = READ_ONCE(&y);             |   b = qatomic_read(&y);        |
+      +----------------------------------+--------------------------------+
+
+  because the read of ``y`` can be moved (by either the processor or the
+  compiler) before the write of ``x``.
+
+  Fixing this requires an ``smp_mb()`` memory barrier between the write
+  of ``x`` and the read of ``y``.  In the common case where only one thread
+  writes ``x``, it is also possible to write it like this:
+
+      +--------------------------------+
+      | QEMU (correct)                 |
+      +================================+
+      | ::                             |
+      |                                |
+      |   a = qatomic_read(&x);        |
+      |   qatomic_set(&x, a + 2);      |
+      |   smp_mb();                    |
+      |   b = qatomic_read(&y);        |
+      +--------------------------------+
+
+Sources
+=======
+
+- ``Documentation/memory-barriers.txt`` from the Linux kernel