diff options
Diffstat (limited to 'docs')
282 files changed, 53945 insertions, 0 deletions
diff --git a/docs/COLO-FT.txt b/docs/COLO-FT.txt new file mode 100644 index 000000000..8ec653f81 --- /dev/null +++ b/docs/COLO-FT.txt @@ -0,0 +1,333 @@ +COarse-grained LOck-stepping Virtual Machines for Non-stop Service +---------------------------------------- +Copyright (c) 2016 Intel Corporation +Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. +Copyright (c) 2016 Fujitsu, Corp. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +This document gives an overview of COLO's design and how to use it. + +== Background == +Virtual machine (VM) replication is a well known technique for providing +application-agnostic software-implemented hardware fault tolerance, +also known as "non-stop service". + +COLO (COarse-grained LOck-stepping) is a high availability solution. +Both primary VM (PVM) and secondary VM (SVM) run in parallel. They receive the +same request from client, and generate response in parallel too. +If the response packets from PVM and SVM are identical, they are released +immediately. Otherwise, a VM checkpoint (on demand) is conducted. + +== Architecture == + +The architecture of COLO is shown in the diagram below. +It consists of a pair of networked physical nodes: +The primary node running the PVM, and the secondary node running the SVM +to maintain a valid replica of the PVM. +PVM and SVM execute in parallel and generate output of response packets for +client requests according to the application semantics. + +The incoming packets from the client or external network are received by the +primary node, and then forwarded to the secondary node, so that both the PVM +and the SVM are stimulated with the same requests. + +COLO receives the outbound packets from both the PVM and SVM and compares them +before allowing the output to be sent to clients. + +The SVM is qualified as a valid replica of the PVM, as long as it generates +identical responses to all client requests. Once the differences in the outputs +are detected between the PVM and SVM, COLO withholds transmission of the +outbound packets until it has successfully synchronized the PVM state to the SVM. + + Primary Node Secondary Node ++------------+ +-----------------------+ +------------------------+ +------------+ +| | | HeartBeat +<----->+ HeartBeat | | | +| Primary VM | +-----------+-----------+ +-----------+------------+ |Secondary VM| +| | | | | | +| | +-----------|-----------+ +-----------|------------+ | | +| | |QEMU +---v----+ | |QEMU +----v---+ | | | +| | | |Failover| | | |Failover| | | | +| | | +--------+ | | +--------+ | | | +| | | +---------------+ | | +---------------+ | | | +| | | | VM Checkpoint +-------------->+ VM Checkpoint | | | | +| | | +---------------+ | | +---------------+ | | | +|Requests<--------------------------\ /-----------------\ /--------------------->Requests| +| | | ^ ^ | | | | | | | +|Responses+---------------------\ /-|-|------------\ /-------------------------+Responses| +| | | | | | | | | | | | | | | | +| | | +-----------+ | | | | | | | | | | +----------+ | | | +| | | | COLO disk | | | | | | | | | | | | COLO disk| | | | +| | | | Manager +---------------------------->| Manager | | | | +| | | ++----------+ v v | | | | | v v | +---------++ | | | +| | | |+-----------+-+-+-++| | ++-+--+-+---------+ | | | | +| | | || COLO Proxy || | | COLO Proxy | | | | | +| | | || (compare packet || | |(adjust sequence | | | | | +| | | ||and mirror packet)|| | | and ACK) | | | | | +| | | |+------------+---+-+| | +-----------------+ | | | | ++------------+ +-----------------------+ +------------------------+ +------------+ ++------------+ | | | | +------------+ +| VM Monitor | | | | | | VM Monitor | ++------------+ | | | | +------------+ ++---------------------------------------+ +----------------------------------------+ +| Kernel | | | | | Kernel | | ++---------------------------------------+ +----------------------------------------+ + | | | | + +--------------v+ +---------v---+--+ +------------------+ +v-------------+ + | Storage | |External Network| | External Network | | Storage | + +---------------+ +----------------+ +------------------+ +--------------+ + + +== Components introduction == + +You can see there are several components in COLO's diagram of architecture. +Their functions are described below. + +HeartBeat: +Runs on both the primary and secondary nodes, to periodically check platform +availability. When the primary node suffers a hardware fail-stop failure, +the heartbeat stops responding, the secondary node will trigger a failover +as soon as it determines the absence. + +COLO disk Manager: +When primary VM writes data into image, the colo disk manager captures this data +and sends it to secondary VM's which makes sure the context of secondary VM's +image is consistent with the context of primary VM 's image. +For more details, please refer to docs/block-replication.txt. + +Checkpoint/Failover Controller: +Modifications of save/restore flow to realize continuous migration, +to make sure the state of VM in Secondary side is always consistent with VM in +Primary side. + +COLO Proxy: +Delivers packets to Primary and Secondary, and then compare the responses from +both side. Then decide whether to start a checkpoint according to some rules. +Please refer to docs/colo-proxy.txt for more information. + +Note: +HeartBeat has not been implemented yet, so you need to trigger failover process +by using 'x-colo-lost-heartbeat' command. + +== COLO operation status == + ++-----------------+ +| | +| Start COLO | +| | ++--------+--------+ + | + | Main qmp command: + | migrate-set-capabilities with x-colo + | migrate + | + v ++--------+--------+ +| | +| COLO running | +| | ++--------+--------+ + | + | Main qmp command: + | x-colo-lost-heartbeat + | or + | some error happened + v ++--------+--------+ +| | send qmp event: +| COLO failover | COLO_EXIT +| | ++-----------------+ + +COLO use the qmp command to switch and report operation status. +The diagram just shows the main qmp command, you can get the detail +in test procedure. + +== Test procedure == +Note: Here we are running both instances on the same host for testing, +change the IP Addresses if you want to run it on two hosts. Initially +127.0.0.1 is the Primary Host and 127.0.0.2 is the Secondary Host. + +== Startup qemu == +1. Primary: +Note: Initially, $imagefolder/primary.qcow2 needs to be copied to all hosts. +You don't need to change any IP's here, because 0.0.0.0 listens on any +interface. The chardev's with 127.0.0.1 IP's loopback to the local qemu +instance. + +# imagefolder="/mnt/vms/colo-test-primary" + +# qemu-system-x86_64 -enable-kvm -cpu qemu64,kvmclock=on -m 512 -smp 1 -qmp stdio \ + -device piix3-usb-uhci -device usb-tablet -name primary \ + -netdev tap,id=hn0,vhost=off,helper=/usr/lib/qemu/qemu-bridge-helper \ + -device rtl8139,id=e0,netdev=hn0 \ + -chardev socket,id=mirror0,host=0.0.0.0,port=9003,server=on,wait=off \ + -chardev socket,id=compare1,host=0.0.0.0,port=9004,server=on,wait=on \ + -chardev socket,id=compare0,host=127.0.0.1,port=9001,server=on,wait=off \ + -chardev socket,id=compare0-0,host=127.0.0.1,port=9001 \ + -chardev socket,id=compare_out,host=127.0.0.1,port=9005,server=on,wait=off \ + -chardev socket,id=compare_out0,host=127.0.0.1,port=9005 \ + -object filter-mirror,id=m0,netdev=hn0,queue=tx,outdev=mirror0 \ + -object filter-redirector,netdev=hn0,id=redire0,queue=rx,indev=compare_out \ + -object filter-redirector,netdev=hn0,id=redire1,queue=rx,outdev=compare0 \ + -object iothread,id=iothread1 \ + -object colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,\ +outdev=compare_out0,iothread=iothread1 \ + -drive if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\ +children.0.file.filename=$imagefolder/primary.qcow2,children.0.driver=qcow2 -S + +2. Secondary: +Note: Active and hidden images need to be created only once and the +size should be the same as primary.qcow2. Again, you don't need to change +any IP's here, except for the $primary_ip variable. + +# imagefolder="/mnt/vms/colo-test-secondary" +# primary_ip=127.0.0.1 + +# qemu-img create -f qcow2 $imagefolder/secondary-active.qcow2 10G + +# qemu-img create -f qcow2 $imagefolder/secondary-hidden.qcow2 10G + +# qemu-system-x86_64 -enable-kvm -cpu qemu64,kvmclock=on -m 512 -smp 1 -qmp stdio \ + -device piix3-usb-uhci -device usb-tablet -name secondary \ + -netdev tap,id=hn0,vhost=off,helper=/usr/lib/qemu/qemu-bridge-helper \ + -device rtl8139,id=e0,netdev=hn0 \ + -chardev socket,id=red0,host=$primary_ip,port=9003,reconnect=1 \ + -chardev socket,id=red1,host=$primary_ip,port=9004,reconnect=1 \ + -object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0 \ + -object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1 \ + -object filter-rewriter,id=rew0,netdev=hn0,queue=all \ + -drive if=none,id=parent0,file.filename=$imagefolder/primary.qcow2,driver=qcow2 \ + -drive if=none,id=childs0,driver=replication,mode=secondary,file.driver=qcow2,\ +top-id=colo-disk0,file.file.filename=$imagefolder/secondary-active.qcow2,\ +file.backing.driver=qcow2,file.backing.file.filename=$imagefolder/secondary-hidden.qcow2,\ +file.backing.backing=parent0 \ + -drive if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\ +children.0=childs0 \ + -incoming tcp:0.0.0.0:9998 + + +3. On Secondary VM's QEMU monitor, issue command +{"execute":"qmp_capabilities"} +{"execute": "nbd-server-start", "arguments": {"addr": {"type": "inet", "data": {"host": "0.0.0.0", "port": "9999"} } } } +{"execute": "nbd-server-add", "arguments": {"device": "parent0", "writable": true } } + +Note: + a. The qmp command nbd-server-start and nbd-server-add must be run + before running the qmp command migrate on primary QEMU + b. Active disk, hidden disk and nbd target's length should be the + same. + c. It is better to put active disk and hidden disk in ramdisk. They + will be merged into the parent disk on failover. + +4. On Primary VM's QEMU monitor, issue command: +{"execute":"qmp_capabilities"} +{"execute": "human-monitor-command", "arguments": {"command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.2,file.port=9999,file.export=parent0,node-name=replication0"}} +{"execute": "x-blockdev-change", "arguments":{"parent": "colo-disk0", "node": "replication0" } } +{"execute": "migrate-set-capabilities", "arguments": {"capabilities": [ {"capability": "x-colo", "state": true } ] } } +{"execute": "migrate", "arguments": {"uri": "tcp:127.0.0.2:9998" } } + + Note: + a. There should be only one NBD Client for each primary disk. + b. The qmp command line must be run after running qmp command line in + secondary qemu. + +5. After the above steps, you will see, whenever you make changes to PVM, SVM will be synced. +You can issue command '{ "execute": "migrate-set-parameters" , "arguments":{ "x-checkpoint-delay": 2000 } }' +to change the idle checkpoint period time + +6. Failover test +You can kill one of the VMs and Failover on the surviving VM: + +If you killed the Secondary, then follow "Primary Failover". After that, +if you want to resume the replication, follow "Primary resume replication" + +If you killed the Primary, then follow "Secondary Failover". After that, +if you want to resume the replication, follow "Secondary resume replication" + +== Primary Failover == +The Secondary died, resume on the Primary + +{"execute": "x-blockdev-change", "arguments":{ "parent": "colo-disk0", "child": "children.1"} } +{"execute": "human-monitor-command", "arguments":{ "command-line": "drive_del replication0" } } +{"execute": "object-del", "arguments":{ "id": "comp0" } } +{"execute": "object-del", "arguments":{ "id": "iothread1" } } +{"execute": "object-del", "arguments":{ "id": "m0" } } +{"execute": "object-del", "arguments":{ "id": "redire0" } } +{"execute": "object-del", "arguments":{ "id": "redire1" } } +{"execute": "x-colo-lost-heartbeat" } + +== Secondary Failover == +The Primary died, resume on the Secondary and prepare to become the new Primary + +{"execute": "nbd-server-stop"} +{"execute": "x-colo-lost-heartbeat"} + +{"execute": "object-del", "arguments":{ "id": "f2" } } +{"execute": "object-del", "arguments":{ "id": "f1" } } +{"execute": "chardev-remove", "arguments":{ "id": "red1" } } +{"execute": "chardev-remove", "arguments":{ "id": "red0" } } + +{"execute": "chardev-add", "arguments":{ "id": "mirror0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "0.0.0.0", "port": "9003" } }, "server": true } } } } +{"execute": "chardev-add", "arguments":{ "id": "compare1", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "0.0.0.0", "port": "9004" } }, "server": true } } } } +{"execute": "chardev-add", "arguments":{ "id": "compare0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9001" } }, "server": true } } } } +{"execute": "chardev-add", "arguments":{ "id": "compare0-0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9001" } }, "server": false } } } } +{"execute": "chardev-add", "arguments":{ "id": "compare_out", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9005" } }, "server": true } } } } +{"execute": "chardev-add", "arguments":{ "id": "compare_out0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9005" } }, "server": false } } } } + +== Primary resume replication == +Resume replication after new Secondary is up. + +Start the new Secondary (Steps 2 and 3 above), then on the Primary: +{"execute": "drive-mirror", "arguments":{ "device": "colo-disk0", "job-id": "resync", "target": "nbd://127.0.0.2:9999/parent0", "mode": "existing", "format": "raw", "sync": "full"} } + +Wait until disk is synced, then: +{"execute": "stop"} +{"execute": "block-job-cancel", "arguments":{ "device": "resync"} } + +{"execute": "human-monitor-command", "arguments":{ "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.2,file.port=9999,file.export=parent0,node-name=replication0"}} +{"execute": "x-blockdev-change", "arguments":{ "parent": "colo-disk0", "node": "replication0" } } + +{"execute": "object-add", "arguments":{ "qom-type": "filter-mirror", "id": "m0", "netdev": "hn0", "queue": "tx", "outdev": "mirror0" } } +{"execute": "object-add", "arguments":{ "qom-type": "filter-redirector", "id": "redire0", "netdev": "hn0", "queue": "rx", "indev": "compare_out" } } +{"execute": "object-add", "arguments":{ "qom-type": "filter-redirector", "id": "redire1", "netdev": "hn0", "queue": "rx", "outdev": "compare0" } } +{"execute": "object-add", "arguments":{ "qom-type": "iothread", "id": "iothread1" } } +{"execute": "object-add", "arguments":{ "qom-type": "colo-compare", "id": "comp0", "primary_in": "compare0-0", "secondary_in": "compare1", "outdev": "compare_out0", "iothread": "iothread1" } } + +{"execute": "migrate-set-capabilities", "arguments":{ "capabilities": [ {"capability": "x-colo", "state": true } ] } } +{"execute": "migrate", "arguments":{ "uri": "tcp:127.0.0.2:9998" } } + +Note: +If this Primary previously was a Secondary, then we need to insert the +filters before the filter-rewriter by using the +""insert": "before", "position": "id=rew0"" Options. See below. + +== Secondary resume replication == +Become Primary and resume replication after new Secondary is up. Note +that now 127.0.0.1 is the Secondary and 127.0.0.2 is the Primary. + +Start the new Secondary (Steps 2 and 3 above, but with primary_ip=127.0.0.2), +then on the old Secondary: +{"execute": "drive-mirror", "arguments":{ "device": "colo-disk0", "job-id": "resync", "target": "nbd://127.0.0.1:9999/parent0", "mode": "existing", "format": "raw", "sync": "full"} } + +Wait until disk is synced, then: +{"execute": "stop"} +{"execute": "block-job-cancel", "arguments":{ "device": "resync" } } + +{"execute": "human-monitor-command", "arguments":{ "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.1,file.port=9999,file.export=parent0,node-name=replication0"}} +{"execute": "x-blockdev-change", "arguments":{ "parent": "colo-disk0", "node": "replication0" } } + +{"execute": "object-add", "arguments":{ "qom-type": "filter-mirror", "id": "m0", "insert": "before", "position": "id=rew0", "netdev": "hn0", "queue": "tx", "outdev": "mirror0" } } +{"execute": "object-add", "arguments":{ "qom-type": "filter-redirector", "id": "redire0", "insert": "before", "position": "id=rew0", "netdev": "hn0", "queue": "rx", "indev": "compare_out" } } +{"execute": "object-add", "arguments":{ "qom-type": "filter-redirector", "id": "redire1", "insert": "before", "position": "id=rew0", "netdev": "hn0", "queue": "rx", "outdev": "compare0" } } +{"execute": "object-add", "arguments":{ "qom-type": "iothread", "id": "iothread1" } } +{"execute": "object-add", "arguments":{ "qom-type": "colo-compare", "id": "comp0", "primary_in": "compare0-0", "secondary_in": "compare1", "outdev": "compare_out0", "iothread": "iothread1" } } + +{"execute": "migrate-set-capabilities", "arguments":{ "capabilities": [ {"capability": "x-colo", "state": true } ] } } +{"execute": "migrate", "arguments":{ "uri": "tcp:127.0.0.1:9998" } } + +== TODO == +1. Support shared storage. +2. Develop the heartbeat part. +3. Reduce checkpoint VM’s downtime while doing checkpoint. diff --git a/docs/_templates/footer.html b/docs/_templates/footer.html new file mode 100644 index 000000000..977053b54 --- /dev/null +++ b/docs/_templates/footer.html @@ -0,0 +1,14 @@ +{% extends "!footer.html" %} +{% block extrafooter %} + +<!-- Empty para to force a blank line after "Built with Sphinx ..." --> +<p></p> + +<p>This documentation is for QEMU version {{ version }}.</p> + +{% trans path=pathto('about/license') %} +<p><a href="{{ path }}">QEMU and this manual are released under the +GNU General Public License, version 2.</a></p> +{% endtrans %} +{{ super() }} +{% endblock %} diff --git a/docs/about/build-platforms.rst b/docs/about/build-platforms.rst new file mode 100644 index 000000000..c29a4b8fe --- /dev/null +++ b/docs/about/build-platforms.rst @@ -0,0 +1,97 @@ +.. _Supported-build-platforms: + +Supported build platforms +========================= + +QEMU aims to support building and executing on multiple host OS +platforms. This appendix outlines which platforms are the major build +targets. These platforms are used as the basis for deciding upon the +minimum required versions of 3rd party software QEMU depends on. The +supported platforms are the targets for automated testing performed by +the project when patches are submitted for review, and tested before and +after merge. + +If a platform is not listed here, it does not imply that QEMU won't +work. If an unlisted platform has comparable software versions to a +listed platform, there is every expectation that it will work. Bug +reports are welcome for problems encountered on unlisted platforms +unless they are clearly older vintage than what is described here. + +Note that when considering software versions shipped in distros as +support targets, QEMU considers only the version number, and assumes the +features in that distro match the upstream release with the same +version. In other words, if a distro backports extra features to the +software in their distro, QEMU upstream code will not add explicit +support for those backports, unless the feature is auto-detectable in a +manner that works for the upstream releases too. + +The `Repology`_ site is a useful resource to identify +currently shipped versions of software in various operating systems, +though it does not cover all distros listed below. + +Supported host architectures +---------------------------- + +Those hosts are officially supported, with various accelerators: + + .. list-table:: + :header-rows: 1 + + * - CPU Architecture + - Accelerators + * - Arm + - kvm (64 bit only), tcg, xen + * - MIPS + - kvm, tcg + * - PPC + - kvm, tcg + * - RISC-V + - tcg + * - s390x + - kvm, tcg + * - SPARC + - tcg + * - x86 + - hax, hvf (64 bit only), kvm, nvmm, tcg, whpx (64 bit only), xen + +Other host architectures are not supported. It is possible to build QEMU system +emulation on an unsupported host architecture using the configure +``--enable-tcg-interpreter`` option to enable the TCI support, but note that +this is very slow and is not recommended for normal use. QEMU user emulation +requires host-specific support for signal handling, therefore TCI won't help +on unsupported host architectures. + +Non-supported architectures may be removed in the future following the +:ref:`deprecation process<Deprecated features>`. + +Linux OS, macOS, FreeBSD, NetBSD, OpenBSD +----------------------------------------- + +The project aims to support the most recent major version at all times. Support +for the previous major version will be dropped 2 years after the new major +version is released or when the vendor itself drops support, whichever comes +first. In this context, third-party efforts to extend the lifetime of a distro +are not considered, even when they are endorsed by the vendor (eg. Debian LTS). + +For the purposes of identifying supported software versions available on Linux, +the project will look at CentOS, Debian, Fedora, openSUSE, RHEL, SLES and +Ubuntu LTS. Other distros will be assumed to ship similar software versions. + +For FreeBSD and OpenBSD, decisions will be made based on the contents of the +respective ports repository, while NetBSD will use the pkgsrc repository. + +For macOS, `HomeBrew`_ will be used, although `MacPorts`_ is expected to carry +similar versions. + +Windows +------- + +The project supports building with current versions of the MinGW toolchain, +hosted on Linux (Debian/Fedora). + +The version of the Windows API that's currently targeted is Vista / Server +2008. + +.. _HomeBrew: https://brew.sh/ +.. _MacPorts: https://www.macports.org/ +.. _Repology: https://repology.org/ diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst new file mode 100644 index 000000000..ff7488cb6 --- /dev/null +++ b/docs/about/deprecated.rst @@ -0,0 +1,440 @@ +.. _Deprecated features: + +Deprecated features +=================== + +In general features are intended to be supported indefinitely once +introduced into QEMU. In the event that a feature needs to be removed, +it will be listed in this section. The feature will remain functional for the +release in which it was deprecated and one further release. After these two +releases, the feature is liable to be removed. Deprecated features may also +generate warnings on the console when QEMU starts up, or if activated via a +monitor command, however, this is not a mandatory requirement. + +Prior to the 2.10.0 release there was no official policy on how +long features would be deprecated prior to their removal, nor +any documented list of which features were deprecated. Thus +any features deprecated prior to 2.10.0 will be treated as if +they were first deprecated in the 2.10.0 release. + +What follows is a list of all features currently marked as +deprecated. + +System emulator command line arguments +-------------------------------------- + +``QEMU_AUDIO_`` environment variables and ``-audio-help`` (since 4.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``-audiodev`` argument is now the preferred way to specify audio +backend settings instead of environment variables. To ease migration to +the new format, the ``-audiodev-help`` option can be used to convert +the current values of the environment variables to ``-audiodev`` options. + +Creating sound card devices and vnc without ``audiodev=`` property (since 4.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +When not using the deprecated legacy audio config, each sound card +should specify an ``audiodev=`` property. Additionally, when using +vnc, you should specify an ``audiodev=`` property if you plan to +transmit audio through the VNC protocol. + +Creating sound card devices using ``-soundhw`` (since 5.1) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Sound card devices should be created using ``-device`` instead. The +names are the same for most devices. The exceptions are ``hda`` which +needs two devices (``-device intel-hda -device hda-duplex``) and +``pcspk`` which can be activated using ``-machine +pcspk-audiodev=<name>``. + +``-chardev`` backend aliases ``tty`` and ``parport`` (since 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +``tty`` and ``parport`` are aliases that will be removed. Instead, the +actual backend names ``serial`` and ``parallel`` should be used. + +Short-form boolean options (since 6.0) +'''''''''''''''''''''''''''''''''''''' + +Boolean options such as ``share=on``/``share=off`` could be written +in short form as ``share`` and ``noshare``. This is now deprecated +and will cause a warning. + +``delay`` option for socket character devices (since 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The replacement for the ``nodelay`` short-form boolean option is ``nodelay=on`` +rather than ``delay=off``. + +``--enable-fips`` (since 6.0) +''''''''''''''''''''''''''''' + +This option restricts usage of certain cryptographic algorithms when +the host is operating in FIPS mode. + +If FIPS compliance is required, QEMU should be built with the ``libgcrypt`` +library enabled as a cryptography provider. + +Neither the ``nettle`` library, or the built-in cryptography provider are +supported on FIPS enabled hosts. + +``-writeconfig`` (since 6.0) +''''''''''''''''''''''''''''' + +The ``-writeconfig`` option is not able to serialize the entire contents +of the QEMU command line. It is thus considered a failed experiment +and deprecated, with no current replacement. + +Userspace local APIC with KVM (x86, since 6.0) +'''''''''''''''''''''''''''''''''''''''''''''' + +Using ``-M kernel-irqchip=off`` with x86 machine types that include a local +APIC is deprecated. The ``split`` setting is supported, as is using +``-M kernel-irqchip=off`` with the ISA PC machine type. + +hexadecimal sizes with scaling multipliers (since 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Input parameters that take a size value should only use a size suffix +(such as 'k' or 'M') when the base is written in decimal, and not when +the value is hexadecimal. That is, '0x20M' is deprecated, and should +be written either as '32M' or as '0x2000000'. + +``-spice password=string`` (since 6.0) +'''''''''''''''''''''''''''''''''''''' + +This option is insecure because the SPICE password remains visible in +the process listing. This is replaced by the new ``password-secret`` +option which lets the password be securely provided on the command +line using a ``secret`` object instance. + +``opened`` property of ``rng-*`` objects (since 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''''''' + +The only effect of specifying ``opened=on`` in the command line or QMP +``object-add`` is that the device is opened immediately, possibly before all +other options have been processed. This will either have no effect (if +``opened`` was the last option) or cause errors. The property is therefore +useless and should not be specified. + +``loaded`` property of ``secret`` and ``secret_keyring`` objects (since 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The only effect of specifying ``loaded=on`` in the command line or QMP +``object-add`` is that the secret is loaded immediately, possibly before all +other options have been processed. This will either have no effect (if +``loaded`` was the last option) or cause options to be effectively ignored as +if they were not given. The property is therefore useless and should not be +specified. + +``-display sdl,window_close=...`` (since 6.1) +''''''''''''''''''''''''''''''''''''''''''''' + +Use ``-display sdl,window-close=...`` instead (i.e. with a minus instead of +an underscore between "window" and "close"). + +``-no-quit`` (since 6.1) +'''''''''''''''''''''''' + +The ``-no-quit`` is a synonym for ``-display ...,window-close=off`` which +should be used instead. + +``-alt-grab`` and ``-display sdl,alt_grab=on`` (since 6.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use ``-display sdl,grab-mod=lshift-lctrl-lalt`` instead. + +``-ctrl-grab`` and ``-display sdl,ctrl_grab=on`` (since 6.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use ``-display sdl,grab-mod=rctrl`` instead. + +``-sdl`` (since 6.2) +'''''''''''''''''''' + +Use ``-display sdl`` instead. + +``-curses`` (since 6.2) +''''''''''''''''''''''' + +Use ``-display curses`` instead. + +``-watchdog`` (since 6.2) +''''''''''''''''''''''''' + +Use ``-device`` instead. + +``-smp`` ("parameter=0" SMP configurations) (since 6.2) +''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Specified CPU topology parameters must be greater than zero. + +In the SMP configuration, users should either provide a CPU topology +parameter with a reasonable value (greater than zero) or just omit it +and QEMU will compute the missing value. + +However, historically it was implicitly allowed for users to provide +a parameter with zero value, which is meaningless and could also possibly +cause unexpected results in the -smp parsing. So support for this kind of +configurations (e.g. -smp 8,sockets=0) is deprecated since 6.2 and will +be removed in the near future, users have to ensure that all the topology +members described with -smp are greater than zero. + +Plugin argument passing through ``arg=<string>`` (since 6.1) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Passing TCG plugins arguments through ``arg=`` is redundant is makes the +command-line less readable, especially when the argument itself consist of a +name and a value, e.g. ``-plugin plugin_name,arg="arg_name=arg_value"``. +Therefore, the usage of ``arg`` is redundant. Single-word arguments are treated +as short-form boolean values, and passed to plugins as ``arg_name=on``. +However, short-form booleans are deprecated and full explicit ``arg_name=on`` +form is preferred. + +``-drive if=none`` for the sifive_u OTP device (since 6.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Using ``-drive if=none`` to configure the OTP device of the sifive_u +RISC-V machine is deprecated. Use ``-drive if=pflash`` instead. + + +QEMU Machine Protocol (QMP) commands +------------------------------------ + +``blockdev-open-tray``, ``blockdev-close-tray`` argument ``device`` (since 2.8) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use argument ``id`` instead. + +``eject`` argument ``device`` (since 2.8) +''''''''''''''''''''''''''''''''''''''''' + +Use argument ``id`` instead. + +``blockdev-change-medium`` argument ``device`` (since 2.8) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use argument ``id`` instead. + +``block_set_io_throttle`` argument ``device`` (since 2.8) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use argument ``id`` instead. + +``blockdev-add`` empty string argument ``backing`` (since 2.10) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use argument value ``null`` instead. + +``block-commit`` arguments ``base`` and ``top`` (since 3.1) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use arguments ``base-node`` and ``top-node`` instead. + +``nbd-server-add`` and ``nbd-server-remove`` (since 5.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use the more generic commands ``block-export-add`` and ``block-export-del`` +instead. As part of this deprecation, where ``nbd-server-add`` used a +single ``bitmap``, the new ``block-export-add`` uses a list of ``bitmaps``. + +``query-qmp-schema`` return value member ``values`` (since 6.2) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Member ``values`` in return value elements with meta-type ``enum`` is +deprecated. Use ``members`` instead. + +``drive-backup`` (since 6.2) +'''''''''''''''''''''''''''' + +Use ``blockdev-backup`` in combination with ``blockdev-add`` instead. +This change primarily separates the creation/opening process of the backup +target with explicit, separate steps. ``blockdev-backup`` uses mostly the +same arguments as ``drive-backup``, except the ``format`` and ``mode`` +options are removed in favor of using explicit ``blockdev-create`` and +``blockdev-add`` calls. See :doc:`/interop/live-block-operations` for +details. + +Incorrectly typed ``device_add`` arguments (since 6.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Due to shortcomings in the internal implementation of ``device_add``, QEMU +incorrectly accepts certain invalid arguments: Any object or list arguments are +silently ignored. Other argument types are not checked, but an implicit +conversion happens, so that e.g. string values can be assigned to integer +device properties or vice versa. + +This is a bug in QEMU that will be fixed in the future so that previously +accepted incorrect commands will return an error. Users should make sure that +all arguments passed to ``device_add`` are consistent with the documented +property types. + +System accelerators +------------------- + +MIPS ``Trap-and-Emul`` KVM support (since 6.0) +'''''''''''''''''''''''''''''''''''''''''''''' + +The MIPS ``Trap-and-Emul`` KVM host and guest support has been removed +from Linux upstream kernel, declare it deprecated. + +System emulator CPUS +-------------------- + +``Icelake-Client`` CPU Model (since 5.2) +'''''''''''''''''''''''''''''''''''''''' + +``Icelake-Client`` CPU Models are deprecated. Use ``Icelake-Server`` CPU +Models instead. + +MIPS ``I7200`` CPU Model (since 5.2) +'''''''''''''''''''''''''''''''''''' + +The ``I7200`` guest CPU relies on the nanoMIPS ISA, which is deprecated +(the ISA has never been upstreamed to a compiler toolchain). Therefore +this CPU is also deprecated. + + +QEMU API (QAPI) events +---------------------- + +``MEM_UNPLUG_ERROR`` (since 6.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use the more generic event ``DEVICE_UNPLUG_GUEST_ERROR`` instead. + + +System emulator machines +------------------------ + +Aspeed ``swift-bmc`` machine (since 6.1) +'''''''''''''''''''''''''''''''''''''''' + +This machine is deprecated because we have enough AST2500 based OpenPOWER +machines. It can be easily replaced by the ``witherspoon-bmc`` or the +``romulus-bmc`` machines. + +Backend options +--------------- + +Using non-persistent backing file with pmem=on (since 6.1) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +This option is used when ``memory-backend-file`` is consumed by emulated NVDIMM +device. However enabling ``memory-backend-file.pmem`` option, when backing file +is (a) not DAX capable or (b) not on a filesystem that support direct mapping +of persistent memory, is not safe and may lead to data loss or corruption in case +of host crash. +Options are: + + - modify VM configuration to set ``pmem=off`` to continue using fake NVDIMM + (without persistence guaranties) with backing file on non DAX storage + - move backing file to NVDIMM storage and keep ``pmem=on`` + (to have NVDIMM with persistence guaranties). + +Device options +-------------- + +Emulated device options +''''''''''''''''''''''' + +``-device virtio-blk,scsi=on|off`` (since 5.0) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The virtio-blk SCSI passthrough feature is a legacy VIRTIO feature. VIRTIO 1.0 +and later do not support it because the virtio-scsi device was introduced for +full SCSI support. Use virtio-scsi instead when SCSI passthrough is required. + +Note this also applies to ``-device virtio-blk-pci,scsi=on|off``, which is an +alias. + +``-device sga`` (since 6.2) +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``sga`` device loads an option ROM for x86 targets which enables +SeaBIOS to send messages to the serial console. SeaBIOS 1.11.0 onwards +contains native support for this feature and thus use of the option +ROM approach is obsolete. The native SeaBIOS support can be activated +by using ``-machine graphics=off``. + + +Block device options +'''''''''''''''''''' + +``"backing": ""`` (since 2.12) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In order to prevent QEMU from automatically opening an image's backing +chain, use ``"backing": null`` instead. + +``rbd`` keyvalue pair encoded filenames: ``""`` (since 3.1) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Options for ``rbd`` should be specified according to its runtime options, +like other block drivers. Legacy parsing of keyvalue pair encoded +filenames is useful to open images with the old format for backing files; +These image files should be updated to use the current format. + +Example of legacy encoding:: + + json:{"file.driver":"rbd", "file.filename":"rbd:rbd/name"} + +The above, converted to the current supported format:: + + json:{"file.driver":"rbd", "file.pool":"rbd", "file.image":"name"} + +linux-user mode CPUs +-------------------- + +``ppc64abi32`` CPUs (since 5.2) +''''''''''''''''''''''''''''''' + +The ``ppc64abi32`` architecture has a number of issues which regularly +trip up our CI testing and is suspected to be quite broken. For that +reason the maintainers strongly suspect no one actually uses it. + +MIPS ``I7200`` CPU (since 5.2) +'''''''''''''''''''''''''''''' + +The ``I7200`` guest CPU relies on the nanoMIPS ISA, which is deprecated +(the ISA has never been upstreamed to a compiler toolchain). Therefore +this CPU is also deprecated. + +Backwards compatibility +----------------------- + +Runnability guarantee of CPU models (since 4.1) +''''''''''''''''''''''''''''''''''''''''''''''' + +Previous versions of QEMU never changed existing CPU models in +ways that introduced additional host software or hardware +requirements to the VM. This allowed management software to +safely change the machine type of an existing VM without +introducing new requirements ("runnability guarantee"). This +prevented CPU models from being updated to include CPU +vulnerability mitigations, leaving guests vulnerable in the +default configuration. + +The CPU model runnability guarantee won't apply anymore to +existing CPU models. Management software that needs runnability +guarantees must resolve the CPU model aliases using the +``alias-of`` field returned by the ``query-cpu-definitions`` QMP +command. + +While those guarantees are kept, the return value of +``query-cpu-definitions`` will have existing CPU model aliases +point to a version that doesn't break runnability guarantees +(specifically, version 1 of those CPU models). In future QEMU +versions, aliases will point to newer CPU model versions +depending on the machine type, so management software must +resolve CPU model aliases before starting a virtual machine. + +Guest Emulator ISAs +------------------- + +nanoMIPS ISA +'''''''''''' + +The ``nanoMIPS`` ISA has never been upstreamed to any compiler toolchain. +As it is hard to generate binaries for it, declare it deprecated. diff --git a/docs/about/index.rst b/docs/about/index.rst new file mode 100644 index 000000000..5bea653c0 --- /dev/null +++ b/docs/about/index.rst @@ -0,0 +1,28 @@ +---------- +About QEMU +---------- + +QEMU is a generic and open source machine emulator and virtualizer. + +QEMU can be used in several different ways. The most common is for +"system emulation", where it provides a virtual model of an +entire machine (CPU, memory and emulated devices) to run a guest OS. +In this mode the CPU may be fully emulated, or it may work with +a hypervisor such as KVM, Xen, Hax or Hypervisor.Framework to +allow the guest to run directly on the host CPU. + +The second supported way to use QEMU is "user mode emulation", +where QEMU can launch processes compiled for one CPU on another CPU. +In this mode the CPU is always emulated. + +QEMU also provides a number of standalone commandline utilities, +such as the ``qemu-img`` disk image utility that allows you to create, +convert and modify disk images. + +.. toctree:: + :maxdepth: 2 + + build-platforms + deprecated + removed-features + license diff --git a/docs/about/license.rst b/docs/about/license.rst new file mode 100644 index 000000000..cde3d2d25 --- /dev/null +++ b/docs/about/license.rst @@ -0,0 +1,11 @@ +.. _License: + +License +======= + +QEMU is a trademark of Fabrice Bellard. + +QEMU is released under the `GNU General Public +License <https://www.gnu.org/licenses/gpl-2.0.txt>`__, version 2. Parts +of QEMU have specific licenses, see file +`LICENSE <https://git.qemu.org/?p=qemu.git;a=blob_plain;f=LICENSE>`__. diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst new file mode 100644 index 000000000..d42c3341d --- /dev/null +++ b/docs/about/removed-features.rst @@ -0,0 +1,705 @@ + +Removed features +================ + +What follows is a record of recently removed, formerly deprecated +features that serves as a record for users who have encountered +trouble after a recent upgrade. + +System emulator command line arguments +-------------------------------------- + +``-hdachs`` (removed in 2.12) +''''''''''''''''''''''''''''' + +The geometry defined by ``-hdachs c,h,s,t`` should now be specified via +``-device ide-hd,drive=dr,cyls=c,heads=h,secs=s,bios-chs-trans=t`` +(together with ``-drive if=none,id=dr,...``). + +``-net channel`` (removed in 2.12) +'''''''''''''''''''''''''''''''''' + +This option has been replaced by ``-net user,guestfwd=...``. + +``-net dump`` (removed in 2.12) +''''''''''''''''''''''''''''''' + +``-net dump[,vlan=n][,file=filename][,len=maxlen]`` has been replaced by +``-object filter-dump,id=id,netdev=dev[,file=filename][,maxlen=maxlen]``. +Note that the new syntax works with netdev IDs instead of the old "vlan" hubs. + +``-no-kvm-pit`` (removed in 2.12) +''''''''''''''''''''''''''''''''' + +This was just a dummy option that has been ignored, since the in-kernel PIT +cannot be disabled separately from the irqchip anymore. A similar effect +(which also disables the KVM IOAPIC) can be obtained with +``-M kernel_irqchip=split``. + +``-tdf`` (removed in 2.12) +'''''''''''''''''''''''''' + +There is no replacement, the ``-tdf`` option has just been ignored since the +behaviour that could be changed by this option in qemu-kvm is now the default +when using the KVM PIT. It still can be requested explicitly using +``-global kvm-pit.lost_tick_policy=delay``. + +``-drive secs=s``, ``-drive heads=h`` & ``-drive cyls=c`` (removed in 3.0) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The drive geometry should now be specified via +``-device ...,drive=dr,cyls=c,heads=h,secs=s`` (together with +``-drive if=none,id=dr,...``). + +``-drive serial=``, ``-drive trans=`` & ``-drive addr=`` (removed in 3.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use ``-device ...,drive=dr,serial=r,bios-chs-trans=t,addr=a`` instead +(together with ``-drive if=none,id=dr,...``). + +``-net ...,vlan=x`` (removed in 3.0) +'''''''''''''''''''''''''''''''''''' + +The term "vlan" was very confusing for most users in this context (it's about +specifying a hub ID, not about IEEE 802.1Q or something similar), so this +has been removed. To connect one NIC frontend with a network backend, either +use ``-nic ...`` (e.g. for on-board NICs) or use ``-netdev ...,id=n`` together +with ``-device ...,netdev=n`` (for full control over pluggable NICs). To +connect multiple NICs or network backends via a hub device (which is what +vlan did), use ``-nic hubport,hubid=x,...`` or +``-netdev hubport,id=n,hubid=x,...`` (with ``-device ...,netdev=n``) instead. + +``-no-kvm-irqchip`` (removed in 3.0) +'''''''''''''''''''''''''''''''''''' + +Use ``-machine kernel_irqchip=off`` instead. + +``-no-kvm-pit-reinjection`` (removed in 3.0) +'''''''''''''''''''''''''''''''''''''''''''' + +Use ``-global kvm-pit.lost_tick_policy=discard`` instead. + +``-balloon`` (removed in 3.1) +''''''''''''''''''''''''''''' + +The ``-balloon virtio`` option has been replaced by ``-device virtio-balloon``. +The ``-balloon none`` option was a no-op and has no replacement. + +``-bootp`` (removed in 3.1) +''''''''''''''''''''''''''' + +The ``-bootp /some/file`` argument is replaced by either +``-netdev user,id=x,bootp=/some/file`` (for pluggable NICs, accompanied with +``-device ...,netdev=x``), or ``-nic user,bootp=/some/file`` (for on-board NICs). +The new syntax allows different settings to be provided per NIC. + +``-redir`` (removed in 3.1) +''''''''''''''''''''''''''' + +The ``-redir [tcp|udp]:hostport:[guestaddr]:guestport`` option is replaced +by either ``-netdev +user,id=x,hostfwd=[tcp|udp]:[hostaddr]:hostport-[guestaddr]:guestport`` +(for pluggable NICs, accompanied with ``-device ...,netdev=x``) or by the option +``-nic user,hostfwd=[tcp|udp]:[hostaddr]:hostport-[guestaddr]:guestport`` +(for on-board NICs). The new syntax allows different settings to be provided +per NIC. + +``-smb`` (removed in 3.1) +''''''''''''''''''''''''' + +The ``-smb /some/dir`` argument is replaced by either +``-netdev user,id=x,smb=/some/dir`` (for pluggable NICs, accompanied with +``-device ...,netdev=x``), or ``-nic user,smb=/some/dir`` (for on-board NICs). +The new syntax allows different settings to be provided per NIC. + +``-tftp`` (removed in 3.1) +'''''''''''''''''''''''''' + +The ``-tftp /some/dir`` argument is replaced by either +``-netdev user,id=x,tftp=/some/dir`` (for pluggable NICs, accompanied with +``-device ...,netdev=x``), or ``-nic user,tftp=/some/dir`` (for embedded NICs). +The new syntax allows different settings to be provided per NIC. + +``-localtime`` (removed in 3.1) +''''''''''''''''''''''''''''''' + +Replaced by ``-rtc base=localtime``. + +``-nodefconfig`` (removed in 3.1) +''''''''''''''''''''''''''''''''' + +Use ``-no-user-config`` instead. + +``-rtc-td-hack`` (removed in 3.1) +''''''''''''''''''''''''''''''''' + +Use ``-rtc driftfix=slew`` instead. + +``-startdate`` (removed in 3.1) +''''''''''''''''''''''''''''''' + +Replaced by ``-rtc base=date``. + +``-vnc ...,tls=...``, ``-vnc ...,x509=...`` & ``-vnc ...,x509verify=...`` (removed in 3.1) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The "tls-creds" option should be used instead to point to a "tls-creds-x509" +object created using "-object". + +``-mem-path`` fallback to RAM (removed in 5.0) +'''''''''''''''''''''''''''''''''''''''''''''' + +If guest RAM allocation from file pointed by ``mem-path`` failed, +QEMU was falling back to allocating from RAM, which might have resulted +in unpredictable behavior since the backing file specified by the user +as ignored. Currently, users are responsible for making sure the backing storage +specified with ``-mem-path`` can actually provide the guest RAM configured with +``-m`` and QEMU fails to start up if RAM allocation is unsuccessful. + +``-net ...,name=...`` (removed in 5.1) +'''''''''''''''''''''''''''''''''''''' + +The ``name`` parameter of the ``-net`` option was a synonym +for the ``id`` parameter, which should now be used instead. + +``-numa node,mem=...`` (removed in 5.1) +''''''''''''''''''''''''''''''''''''''' + +The parameter ``mem`` of ``-numa node`` was used to assign a part of guest RAM +to a NUMA node. But when using it, it's impossible to manage a specified RAM +chunk on the host side (like bind it to a host node, setting bind policy, ...), +so the guest ends up with the fake NUMA configuration with suboptiomal +performance. +However since 2014 there is an alternative way to assign RAM to a NUMA node +using parameter ``memdev``, which does the same as ``mem`` and adds +means to actually manage node RAM on the host side. Use parameter ``memdev`` +with *memory-backend-ram* backend as replacement for parameter ``mem`` +to achieve the same fake NUMA effect or a properly configured +*memory-backend-file* backend to actually benefit from NUMA configuration. +New machine versions (since 5.1) will not accept the option but it will still +work with old machine types. User can check the QAPI schema to see if the legacy +option is supported by looking at MachineInfo::numa-mem-supported property. + +``-numa`` node (without memory specified) (removed in 5.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Splitting RAM by default between NUMA nodes had the same issues as ``mem`` +parameter with the difference that the role of the user plays QEMU using +implicit generic or board specific splitting rule. +Use ``memdev`` with *memory-backend-ram* backend or ``mem`` (if +it's supported by used machine type) to define mapping explicitly instead. +Users of existing VMs, wishing to preserve the same RAM distribution, should +configure it explicitly using ``-numa node,memdev`` options. Current RAM +distribution can be retrieved using HMP command ``info numa`` and if separate +memory devices (pc|nv-dimm) are present use ``info memory-device`` and subtract +device memory from output of ``info numa``. + +``-smp`` (invalid topologies) (removed in 5.2) +'''''''''''''''''''''''''''''''''''''''''''''' + +CPU topology properties should describe whole machine topology including +possible CPUs. + +However, historically it was possible to start QEMU with an incorrect topology +where *n* <= *sockets* * *cores* * *threads* < *maxcpus*, +which could lead to an incorrect topology enumeration by the guest. +Support for invalid topologies is removed, the user must ensure +topologies described with -smp include all possible cpus, i.e. +*sockets* * *cores* * *threads* = *maxcpus*. + +``-machine enforce-config-section=on|off`` (removed in 5.2) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``enforce-config-section`` property was replaced by the +``-global migration.send-configuration={on|off}`` option. + +``-no-kvm`` (removed in 5.2) +'''''''''''''''''''''''''''' + +The ``-no-kvm`` argument was a synonym for setting ``-machine accel=tcg``. + +``-realtime`` (removed in 6.0) +'''''''''''''''''''''''''''''' + +The ``-realtime mlock=on|off`` argument has been replaced by the +``-overcommit mem-lock=on|off`` argument. + +``-show-cursor`` option (removed in 6.0) +'''''''''''''''''''''''''''''''''''''''' + +Use ``-display sdl,show-cursor=on``, ``-display gtk,show-cursor=on`` +or ``-display default,show-cursor=on`` instead. + +``-tb-size`` option (removed in 6.0) +'''''''''''''''''''''''''''''''''''' + +QEMU 5.0 introduced an alternative syntax to specify the size of the translation +block cache, ``-accel tcg,tb-size=``. + +``-usbdevice audio`` (removed in 6.0) +''''''''''''''''''''''''''''''''''''' + +This option lacked the possibility to specify an audio backend device. +Use ``-device usb-audio`` now instead (and specify a corresponding USB +host controller or ``-usb`` if necessary). + +``-vnc acl`` (removed in 6.0) +''''''''''''''''''''''''''''' + +The ``acl`` option to the ``-vnc`` argument has been replaced +by the ``tls-authz`` and ``sasl-authz`` options. + +``-mon ...,control=readline,pretty=on|off`` (removed in 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``pretty=on|off`` switch has no effect for HMP monitors and +its use is rejected. + +``-drive file=json:{...{'driver':'file'}}`` (removed in 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The 'file' driver for drives is no longer appropriate for character or host +devices and will only accept regular files (S_IFREG). The correct driver +for these file types is 'host_cdrom' or 'host_device' as appropriate. + +Floppy controllers' drive properties (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use ``-device floppy,...`` instead. When configuring onboard floppy +controllers +:: + + -global isa-fdc.driveA=... + -global sysbus-fdc.driveA=... + -global SUNW,fdtwo.drive=... + +become +:: + + -device floppy,unit=0,drive=... + +and +:: + + -global isa-fdc.driveB=... + -global sysbus-fdc.driveB=... + +become +:: + + -device floppy,unit=1,drive=... + +When plugging in a floppy controller +:: + + -device isa-fdc,...,driveA=... + +becomes +:: + + -device isa-fdc,... + -device floppy,unit=0,drive=... + +and +:: + + -device isa-fdc,...,driveB=... + +becomes +:: + + -device isa-fdc,... + -device floppy,unit=1,drive=... + +``-drive`` with bogus interface type (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Drives with interface types other than ``if=none`` are for onboard +devices. Drives the board doesn't pick up can no longer be used with +-device. Use ``if=none`` instead. + +``-usbdevice ccid`` (removed in 6.0) +''''''''''''''''''''''''''''''''''''' + +This option was undocumented and not used in the field. +Use ``-device usb-ccid`` instead. + +RISC-V firmware not booted by default (removed in 5.1) +'''''''''''''''''''''''''''''''''''''''''''''''''''''' + +QEMU 5.1 changes the default behaviour from ``-bios none`` to ``-bios default`` +for the RISC-V ``virt`` machine and ``sifive_u`` machine. + +QEMU Machine Protocol (QMP) commands +------------------------------------ + +``block-dirty-bitmap-add`` "autoload" parameter (removed in 4.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The "autoload" parameter has been ignored since 2.12.0. All bitmaps +are automatically loaded from qcow2 images. + +``cpu-add`` (removed in 5.2) +'''''''''''''''''''''''''''' + +Use ``device_add`` for hotplugging vCPUs instead of ``cpu-add``. See +documentation of ``query-hotpluggable-cpus`` for additional details. + +``change`` (removed in 6.0) +''''''''''''''''''''''''''' + +Use ``blockdev-change-medium`` or ``change-vnc-password`` instead. + +``query-events`` (removed in 6.0) +''''''''''''''''''''''''''''''''' + +The ``query-events`` command has been superseded by the more powerful +and accurate ``query-qmp-schema`` command. + +``migrate_set_cache_size`` and ``query-migrate-cache-size`` (removed in 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use ``migrate_set_parameter`` and ``info migrate_parameters`` instead. + +``migrate_set_downtime`` and ``migrate_set_speed`` (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use ``migrate_set_parameter`` instead. + +``query-cpus`` (removed in 6.0) +''''''''''''''''''''''''''''''' + +The ``query-cpus`` command is replaced by the ``query-cpus-fast`` command. + +``query-cpus-fast`` ``arch`` output member (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``arch`` output member of the ``query-cpus-fast`` command is +replaced by the ``target`` output member. + +chardev client socket with ``wait`` option (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Character devices creating sockets in client mode should not specify +the 'wait' field, which is only applicable to sockets in server mode + +``query-named-block-nodes`` result ``encryption_key_missing`` (removed in 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Removed with no replacement. + +``query-block`` result ``inserted.encryption_key_missing`` (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Removed with no replacement. + +``query-named-block-nodes`` and ``query-block`` result dirty-bitmaps[i].status (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``status`` field of the ``BlockDirtyInfo`` structure, returned by +these commands is removed. Two new boolean fields, ``recording`` and +``busy`` effectively replace it. + +``query-block`` result field ``dirty-bitmaps`` (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``dirty-bitmaps`` field of the ``BlockInfo`` structure, returned by +the query-block command is itself now removed. The ``dirty-bitmaps`` +field of the ``BlockDeviceInfo`` struct should be used instead, which is the +type of the ``inserted`` field in query-block replies, as well as the +type of array items in query-named-block-nodes. + +``object-add`` option ``props`` (removed in 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''' + +Specify the properties for the object as top-level arguments instead. + +Human Monitor Protocol (HMP) commands +------------------------------------- + +``usb_add`` and ``usb_remove`` (removed in 2.12) +'''''''''''''''''''''''''''''''''''''''''''''''' + +Replaced by ``device_add`` and ``device_del`` (use ``device_add help`` for a +list of available devices). + +``host_net_add`` and ``host_net_remove`` (removed in 2.12) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Replaced by ``netdev_add`` and ``netdev_del``. + +The ``hub_id`` parameter of ``hostfwd_add`` / ``hostfwd_remove`` (removed in 5.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``[hub_id name]`` parameter tuple of the 'hostfwd_add' and +'hostfwd_remove' HMP commands has been replaced by ``netdev_id``. + +``cpu-add`` (removed in 5.2) +'''''''''''''''''''''''''''' + +Use ``device_add`` for hotplugging vCPUs instead of ``cpu-add``. See +documentation of ``query-hotpluggable-cpus`` for additional details. + +``change vnc TARGET`` (removed in 6.0) +'''''''''''''''''''''''''''''''''''''' + +No replacement. The ``change vnc password`` and ``change DEVICE MEDIUM`` +commands are not affected. + +``acl_show``, ``acl_reset``, ``acl_policy``, ``acl_add``, ``acl_remove`` (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``acl_show``, ``acl_reset``, ``acl_policy``, ``acl_add``, and +``acl_remove`` commands were removed with no replacement. Authorization +for VNC should be performed using the pluggable QAuthZ objects. + +``migrate-set-cache-size`` and ``info migrate-cache-size`` (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use ``migrate-set-parameters`` and ``info migrate-parameters`` instead. + +``migrate_set_downtime`` and ``migrate_set_speed`` (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Use ``migrate-set-parameters`` instead. + +``info cpustats`` (removed in 6.1) +'''''''''''''''''''''''''''''''''' + +This command didn't produce any output already. Removed with no replacement. + +Guest Emulator ISAs +------------------- + +RISC-V ISA privilege specification version 1.09.1 (removed in 5.1) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The RISC-V ISA privilege specification version 1.09.1 has been removed. +QEMU supports both the newer version 1.10.0 and the ratified version 1.11.0, these +should be used instead of the 1.09.1 version. + +System emulator CPUS +-------------------- + +KVM guest support on 32-bit Arm hosts (removed in 5.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The Linux kernel has dropped support for allowing 32-bit Arm systems +to host KVM guests as of the 5.7 kernel. Accordingly, QEMU is deprecating +its support for this configuration and will remove it in a future version. +Running 32-bit guests on a 64-bit Arm host remains supported. + +RISC-V ISA Specific CPUs (removed in 5.1) +''''''''''''''''''''''''''''''''''''''''' + +The RISC-V cpus with the ISA version in the CPU name have been removed. The +four CPUs are: ``rv32gcsu-v1.9.1``, ``rv32gcsu-v1.10.0``, ``rv64gcsu-v1.9.1`` and +``rv64gcsu-v1.10.0``. Instead the version can be specified via the CPU ``priv_spec`` +option when using the ``rv32`` or ``rv64`` CPUs. + +RISC-V no MMU CPUs (removed in 5.1) +''''''''''''''''''''''''''''''''''' + +The RISC-V no MMU cpus have been removed. The two CPUs: ``rv32imacu-nommu`` and +``rv64imacu-nommu`` can no longer be used. Instead the MMU status can be specified +via the CPU ``mmu`` option when using the ``rv32`` or ``rv64`` CPUs. + +``compat`` property of server class POWER CPUs (removed in 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``max-cpu-compat`` property of the ``pseries`` machine type should be used +instead. + +``moxie`` CPU (removed in 6.1) +'''''''''''''''''''''''''''''' + +Nobody was using this CPU emulation in QEMU, and there were no test images +available to make sure that the code is still working, so it has been removed +without replacement. + +``lm32`` CPUs (removed in 6.1) +'''''''''''''''''''''''''''''' + +The only public user of this architecture was the milkymist project, +which has been dead for years; there was never an upstream Linux +port. Removed without replacement. + +``unicore32`` CPUs (removed in 6.1) +''''''''''''''''''''''''''''''''''' + +Support for this CPU was removed from the upstream Linux kernel, and +there is no available upstream toolchain to build binaries for it. +Removed without replacement. + +System emulator machines +------------------------ + +``s390-virtio`` (removed in 2.6) +'''''''''''''''''''''''''''''''' + +Use the ``s390-ccw-virtio`` machine instead. + +The m68k ``dummy`` machine (removed in 2.9) +''''''''''''''''''''''''''''''''''''''''''' + +Use the ``none`` machine with the ``loader`` device instead. + +``xlnx-ep108`` (removed in 3.0) +''''''''''''''''''''''''''''''' + +The EP108 was an early access development board that is no longer used. +Use the ``xlnx-zcu102`` machine instead. + +``spike_v1.9.1`` and ``spike_v1.10`` (removed in 5.1) +''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The version specific Spike machines have been removed in favour of the +generic ``spike`` machine. If you need to specify an older version of the RISC-V +spec you can use the ``-cpu rv64gcsu,priv_spec=v1.10.0`` command line argument. + +mips ``r4k`` platform (removed in 5.2) +'''''''''''''''''''''''''''''''''''''' + +This machine type was very old and unmaintained. Users should use the ``malta`` +machine type instead. + +mips ``fulong2e`` machine alias (removed in 6.0) +'''''''''''''''''''''''''''''''''''''''''''''''' + +This machine has been renamed ``fuloong2e``. + +``pc-0.10`` up to ``pc-1.3`` (removed in 4.0 up to 6.0) +''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +These machine types were very old and likely could not be used for live +migration from old QEMU versions anymore. Use a newer machine type instead. + +Raspberry Pi ``raspi2`` and ``raspi3`` machines (removed in 6.2) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The Raspberry Pi machines come in various models (A, A+, B, B+). To be able +to distinguish which model QEMU is implementing, the ``raspi2`` and ``raspi3`` +machines have been renamed ``raspi2b`` and ``raspi3b``. + + +linux-user mode CPUs +-------------------- + +``tilegx`` CPUs (removed in 6.0) +'''''''''''''''''''''''''''''''' + +The ``tilegx`` guest CPU support has been removed without replacement. It was +only implemented in linux-user mode, but support for this CPU was removed from +the upstream Linux kernel in 2018, and it has also been dropped from glibc, so +there is no new Linux development taking place with this architecture. For +running the old binaries, you can use older versions of QEMU. + +System emulator devices +----------------------- + +``spapr-pci-vfio-host-bridge`` (removed in 2.12) +''''''''''''''''''''''''''''''''''''''''''''''''' + +The ``spapr-pci-vfio-host-bridge`` device type has been replaced by the +``spapr-pci-host-bridge`` device type. + +``ivshmem`` (removed in 4.0) +'''''''''''''''''''''''''''' + +Replaced by either the ``ivshmem-plain`` or ``ivshmem-doorbell``. + +``ide-drive`` (removed in 6.0) +'''''''''''''''''''''''''''''' + +The 'ide-drive' device has been removed. Users should use 'ide-hd' or +'ide-cd' as appropriate to get an IDE hard disk or CD-ROM as needed. + +``scsi-disk`` (removed in 6.0) +'''''''''''''''''''''''''''''' + +The 'scsi-disk' device has been removed. Users should use 'scsi-hd' or +'scsi-cd' as appropriate to get a SCSI hard disk or CD-ROM as needed. + +Related binaries +---------------- + +``qemu-nbd --partition`` (removed in 5.0) +''''''''''''''''''''''''''''''''''''''''' + +The ``qemu-nbd --partition $digit`` code (also spelled ``-P``) +could only handle MBR partitions, and never correctly handled logical +partitions beyond partition 5. Exporting a partition can still be +done by utilizing the ``--image-opts`` option with a raw blockdev +using the ``offset`` and ``size`` parameters layered on top of +any other existing blockdev. For example, if partition 1 is 100MiB +long starting at 1MiB, the old command:: + + qemu-nbd -t -P 1 -f qcow2 file.qcow2 + +can be rewritten as:: + + qemu-nbd -t --image-opts driver=raw,offset=1M,size=100M,file.driver=qcow2,file.file.driver=file,file.file.filename=file.qcow2 + +``qemu-img convert -n -o`` (removed in 5.1) +''''''''''''''''''''''''''''''''''''''''''' + +All options specified in ``-o`` are image creation options, so +they are now rejected when used with ``-n`` to skip image creation. + + +``qemu-img create -b bad file $size`` (removed in 5.1) +'''''''''''''''''''''''''''''''''''''''''''''''''''''' + +When creating an image with a backing file that could not be opened, +``qemu-img create`` used to issue a warning about the failure but +proceed with the image creation if an explicit size was provided. +However, as the ``-u`` option exists for this purpose, it is safer to +enforce that any failure to open the backing image (including if the +backing file is missing or an incorrect format was specified) is an +error when ``-u`` is not used. + +``qemu-img amend`` to adjust backing file (removed in 6.1) +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The use of ``qemu-img amend`` to modify the name or format of a qcow2 +backing image was never fully documented or tested, and interferes +with other amend operations that need access to the original backing +image (such as deciding whether a v3 zero cluster may be left +unallocated when converting to a v2 image). Any changes to the +backing chain should be performed with ``qemu-img rebase -u`` either +before or after the remaining changes being performed by amend, as +appropriate. + +``qemu-img`` backing file without format (removed in 6.1) +''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The use of ``qemu-img create``, ``qemu-img rebase``, or ``qemu-img +convert`` to create or modify an image that depends on a backing file +now requires that an explicit backing format be provided. This is +for safety: if QEMU probes a different format than what you thought, +the data presented to the guest will be corrupt; similarly, presenting +a raw image to a guest allows a potential security exploit if a future +probe sees a non-raw image based on guest writes. + +To avoid creating unsafe backing chains, you must pass ``-o +backing_fmt=`` (or the shorthand ``-F`` during create) to specify the +intended backing format. You may use ``qemu-img rebase -u`` to +retroactively add a backing format to an existing image. However, be +aware that there are already potential security risks to blindly using +``qemu-img info`` to probe the format of an untrusted backing image, +when deciding what format to add into an existing image. + +Block devices +------------- + +VXHS backend (removed in 5.1) +''''''''''''''''''''''''''''' + +The VXHS code did not compile since v2.12.0. It was removed in 5.1. + +``sheepdog`` driver (removed in 6.0) +'''''''''''''''''''''''''''''''''''' + +The corresponding upstream server project is no longer maintained. +Users are recommended to switch to an alternative distributed block +device driver such as RBD. diff --git a/docs/amd-memory-encryption.txt b/docs/amd-memory-encryption.txt new file mode 100644 index 000000000..ffca382b5 --- /dev/null +++ b/docs/amd-memory-encryption.txt @@ -0,0 +1,148 @@ +Secure Encrypted Virtualization (SEV) is a feature found on AMD processors. + +SEV is an extension to the AMD-V architecture which supports running encrypted +virtual machines (VMs) under the control of KVM. Encrypted VMs have their pages +(code and data) secured such that only the guest itself has access to the +unencrypted version. Each encrypted VM is associated with a unique encryption +key; if its data is accessed by a different entity using a different key the +encrypted guests data will be incorrectly decrypted, leading to unintelligible +data. + +Key management for this feature is handled by a separate processor known as the +AMD secure processor (AMD-SP), which is present in AMD SOCs. Firmware running +inside the AMD-SP provides commands to support a common VM lifecycle. This +includes commands for launching, snapshotting, migrating and debugging the +encrypted guest. These SEV commands can be issued via KVM_MEMORY_ENCRYPT_OP +ioctls. + +Secure Encrypted Virtualization - Encrypted State (SEV-ES) builds on the SEV +support to additionally protect the guest register state. In order to allow a +hypervisor to perform functions on behalf of a guest, there is architectural +support for notifying a guest's operating system when certain types of VMEXITs +are about to occur. This allows the guest to selectively share information with +the hypervisor to satisfy the requested function. + +Launching +--------- +Boot images (such as bios) must be encrypted before a guest can be booted. The +MEMORY_ENCRYPT_OP ioctl provides commands to encrypt the images: LAUNCH_START, +LAUNCH_UPDATE_DATA, LAUNCH_MEASURE and LAUNCH_FINISH. These four commands +together generate a fresh memory encryption key for the VM, encrypt the boot +images and provide a measurement than can be used as an attestation of a +successful launch. + +For a SEV-ES guest, the LAUNCH_UPDATE_VMSA command is also used to encrypt the +guest register state, or VM save area (VMSA), for all of the guest vCPUs. + +LAUNCH_START is called first to create a cryptographic launch context within +the firmware. To create this context, guest owner must provide a guest policy, +its public Diffie-Hellman key (PDH) and session parameters. These inputs +should be treated as a binary blob and must be passed as-is to the SEV firmware. + +The guest policy is passed as plaintext. A hypervisor may choose to read it, +but should not modify it (any modification of the policy bits will result +in bad measurement). The guest policy is a 4-byte data structure containing +several flags that restricts what can be done on a running SEV guest. +See KM Spec section 3 and 6.2 for more details. + +The guest policy can be provided via the 'policy' property (see below) + +# ${QEMU} \ + sev-guest,id=sev0,policy=0x1...\ + +Setting the "SEV-ES required" policy bit (bit 2) will launch the guest as a +SEV-ES guest (see below) + +# ${QEMU} \ + sev-guest,id=sev0,policy=0x5...\ + +The guest owner provided DH certificate and session parameters will be used to +establish a cryptographic session with the guest owner to negotiate keys used +for the attestation. + +The DH certificate and session blob can be provided via the 'dh-cert-file' and +'session-file' properties (see below) + +# ${QEMU} \ + sev-guest,id=sev0,dh-cert-file=<file1>,session-file=<file2> + +LAUNCH_UPDATE_DATA encrypts the memory region using the cryptographic context +created via the LAUNCH_START command. If required, this command can be called +multiple times to encrypt different memory regions. The command also calculates +the measurement of the memory contents as it encrypts. + +LAUNCH_UPDATE_VMSA encrypts all the vCPU VMSAs for a SEV-ES guest using the +cryptographic context created via the LAUNCH_START command. The command also +calculates the measurement of the VMSAs as it encrypts them. + +LAUNCH_MEASURE can be used to retrieve the measurement of encrypted memory and, +for a SEV-ES guest, encrypted VMSAs. This measurement is a signature of the +memory contents and, for a SEV-ES guest, the VMSA contents, that can be sent +to the guest owner as an attestation that the memory and VMSAs were encrypted +correctly by the firmware. The guest owner may wait to provide the guest +confidential information until it can verify the attestation measurement. +Since the guest owner knows the initial contents of the guest at boot, the +attestation measurement can be verified by comparing it to what the guest owner +expects. + +LAUNCH_FINISH finalizes the guest launch and destroys the cryptographic +context. + +See SEV KM API Spec [1] 'Launching a guest' usage flow (Appendix A) for the +complete flow chart. + +To launch a SEV guest + +# ${QEMU} \ + -machine ...,confidential-guest-support=sev0 \ + -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1 + +To launch a SEV-ES guest + +# ${QEMU} \ + -machine ...,confidential-guest-support=sev0 \ + -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1,policy=0x5 + +An SEV-ES guest has some restrictions as compared to a SEV guest. Because the +guest register state is encrypted and cannot be updated by the VMM/hypervisor, +a SEV-ES guest: + - Does not support SMM - SMM support requires updating the guest register + state. + - Does not support reboot - a system reset requires updating the guest register + state. + - Requires in-kernel irqchip - the burden is placed on the hypervisor to + manage booting APs. + +Debugging +----------- +Since the memory contents of a SEV guest are encrypted, hypervisor access to +the guest memory will return cipher text. If the guest policy allows debugging, +then a hypervisor can use the DEBUG_DECRYPT and DEBUG_ENCRYPT commands to access +the guest memory region for debug purposes. This is not supported in QEMU yet. + +Snapshot/Restore +----------------- +TODO + +Live Migration +---------------- +TODO + +References +----------------- + +AMD Memory Encryption whitepaper: +https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf + +Secure Encrypted Virtualization Key Management: +[1] http://developer.amd.com/wordpress/media/2017/11/55766_SEV-KM-API_Specification.pdf + +KVM Forum slides: +http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf +https://www.linux-kvm.org/images/9/94/Extending-Secure-Encrypted-Virtualization-with-SEV-ES-Thomas-Lendacky-AMD.pdf + +AMD64 Architecture Programmer's Manual: + http://support.amd.com/TechDocs/24593.pdf + SME is section 7.10 + SEV is section 15.34 + SEV-ES is section 15.35 diff --git a/docs/block-replication.txt b/docs/block-replication.txt new file mode 100644 index 000000000..b0f23761c --- /dev/null +++ b/docs/block-replication.txt @@ -0,0 +1,247 @@ +Block replication +---------------------------------------- +Copyright Fujitsu, Corp. 2016 +Copyright (c) 2016 Intel Corporation +Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +Block replication is used for continuous checkpoints. It is designed +for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, +where the Secondary VM is not running. + +This document gives an overview of block replication's design. + +== Background == +High availability solutions such as micro checkpoint and COLO will do +consecutive checkpoints. The VM state of the Primary and Secondary VM is +identical right after a VM checkpoint, but becomes different as the VM +executes till the next checkpoint. To support disk contents checkpoint, +the modified disk contents in the Secondary VM must be buffered, and are +only dropped at next checkpoint time. To reduce the network transportation +effort during a vmstate checkpoint, the disk modification operations of +the Primary disk are asynchronously forwarded to the Secondary node. + +== Workflow == +The following is the image of block replication workflow: + + +----------------------+ +------------------------+ + |Primary Write Requests| |Secondary Write Requests| + +----------------------+ +------------------------+ + | | + | (4) + | V + | /-------------\ + | Copy and Forward | | + |---------(1)----------+ | Disk Buffer | + | | | | + | (3) \-------------/ + | speculative ^ + | write through (2) + | | | + V V | + +--------------+ +----------------+ + | Primary Disk | | Secondary Disk | + +--------------+ +----------------+ + + 1) Primary write requests will be copied and forwarded to Secondary + QEMU. + 2) Before Primary write requests are written to Secondary disk, the + original sector content will be read from Secondary disk and + buffered in the Disk buffer, but it will not overwrite the existing + sector content (it could be from either "Secondary Write Requests" or + previous COW of "Primary Write Requests") in the Disk buffer. + 3) Primary write requests will be written to Secondary disk. + 4) Secondary write requests will be buffered in the Disk buffer and it + will overwrite the existing sector content in the buffer. + +== Architecture == +We are going to implement block replication from many basic +blocks that are already in QEMU. + + virtio-blk || + ^ || .---------- + | || | Secondary + 1 Quorum || '---------- + / \ || virtio-blk + / \ || ^ + Primary 2 filter | + disk ^ 7 Quorum + | / + 3 NBD -------> 3 NBD / + client || server 2 filter + || ^ ^ +--------. || | | +Primary | || Secondary disk <--------- hidden-disk 5 <--------- active-disk 4 +--------' || | backing ^ backing + || | | + || | | + || '-------------------------' + || blockdev-backup sync=none 6 + +1) The disk on the primary is represented by a block device with two +children, providing replication between a primary disk and the host that +runs the secondary VM. The read pattern (fifo) for quorum can be extended +to make the primary always read from the local disk instead of going through +NBD. + +2) The new block filter (the name is replication) will control the block +replication. + +3) The secondary disk receives writes from the primary VM through QEMU's +embedded NBD server (speculative write-through). + +4) The disk on the secondary is represented by a custom block device +(called active-disk). It should start as an empty disk, and the format +should support bdrv_make_empty() and backing file. + +5) The hidden-disk is created automatically. It buffers the original content +that is modified by the primary VM. It should also start as an empty disk, +and the driver supports bdrv_make_empty() and backing file. + +6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer +any state that would otherwise be lost by the speculative write-through +of the NBD server into the secondary disk. So before block replication, +the primary disk and secondary disk should contain the same data. + +7) The secondary also has a quorum node, so after secondary failover it +can become the new primary and continue replication. + + +== Failure Handling == +There are 7 internal errors when block replication is running: +1. I/O error on primary disk +2. Forwarding primary write requests failed +3. Backup failed +4. I/O error on secondary disk +5. I/O error on active disk +6. Making active disk or hidden disk empty failed +7. Doing failover failed +In case 1 and 5, we just report the error to the disk layer. In case 2, 3, +4 and 6, we just report block replication's error to FT/HA manager (which +decides when to do a new checkpoint, when to do failover). +In case 7, if active commit failed, we use replication failover failed state +in Secondary's write operation (what decides which target to write). + +== New block driver interface == +We add four block driver interfaces to control block replication: +a. replication_start_all() + Start block replication, called in migration/checkpoint thread. + We must call block_replication_start_all() in secondary QEMU before + calling block_replication_start_all() in primary QEMU. The caller + must hold the I/O mutex lock if it is in migration/checkpoint + thread. +b. replication_do_checkpoint_all() + This interface is called after all VM state is transferred to + Secondary QEMU. The Disk buffer will be dropped in this interface. + The caller must hold the I/O mutex lock if it is in migration/checkpoint + thread. +c. replication_get_error_all() + This interface is called to check if error happened in replication. + The caller must hold the I/O mutex lock if it is in migration/checkpoint + thread. +d. replication_stop_all() + It is called on failover. We will flush the Disk buffer into + Secondary Disk and stop block replication. The vm should be stopped + before calling it if you use this API to shutdown the guest, or other + things except failover. The caller must hold the I/O mutex lock if it is + in migration/checkpoint thread. + +== Usage == +Primary: + -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ + children.0.file.filename=1.raw,\ + children.0.driver=raw + + Run qmp command in primary qemu: + { "execute": "human-monitor-command", + "arguments": { + "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1" + } + } + { "execute": "x-blockdev-change", + "arguments": { + "parent": "colo1", + "node": "nbd_client1" + } + } + Note: + 1. There should be only one NBD Client for each primary disk. + 2. host is the secondary physical machine's hostname or IP + 3. Each disk must have its own export name. + 4. It is all a single argument to -drive and you should ignore the + leading whitespace. + 5. The qmp command line must be run after running qmp command line in + secondary qemu. + 6. After primary failover we need remove children.1 (replication driver). + +Secondary: + -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \ + -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=childs1 + file.file.filename=active_disk.qcow2,\ + file.driver=qcow2,\ + file.backing.file.filename=hidden_disk.qcow2,\ + file.backing.driver=qcow2,\ + file.backing.backing=colo1 + -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\ + vote-threshold=1,children.0=childs1 + + Then run qmp command in secondary qemu: + { "execute": "nbd-server-start", + "arguments": { + "addr": { + "type": "inet", + "data": { + "host": "xxx", + "port": "xxx" + } + } + } + } + { "execute": "nbd-server-add", + "arguments": { + "device": "colo1", + "writable": true + } + } + + Note: + 1. The export name in secondary QEMU command line is the secondary + disk's id. + 2. The export name for the same disk must be the same + 3. The qmp command nbd-server-start and nbd-server-add must be run + before running the qmp command migrate on primary QEMU + 4. Active disk, hidden disk and nbd target's length should be the + same. + 5. It is better to put active disk and hidden disk in ramdisk. + 6. It is all a single argument to -drive, and you should ignore + the leading whitespace. + +After Failover: +Primary: + The secondary host is down, so we should run the following qmp command + to remove the nbd child from the quorum: + { "execute": "x-blockdev-change", + "arguments": { + "parent": "colo1", + "child": "children.1" + } + } + { "execute": "human-monitor-command", + "arguments": { + "command-line": "drive_del xxxx" + } + } + Note: there is no qmp command to remove the blockdev now + +Secondary: + The primary host is down, so we should do the following thing: + { "execute": "nbd-server-stop" } + +Promote Secondary to Primary: + see COLO-FT.txt + +TODO: +1. Shared disk diff --git a/docs/bypass-iommu.txt b/docs/bypass-iommu.txt new file mode 100644 index 000000000..e6677bddd --- /dev/null +++ b/docs/bypass-iommu.txt @@ -0,0 +1,89 @@ +BYPASS IOMMU PROPERTY +===================== + +Description +=========== +Traditionally, there is a global switch to enable/disable vIOMMU. All +devices in the system can only support go through vIOMMU or not, which +is not flexible. We introduce this bypass iommu property to support +coexist of devices go through vIOMMU and devices not. This is useful to +passthrough devices with no-iommu mode and devices go through vIOMMU in +the same virtual machine. + +PCI host bridges have a bypass_iommu property. This property is used to +determine whether the devices attached on the PCI host bridge will bypass +virtual iommu. The bypass_iommu property is valid only when there is a +virtual iommu in the system, it is implemented to allow some devices to +bypass vIOMMU. When bypass_iommu property is not set for a host bridge, +the attached devices will go through vIOMMU by default. + +Usage +===== +The bypass iommu feature support PXB host bridge and default main host +bridge, we add a bypass_iommu property for PXB and default_bus_bypass_iommu +for machine. Note that default_bus_bypass_iommu is available only when +the 'q35' machine type on x86 architecture and the 'virt' machine type +on AArch64. Other machine types do not support bypass iommu for default +root bus. + +1. The following is the bypass iommu options: + (1) PCI expander bridge + qemu -device pxb-pcie,bus_nr=0x10,addr=0x1,bypass_iommu=true + (2) Arm default host bridge + qemu -machine virt,iommu=smmuv3,default_bus_bypass_iommu=true + (3) X86 default root bus bypass iommu: + qemu -machine q35,default_bus_bypass_iommu=true + +2. Here is the detailed qemu command line for 'virt' machine with PXB on +AArch64: + +qemu-system-aarch64 \ + -machine virt,kernel_irqchip=on,iommu=smmuv3,default_bus_bypass_iommu=true \ + -device pxb-pcie,bus_nr=0x10,id=pci.10,bus=pcie.0,addr=0x3.0x1 \ + -device pxb-pcie,bus_nr=0x20,id=pci.20,bus=pcie.0,addr=0x3.0x2,bypass_iommu=true \ + +And we got: + - a default host bridge which bypass SMMUv3 + - a pxb host bridge which go through SMMUv3 + - a pxb host bridge which bypass SMMUv3 + +3. Here is the detailed qemu command line for 'q35' machine with PXB on +x86 architecture: + +qemu-system-x86_64 \ + -machine q35,accel=kvm,default_bus_bypass_iommu=true \ + -device pxb-pcie,bus_nr=0x10,id=pci.10,bus=pcie.0,addr=0x3 \ + -device pxb-pcie,bus_nr=0x20,id=pci.20,bus=pcie.0,addr=0x4,bypass_iommu=true \ + -device intel-iommu \ + +And we got: + - a default host bridge which bypass iommu + - a pxb host bridge which go through iommu + - a pxb host bridge which bypass iommu + +Limitations +=========== +There might be potential security risk when devices bypass iommu, because +devices might send malicious dma request to virtual machine if there is no +iommu isolation. So it would be necessary to only bypass iommu for trusted +device. + +Implementation +============== +The bypass iommu feature includes: + - Address space + Add bypass iommu property check of PCI Host and do not get iommu address + space for devices bypass iommu. + - Arm SMMUv3 support + We traverse all PCI root bus and get bus number ranges, then build explicit + RID mapping for devices which do not bypass iommu. + - X86 IOMMU support + To support Intel iommu, we traverse all PCI host bridge and get information + of devices which do not bypass iommu, then fill the DMAR drhd struct with + explicit device scope info. To support AMD iommu, add check of bypass iommu + when traverse the PCI hsot bridge. + - Machine and PXB options + We add bypass iommu options in machine option for default root bus, and add + option for PXB also. Note that the default value of bypass iommu is false, + so that the devices will by default go through iommu if there exist one. + diff --git a/docs/can.txt b/docs/can.txt new file mode 100644 index 000000000..0d310237d --- /dev/null +++ b/docs/can.txt @@ -0,0 +1,198 @@ +QEMU CAN bus emulation support +============================== + +The CAN bus emulation provides mechanism to connect multiple +emulated CAN controller chips together by one or multiple CAN busses +(the controller device "canbus" parameter). The individual busses +can be connected to host system CAN API (at this time only Linux +SocketCAN is supported). + +The concept of busses is generic and different CAN controllers +can be implemented. + +The initial submission implemented SJA1000 controller which +is common and well supported by by drivers for the most operating +systems. + +The PCI addon card hardware has been selected as the first CAN +interface to implement because such device can be easily connected +to systems with different CPU architectures (x86, PowerPC, Arm, etc.). + +In 2020, CTU CAN FD controller model has been added as part +of the bachelor thesis of Jan Charvat. This controller is complete +open-source/design/hardware solution. The core designer +of the project is Ondrej Ille, the financial support has been +provided by CTU, and more companies including Volkswagen subsidiaries. + +The project has been initially started in frame of RTEMS GSoC 2013 +slot by Jin Yang under our mentoring The initial idea was to provide generic +CAN subsystem for RTEMS. But lack of common environment for code and RTEMS +testing lead to goal change to provide environment which provides complete +emulated environment for testing and RTEMS GSoC slot has been donated +to work on CAN hardware emulation on QEMU. + +Examples how to use CAN emulation for SJA1000 based boards +========================================================== + +When QEMU with CAN PCI support is compiled then one of the next +CAN boards can be selected + + (1) CAN bus Kvaser PCI CAN-S (single SJA1000 channel) boad. QEMU startup options + -object can-bus,id=canbus0 + -device kvaser_pci,canbus=canbus0 + Add "can-host-socketcan" object to connect device to host system CAN bus + -object can-host-socketcan,id=canhost0,if=can0,canbus=canbus0 + + (2) CAN bus PCM-3680I PCI (dual SJA1000 channel) emulation + -object can-bus,id=canbus0 + -device pcm3680_pci,canbus0=canbus0,canbus1=canbus0 + + another example: + -object can-bus,id=canbus0 + -object can-bus,id=canbus1 + -device pcm3680_pci,canbus0=canbus0,canbus1=canbus1 + + (3) CAN bus MIOe-3680 PCI (dual SJA1000 channel) emulation + -device mioe3680_pci,canbus0=canbus0 + + +The ''kvaser_pci'' board/device model is compatible with and has been tested with +''kvaser_pci'' driver included in mainline Linux kernel. +The tested setup was Linux 4.9 kernel on the host and guest side. +Example for qemu-system-x86_64: + + qemu-system-x86_64 -accel kvm -kernel /boot/vmlinuz-4.9.0-4-amd64 \ + -initrd ramdisk.cpio \ + -virtfs local,path=shareddir,security_model=none,mount_tag=shareddir \ + -object can-bus,id=canbus0 \ + -object can-host-socketcan,id=canhost0,if=can0,canbus=canbus0 \ + -device kvaser_pci,canbus=canbus0 \ + -nographic -append "console=ttyS0" + +Example for qemu-system-arm: + + qemu-system-arm -cpu arm1176 -m 256 -M versatilepb \ + -kernel kernel-qemu-arm1176-versatilepb \ + -hda rpi-wheezy-overlay \ + -append "console=ttyAMA0 root=/dev/sda2 ro init=/sbin/init-overlay" \ + -nographic \ + -virtfs local,path=shareddir,security_model=none,mount_tag=shareddir \ + -object can-bus,id=canbus0 \ + -object can-host-socketcan,id=canhost0,if=can0,canbus=canbus0 \ + -device kvaser_pci,canbus=canbus0,host=can0 \ + +The CAN interface of the host system has to be configured for proper +bitrate and set up. Configuration is not propagated from emulated +devices through bus to the physical host device. Example configuration +for 1 Mbit/s + + ip link set can0 type can bitrate 1000000 + ip link set can0 up + +Virtual (host local only) can interface can be used on the host +side instead of physical interface + + ip link add dev can0 type vcan + +The CAN interface on the host side can be used to analyze CAN +traffic with "candump" command which is included in "can-utils". + + candump can0 + +CTU CAN FD support examples +=========================== + +This open-source core provides CAN FD support. CAN FD drames are +delivered even to the host systems when SocketCAN interface is found +CAN FD capable. + +The PCIe board emulation is provided for now (the device identifier is +ctucan_pci). The default build defines two CTU CAN FD cores +on the board. + +Example how to connect the canbus0-bus (virtual wire) to the host +Linux system (SocketCAN used) and to both CTU CAN FD cores emulated +on the corresponding PCI card expects that host system CAN bus +is setup according to the previous SJA1000 section. + + qemu-system-x86_64 -enable-kvm -kernel /boot/vmlinuz-4.19.52+ \ + -initrd ramdisk.cpio \ + -virtfs local,path=shareddir,security_model=none,mount_tag=shareddir \ + -vga cirrus \ + -append "console=ttyS0" \ + -object can-bus,id=canbus0-bus \ + -object can-host-socketcan,if=can0,canbus=canbus0-bus,id=canbus0-socketcan \ + -device ctucan_pci,canbus0=canbus0-bus,canbus1=canbus0-bus \ + -nographic + +Setup of CTU CAN FD controller in a guest Linux system + + insmod ctucanfd.ko || modprobe ctucanfd + insmod ctucanfd_pci.ko || modprobe ctucanfd_pci + + for ifc in /sys/class/net/can* ; do + if [ -e $ifc/device/vendor ] ; then + if ! grep -q 0x1760 $ifc/device/vendor ; then + continue; + fi + else + continue; + fi + if [ -e $ifc/device/device ] ; then + if ! grep -q 0xff00 $ifc/device/device ; then + continue; + fi + else + continue; + fi + ifc=$(basename $ifc) + /bin/ip link set $ifc type can bitrate 1000000 dbitrate 10000000 fd on + /bin/ip link set $ifc up + done + +The test can run for example + + candump can1 + +in the guest system and next commands in the host system for basic CAN + + cangen can0 + +for CAN FD without bitrate switch + + cangen can0 -f + +and with bitrate switch + + cangen can0 -b + +The test can be run viceversa, generate messages in the guest system and capture them +in the host one and much more combinations. + +Links to other resources +======================== + + (1) CAN related projects at Czech Technical University, Faculty of Electrical Engineering + http://canbus.pages.fel.cvut.cz/ + (2) Repository with development can-pci branch at Czech Technical University + https://gitlab.fel.cvut.cz/canbus/qemu-canbus + (3) RTEMS page describing project + https://devel.rtems.org/wiki/Developer/Simulators/QEMU/CANEmulation + (4) RTLWS 2015 article about the project and its use with CANopen emulation + http://cmp.felk.cvut.cz/~pisa/can/doc/rtlws-17-pisa-qemu-can.pdf + (5) GNU/Linux, CAN and CANopen in Real-time Control Applications + Slides from LinuxDays 2017 (include updated RTLWS 2015 content) + https://www.linuxdays.cz/2017/video/Pavel_Pisa-CAN_canopen.pdf + (6) Linux SocketCAN utilities + https://github.com/linux-can/can-utils/ + (7) CTU CAN FD project including core VHDL design, Linux driver, + test utilities etc. + https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core + (8) CTU CAN FD Core Datasheet Documentation + http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/Progdokum.pdf + (9) CTU CAN FD Core System Architecture Documentation + http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/ctu_can_fd_architecture.pdf + (10) CTU CAN FD Driver Documentation + http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/driver_doc/ctucanfd-driver.html + (11) Integration with PCIe interfacing for Intel/Altera Cyclone IV based board + https://gitlab.fel.cvut.cz/canbus/pcie-ctu_can_fd diff --git a/docs/ccid.txt b/docs/ccid.txt new file mode 100644 index 000000000..2b85b1bd4 --- /dev/null +++ b/docs/ccid.txt @@ -0,0 +1,182 @@ +QEMU CCID Device Documentation. + +Contents +1. USB CCID device +2. Building +3. Using ccid-card-emulated with hardware +4. Using ccid-card-emulated with certificates +5. Using ccid-card-passthru with client side hardware +6. Using ccid-card-passthru with client side certificates +7. Passthrough protocol scenario +8. libcacard + +1. USB CCID device + +The USB CCID device is a USB device implementing the CCID specification, which +lets one connect smart card readers that implement the same spec. For more +information see the specification: + + Universal Serial Bus + Device Class: Smart Card + CCID + Specification for + Integrated Circuit(s) Cards Interface Devices + Revision 1.1 + April 22rd, 2005 + +Smartcards are used for authentication, single sign on, decryption in +public/private schemes and digital signatures. A smartcard reader on the client +cannot be used on a guest with simple usb passthrough since it will then not be +available on the client, possibly locking the computer when it is "removed". On +the other hand this device can let you use the smartcard on both the client and +the guest machine. It is also possible to have a completely virtual smart card +reader and smart card (i.e. not backed by a physical device) using this device. + +2. Building + +The cryptographic functions and access to the physical card is done via the +libcacard library, whose development package must be installed prior to +building QEMU: + +In redhat/fedora: + yum install libcacard-devel +In ubuntu: + apt-get install libcacard-dev + +Configuring and building: + ./configure --enable-smartcard && make + + +3. Using ccid-card-emulated with hardware + +Assuming you have a working smartcard on the host with the current +user, using libcacard, QEMU acts as another client using ccid-card-emulated: + + qemu -usb -device usb-ccid -device ccid-card-emulated + + +4. Using ccid-card-emulated with certificates stored in files + +You must create the CA and card certificates. This is a one time process. +We use NSS certificates: + + mkdir fake-smartcard + cd fake-smartcard + certutil -N -d sql:$PWD + certutil -S -d sql:$PWD -s "CN=Fake Smart Card CA" -x -t TC,TC,TC -n fake-smartcard-ca + certutil -S -d sql:$PWD -t ,, -s "CN=John Doe" -n id-cert -c fake-smartcard-ca + certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (signing)" --nsCertType smime -n signing-cert -c fake-smartcard-ca + certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (encryption)" --nsCertType sslClient -n encryption-cert -c fake-smartcard-ca + +Note: you must have exactly three certificates. + +You can use the emulated card type with the certificates backend: + + qemu -usb -device usb-ccid -device ccid-card-emulated,backend=certificates,db=sql:$PWD,cert1=id-cert,cert2=signing-cert,cert3=encryption-cert + +To use the certificates in the guest, export the CA certificate: + + certutil -L -r -d sql:$PWD -o fake-smartcard-ca.cer -n fake-smartcard-ca + +and import it in the guest: + + certutil -A -d /etc/pki/nssdb -i fake-smartcard-ca.cer -t TC,TC,TC -n fake-smartcard-ca + +In a Linux guest you can then use the CoolKey PKCS #11 module to access +the card: + + certutil -d /etc/pki/nssdb -L -h all + +It will prompt you for the PIN (which is the password you assigned to the +certificate database early on), and then show you all three certificates +together with the manually imported CA cert: + + Certificate Nickname Trust Attributes + fake-smartcard-ca CT,C,C + John Doe:CAC ID Certificate u,u,u + John Doe:CAC Email Signature Certificate u,u,u + John Doe:CAC Email Encryption Certificate u,u,u + +If this does not happen, CoolKey is not installed or not registered with +NSS. Registration can be done from Firefox or the command line: + + modutil -dbdir /etc/pki/nssdb -add "CAC Module" -libfile /usr/lib64/pkcs11/libcoolkeypk11.so + modutil -dbdir /etc/pki/nssdb -list + + +5. Using ccid-card-passthru with client side hardware + +on the host specify the ccid-card-passthru device with a suitable chardev: + + qemu -chardev socket,server=on,host=0.0.0.0,port=2001,id=ccid,wait=off \ + -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid + +on the client run vscclient, built when you built QEMU: + + vscclient <qemu-host> 2001 + + +6. Using ccid-card-passthru with client side certificates + +This case is not particularly useful, but you can use it to debug +your setup if #4 works but #5 does not. + +Follow instructions as per #4, except run QEMU and vscclient as follows: +Run qemu as per #5, and run vscclient from the "fake-smartcard" +directory as follows: + + qemu -chardev socket,server=on,host=0.0.0.0,port=2001,id=ccid,wait=off \ + -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid + vscclient -e "db=\"sql:$PWD\" use_hw=no soft=(,Test,CAC,,id-cert,signing-cert,encryption-cert)" <qemu-host> 2001 + + +7. Passthrough protocol scenario + +This is a typical interchange of messages when using the passthru card device. +usb-ccid is a usb device. It defaults to an unattached usb device on startup. +usb-ccid expects a chardev and expects the protocol defined in +cac_card/vscard_common.h to be passed over that. +The usb-ccid device can be in one of three modes: + * detached + * attached with no card + * attached with card + +A typical interchange is: (the arrow shows who started each exchange, it can be client +originated or guest originated) + +client event | vscclient | passthru | usb-ccid | guest event +---------------------------------------------------------------------------------------------- + | VSC_Init | | | + | VSC_ReaderAdd | | attach | + | | | | sees new usb device. +card inserted -> | | | | + | VSC_ATR | insert | insert | see new card + | | | | + | VSC_APDU | VSC_APDU | | <- guest sends APDU +client<->physical | | | | +card APDU exchange| | | | +client response ->| VSC_APDU | VSC_APDU | | receive APDU response + ... + [APDU<->APDU repeats several times] + ... +card removed -> | | | | + | VSC_CardRemove | remove | remove | card removed + ... + [(card insert, apdu's, card remove) repeat] + ... +kill/quit | | | | + vscclient | | | | + | VSC_ReaderRemove | | detach | + | | | | usb device removed. + + +8. libcacard + +Both ccid-card-emulated and vscclient use libcacard as the card emulator. +libcacard implements a completely virtual CAC (DoD standard for smart +cards) compliant card and uses NSS to retrieve certificates and do +any encryption. The backend can then be a real reader and card, or +certificates stored in files. + +For documentation of the library see docs/libcacard.txt. + diff --git a/docs/colo-proxy.txt b/docs/colo-proxy.txt new file mode 100644 index 000000000..1fc38aed1 --- /dev/null +++ b/docs/colo-proxy.txt @@ -0,0 +1,217 @@ +COLO-proxy +---------- +Copyright (c) 2016 Intel Corporation +Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. +Copyright (c) 2016 Fujitsu, Corp. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +This document gives an overview of COLO proxy's design. + +== Background == +COLO-proxy is a part of COLO project. It is used +to compare the network package to help COLO decide +whether to do checkpoint. With COLO-proxy's help, +COLO greatly improves the performance. + +The filter-redirector, filter-mirror, colo-compare +and filter-rewriter compose the COLO-proxy. + +== Architecture == + +COLO-Proxy is based on qemu netfilter and it's a plugin for qemu netfilter +(except colo-compare). It keep Secondary VM connect normally to +client and compare packets sent by PVM with sent by SVM. +If the packet difference, notify COLO-frame to do checkpoint and send +all primary packet has queued. Otherwise just send the queued primary +packet and drop the queued secondary packet. + +Below is a COLO proxy ascii figure: + + Primary qemu Secondary qemu ++--------------------------------------------------------------+ +----------------------------------------------------------------+ +| +----------------------------------------------------------+ | | +-----------------------------------------------------------+ | +| | | | | | | | +| | guest | | | | guest | | +| | | | | | | | +| +-------^--------------------------+-----------------------+ | | +---------------------+--------+----------------------------+ | +| | | | | ^ | | +| | | | | | | | +| | +------------------------------------------------------+ | | | | +|netfilter| | | | | | netfilter | | | +| +----------+ +----------------------------+ | | | +-----------------------------------------------------------+ | +| | | | | | out | | | | | | filter execute order | | +| | | | +-----------------------------+ | | | | | | +-------------------> | | +| | | | | | | | | | | | | | TCP | | +| | +-----+--+-+ +-----v----+ +-----v----+ |pri +----+----+sec| | | | +------------+ +---+----+---v+rewriter++ +------------+ | | +| | | | | | | | |in | |in | | | | | | | | | | | | | +| | | filter | | filter | | filter +------> colo <------+ +--------> filter +--> adjust | adjust +--> filter | | | +| | | mirror | |redirector| |redirector| | | compare | | | | | | redirector | | ack | seq | | redirector | | | +| | | | | | | | | | | | | | | | | | | | | | | | +| | +----^-----+ +----+-----+ +----------+ | +---------+ | | | | +------------+ +--------+--------------+ +---+--------+ | | +| | | tx | rx rx | | | | | tx all | rx | | +| | | | | | | | +-----------------------------------------------------------+ | +| | | +--------------+ | | | | | | +| | | filter execute order | | | | | | | +| | | +----------------> | | | +--------------------------------------------------------+ | +| +-----------------------------------------+ | | | +| | | | | | ++--------------------------------------------------------------+ +----------------------------------------------------------------+ + |guest receive | guest send + | | ++--------+----------------------------v------------------------+ +| | NOTE: filter direction is rx/tx/all +| tap | rx:receive packets sent to the netdev +| | tx:receive packets sent by the netdev ++--------------------------------------------------------------+ + +1.Guest receive packet route: + +Primary: + +Tap --> Mirror Client Filter +Mirror client will send packet to guest,at the +same time, copy and forward packet to secondary +mirror server. + +Secondary: + +Mirror Server Filter --> TCP Rewriter +If receive packet is TCP packet,we will adjust ack +and update TCP checksum, then send to secondary +guest. Otherwise directly send to guest. + +2.Guest send packet route: + +Primary: + +Guest --> Redirect Server Filter +Redirect server filter receive primary guest packet +but do nothing, just pass to next filter. + +Redirect Server Filter --> COLO-Compare +COLO-compare receive primary guest packet then +waiting secondary redirect packet to compare it. +If packet same,send queued primary packet and clear +queued secondary packet, Otherwise send primary packet +and do checkpoint. + +COLO-Compare --> Another Redirector Filter +The redirector get packet from colo-compare by use +chardev socket. + +Redirector Filter --> Tap +Send the packet. + +Secondary: + +Guest --> TCP Rewriter Filter +If the packet is TCP packet,we will adjust seq +and update TCP checksum. Then send it to +redirect client filter. Otherwise directly send to +redirect client filter. + +Redirect Client Filter --> Redirect Server Filter +Forward packet to primary. + +== Components introduction == + +Filter-mirror is a netfilter plugin. +It gives qemu the ability to mirror +packets to a chardev. + +Filter-redirector is a netfilter plugin. +It gives qemu the ability to redirect net packet. +Redirector can redirect filter's net packet to outdev, +and redirect indev's packet to filter. + + filter + + + redirector | + +--------------+ + | | | + | | | + | | | + indev +---------+ +----------> outdev + | | | + | | | + | | | + +--------------+ + | + v + filter + +COLO-compare, we do packet comparing job. +Packets coming from the primary char indev will be sent to outdev. +Packets coming from the secondary char dev will be dropped after comparing. +COLO-compare needs two input chardevs and one output chardev: +primary_in=chardev1-id (source: primary send packet) +secondary_in=chardev2-id (source: secondary send packet) +outdev=chardev3-id + +Filter-rewriter will rewrite some of secondary packet to make +secondary guest's tcp connection established successfully. +In this module we will rewrite tcp packet's ack to the secondary +from primary,and rewrite tcp packet's seq to the primary from +secondary. + +== Usage == + +Here is an example using demonstration IP and port addresses to more +clearly describe the usage. + +Primary(ip:3.3.3.3): +-netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown +-device e1000,id=e0,netdev=hn0,mac=52:a4:00:12:78:66 +-chardev socket,id=mirror0,host=3.3.3.3,port=9003,server=on,wait=off +-chardev socket,id=compare1,host=3.3.3.3,port=9004,server=on,wait=off +-chardev socket,id=compare0,host=3.3.3.3,port=9001,server=on,wait=off +-chardev socket,id=compare0-0,host=3.3.3.3,port=9001 +-chardev socket,id=compare_out,host=3.3.3.3,port=9005,server=on,wait=off +-chardev socket,id=compare_out0,host=3.3.3.3,port=9005 +-object iothread,id=iothread1 +-object filter-mirror,id=m0,netdev=hn0,queue=tx,outdev=mirror0 +-object filter-redirector,netdev=hn0,id=redire0,queue=rx,indev=compare_out +-object filter-redirector,netdev=hn0,id=redire1,queue=rx,outdev=compare0 +-object colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,outdev=compare_out0,iothread=iothread1 + +Secondary(ip:3.3.3.8): +-netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,down script=/etc/qemu-ifdown +-device e1000,netdev=hn0,mac=52:a4:00:12:78:66 +-chardev socket,id=red0,host=3.3.3.3,port=9003 +-chardev socket,id=red1,host=3.3.3.3,port=9004 +-object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0 +-object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1 +-object filter-rewriter,id=f3,netdev=hn0,queue=all + +If you want to use virtio-net-pci or other driver with vnet_header: + +Primary(ip:3.3.3.3): +-netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown +-device e1000,id=e0,netdev=hn0,mac=52:a4:00:12:78:66 +-chardev socket,id=mirror0,host=3.3.3.3,port=9003,server=on,wait=off +-chardev socket,id=compare1,host=3.3.3.3,port=9004,server=on,wait=off +-chardev socket,id=compare0,host=3.3.3.3,port=9001,server=on,wait=off +-chardev socket,id=compare0-0,host=3.3.3.3,port=9001 +-chardev socket,id=compare_out,host=3.3.3.3,port=9005,server=on,wait=off +-chardev socket,id=compare_out0,host=3.3.3.3,port=9005 +-object filter-mirror,id=m0,netdev=hn0,queue=tx,outdev=mirror0,vnet_hdr_support +-object filter-redirector,netdev=hn0,id=redire0,queue=rx,indev=compare_out,vnet_hdr_support +-object filter-redirector,netdev=hn0,id=redire1,queue=rx,outdev=compare0,vnet_hdr_support +-object colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,outdev=compare_out0,vnet_hdr_support + +Secondary(ip:3.3.3.8): +-netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,down script=/etc/qemu-ifdown +-device e1000,netdev=hn0,mac=52:a4:00:12:78:66 +-chardev socket,id=red0,host=3.3.3.3,port=9003 +-chardev socket,id=red1,host=3.3.3.3,port=9004 +-object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0,vnet_hdr_support +-object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1,vnet_hdr_support +-object filter-rewriter,id=f3,netdev=hn0,queue=all,vnet_hdr_support + +Note: + a.COLO-proxy must work with COLO-frame and Block-replication. + b.Primary COLO must be started firstly, because COLO-proxy needs + chardev socket server running before secondary started. + c.Filter-rewriter only rewrite tcp packet. diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 000000000..763e7d243 --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,313 @@ +# -*- coding: utf-8 -*- +# +# QEMU documentation build configuration file, created by +# sphinx-quickstart on Thu Jan 31 16:40:14 2019. +# +# This config file can be used in one of two ways: +# (1) as a common config file which is included by the conf.py +# for each of QEMU's manuals: in this case sphinx-build is run multiple +# times, once per subdirectory. +# (2) as a top level conf file which will result in building all +# the manuals into a single document: in this case sphinx-build is +# run once, on the top-level docs directory. +# +# QEMU's makefiles take option (1), which allows us to install +# only the ones the user cares about (in particular we don't want +# to ship the 'devel' manual to end-users). +# Third-party sites such as readthedocs.org will take option (2). +# +# +# This file is execfile()d with the current directory set to its +# containing dir. +# +# Note that not all possible configuration values are present in this +# autogenerated file. +# +# All configuration values have a default; values that are commented out +# serve to show the default. + +import os +import sys +import sphinx +from distutils.version import LooseVersion +from sphinx.errors import ConfigError + +# Make Sphinx fail cleanly if using an old Python, rather than obscurely +# failing because some code in one of our extensions doesn't work there. +# In newer versions of Sphinx this will display nicely; in older versions +# Sphinx will also produce a Python backtrace but at least the information +# gets printed... +if sys.version_info < (3,6): + raise ConfigError( + "QEMU requires a Sphinx that uses Python 3.6 or better\n") + +# The per-manual conf.py will set qemu_docdir for a single-manual build; +# otherwise set it here if this is an entire-manual-set build. +# This is always the absolute path of the docs/ directory in the source tree. +try: + qemu_docdir +except NameError: + qemu_docdir = os.path.abspath(".") + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use an absolute path starting from qemu_docdir. +# +# Our extensions are in docs/sphinx; the qapidoc extension requires +# the QAPI modules from scripts/. +sys.path.insert(0, os.path.join(qemu_docdir, "sphinx")) +sys.path.insert(0, os.path.join(qemu_docdir, "../scripts")) + + +# -- General configuration ------------------------------------------------ + +# If your documentation needs a minimal Sphinx version, state it here. +# +# Sphinx 1.5 and earlier can't build our docs because they are too +# picky about the syntax of the argument to the option:: directive +# (see Sphinx bugs #646, #3366). +needs_sphinx = '1.6' + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = ['kerneldoc', 'qmp_lexer', 'hxtool', 'depfile', 'qapidoc'] + +# Add any paths that contain templates here, relative to this directory. +templates_path = [os.path.join(qemu_docdir, '_templates')] + +# The suffix(es) of source filenames. +# You can specify multiple suffix as a list of string: +# +# source_suffix = ['.rst', '.md'] +source_suffix = '.rst' + +# The master toctree document. +master_doc = 'index' + +# Interpret `single-backticks` to be a cross-reference to any kind of +# referenceable object. Unresolvable or ambiguous references will emit a +# warning at build time. +default_role = 'any' + +# General information about the project. +project = u'QEMU' +copyright = u'2021, The QEMU Project Developers' +author = u'The QEMU Project Developers' + +# The version info for the project you're documenting, acts as replacement for +# |version| and |release|, also used in various other places throughout the +# built documents. + +# Extract this information from the VERSION file, for the benefit of +# standalone Sphinx runs as used by readthedocs.org. Builds run from +# the Makefile will pass version and release on the sphinx-build +# command line, which override this. +try: + extracted_version = None + with open(os.path.join(qemu_docdir, '../VERSION')) as f: + extracted_version = f.readline().strip() +except: + pass +finally: + if extracted_version: + version = release = extracted_version + else: + version = release = "unknown version" + +# The language for content autogenerated by Sphinx. Refer to documentation +# for a list of supported languages. +# +# This is also used if you do content translation via gettext catalogs. +# Usually you set "language" from the command line for these cases. +language = None + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This patterns also effect to html_static_path and html_extra_path +exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] + +# The name of the Pygments (syntax highlighting) style to use. +pygments_style = 'sphinx' + +# If true, `todo` and `todoList` produce output, else they produce nothing. +todo_include_todos = False + +# Sphinx defaults to warning about use of :option: for options not defined +# with "option::" in the document being processed. Turn that off. +suppress_warnings = ["ref.option"] + +# The rst_epilog fragment is effectively included in every rST file. +# We use it to define substitutions based on build config that +# can then be used in the documentation. The fallback if the +# environment variable is not set is for the benefit of readthedocs +# style document building; our Makefile always sets the variable. +confdir = os.getenv('CONFDIR', "/etc/qemu") +rst_epilog = ".. |CONFDIR| replace:: ``" + confdir + "``\n" +# We slurp in the defs.rst.inc and literally include it into rst_epilog, +# because Sphinx's include:: directive doesn't work with absolute paths +# and there isn't any one single relative path that will work for all +# documents and for both via-make and direct sphinx-build invocation. +with open(os.path.join(qemu_docdir, 'defs.rst.inc')) as f: + rst_epilog += f.read() + +# -- Options for HTML output ---------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +try: + import sphinx_rtd_theme +except ImportError: + raise ConfigError( + 'The Sphinx \'sphinx_rtd_theme\' HTML theme was not found.\n' + ) + +html_theme = 'sphinx_rtd_theme' + +# Theme options are theme-specific and customize the look and feel of a theme +# further. For a list of options available for each theme, see the +# documentation. +if LooseVersion(sphinx_rtd_theme.__version__) >= LooseVersion("0.4.3"): + html_theme_options = { + "style_nav_header_background": "#802400", + "navigation_with_keys": True, + } + +html_logo = os.path.join(qemu_docdir, "../ui/icons/qemu_128x128.png") + +html_favicon = os.path.join(qemu_docdir, "../ui/icons/qemu_32x32.png") + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = [os.path.join(qemu_docdir, "sphinx-static")] + +html_css_files = [ + 'theme_overrides.css', +] + +html_js_files = [ + 'custom.js', +] + +html_context = { + "display_gitlab": True, + "gitlab_user": "qemu-project", + "gitlab_repo": "qemu", + "gitlab_version": "master", + "conf_py_path": "/docs/", # Path in the checkout to the docs root +} + +# Custom sidebar templates, must be a dictionary that maps document names +# to template names. +#html_sidebars = {} + +# Don't copy the rST source files to the HTML output directory, +# and don't put links to the sources into the output HTML. +html_copy_source = False + +# -- Options for HTMLHelp output ------------------------------------------ + +# Output file base name for HTML help builder. +htmlhelp_basename = 'QEMUdoc' + + +# -- Options for LaTeX output --------------------------------------------- + +latex_elements = { + # The paper size ('letterpaper' or 'a4paper'). + # + # 'papersize': 'letterpaper', + + # The font size ('10pt', '11pt' or '12pt'). + # + # 'pointsize': '10pt', + + # Additional stuff for the LaTeX preamble. + # + # 'preamble': '', + + # Latex figure (float) alignment + # + # 'figure_align': 'htbp', +} + +# Grouping the document tree into LaTeX files. List of tuples +# (source start file, target name, title, +# author, documentclass [howto, manual, or own class]). +latex_documents = [ + (master_doc, 'QEMU.tex', u'QEMU Documentation', + u'The QEMU Project Developers', 'manual'), +] + + +# -- Options for manual page output --------------------------------------- +# Individual manual/conf.py can override this to create man pages +man_pages = [ + ('interop/qemu-ga', 'qemu-ga', + 'QEMU Guest Agent', + ['Michael Roth <mdroth@linux.vnet.ibm.com>'], 8), + ('interop/qemu-ga-ref', 'qemu-ga-ref', + 'QEMU Guest Agent Protocol Reference', + [], 7), + ('interop/qemu-qmp-ref', 'qemu-qmp-ref', + 'QEMU QMP Reference Manual', + [], 7), + ('interop/qemu-storage-daemon-qmp-ref', 'qemu-storage-daemon-qmp-ref', + 'QEMU Storage Daemon QMP Reference Manual', + [], 7), + ('system/qemu-manpage', 'qemu', + 'QEMU User Documentation', + ['Fabrice Bellard'], 1), + ('system/qemu-block-drivers', 'qemu-block-drivers', + 'QEMU block drivers reference', + ['Fabrice Bellard and the QEMU Project developers'], 7), + ('system/qemu-cpu-models', 'qemu-cpu-models', + 'QEMU CPU Models', + ['The QEMU Project developers'], 7), + ('tools/qemu-img', 'qemu-img', + 'QEMU disk image utility', + ['Fabrice Bellard'], 1), + ('tools/qemu-nbd', 'qemu-nbd', + 'QEMU Disk Network Block Device Server', + ['Anthony Liguori <anthony@codemonkey.ws>'], 8), + ('tools/qemu-pr-helper', 'qemu-pr-helper', + 'QEMU persistent reservation helper', + [], 8), + ('tools/qemu-storage-daemon', 'qemu-storage-daemon', + 'QEMU storage daemon', + [], 1), + ('tools/qemu-trace-stap', 'qemu-trace-stap', + 'QEMU SystemTap trace tool', + [], 1), + ('tools/virtfs-proxy-helper', 'virtfs-proxy-helper', + 'QEMU 9p virtfs proxy filesystem helper', + ['M. Mohan Kumar'], 1), + ('tools/virtiofsd', 'virtiofsd', + 'QEMU virtio-fs shared file system daemon', + ['Stefan Hajnoczi <stefanha@redhat.com>', + 'Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>'], 1), +] +man_make_section_directory = False + +# -- Options for Texinfo output ------------------------------------------- + +# Grouping the document tree into Texinfo files. List of tuples +# (source start file, target name, title, author, +# dir menu entry, description, category) +texinfo_documents = [ + (master_doc, 'QEMU', u'QEMU Documentation', + author, 'QEMU', 'One line description of project.', + 'Miscellaneous'), +] + + + +# We use paths starting from qemu_docdir here so that you can run +# sphinx-build from anywhere and the kerneldoc extension can still +# find everything. +kerneldoc_bin = ['perl', os.path.join(qemu_docdir, '../scripts/kernel-doc')] +kerneldoc_srctree = os.path.join(qemu_docdir, '..') +hxtool_srctree = os.path.join(qemu_docdir, '..') +qapidoc_srctree = os.path.join(qemu_docdir, '..') diff --git a/docs/confidential-guest-support.txt b/docs/confidential-guest-support.txt new file mode 100644 index 000000000..71d07ba57 --- /dev/null +++ b/docs/confidential-guest-support.txt @@ -0,0 +1,49 @@ +Confidential Guest Support +========================== + +Traditionally, hypervisors such as QEMU have complete access to a +guest's memory and other state, meaning that a compromised hypervisor +can compromise any of its guests. A number of platforms have added +mechanisms in hardware and/or firmware which give guests at least some +protection from a compromised hypervisor. This is obviously +especially desirable for public cloud environments. + +These mechanisms have different names and different modes of +operation, but are often referred to as Secure Guests or Confidential +Guests. We use the term "Confidential Guest Support" to distinguish +this from other aspects of guest security (such as security against +attacks from other guests, or from network sources). + +Running a Confidential Guest +---------------------------- + +To run a confidential guest you need to add two command line parameters: + +1. Use "-object" to create a "confidential guest support" object. The + type and parameters will vary with the specific mechanism to be + used +2. Set the "confidential-guest-support" machine parameter to the ID of + the object from (1). + +Example (for AMD SEV):: + + qemu-system-x86_64 \ + <other parameters> \ + -machine ...,confidential-guest-support=sev0 \ + -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1 + +Supported mechanisms +-------------------- + +Currently supported confidential guest mechanisms are: + +AMD Secure Encrypted Virtualization (SEV) + docs/amd-memory-encryption.txt + +POWER Protected Execution Facility (PEF) + docs/papr-pef.txt + +s390x Protected Virtualization (PV) + docs/system/s390x/protvirt.rst + +Other mechanisms may be supported in future. diff --git a/docs/config/ich9-ehci-uhci.cfg b/docs/config/ich9-ehci-uhci.cfg new file mode 100644 index 000000000..a0e9b96f4 --- /dev/null +++ b/docs/config/ich9-ehci-uhci.cfg @@ -0,0 +1,37 @@ +########################################################################### +# +# You can pass this file directly to qemu using the -readconfig +# command line switch. +# +# This config file creates a EHCI adapter with companion UHCI +# controllers as multifunction device in PCI slot "1d". +# +# Specify "bus=ehci.0" when creating usb devices to hook them up +# there. +# + +[device "ehci"] + driver = "ich9-usb-ehci1" + addr = "1d.7" + multifunction = "on" + +[device "uhci-1"] + driver = "ich9-usb-uhci1" + addr = "1d.0" + multifunction = "on" + masterbus = "ehci.0" + firstport = "0" + +[device "uhci-2"] + driver = "ich9-usb-uhci2" + addr = "1d.1" + multifunction = "on" + masterbus = "ehci.0" + firstport = "2" + +[device "uhci-3"] + driver = "ich9-usb-uhci3" + addr = "1d.2" + multifunction = "on" + masterbus = "ehci.0" + firstport = "4" diff --git a/docs/config/mach-virt-graphical.cfg b/docs/config/mach-virt-graphical.cfg new file mode 100644 index 000000000..d6d31b17f --- /dev/null +++ b/docs/config/mach-virt-graphical.cfg @@ -0,0 +1,281 @@ +# mach-virt - VirtIO guest (graphical console) +# ========================================================= +# +# Usage: +# +# $ qemu-system-aarch64 \ +# -nodefaults \ +# -readconfig mach-virt-graphical.cfg \ +# -cpu host +# +# You will probably need to tweak the lines marked as +# CHANGE ME before being able to use this configuration! +# +# The guest will have a selection of VirtIO devices +# tailored towards optimal performance with modern guests, +# and will be accessed through a graphical console. +# +# --------------------------------------------------------- +# +# Using -nodefaults is required to have full control over +# the virtual hardware: when it's specified, QEMU will +# populate the board with only the builtin peripherals, +# such as the PL011 UART, plus a PCI Express Root Bus; the +# user will then have to explicitly add further devices. +# +# The PCI Express Root Bus shows up in the guest as: +# +# 00:00.0 Host bridge +# +# This configuration file adds a number of other useful +# devices, more specifically: +# +# 00:01.0 Display controller +# 00.1c.* PCI bridge (PCI Express Root Ports) +# 01:00.0 SCSI storage controller +# 02:00.0 Ethernet controller +# 03:00.0 USB controller +# +# More information about these devices is available below. + + +# Machine options +# ========================================================= +# +# We use the virt machine type and enable KVM acceleration +# for better performance. +# +# Using less than 1 GiB of memory is probably not going to +# yield good performance in the guest, and might even lead +# to obscure boot issues in some cases. +# +# Unfortunately, there is no way to configure the CPU model +# in this file, so it will have to be provided on the +# command line, but we can configure the guest to use the +# same GIC version as the host. + +[machine] + type = "virt" + accel = "kvm" + gic-version = "host" + +[memory] + size = "1024" + + +# Firmware configuration +# ========================================================= +# +# There are two parts to the firmware: a read-only image +# containing the executable code, which is shared between +# guests, and a read/write variable store that is owned +# by one specific guest, exclusively, and is used to +# record information such as the UEFI boot order. +# +# For any new guest, its permanent, private variable store +# should initially be copied from the template file +# provided along with the firmware binary. +# +# Depending on the OS distribution you're using on the +# host, the name of the package containing the firmware +# binary and variable store template, as well as the paths +# to the files themselves, will be different. For example: +# +# Fedora +# edk2-aarch64 (pkg) +# /usr/share/edk2/aarch64/QEMU_EFI-pflash.raw (bin) +# /usr/share/edk2/aarch64/vars-template-pflash.raw (var) +# +# RHEL +# AAVMF (pkg) +# /usr/share/AAVMF/AAVMF_CODE.fd (bin) +# /usr/share/AAVMF/AAVMF_VARS.fd (var) +# +# Debian/Ubuntu +# qemu-efi (pkg) +# /usr/share/AAVMF/AAVMF_CODE.fd (bin) +# /usr/share/AAVMF/AAVMF_VARS.fd (var) + +[drive "uefi-binary"] + file = "/usr/share/AAVMF/AAVMF_CODE.fd" # CHANGE ME + format = "raw" + if = "pflash" + unit = "0" + readonly = "on" + +[drive "uefi-varstore"] + file = "guest_VARS.fd" # CHANGE ME + format = "raw" + if = "pflash" + unit = "1" + + +# PCI bridge (PCI Express Root Ports) +# ========================================================= +# +# We create eight PCI Express Root Ports, and we plug them +# all into separate functions of the same slot. Some of +# them will be used by devices, the rest will remain +# available for hotplug. + +[device "pcie.1"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.0" + port = "1" + chassis = "1" + multifunction = "on" + +[device "pcie.2"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.1" + port = "2" + chassis = "2" + +[device "pcie.3"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.2" + port = "3" + chassis = "3" + +[device "pcie.4"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.3" + port = "4" + chassis = "4" + +[device "pcie.5"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.4" + port = "5" + chassis = "5" + +[device "pcie.6"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.5" + port = "6" + chassis = "6" + +[device "pcie.7"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.6" + port = "7" + chassis = "7" + +[device "pcie.8"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.7" + port = "8" + chassis = "8" + + +# SCSI storage controller (and storage) +# ========================================================= +# +# We use virtio-scsi here so that we can (hot)plug a large +# number of disks without running into issues; a SCSI disk, +# backed by a qcow2 disk image on the host's filesystem, is +# attached to it. +# +# We also create an optical disk, mostly for installation +# purposes: once the guest OS has been successfully +# installed, the guest will no longer boot from optical +# media. If you don't want, or no longer want, to have an +# optical disk in the guest you can safely comment out +# all relevant sections below. + +[device "scsi"] + driver = "virtio-scsi-pci" + bus = "pcie.1" + addr = "00.0" + +[device "scsi-disk"] + driver = "scsi-hd" + bus = "scsi.0" + drive = "disk" + bootindex = "1" + +[drive "disk"] + file = "guest.qcow2" # CHANGE ME + format = "qcow2" + if = "none" + +[device "scsi-optical-disk"] + driver = "scsi-cd" + bus = "scsi.0" + drive = "optical-disk" + bootindex = "2" + +[drive "optical-disk"] + file = "install.iso" # CHANGE ME + format = "raw" + if = "none" + + +# Ethernet controller +# ========================================================= +# +# We use virtio-net for improved performance over emulated +# hardware; on the host side, we take advantage of user +# networking so that the QEMU process doesn't require any +# additional privileges. + +[netdev "hostnet"] + type = "user" + +[device "net"] + driver = "virtio-net-pci" + netdev = "hostnet" + bus = "pcie.2" + addr = "00.0" + + +# USB controller (and input devices) +# ========================================================= +# +# We add a virtualization-friendly USB 3.0 controller and +# a USB keyboard / USB tablet combo so that graphical +# guests can be controlled appropriately. + +[device "usb"] + driver = "nec-usb-xhci" + bus = "pcie.3" + addr = "00.0" + +[device "keyboard"] + driver = "usb-kbd" + bus = "usb.0" + +[device "tablet"] + driver = "usb-tablet" + bus = "usb.0" + + +# Display controller +# ========================================================= +# +# We use virtio-gpu because the legacy VGA framebuffer is +# very troublesome on aarch64, and virtio-gpu is the only +# video device that doesn't implement it. +# +# If you're running the guest on a remote, potentially +# headless host, you will probably want to append something +# like +# +# -display vnc=127.0.0.1:0 +# +# to the command line in order to prevent QEMU from +# creating a graphical display window on the host and +# enable remote access instead. + +[device "video"] + driver = "virtio-gpu" + bus = "pcie.0" + addr = "01.0" diff --git a/docs/config/mach-virt-serial.cfg b/docs/config/mach-virt-serial.cfg new file mode 100644 index 000000000..18a7c8373 --- /dev/null +++ b/docs/config/mach-virt-serial.cfg @@ -0,0 +1,243 @@ +# mach-virt - VirtIO guest (serial console) +# ========================================================= +# +# Usage: +# +# $ qemu-system-aarch64 \ +# -nodefaults \ +# -readconfig mach-virt-serial.cfg \ +# -display none -serial mon:stdio \ +# -cpu host +# +# You will probably need to tweak the lines marked as +# CHANGE ME before being able to use this configuration! +# +# The guest will have a selection of VirtIO devices +# tailored towards optimal performance with modern guests, +# and will be accessed through the serial console. +# +# --------------------------------------------------------- +# +# Using -nodefaults is required to have full control over +# the virtual hardware: when it's specified, QEMU will +# populate the board with only the builtin peripherals, +# such as the PL011 UART, plus a PCI Express Root Bus; the +# user will then have to explicitly add further devices. +# +# The PCI Express Root Bus shows up in the guest as: +# +# 00:00.0 Host bridge +# +# This configuration file adds a number of other useful +# devices, more specifically: +# +# 00.1c.* PCI bridge (PCI Express Root Ports) +# 01:00.0 SCSI storage controller +# 02:00.0 Ethernet controller +# +# More information about these devices is available below. +# +# We use '-display none' to prevent QEMU from creating a +# graphical display window, which would serve no use in +# this specific configuration, and '-serial mon:stdio' to +# multiplex the guest's serial console and the QEMU monitor +# to the host's stdio; use 'Ctrl+A h' to learn how to +# switch between the two and more. + + +# Machine options +# ========================================================= +# +# We use the virt machine type and enable KVM acceleration +# for better performance. +# +# Using less than 1 GiB of memory is probably not going to +# yield good performance in the guest, and might even lead +# to obscure boot issues in some cases. +# +# Unfortunately, there is no way to configure the CPU model +# in this file, so it will have to be provided on the +# command line, but we can configure the guest to use the +# same GIC version as the host. + +[machine] + type = "virt" + accel = "kvm" + gic-version = "host" + +[memory] + size = "1024" + + +# Firmware configuration +# ========================================================= +# +# There are two parts to the firmware: a read-only image +# containing the executable code, which is shared between +# guests, and a read/write variable store that is owned +# by one specific guest, exclusively, and is used to +# record information such as the UEFI boot order. +# +# For any new guest, its permanent, private variable store +# should initially be copied from the template file +# provided along with the firmware binary. +# +# Depending on the OS distribution you're using on the +# host, the name of the package containing the firmware +# binary and variable store template, as well as the paths +# to the files themselves, will be different. For example: +# +# Fedora +# edk2-aarch64 (pkg) +# /usr/share/edk2/aarch64/QEMU_EFI-pflash.raw (bin) +# /usr/share/edk2/aarch64/vars-template-pflash.raw (var) +# +# RHEL +# AAVMF (pkg) +# /usr/share/AAVMF/AAVMF_CODE.fd (bin) +# /usr/share/AAVMF/AAVMF_VARS.fd (var) +# +# Debian/Ubuntu +# qemu-efi (pkg) +# /usr/share/AAVMF/AAVMF_CODE.fd (bin) +# /usr/share/AAVMF/AAVMF_VARS.fd (var) + +[drive "uefi-binary"] + file = "/usr/share/AAVMF/AAVMF_CODE.fd" # CHANGE ME + format = "raw" + if = "pflash" + unit = "0" + readonly = "on" + +[drive "uefi-varstore"] + file = "guest_VARS.fd" # CHANGE ME + format = "raw" + if = "pflash" + unit = "1" + + +# PCI bridge (PCI Express Root Ports) +# ========================================================= +# +# We create eight PCI Express Root Ports, and we plug them +# all into separate functions of the same slot. Some of +# them will be used by devices, the rest will remain +# available for hotplug. + +[device "pcie.1"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.0" + port = "1" + chassis = "1" + multifunction = "on" + +[device "pcie.2"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.1" + port = "2" + chassis = "2" + +[device "pcie.3"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.2" + port = "3" + chassis = "3" + +[device "pcie.4"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.3" + port = "4" + chassis = "4" + +[device "pcie.5"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.4" + port = "5" + chassis = "5" + +[device "pcie.6"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.5" + port = "6" + chassis = "6" + +[device "pcie.7"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.6" + port = "7" + chassis = "7" + +[device "pcie.8"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.7" + port = "8" + chassis = "8" + + +# SCSI storage controller (and storage) +# ========================================================= +# +# We use virtio-scsi here so that we can (hot)plug a large +# number of disks without running into issues; a SCSI disk, +# backed by a qcow2 disk image on the host's filesystem, is +# attached to it. +# +# We also create an optical disk, mostly for installation +# purposes: once the guest OS has been successfully +# installed, the guest will no longer boot from optical +# media. If you don't want, or no longer want, to have an +# optical disk in the guest you can safely comment out +# all relevant sections below. + +[device "scsi"] + driver = "virtio-scsi-pci" + bus = "pcie.1" + addr = "00.0" + +[device "scsi-disk"] + driver = "scsi-hd" + bus = "scsi.0" + drive = "disk" + bootindex = "1" + +[drive "disk"] + file = "guest.qcow2" # CHANGE ME + format = "qcow2" + if = "none" + +[device "scsi-optical-disk"] + driver = "scsi-cd" + bus = "scsi.0" + drive = "optical-disk" + bootindex = "2" + +[drive "optical-disk"] + file = "install.iso" # CHANGE ME + format = "raw" + if = "none" + + +# Ethernet controller +# ========================================================= +# +# We use virtio-net for improved performance over emulated +# hardware; on the host side, we take advantage of user +# networking so that the QEMU process doesn't require any +# additional privileges. + +[netdev "hostnet"] + type = "user" + +[device "net"] + driver = "virtio-net-pci" + netdev = "hostnet" + bus = "pcie.2" + addr = "00.0" diff --git a/docs/config/q35-emulated.cfg b/docs/config/q35-emulated.cfg new file mode 100644 index 000000000..99ac918e7 --- /dev/null +++ b/docs/config/q35-emulated.cfg @@ -0,0 +1,288 @@ +# q35 - Emulated guest (graphical console) +# ========================================================= +# +# Usage: +# +# $ qemu-system-x86_64 \ +# -nodefaults \ +# -readconfig q35-emulated.cfg +# +# You will probably need to tweak the lines marked as +# CHANGE ME before being able to use this configuration! +# +# The guest will have a selection of emulated devices that +# closely resembles that of a physical machine, and will be +# accessed through a graphical console. +# +# --------------------------------------------------------- +# +# Using -nodefaults is required to have full control over +# the virtual hardware: when it's specified, QEMU will +# populate the board with only the builtin peripherals +# plus a small selection of core PCI devices and +# controllers; the user will then have to explicitly add +# further devices. +# +# The core PCI devices show up in the guest as: +# +# 00:00.0 Host bridge +# 00:1f.0 ISA bridge / LPC +# 00:1f.2 SATA (AHCI) controller +# 00:1f.3 SMBus controller +# +# This configuration file adds a number of devices that +# are pretty much guaranteed to be present in every single +# physical machine based on q35, more specifically: +# +# 00:01.0 VGA compatible controller +# 00:19.0 Ethernet controller +# 00:1a.* USB controller (#2) +# 00:1b.0 Audio device +# 00:1c.* PCI bridge (PCI Express Root Ports) +# 00:1d.* USB Controller (#1) +# 00:1e.0 PCI bridge (legacy PCI bridge) +# +# More information about these devices is available below. + + +# Machine options +# ========================================================= +# +# We use the q35 machine type and enable KVM acceleration +# for better performance. +# +# Using less than 1 GiB of memory is probably not going to +# yield good performance in the guest, and might even lead +# to obscure boot issues in some cases. +# +# Unfortunately, there is no way to configure the CPU model +# in this file, so it will have to be provided on the +# command line. + +[machine] + type = "q35" + accel = "kvm" + +[memory] + size = "1024" + + +# PCI bridge (PCI Express Root Ports) +# ========================================================= +# +# We add four PCI Express Root Ports, all sharing the same +# slot on the PCI Express Root Bus. These ports support +# hotplug. + +[device "ich9-pcie-port-1"] + driver = "ioh3420" + multifunction = "on" + bus = "pcie.0" + addr = "1c.0" + port = "1" + chassis = "1" + +[device "ich9-pcie-port-2"] + driver = "ioh3420" + multifunction = "on" + bus = "pcie.0" + addr = "1c.1" + port = "2" + chassis = "2" + +[device "ich9-pcie-port-3"] + driver = "ioh3420" + multifunction = "on" + bus = "pcie.0" + addr = "1c.2" + port = "3" + chassis = "3" + +[device "ich9-pcie-port-4"] + driver = "ioh3420" + multifunction = "on" + bus = "pcie.0" + addr = "1c.3" + port = "4" + chassis = "4" + + +# PCI bridge (legacy PCI bridge) +# ========================================================= +# +# This bridge can be used to build an independent topology +# for legacy PCI devices. PCI Express devices should be +# plugged into PCI Express slots instead, so ideally there +# will be no devices connected to this bridge. + +[device "ich9-pci-bridge"] + driver = "i82801b11-bridge" + bus = "pcie.0" + addr = "1e.0" + + +# SATA storage +# ========================================================= +# +# An implicit SATA controller is created automatically for +# every single q35 guest; here we create a disk, backed by +# a qcow2 disk image on the host's filesystem, and attach +# it to that controller so that the guest can use it. +# +# We also create an optical disk, mostly for installation +# purposes: once the guest OS has been successfully +# installed, the guest will no longer boot from optical +# media. If you don't want, or no longer want, to have an +# optical disk in the guest you can safely comment out +# all relevant sections below. + +[device "sata-disk"] + driver = "ide-hd" + bus = "ide.0" + drive = "disk" + bootindex = "1" + +[drive "disk"] + file = "guest.qcow2" # CHANGE ME + format = "qcow2" + if = "none" + +[device "sata-optical-disk"] + driver = "ide-cd" + bus = "ide.1" + drive = "optical-disk" + bootindex = "2" + +[drive "optical-disk"] + file = "install.iso" # CHANGE ME + format = "raw" + if = "none" + + +# USB controller (#1) +# ========================================================= +# +# EHCI controller + UHCI companion controllers. + +[device "ich9-ehci-1"] + driver = "ich9-usb-ehci1" + multifunction = "on" + bus = "pcie.0" + addr = "1d.7" + +[device "ich9-uhci-1"] + driver = "ich9-usb-uhci1" + multifunction = "on" + bus = "pcie.0" + addr = "1d.0" + masterbus = "ich9-ehci-1.0" + firstport = "0" + +[device "ich9-uhci-2"] + driver = "ich9-usb-uhci2" + multifunction = "on" + bus = "pcie.0" + addr = "1d.1" + masterbus = "ich9-ehci-1.0" + firstport = "2" + +[device "ich9-uhci-3"] + driver = "ich9-usb-uhci3" + multifunction = "on" + bus = "pcie.0" + addr = "1d.2" + masterbus = "ich9-ehci-1.0" + firstport = "4" + + +# USB controller (#2) +# ========================================================= +# +# EHCI controller + UHCI companion controllers. + +[device "ich9-ehci-2"] + driver = "ich9-usb-ehci2" + multifunction = "on" + bus = "pcie.0" + addr = "1a.7" + +[device "ich9-uhci-4"] + driver = "ich9-usb-uhci4" + multifunction = "on" + bus = "pcie.0" + addr = "1a.0" + masterbus = "ich9-ehci-2.0" + firstport = "0" + +[device "ich9-uhci-5"] + driver = "ich9-usb-uhci5" + multifunction = "on" + bus = "pcie.0" + addr = "1a.1" + masterbus = "ich9-ehci-2.0" + firstport = "2" + +[device "ich9-uhci-6"] + driver = "ich9-usb-uhci6" + multifunction = "on" + bus = "pcie.0" + addr = "1a.2" + masterbus = "ich9-ehci-2.0" + firstport = "4" + + +# Ethernet controller +# ========================================================= +# +# We add a Gigabit Ethernet interface to the guest; on the +# host side, we take advantage of user networking so that +# the QEMU process doesn't require any additional +# privileges. + +[netdev "hostnet"] + type = "user" + +[device "net"] + driver = "e1000" + netdev = "hostnet" + bus = "pcie.0" + addr = "19.0" + + +# VGA compatible controller +# ========================================================= +# +# We use stdvga instead of Cirrus as it supports more video +# modes and is closer to what actual hardware looks like. +# +# If you're running the guest on a remote, potentially +# headless host, you will probably want to append something +# like +# +# -display vnc=127.0.0.1:0 +# +# to the command line in order to prevent QEMU from +# creating a graphical display window on the host and +# enable remote access instead. + +[device "video"] + driver = "VGA" + bus = "pcie.0" + addr = "01.0" + + +# Audio device +# ========================================================= +# +# The sound card is a legacy PCI device that is plugged +# directly into the PCI Express Root Bus. + +[device "ich9-hda-audio"] + driver = "ich9-intel-hda" + bus = "pcie.0" + addr = "1b.0" + +[device "ich9-hda-duplex"] + driver = "hda-duplex" + bus = "ich9-hda-audio.0" + cad = "0" diff --git a/docs/config/q35-virtio-graphical.cfg b/docs/config/q35-virtio-graphical.cfg new file mode 100644 index 000000000..4207f11e4 --- /dev/null +++ b/docs/config/q35-virtio-graphical.cfg @@ -0,0 +1,248 @@ +# q35 - VirtIO guest (graphical console) +# ========================================================= +# +# Usage: +# +# $ qemu-system-x86_64 \ +# -nodefaults \ +# -readconfig q35-virtio-graphical.cfg +# +# You will probably need to tweak the lines marked as +# CHANGE ME before being able to use this configuration! +# +# The guest will have a selection of VirtIO devices +# tailored towards optimal performance with modern guests, +# and will be accessed through a graphical console. +# +# --------------------------------------------------------- +# +# Using -nodefaults is required to have full control over +# the virtual hardware: when it's specified, QEMU will +# populate the board with only the builtin peripherals +# plus a small selection of core PCI devices and +# controllers; the user will then have to explicitly add +# further devices. +# +# The core PCI devices show up in the guest as: +# +# 00:00.0 Host bridge +# 00:1f.0 ISA bridge / LPC +# 00:1f.2 SATA (AHCI) controller +# 00:1f.3 SMBus controller +# +# This configuration file adds a number of other useful +# devices, more specifically: +# +# 00:01.0 VGA compatible controller +# 00:1b.0 Audio device +# 00.1c.* PCI bridge (PCI Express Root Ports) +# 01:00.0 SCSI storage controller +# 02:00.0 Ethernet controller +# 03:00.0 USB controller +# +# More information about these devices is available below. + + +# Machine options +# ========================================================= +# +# We use the q35 machine type and enable KVM acceleration +# for better performance. +# +# Using less than 1 GiB of memory is probably not going to +# yield good performance in the guest, and might even lead +# to obscure boot issues in some cases. + +[machine] + type = "q35" + accel = "kvm" + +[memory] + size = "1024" + + +# PCI bridge (PCI Express Root Ports) +# ========================================================= +# +# We create eight PCI Express Root Ports, and we plug them +# all into separate functions of the same slot. Some of +# them will be used by devices, the rest will remain +# available for hotplug. + +[device "pcie.1"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.0" + port = "1" + chassis = "1" + multifunction = "on" + +[device "pcie.2"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.1" + port = "2" + chassis = "2" + +[device "pcie.3"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.2" + port = "3" + chassis = "3" + +[device "pcie.4"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.3" + port = "4" + chassis = "4" + +[device "pcie.5"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.4" + port = "5" + chassis = "5" + +[device "pcie.6"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.5" + port = "6" + chassis = "6" + +[device "pcie.7"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.6" + port = "7" + chassis = "7" + +[device "pcie.8"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.7" + port = "8" + chassis = "8" + + +# SCSI storage controller (and storage) +# ========================================================= +# +# We use virtio-scsi here so that we can (hot)plug a large +# number of disks without running into issues; a SCSI disk, +# backed by a qcow2 disk image on the host's filesystem, is +# attached to it. +# +# We also create an optical disk, mostly for installation +# purposes: once the guest OS has been successfully +# installed, the guest will no longer boot from optical +# media. If you don't want, or no longer want, to have an +# optical disk in the guest you can safely comment out +# all relevant sections below. + +[device "scsi"] + driver = "virtio-scsi-pci" + bus = "pcie.1" + addr = "00.0" + +[device "scsi-disk"] + driver = "scsi-hd" + bus = "scsi.0" + drive = "disk" + bootindex = "1" + +[drive "disk"] + file = "guest.qcow2" # CHANGE ME + format = "qcow2" + if = "none" + +[device "scsi-optical-disk"] + driver = "scsi-cd" + bus = "scsi.0" + drive = "optical-disk" + bootindex = "2" + +[drive "optical-disk"] + file = "install.iso" # CHANGE ME + format = "raw" + if = "none" + + +# Ethernet controller +# ========================================================= +# +# We use virtio-net for improved performance over emulated +# hardware; on the host side, we take advantage of user +# networking so that the QEMU process doesn't require any +# additional privileges. + +[netdev "hostnet"] + type = "user" + +[device "net"] + driver = "virtio-net-pci" + netdev = "hostnet" + bus = "pcie.2" + addr = "00.0" + + +# USB controller (and input devices) +# ========================================================= +# +# We add a virtualization-friendly USB 3.0 controller and +# a USB tablet so that graphical guests can be controlled +# appropriately. A USB keyboard is not needed, as q35 +# guests get a PS/2 one added automatically. + +[device "usb"] + driver = "nec-usb-xhci" + bus = "pcie.3" + addr = "00.0" + +[device "tablet"] + driver = "usb-tablet" + bus = "usb.0" + + +# VGA compatible controller +# ========================================================= +# +# We plug the QXL video card directly into the PCI Express +# Root Bus as it is a legacy PCI device; this way, we can +# reduce the number of PCI Express controllers in the +# guest. +# +# If you're running the guest on a remote, potentially +# headless host, you will probably want to append something +# like +# +# -display vnc=127.0.0.1:0 +# +# to the command line in order to prevent QEMU from +# creating a graphical display window on the host and +# enable remote access instead. + +[device "video"] + driver = "qxl-vga" + bus = "pcie.0" + addr = "01.0" + + +# Audio device +# ========================================================= +# +# Like the video card, the sound card is a legacy PCI +# device and as such can be plugged directly into the PCI +# Express Root Bus. + +[device "sound"] + driver = "ich9-intel-hda" + bus = "pcie.0" + addr = "1b.0" + +[device "duplex"] + driver = "hda-duplex" + bus = "sound.0" + cad = "0" diff --git a/docs/config/q35-virtio-serial.cfg b/docs/config/q35-virtio-serial.cfg new file mode 100644 index 000000000..d2830aec5 --- /dev/null +++ b/docs/config/q35-virtio-serial.cfg @@ -0,0 +1,193 @@ +# q35 - VirtIO guest (serial console) +# ========================================================= +# +# Usage: +# +# $ qemu-system-x86_64 \ +# -nodefaults \ +# -readconfig q35-virtio-serial.cfg \ +# -display none -serial mon:stdio +# +# You will probably need to tweak the lines marked as +# CHANGE ME before being able to use this configuration! +# +# The guest will have a selection of VirtIO devices +# tailored towards optimal performance with modern guests, +# and will be accessed through the serial console. +# +# --------------------------------------------------------- +# +# Using -nodefaults is required to have full control over +# the virtual hardware: when it's specified, QEMU will +# populate the board with only the builtin peripherals +# plus a small selection of core PCI devices and +# controllers; the user will then have to explicitly add +# further devices. +# +# The core PCI devices show up in the guest as: +# +# 00:00.0 Host bridge +# 00:1f.0 ISA bridge / LPC +# 00:1f.2 SATA (AHCI) controller +# 00:1f.3 SMBus controller +# +# This configuration file adds a number of other useful +# devices, more specifically: +# +# 00.1c.* PCI bridge (PCI Express Root Ports) +# 01:00.0 SCSI storage controller +# 02:00.0 Ethernet controller +# +# More information about these devices is available below. +# +# We use '-display none' to prevent QEMU from creating a +# graphical display window, which would serve no use in +# this specific configuration, and '-serial mon:stdio' to +# multiplex the guest's serial console and the QEMU monitor +# to the host's stdio; use 'Ctrl+A h' to learn how to +# switch between the two and more. + + +# Machine options +# ========================================================= +# +# We use the q35 machine type and enable KVM acceleration +# for better performance. +# +# Using less than 1 GiB of memory is probably not going to +# yield good performance in the guest, and might even lead +# to obscure boot issues in some cases. + +[machine] + type = "q35" + accel = "kvm" + +[memory] + size = "1024" + + +# PCI bridge (PCI Express Root Ports) +# ========================================================= +# +# We create eight PCI Express Root Ports, and we plug them +# all into separate functions of the same slot. Some of +# them will be used by devices, the rest will remain +# available for hotplug. + +[device "pcie.1"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.0" + port = "1" + chassis = "1" + multifunction = "on" + +[device "pcie.2"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.1" + port = "2" + chassis = "2" + +[device "pcie.3"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.2" + port = "3" + chassis = "3" + +[device "pcie.4"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.3" + port = "4" + chassis = "4" + +[device "pcie.5"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.4" + port = "5" + chassis = "5" + +[device "pcie.6"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.5" + port = "6" + chassis = "6" + +[device "pcie.7"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.6" + port = "7" + chassis = "7" + +[device "pcie.8"] + driver = "pcie-root-port" + bus = "pcie.0" + addr = "1c.7" + port = "8" + chassis = "8" + + +# SCSI storage controller (and storage) +# ========================================================= +# +# We use virtio-scsi here so that we can (hot)plug a large +# number of disks without running into issues; a SCSI disk, +# backed by a qcow2 disk image on the host's filesystem, is +# attached to it. +# +# We also create an optical disk, mostly for installation +# purposes: once the guest OS has been successfully +# installed, the guest will no longer boot from optical +# media. If you don't want, or no longer want, to have an +# optical disk in the guest you can safely comment out +# all relevant sections below. + +[device "scsi"] + driver = "virtio-scsi-pci" + bus = "pcie.1" + addr = "00.0" + +[device "scsi-disk"] + driver = "scsi-hd" + bus = "scsi.0" + drive = "disk" + bootindex = "1" + +[drive "disk"] + file = "guest.qcow2" # CHANGE ME + format = "qcow2" + if = "none" + +[device "scsi-optical-disk"] + driver = "scsi-cd" + bus = "scsi.0" + drive = "optical-disk" + bootindex = "2" + +[drive "optical-disk"] + file = "install.iso" # CHANGE ME + format = "raw" + if = "none" + + +# Ethernet controller +# ========================================================= +# +# We use virtio-net for improved performance over emulated +# hardware; on the host side, we take advantage of user +# networking so that the QEMU process doesn't require any +# additional privileges. + +[netdev "hostnet"] + type = "user" + +[device "net"] + driver = "virtio-net-pci" + netdev = "hostnet" + bus = "pcie.2" + addr = "00.0" diff --git a/docs/defs.rst.inc b/docs/defs.rst.inc new file mode 100644 index 000000000..52d6454b9 --- /dev/null +++ b/docs/defs.rst.inc @@ -0,0 +1,15 @@ +.. + Generally useful rST substitution definitions. This is included for + all rST files as part of the epilogue by docs/conf.py. conf.py + also defines some dynamically generated substitutions like CONFDIR. + + Note that |qemu_system| and |qemu_system_x86| are intended to be + used inside a parsed-literal block: the definition must not include + extra literal formatting with ``..``: this works in the HTML output + but the manpages will end up misrendered with following normal text + incorrectly in boldface. + +.. |qemu_system| replace:: qemu-system-x86_64 +.. |qemu_system_x86| replace:: qemu-system-x86_64 +.. |I2C| replace:: I\ :sup:`2`\ C +.. |I2S| replace:: I\ :sup:`2`\ S diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst new file mode 100644 index 000000000..52baa0736 --- /dev/null +++ b/docs/devel/atomics.rst @@ -0,0 +1,507 @@ +========================= +Atomic operations in QEMU +========================= + +CPUs perform independent memory operations effectively in random order. +but this can be a problem for CPU-CPU interaction (including interactions +between QEMU and the guest). Multi-threaded programs use various tools +to instruct the compiler and the CPU to restrict the order to something +that is consistent with the expectations of the programmer. + +The most basic tool is locking. Mutexes, condition variables and +semaphores are used in QEMU, and should be the default approach to +synchronization. Anything else is considerably harder, but it's +also justified more often than one would like; +the most performance-critical parts of QEMU in particular require +a very low level approach to concurrency, involving memory barriers +and atomic operations. The semantics of concurrent memory accesses are governed +by the C11 memory model. + +QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to +provide better portability and a less verbose syntax. ``qemu/atomic.h`` +provides macros that fall in three camps: + +- compiler barriers: ``barrier()``; + +- weak atomic access and manual memory barriers: ``qatomic_read()``, + ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, + ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``; + +- sequentially consistent atomic access: everything else. + +In general, use of ``qemu/atomic.h`` should be wrapped with more easily +used data structures (e.g. the lock-free singly-linked list operations +``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization +primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of +atomic operations and memory barriers should be limited to inter-thread +checking of flags and documented thoroughly. + + + +Compiler memory barrier +======================= + +``barrier()`` prevents the compiler from moving the memory accesses on +either side of it to the other side. The compiler barrier has no direct +effect on the CPU, which may then reorder things however it wishes. + +``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some +architectures, CPU guarantees are strong enough that blocking compiler +optimizations already ensures the correct order of execution. In this +case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple +compiler barriers. + +Still, ``barrier()`` can be useful when writing code that can be interrupted +by signal handlers. + + +Sequentially consistent atomic access +===================================== + +Most of the operations in the ``qemu/atomic.h`` header ensure *sequential +consistency*, where "the result of any execution is the same as if the +operations of all the processors were executed in some sequential order, +and the operations of each individual processor appear in this sequence +in the order specified by its program". + +``qemu/atomic.h`` provides the following set of atomic read-modify-write +operations:: + + void qatomic_inc(ptr) + void qatomic_dec(ptr) + void qatomic_add(ptr, val) + void qatomic_sub(ptr, val) + void qatomic_and(ptr, val) + void qatomic_or(ptr, val) + + typeof(*ptr) qatomic_fetch_inc(ptr) + typeof(*ptr) qatomic_fetch_dec(ptr) + typeof(*ptr) qatomic_fetch_add(ptr, val) + typeof(*ptr) qatomic_fetch_sub(ptr, val) + typeof(*ptr) qatomic_fetch_and(ptr, val) + typeof(*ptr) qatomic_fetch_or(ptr, val) + typeof(*ptr) qatomic_fetch_xor(ptr, val) + typeof(*ptr) qatomic_fetch_inc_nonzero(ptr) + typeof(*ptr) qatomic_xchg(ptr, val) + typeof(*ptr) qatomic_cmpxchg(ptr, old, new) + +all of which return the old value of ``*ptr``. These operations are +polymorphic; they operate on any type that is as wide as a pointer or +smaller. + +Similar operations return the new value of ``*ptr``:: + + typeof(*ptr) qatomic_inc_fetch(ptr) + typeof(*ptr) qatomic_dec_fetch(ptr) + typeof(*ptr) qatomic_add_fetch(ptr, val) + typeof(*ptr) qatomic_sub_fetch(ptr, val) + typeof(*ptr) qatomic_and_fetch(ptr, val) + typeof(*ptr) qatomic_or_fetch(ptr, val) + typeof(*ptr) qatomic_xor_fetch(ptr, val) + +``qemu/atomic.h`` also provides loads and stores that cannot be reordered +with each other:: + + typeof(*ptr) qatomic_mb_read(ptr) + void qatomic_mb_set(ptr, val) + +However these do not provide sequential consistency and, in particular, +they do not participate in the total ordering enforced by +sequentially-consistent operations. For this reason they are deprecated. +They should instead be replaced with any of the following (ordered from +easiest to hardest): + +- accesses inside a mutex or spinlock + +- lightweight synchronization primitives such as ``QemuEvent`` + +- RCU operations (``qatomic_rcu_read``, ``qatomic_rcu_set``) when publishing + or accessing a new version of a data structure + +- other atomic accesses: ``qatomic_read`` and ``qatomic_load_acquire`` for + loads, ``qatomic_set`` and ``qatomic_store_release`` for stores, ``smp_mb`` + to forbid reordering subsequent loads before a store. + + +Weak atomic access and manual memory barriers +============================================= + +Compared to sequentially consistent atomic access, programming with +weaker consistency models can be considerably more complicated. +The only guarantees that you can rely upon in this case are: + +- atomic accesses will not cause data races (and hence undefined behavior); + ordinary accesses instead cause data races if they are concurrent with + other accesses of which at least one is a write. In order to ensure this, + the compiler will not optimize accesses out of existence, create unsolicited + accesses, or perform other similar optimzations. + +- acquire operations will appear to happen, with respect to the other + components of the system, before all the LOAD or STORE operations + specified afterwards. + +- release operations will appear to happen, with respect to the other + components of the system, after all the LOAD or STORE operations + specified before. + +- release operations will *synchronize with* acquire operations; + see :ref:`acqrel` for a detailed explanation. + +When using this model, variables are accessed with: + +- ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from + optimizing accesses out of existence and creating unsolicited + accesses, but do not otherwise impose any ordering on loads and + stores: both the compiler and the processor are free to reorder + them. + +- ``qatomic_load_acquire()``, which guarantees the LOAD to appear to + happen, with respect to the other components of the system, + before all the LOAD or STORE operations specified afterwards. + Operations coming before ``qatomic_load_acquire()`` can still be + reordered after it. + +- ``qatomic_store_release()``, which guarantees the STORE to appear to + happen, with respect to the other components of the system, + after all the LOAD or STORE operations specified before. + Operations coming after ``qatomic_store_release()`` can still be + reordered before it. + +Restrictions to the ordering of accesses can also be specified +using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, +``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. + +Memory barriers control the order of references to shared memory. +They come in six kinds: + +- ``smp_rmb()`` guarantees that all the LOAD operations specified before + the barrier will appear to happen before all the LOAD operations + specified after the barrier with respect to the other components of + the system. + + In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not + required to have any effect on stores. + +- ``smp_wmb()`` guarantees that all the STORE operations specified before + the barrier will appear to happen before all the STORE operations + specified after the barrier with respect to the other components of + the system. + + In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not + required to have any effect on loads. + +- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before + the barrier will appear to happen before all the LOAD or STORE operations + specified after the barrier with respect to the other components of + the system. + +- ``smp_mb_release()`` guarantees that all the STORE operations specified *after* + the barrier will appear to happen after all the LOAD or STORE operations + specified *before* the barrier with respect to the other components of + the system. + +- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified + before the barrier will appear to happen before all the LOAD and + STORE operations specified after the barrier with respect to the other + components of the system. + + ``smp_mb()`` puts a partial ordering on both loads and stores. It is + stronger than both a read and a write memory barrier; it implies both + ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs + coming before the barrier from overtaking LOADs coming after the + barrier and vice versa. + +- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On + most processors, whenever two loads are performed such that the + second depends on the result of the first (e.g., the first load + retrieves the address to which the second load will be directed), + the processor will guarantee that the first LOAD will appear to happen + before the second with respect to the other components of the system. + However, this is not always true---for example, it was not true on + Alpha processors. Whenever this kind of access happens to shared + memory (that is not protected by a lock), a read barrier is needed, + and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``. + + Note that the first load really has to have a _data_ dependency and not + a control dependency. If the address for the second load is dependent + on the first load, but the dependency is through a conditional rather + than actually loading the address itself, then it's a _control_ + dependency and a full read barrier or better is required. + + +Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are +mostly used when a data structure has one thread that is always a writer +and one thread that is always a reader: + + +----------------------------------+----------------------------------+ + | thread 1 | thread 2 | + +==================================+==================================+ + | :: | :: | + | | | + | qatomic_store_release(&a, x); | y = qatomic_load_acquire(&b); | + | qatomic_store_release(&b, y); | x = qatomic_load_acquire(&a); | + +----------------------------------+----------------------------------+ + +In this case, correctness is easy to check for using the "pairing" +trick that is explained below. + +Sometimes, a thread is accessing many variables that are otherwise +unrelated to each other (for example because, apart from the current +thread, exactly one other thread will read or write each of these +variables). In this case, it is possible to "hoist" the barriers +outside a loop. For example: + + +------------------------------------------+----------------------------------+ + | before | after | + +==========================================+==================================+ + | :: | :: | + | | | + | n = 0; | n = 0; | + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | + | n += qatomic_load_acquire(&a[i]); | n += qatomic_read(&a[i]); | + | | smp_mb_acquire(); | + +------------------------------------------+----------------------------------+ + | :: | :: | + | | | + | | smp_mb_release(); | + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | + | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | + +------------------------------------------+----------------------------------+ + +Splitting a loop can also be useful to reduce the number of barriers: + + +------------------------------------------+----------------------------------+ + | before | after | + +==========================================+==================================+ + | :: | :: | + | | | + | n = 0; | smp_mb_release(); | + | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | + | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | + | smp_mb(); | smb_mb(); | + | n += qatomic_read(&b[i]); | n = 0; | + | } | for (i = 0; i < 10; i++) | + | | n += qatomic_read(&b[i]); | + +------------------------------------------+----------------------------------+ + +In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer +as well) ``smp_wmb()``: + + +------------------------------------------+----------------------------------+ + | before | after | + +==========================================+==================================+ + | :: | :: | + | | | + | | smp_mb_release(); | + | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | + | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | + | qatomic_store_release(&b[i], false); | smb_wmb(); | + | } | for (i = 0; i < 10; i++) | + | | qatomic_set(&b[i], false); | + +------------------------------------------+----------------------------------+ + + +.. _acqrel: + +Acquire/release pairing and the *synchronizes-with* relation +------------------------------------------------------------ + +Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have +either *acquire* or *release* semantics [#rmw]_. This has two effects: + +.. [#rmw] Read-modify-write operations can have both---acquire applies to the + read part, and release to the write. + +- within a thread, they are ordered either before subsequent operations + (for acquire) or after previous operations (for release). + +- if a release operation in one thread *synchronizes with* an acquire operation + in another thread, the ordering constraints propagates from the first to the + second thread. That is, everything before the release operation in the + first thread is guaranteed to *happen before* everything after the + acquire operation in the second thread. + +The concept of acquire and release semantics is not exclusive to atomic +operations; almost all higher-level synchronization primitives also have +acquire or release semantics. For example: + +- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has + release semantics and synchronizes with a ``pthread_mutex_lock`` for the + same mutex. + +- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics; + ``pthread_cond_wait`` has both release semantics (synchronizing with + ``pthread_mutex_lock``) and acquire semantics (synchronizing with + ``pthread_mutex_unlock`` and signaling of the condition variable). + +- ``pthread_create`` has release semantics and synchronizes with the start + of the new thread; ``pthread_join`` has acquire semantics and synchronizes + with the exiting of the thread. + +- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has + acquire semantics. + +For example, in the following example there are no atomic accesses, but still +thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit`` +(release) and ``pthread_join`` (acquire): + + +----------------------+-------------------------------+ + | thread 1 | thread 2 | + +======================+===============================+ + | :: | :: | + | | | + | *a = 1; | | + | pthread_exit(a); | pthread_join(thread1, &a); | + | | x = *a; | + +----------------------+-------------------------------+ + +Synchronization between threads basically descends from this pairing of +a release operation and an acquire operation. Therefore, atomic operations +other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be +paired with another operation of the opposite kind: an acquire operation +will pair with a release operation and vice versa. This rule of thumb is +extremely useful; in the case of QEMU, however, note that the other +operation may actually be in a driver that runs in the guest! + +``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``, +``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count +as acquire operations. ``smp_wmb()``, ``smp_mb_release()``, +``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release +operations. ``smp_mb()`` counts as both acquire and release, therefore +it can pair with any other atomic operation. Here is an example: + + +----------------------+------------------------------+ + | thread 1 | thread 2 | + +======================+==============================+ + | :: | :: | + | | | + | qatomic_set(&a, 1);| | + | smp_wmb(); | | + | qatomic_set(&b, 2);| x = qatomic_read(&b); | + | | smp_rmb(); | + | | y = qatomic_read(&a); | + +----------------------+------------------------------+ + +Note that a load-store pair only counts if the two operations access the +same variable: that is, a store-release on a variable ``x`` *synchronizes +with* a load-acquire on a variable ``x``, while a release barrier +synchronizes with any acquire operation. The following example shows +correct synchronization: + + +--------------------------------+--------------------------------+ + | thread 1 | thread 2 | + +================================+================================+ + | :: | :: | + | | | + | qatomic_set(&a, 1); | | + | qatomic_store_release(&b, 2);| x = qatomic_load_acquire(&b);| + | | y = qatomic_read(&a); | + +--------------------------------+--------------------------------+ + +Acquire and release semantics of higher-level primitives can also be +relied upon for the purpose of establishing the *synchronizes with* +relation. + +Note that the "writing" thread is accessing the variables in the +opposite order as the "reading" thread. This is expected: stores +before a release operation will normally match the loads after +the acquire operation, and vice versa. In fact, this happened already +in the ``pthread_exit``/``pthread_join`` example above. + +Finally, this more complex example has more than two accesses and data +dependency barriers. It also does not use atomic accesses whenever there +cannot be a data race: + + +----------------------+------------------------------+ + | thread 1 | thread 2 | + +======================+==============================+ + | :: | :: | + | | | + | b[2] = 1; | | + | smp_wmb(); | | + | x->i = 2; | | + | smp_wmb(); | | + | qatomic_set(&a, x);| x = qatomic_read(&a); | + | | smp_read_barrier_depends(); | + | | y = x->i; | + | | smp_read_barrier_depends(); | + | | z = b[y]; | + +----------------------+------------------------------+ + +Comparison with Linux kernel primitives +======================================= + +Here is a list of differences between Linux kernel atomic operations +and memory barriers, and the equivalents in QEMU: + +- atomic operations in Linux are always on a 32-bit int type and + use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic + and use normal C types. + +- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee + at all. Linux 4.1 updated them to implement volatile + semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). + + QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed + semantics if the compiler supports it, and volatile semantics otherwise. + Both semantics prevent the compiler from doing certain transformations; + the difference is that atomic accesses are guaranteed to be atomic, + while volatile accesses aren't. Thus, in the volatile case we just cross + our fingers hoping that the compiler will generate atomic accesses, + since we assume the variables passed are machine-word sized and + properly aligned. + + No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either + Linux or QEMU. + +- atomic read-modify-write operations in Linux are of three kinds: + + ===================== ========================================= + ``atomic_OP`` returns void + ``atomic_OP_return`` returns new value of the variable + ``atomic_fetch_OP`` returns the old value of the variable + ``atomic_cmpxchg`` returns the old value of the variable + ===================== ========================================= + + In QEMU, the second kind is named ``atomic_OP_fetch``. + +- different atomic read-modify-write operations in Linux imply + a different set of memory barriers; in QEMU, all of them enforce + sequential consistency. + +- in QEMU, ``qatomic_read()`` and ``qatomic_set()`` do not participate in + the total ordering enforced by sequentially-consistent operations. + This is because QEMU uses the C11 memory model. The following example + is correct in Linux but not in QEMU: + + +----------------------------------+--------------------------------+ + | Linux (correct) | QEMU (incorrect) | + +==================================+================================+ + | :: | :: | + | | | + | a = atomic_fetch_add(&x, 2); | a = qatomic_fetch_add(&x, 2);| + | b = READ_ONCE(&y); | b = qatomic_read(&y); | + +----------------------------------+--------------------------------+ + + because the read of ``y`` can be moved (by either the processor or the + compiler) before the write of ``x``. + + Fixing this requires an ``smp_mb()`` memory barrier between the write + of ``x`` and the read of ``y``. In the common case where only one thread + writes ``x``, it is also possible to write it like this: + + +--------------------------------+ + | QEMU (correct) | + +================================+ + | :: | + | | + | a = qatomic_read(&x); | + | qatomic_set(&x, a + 2); | + | smp_mb(); | + | b = qatomic_read(&y); | + +--------------------------------+ + +Sources +======= + +- ``Documentation/memory-barriers.txt`` from the Linux kernel diff --git a/docs/devel/bitops.rst b/docs/devel/bitops.rst new file mode 100644 index 000000000..6addaecf8 --- /dev/null +++ b/docs/devel/bitops.rst @@ -0,0 +1,8 @@ +================== +Bitwise operations +================== + +The header ``qemu/bitops.h`` provides utility functions for +performing bitwise operations. + +.. kernel-doc:: include/qemu/bitops.h diff --git a/docs/devel/blkdebug.txt b/docs/devel/blkdebug.txt new file mode 100644 index 000000000..0b0c128d3 --- /dev/null +++ b/docs/devel/blkdebug.txt @@ -0,0 +1,162 @@ +Block I/O error injection using blkdebug +---------------------------------------- +Copyright (C) 2014-2015 Red Hat Inc + +This work is licensed under the terms of the GNU GPL, version 2 or later. See +the COPYING file in the top-level directory. + +The blkdebug block driver is a rule-based error injection engine. It can be +used to exercise error code paths in block drivers including ENOSPC (out of +space) and EIO. + +This document gives an overview of the features available in blkdebug. + +Background +---------- +Block drivers have many error code paths that handle I/O errors. Image formats +are especially complex since metadata I/O errors during cluster allocation or +while updating tables happen halfway through request processing and require +discipline to keep image files consistent. + +Error injection allows test cases to trigger I/O errors at specific points. +This way, all error paths can be tested to make sure they are correct. + +Rules +----- +The blkdebug block driver takes a list of "rules" that tell the error injection +engine when to fail an I/O request. + +Each I/O request is evaluated against the rules. If a rule matches the request +then its "action" is executed. + +Rules can be placed in a configuration file; the configuration file +follows the same .ini-like format used by QEMU's -readconfig option, and +each section of the file represents a rule. + +The following configuration file defines a single rule: + + $ cat blkdebug.conf + [inject-error] + event = "read_aio" + errno = "28" + +This rule fails all aio read requests with ENOSPC (28). Note that the errno +value depends on the host. On Linux, see +/usr/include/asm-generic/errno-base.h for errno values. + +Invoke QEMU as follows: + + $ qemu-system-x86_64 + -drive if=none,cache=none,file=blkdebug:blkdebug.conf:test.img,id=drive0 \ + -device virtio-blk-pci,drive=drive0,id=virtio-blk-pci0 + +Rules support the following attributes: + + event - which type of operation to match (e.g. read_aio, write_aio, + flush_to_os, flush_to_disk). See the "Events" section for + information on events. + + state - (optional) the engine must be in this state number in order for this + rule to match. See the "State transitions" section for information + on states. + + errno - the numeric errno value to return when a request matches this rule. + The errno values depend on the host since the numeric values are not + standardized in the POSIX specification. + + sector - (optional) a sector number that the request must overlap in order to + match this rule + + once - (optional, default "off") only execute this action on the first + matching request + + immediately - (optional, default "off") return a NULL BlockAIOCB + pointer and fail without an errno instead. This + exercises the code path where BlockAIOCB fails and the + caller's BlockCompletionFunc is not invoked. + +Events +------ +Block drivers provide information about the type of I/O request they are about +to make so rules can match specific types of requests. For example, the qcow2 +block driver tells blkdebug when it accesses the L1 table so rules can match +only L1 table accesses and not other metadata or guest data requests. + +The core events are: + + read_aio - guest data read + + write_aio - guest data write + + flush_to_os - write out unwritten block driver state (e.g. cached metadata) + + flush_to_disk - flush the host block device's disk cache + +See qapi/block-core.json:BlkdebugEvent for the full list of events. +You may need to grep block driver source code to understand the +meaning of specific events. + +State transitions +----------------- +There are cases where more power is needed to match a particular I/O request in +a longer sequence of requests. For example: + + write_aio + flush_to_disk + write_aio + +How do we match the 2nd write_aio but not the first? This is where state +transitions come in. + +The error injection engine has an integer called the "state" that always starts +initialized to 1. The state integer is internal to blkdebug and cannot be +observed from outside but rules can interact with it for powerful matching +behavior. + +Rules can be conditional on the current state and they can transition to a new +state. + +When a rule's "state" attribute is non-zero then the current state must equal +the attribute in order for the rule to match. + +For example, to match the 2nd write_aio: + + [set-state] + event = "write_aio" + state = "1" + new_state = "2" + + [inject-error] + event = "write_aio" + state = "2" + errno = "5" + +The first write_aio request matches the set-state rule and transitions from +state 1 to state 2. Once state 2 has been entered, the set-state rule no +longer matches since it requires state 1. But the inject-error rule now +matches the next write_aio request and injects EIO (5). + +State transition rules support the following attributes: + + event - which type of operation to match (e.g. read_aio, write_aio, + flush_to_os, flush_to_disk). See the "Events" section for + information on events. + + state - (optional) the engine must be in this state number in order for this + rule to match + + new_state - transition to this state number + +Suspend and resume +------------------ +Exercising code paths in block drivers may require specific ordering amongst +concurrent requests. The "breakpoint" feature allows requests to be halted on +a blkdebug event and resumed later. This makes it possible to achieve +deterministic ordering when multiple requests are in flight. + +Breakpoints on blkdebug events are associated with a user-defined "tag" string. +This tag serves as an identifier by which the request can be resumed at a later +point. + +See the qemu-io(1) break, resume, remove_break, and wait_break commands for +details. diff --git a/docs/devel/blkverify.txt b/docs/devel/blkverify.txt new file mode 100644 index 000000000..aca826c51 --- /dev/null +++ b/docs/devel/blkverify.txt @@ -0,0 +1,69 @@ += Block driver correctness testing with blkverify = + +== Introduction == + +This document describes how to use the blkverify protocol to test that a block +driver is operating correctly. + +It is difficult to test and debug block drivers against real guests. Often +processes inside the guest will crash because corrupt sectors were read as part +of the executable. Other times obscure errors are raised by a program inside +the guest. These issues are extremely hard to trace back to bugs in the block +driver. + +Blkverify solves this problem by catching data corruption inside QEMU the first +time bad data is read and reporting the disk sector that is corrupted. + +== How it works == + +The blkverify protocol has two child block devices, the "test" device and the +"raw" device. Read/write operations are mirrored to both devices so their +state should always be in sync. + +The "raw" device is a raw image, a flat file, that has identical starting +contents to the "test" image. The idea is that the "raw" device will handle +read/write operations correctly and not corrupt data. It can be used as a +reference for comparison against the "test" device. + +After a mirrored read operation completes, blkverify will compare the data and +raise an error if it is not identical. This makes it possible to catch the +first instance where corrupt data is read. + +== Example == + +Imagine raw.img has 0xcd repeated throughout its first sector: + + $ ./qemu-io -c 'read -v 0 512' raw.img + 00000000: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................ + 00000010: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................ + [...] + 000001e0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................ + 000001f0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................ + read 512/512 bytes at offset 0 + 512.000000 bytes, 1 ops; 0.0000 sec (97.656 MiB/sec and 200000.0000 ops/sec) + +And test.img is corrupt, its first sector is zeroed when it shouldn't be: + + $ ./qemu-io -c 'read -v 0 512' test.img + 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ + 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ + [...] + 000001e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ + 000001f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ + read 512/512 bytes at offset 0 + 512.000000 bytes, 1 ops; 0.0000 sec (81.380 MiB/sec and 166666.6667 ops/sec) + +This error is caught by blkverify: + + $ ./qemu-io -c 'read 0 512' blkverify:a.img:b.img + blkverify: read sector_num=0 nb_sectors=4 contents mismatch in sector 0 + +A more realistic scenario is verifying the installation of a guest OS: + + $ ./qemu-img create raw.img 16G + $ ./qemu-img create -f qcow2 test.qcow2 16G + $ ./qemu-system-x86_64 -cdrom debian.iso \ + -drive file=blkverify:raw.img:test.qcow2 + +If the installation is aborted when blkverify detects corruption, use qemu-io +to explore the contents of the disk image at the sector in question. diff --git a/docs/devel/block-coroutine-wrapper.rst b/docs/devel/block-coroutine-wrapper.rst new file mode 100644 index 000000000..412851986 --- /dev/null +++ b/docs/devel/block-coroutine-wrapper.rst @@ -0,0 +1,54 @@ +======================= +block-coroutine-wrapper +======================= + +A lot of functions in QEMU block layer (see ``block/*``) can only be +called in coroutine context. Such functions are normally marked by the +coroutine_fn specifier. Still, sometimes we need to call them from +non-coroutine context; for this we need to start a coroutine, run the +needed function from it and wait for the coroutine to finish in a +BDRV_POLL_WHILE() loop. To run a coroutine we need a function with one +void* argument. So for each coroutine_fn function which needs a +non-coroutine interface, we should define a structure to pack the +parameters, define a separate function to unpack the parameters and +call the original function and finally define a new interface function +with same list of arguments as original one, which will pack the +parameters into a struct, create a coroutine, run it and wait in +BDRV_POLL_WHILE() loop. It's boring to create such wrappers by hand, +so we have a script to generate them. + +Usage +===== + +Assume we have defined the ``coroutine_fn`` function +``bdrv_co_foo(<some args>)`` and need a non-coroutine interface for it, +called ``bdrv_foo(<same args>)``. In this case the script can help. To +trigger the generation: + +1. You need ``bdrv_foo`` declaration somewhere (for example, in + ``block/coroutines.h``) with the ``generated_co_wrapper`` mark, + like this: + +.. code-block:: c + + int generated_co_wrapper bdrv_foo(<some args>); + +2. You need to feed this declaration to block-coroutine-wrapper script. + For this, add the .h (or .c) file with the declaration to the + ``input: files(...)`` list of ``block_gen_c`` target declaration in + ``block/meson.build`` + +You are done. During the build, coroutine wrappers will be generated in +``<BUILD_DIR>/block/block-gen.c``. + +Links +===== + +1. The script location is ``scripts/block-coroutine-wrapper.py``. + +2. Generic place for private ``generated_co_wrapper`` declarations is + ``block/coroutines.h``, for public declarations: + ``include/block/block.h`` + +3. The core API of generated coroutine wrappers is placed in + (not generated) ``block/block-gen.h`` diff --git a/docs/devel/build-system.rst b/docs/devel/build-system.rst new file mode 100644 index 000000000..431caba7a --- /dev/null +++ b/docs/devel/build-system.rst @@ -0,0 +1,486 @@ +================================== +The QEMU build system architecture +================================== + +This document aims to help developers understand the architecture of the +QEMU build system. As with projects using GNU autotools, the QEMU build +system has two stages, first the developer runs the "configure" script +to determine the local build environment characteristics, then they run +"make" to build the project. There is about where the similarities with +GNU autotools end, so try to forget what you know about them. + + +Stage 1: configure +================== + +The QEMU configure script is written directly in shell, and should be +compatible with any POSIX shell, hence it uses #!/bin/sh. An important +implication of this is that it is important to avoid using bash-isms on +development platforms where bash is the primary host. + +In contrast to autoconf scripts, QEMU's configure is expected to be +silent while it is checking for features. It will only display output +when an error occurs, or to show the final feature enablement summary +on completion. + +Because QEMU uses the Meson build system under the hood, only VPATH +builds are supported. There are two general ways to invoke configure & +perform a build: + + - VPATH, build artifacts outside of QEMU source tree entirely:: + + cd ../ + mkdir build + cd build + ../qemu/configure + make + + - VPATH, build artifacts in a subdir of QEMU source tree:: + + mkdir build + cd build + ../configure + make + +The configure script automatically recognizes +command line options for which a same-named Meson option exists; +dashes in the command line are replaced with underscores. + +Many checks on the compilation environment are still found in configure +rather than ``meson.build``, but new checks should be added directly to +``meson.build``. + +Patches are also welcome to move existing checks from the configure +phase to ``meson.build``. When doing so, ensure that ``meson.build`` does +not use anymore the keys that you have removed from ``config-host.mak``. +Typically these will be replaced in ``meson.build`` by boolean variables, +``get_option('optname')`` invocations, or ``dep.found()`` expressions. +In general, the remaining checks have little or no interdependencies, +so they can be moved one by one. + +Helper functions +---------------- + +The configure script provides a variety of helper functions to assist +developers in checking for system features: + +``do_cc $ARGS...`` + Attempt to run the system C compiler passing it $ARGS... + +``do_cxx $ARGS...`` + Attempt to run the system C++ compiler passing it $ARGS... + +``compile_object $CFLAGS`` + Attempt to compile a test program with the system C compiler using + $CFLAGS. The test program must have been previously written to a file + called $TMPC. The replacement in Meson is the compiler object ``cc``, + which has methods such as ``cc.compiles()``, + ``cc.check_header()``, ``cc.has_function()``. + +``compile_prog $CFLAGS $LDFLAGS`` + Attempt to compile a test program with the system C compiler using + $CFLAGS and link it with the system linker using $LDFLAGS. The test + program must have been previously written to a file called $TMPC. + The replacement in Meson is ``cc.find_library()`` and ``cc.links()``. + +``has $COMMAND`` + Determine if $COMMAND exists in the current environment, either as a + shell builtin, or executable binary, returning 0 on success. The + replacement in Meson is ``find_program()``. + +``check_define $NAME`` + Determine if the macro $NAME is defined by the system C compiler + +``check_include $NAME`` + Determine if the include $NAME file is available to the system C + compiler. The replacement in Meson is ``cc.has_header()``. + +``write_c_skeleton`` + Write a minimal C program main() function to the temporary file + indicated by $TMPC + +``feature_not_found $NAME $REMEDY`` + Print a message to stderr that the feature $NAME was not available + on the system, suggesting the user try $REMEDY to address the + problem. + +``error_exit $MESSAGE $MORE...`` + Print $MESSAGE to stderr, followed by $MORE... and then exit from the + configure script with non-zero status + +``query_pkg_config $ARGS...`` + Run pkg-config passing it $ARGS. If QEMU is doing a static build, + then --static will be automatically added to $ARGS + + +Stage 2: Meson +============== + +The Meson build system is currently used to describe the build +process for: + +1) executables, which include: + + - Tools - ``qemu-img``, ``qemu-nbd``, ``qga`` (guest agent), etc + + - System emulators - ``qemu-system-$ARCH`` + + - Userspace emulators - ``qemu-$ARCH`` + + - Unit tests + +2) documentation + +3) ROMs, which can be either installed as binary blobs or compiled + +4) other data files, such as icons or desktop files + +All executables are built by default, except for some ``contrib/`` +binaries that are known to fail to build on some platforms (for example +32-bit or big-endian platforms). Tests are also built by default, +though that might change in the future. + +The source code is highly modularized, split across many files to +facilitate building of all of these components with as little duplicated +compilation as possible. Using the Meson "sourceset" functionality, +``meson.build`` files group the source files in rules that are +enabled according to the available system libraries and to various +configuration symbols. Sourcesets belong to one of four groups: + +Subsystem sourcesets: + Various subsystems that are common to both tools and emulators have + their own sourceset, for example ``block_ss`` for the block device subsystem, + ``chardev_ss`` for the character device subsystem, etc. These sourcesets + are then turned into static libraries as follows:: + + libchardev = static_library('chardev', chardev_ss.sources(), + name_suffix: 'fa', + build_by_default: false) + + chardev = declare_dependency(link_whole: libchardev) + + As of Meson 0.55.1, the special ``.fa`` suffix should be used for everything + that is used with ``link_whole``, to ensure that the link flags are placed + correctly in the command line. + +Target-independent emulator sourcesets: + Various general purpose helper code is compiled only once and + the .o files are linked into all output binaries that need it. + This includes error handling infrastructure, standard data structures, + platform portability wrapper functions, etc. + + Target-independent code lives in the ``common_ss``, ``softmmu_ss`` and + ``user_ss`` sourcesets. ``common_ss`` is linked into all emulators, + ``softmmu_ss`` only in system emulators, ``user_ss`` only in user-mode + emulators. + + Target-independent sourcesets must exercise particular care when using + ``if_false`` rules. The ``if_false`` rule will be used correctly when linking + emulator binaries; however, when *compiling* target-independent files + into .o files, Meson may need to pick *both* the ``if_true`` and + ``if_false`` sides to cater for targets that want either side. To + achieve that, you can add a special rule using the ``CONFIG_ALL`` + symbol:: + + # Some targets have CONFIG_ACPI, some don't, so this is not enough + softmmu_ss.add(when: 'CONFIG_ACPI', if_true: files('acpi.c'), + if_false: files('acpi-stub.c')) + + # This is required as well: + softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('acpi-stub.c')) + +Target-dependent emulator sourcesets: + In the target-dependent set lives CPU emulation, some device emulation and + much glue code. This sometimes also has to be compiled multiple times, + once for each target being built. Target-dependent files are included + in the ``specific_ss`` sourceset. + + Each emulator also includes sources for files in the ``hw/`` and ``target/`` + subdirectories. The subdirectory used for each emulator comes + from the target's definition of ``TARGET_BASE_ARCH`` or (if missing) + ``TARGET_ARCH``, as found in ``default-configs/targets/*.mak``. + + Each subdirectory in ``hw/`` adds one sourceset to the ``hw_arch`` dictionary, + for example:: + + arm_ss = ss.source_set() + arm_ss.add(files('boot.c'), fdt) + ... + hw_arch += {'arm': arm_ss} + + The sourceset is only used for system emulators. + + Each subdirectory in ``target/`` instead should add one sourceset to each + of the ``target_arch`` and ``target_softmmu_arch``, which are used respectively + for all emulators and for system emulators only. For example:: + + arm_ss = ss.source_set() + arm_softmmu_ss = ss.source_set() + ... + target_arch += {'arm': arm_ss} + target_softmmu_arch += {'arm': arm_softmmu_ss} + +Module sourcesets: + There are two dictionaries for modules: ``modules`` is used for + target-independent modules and ``target_modules`` is used for + target-dependent modules. When modules are disabled the ``module`` + source sets are added to ``softmmu_ss`` and the ``target_modules`` + source sets are added to ``specific_ss``. + + Both dictionaries are nested. One dictionary is created per + subdirectory, and these per-subdirectory dictionaries are added to + the toplevel dictionaries. For example:: + + hw_display_modules = {} + qxl_ss = ss.source_set() + ... + hw_display_modules += { 'qxl': qxl_ss } + modules += { 'hw-display': hw_display_modules } + +Utility sourcesets: + All binaries link with a static library ``libqemuutil.a``. This library + is built from several sourcesets; most of them however host generated + code, and the only two of general interest are ``util_ss`` and ``stub_ss``. + + The separation between these two is purely for documentation purposes. + ``util_ss`` contains generic utility files. Even though this code is only + linked in some binaries, sometimes it requires hooks only in some of + these and depend on other functions that are not fully implemented by + all QEMU binaries. ``stub_ss`` links dummy stubs that will only be linked + into the binary if the real implementation is not present. In a way, + the stubs can be thought of as a portable implementation of the weak + symbols concept. + + +The following files concur in the definition of which files are linked +into each emulator: + +``default-configs/devices/*.mak`` + The files under ``default-configs/devices/`` control the boards and devices + that are built into each QEMU system emulation targets. They merely contain + a list of config variable definitions such as:: + + include arm-softmmu.mak + CONFIG_XLNX_ZYNQMP_ARM=y + CONFIG_XLNX_VERSAL=y + +``*/Kconfig`` + These files are processed together with ``default-configs/devices/*.mak`` and + describe the dependencies between various features, subsystems and + device models. They are described in :ref:`kconfig` + +``default-configs/targets/*.mak`` + These files mostly define symbols that appear in the ``*-config-target.h`` + file for each emulator [#cfgtarget]_. However, the ``TARGET_ARCH`` + and ``TARGET_BASE_ARCH`` will also be used to select the ``hw/`` and + ``target/`` subdirectories that are compiled into each target. + +.. [#cfgtarget] This header is included by ``qemu/osdep.h`` when + compiling files from the target-specific sourcesets. + +These files rarely need changing unless you are adding a completely +new target, or enabling new devices or hardware for a particular +system/userspace emulation target + + +Adding checks +------------- + +New checks should be added to Meson. Compiler checks can be as simple as +the following:: + + config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h')) + +A more complex task such as adding a new dependency usually +comprises the following tasks: + + - Add a Meson build option to meson_options.txt. + + - Add code to perform the actual feature check. + + - Add code to include the feature status in ``config-host.h`` + + - Add code to print out the feature status in the configure summary + upon completion. + +Taking the probe for SDL2_Image as an example, we have the following +in ``meson_options.txt``:: + + option('sdl_image', type : 'feature', value : 'auto', + description: 'SDL Image support for icons') + +Unless the option was given a non-``auto`` value (on the configure +command line), the detection code must be performed only if the +dependency will be used:: + + sdl_image = not_found + if not get_option('sdl_image').auto() or have_system + sdl_image = dependency('SDL2_image', required: get_option('sdl_image'), + method: 'pkg-config', + static: enable_static) + endif + +This avoids warnings on static builds of user-mode emulators, for example. +Most of the libraries used by system-mode emulators are not available for +static linking. + +The other supporting code is generally simple:: + + # Create config-host.h (if applicable) + config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found()) + + # Summary + summary_info += {'SDL image support': sdl_image.found()} + +For the configure script to parse the new option, the +``scripts/meson-buildoptions.sh`` file must be up-to-date; ``make +update-buildoptions`` (or just ``make``) will take care of updating it. + + +Support scripts +--------------- + +Meson has a special convention for invoking Python scripts: if their +first line is ``#! /usr/bin/env python3`` and the file is *not* executable, +find_program() arranges to invoke the script under the same Python +interpreter that was used to invoke Meson. This is the most common +and preferred way to invoke support scripts from Meson build files, +because it automatically uses the value of configure's --python= option. + +In case the script is not written in Python, use a ``#! /usr/bin/env ...`` +line and make the script executable. + +Scripts written in Python, where it is desirable to make the script +executable (for example for test scripts that developers may want to +invoke from the command line, such as tests/qapi-schema/test-qapi.py), +should be invoked through the ``python`` variable in meson.build. For +example:: + + test('QAPI schema regression tests', python, + args: files('test-qapi.py'), + env: test_env, suite: ['qapi-schema', 'qapi-frontend']) + +This is needed to obey the --python= option passed to the configure +script, which may point to something other than the first python3 +binary on the path. + + +Stage 3: makefiles +================== + +The use of GNU make is required with the QEMU build system. + +The output of Meson is a build.ninja file, which is used with the Ninja +build system. QEMU uses a different approach, where Makefile rules are +synthesized from the build.ninja file. The main Makefile includes these +rules and wraps them so that e.g. submodules are built before QEMU. +The resulting build system is largely non-recursive in nature, in +contrast to common practices seen with automake. + +Tests are also ran by the Makefile with the traditional ``make check`` +phony target, while benchmarks are run with ``make bench``. Meson test +suites such as ``unit`` can be ran with ``make check-unit`` too. It is also +possible to run tests defined in meson.build with ``meson test``. + +Useful make targets +------------------- + +``help`` + Print a help message for the most common build targets. + +``print-VAR`` + Print the value of the variable VAR. Useful for debugging the build + system. + +Important files for the build system +==================================== + +Statically defined files +------------------------ + +The following key files are statically defined in the source tree, with +the rules needed to build QEMU. Their behaviour is influenced by a +number of dynamically created files listed later. + +``Makefile`` + The main entry point used when invoking make to build all the components + of QEMU. The default 'all' target will naturally result in the build of + every component. Makefile takes care of recursively building submodules + directly via a non-recursive set of rules. + +``*/meson.build`` + The meson.build file in the root directory is the main entry point for the + Meson build system, and it coordinates the configuration and build of all + executables. Build rules for various subdirectories are included in + other meson.build files spread throughout the QEMU source tree. + +``tests/Makefile.include`` + Rules for external test harnesses. These include the TCG tests, + ``qemu-iotests`` and the Avocado-based integration tests. + +``tests/docker/Makefile.include`` + Rules for Docker tests. Like tests/Makefile, this file is included + directly by the top level Makefile, anything defined in this file will + influence the entire build system. + +``tests/vm/Makefile.include`` + Rules for VM-based tests. Like tests/Makefile, this file is included + directly by the top level Makefile, anything defined in this file will + influence the entire build system. + +Dynamically created files +------------------------- + +The following files are generated dynamically by configure in order to +control the behaviour of the statically defined makefiles. This avoids +the need for QEMU makefiles to go through any pre-processing as seen +with autotools, where Makefile.am generates Makefile.in which generates +Makefile. + +Built by configure: + +``config-host.mak`` + When configure has determined the characteristics of the build host it + will write a long list of variables to config-host.mak file. This + provides the various install directories, compiler / linker flags and a + variety of ``CONFIG_*`` variables related to optionally enabled features. + This is imported by the top level Makefile and meson.build in order to + tailor the build output. + + config-host.mak is also used as a dependency checking mechanism. If make + sees that the modification timestamp on configure is newer than that on + config-host.mak, then configure will be re-run. + + The variables defined here are those which are applicable to all QEMU + build outputs. Variables which are potentially different for each + emulator target are defined by the next file... + + +Built by Meson: + +``${TARGET-NAME}-config-devices.mak`` + TARGET-NAME is again the name of a system or userspace emulator. The + config-devices.mak file is automatically generated by make using the + scripts/make_device_config.sh program, feeding it the + default-configs/$TARGET-NAME file as input. + +``config-host.h``, ``$TARGET_NAME-config-target.h``, ``$TARGET_NAME-config-devices.h`` + These files are used by source code to determine what features are + enabled. They are generated from the contents of the corresponding + ``*.mak`` files using Meson's ``configure_file()`` function. + +``build.ninja`` + The build rules. + + +Built by Makefile: + +``Makefile.ninja`` + A Makefile include that bridges to ninja for the actual build. The + Makefile is mostly a list of targets that Meson included in build.ninja. + +``Makefile.mtest`` + The Makefile definitions that let "make check" run tests defined in + meson.build. The rules are produced from Meson's JSON description of + tests (obtained with "meson introspect --tests") through the script + scripts/mtest2make.py. diff --git a/docs/devel/ci-definitions.rst.inc b/docs/devel/ci-definitions.rst.inc new file mode 100644 index 000000000..6d5c6fd9f --- /dev/null +++ b/docs/devel/ci-definitions.rst.inc @@ -0,0 +1,121 @@ +Definition of terms +=================== + +This section defines the terms used in this document and correlates them with +what is currently used on QEMU. + +Automated tests +--------------- + +An automated test is written on a test framework using its generic test +functions/classes. The test framework can run the tests and report their +success or failure [1]_. + +An automated test has essentially three parts: + +1. The test initialization of the parameters, where the expected parameters, + like inputs and expected results, are set up; +2. The call to the code that should be tested; +3. An assertion, comparing the result from the previous call with the expected + result set during the initialization of the parameters. If the result + matches the expected result, the test has been successful; otherwise, it has + failed. + +Unit testing +------------ + +A unit test is responsible for exercising individual software components as a +unit, like interfaces, data structures, and functionality, uncovering errors +within the boundaries of a component. The verification effort is in the +smallest software unit and focuses on the internal processing logic and data +structures. A test case of unit tests should be designed to uncover errors due +to erroneous computations, incorrect comparisons, or improper control flow [2]_. + +On QEMU, unit testing is represented by the 'check-unit' target from 'make'. + +Functional testing +------------------ + +A functional test focuses on the functional requirement of the software. +Deriving sets of input conditions, the functional tests should fully exercise +all the functional requirements for a program. Functional testing is +complementary to other testing techniques, attempting to find errors like +incorrect or missing functions, interface errors, behavior errors, and +initialization and termination errors [3]_. + +On QEMU, functional testing is represented by the 'check-qtest' target from +'make'. + +System testing +-------------- + +System tests ensure all application elements mesh properly while the overall +functionality and performance are achieved [4]_. Some or all system components +are integrated to create a complete system to be tested as a whole. System +testing ensures that components are compatible, interact correctly, and +transfer the right data at the right time across their interfaces. As system +testing focuses on interactions, use case-based testing is a practical approach +to system testing [5]_. Note that, in some cases, system testing may require +interaction with third-party software, like operating system images, databases, +networks, and so on. + +On QEMU, system testing is represented by the 'check-avocado' target from +'make'. + +Flaky tests +----------- + +A flaky test is defined as a test that exhibits both a passing and a failing +result with the same code on different runs. Some usual reasons for an +intermittent/flaky test are async wait, concurrency, and test order dependency +[6]_. + +Gating +------ + +A gate restricts the move of code from one stage to another on a +test/deployment pipeline. The step move is granted with approval. The approval +can be a manual intervention or a set of tests succeeding [7]_. + +On QEMU, the gating process happens during the pull request. The approval is +done by the project leader running its own set of tests. The pull request gets +merged when the tests succeed. + +Continuous Integration (CI) +--------------------------- + +Continuous integration (CI) requires the builds of the entire application and +the execution of a comprehensive set of automated tests every time there is a +need to commit any set of changes [8]_. The automated tests can be composed of +the unit, functional, system, and other tests. + +Keynotes about continuous integration (CI) [9]_: + +1. System tests may depend on external software (operating system images, + firmware, database, network). +2. It may take a long time to build and test. It may be impractical to build + the system being developed several times per day. +3. If the development platform is different from the target platform, it may + not be possible to run system tests in the developer’s private workspace. + There may be differences in hardware, operating system, or installed + software. Therefore, more time is required for testing the system. + +References +---------- + +.. [1] Sommerville, Ian (2016). Software Engineering. p. 233. +.. [2] Pressman, Roger S. & Maxim, Bruce R. (2020). Software Engineering, + A Practitioner’s Approach. p. 48, 376, 378, 381. +.. [3] Pressman, Roger S. & Maxim, Bruce R. (2020). Software Engineering, + A Practitioner’s Approach. p. 388. +.. [4] Pressman, Roger S. & Maxim, Bruce R. (2020). Software Engineering, + A Practitioner’s Approach. Software Engineering, p. 377. +.. [5] Sommerville, Ian (2016). Software Engineering. p. 59, 232, 240. +.. [6] Luo, Qingzhou, et al. An empirical analysis of flaky tests. + Proceedings of the 22nd ACM SIGSOFT International Symposium on + Foundations of Software Engineering. 2014. +.. [7] Humble, Jez & Farley, David (2010). Continuous Delivery: + Reliable Software Releases Through Build, Test, and Deployment, p. 122. +.. [8] Humble, Jez & Farley, David (2010). Continuous Delivery: + Reliable Software Releases Through Build, Test, and Deployment, p. 55. +.. [9] Sommerville, Ian (2016). Software Engineering. p. 743. diff --git a/docs/devel/ci-jobs.rst.inc b/docs/devel/ci-jobs.rst.inc new file mode 100644 index 000000000..db3f571d5 --- /dev/null +++ b/docs/devel/ci-jobs.rst.inc @@ -0,0 +1,58 @@ +Custom CI/CD variables +====================== + +QEMU CI pipelines can be tuned by setting some CI environment variables. + +Set variable globally in the user's CI namespace +------------------------------------------------ + +Variables can be set globally in the user's CI namespace setting. + +For further information about how to set these variables, please refer to:: + + https://docs.gitlab.com/ee/ci/variables/#add-a-cicd-variable-to-a-project + +Set variable manually when pushing a branch or tag to the user's repository +--------------------------------------------------------------------------- + +Variables can be set manually when pushing a branch or tag, using +git-push command line arguments. + +Example setting the QEMU_CI_EXAMPLE_VAR variable: + +.. code:: + + git push -o ci.variable="QEMU_CI_EXAMPLE_VAR=value" myrepo mybranch + +For further information about how to set these variables, please refer to:: + + https://docs.gitlab.com/ee/user/project/push_options.html#push-options-for-gitlab-cicd + +Here is a list of the most used variables: + +QEMU_CI_AVOCADO_TESTING +~~~~~~~~~~~~~~~~~~~~~~~ +By default, tests using the Avocado framework are not run automatically in +the pipelines (because multiple artifacts have to be downloaded, and if +these artifacts are not already cached, downloading them make the jobs +reach the timeout limit). Set this variable to have the tests using the +Avocado framework run automatically. + +AARCH64_RUNNER_AVAILABLE +~~~~~~~~~~~~~~~~~~~~~~~~ +If you've got access to an aarch64 host that can be used as a gitlab-CI +runner, you can set this variable to enable the tests that require this +kind of host. The runner should be tagged with "aarch64". + +S390X_RUNNER_AVAILABLE +~~~~~~~~~~~~~~~~~~~~~~ +If you've got access to an IBM Z host that can be used as a gitlab-CI +runner, you can set this variable to enable the tests that require this +kind of host. The runner should be tagged with "s390x". + +CENTOS_STREAM_8_x86_64_RUNNER_AVAILABLE +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If you've got access to a CentOS Stream 8 x86_64 host that can be +used as a gitlab-CI runner, you can set this variable to enable the +tests that require this kind of host. The runner should be tagged with +both "centos_stream_8" and "x86_64". diff --git a/docs/devel/ci-runners.rst.inc b/docs/devel/ci-runners.rst.inc new file mode 100644 index 000000000..7817001fb --- /dev/null +++ b/docs/devel/ci-runners.rst.inc @@ -0,0 +1,117 @@ +Jobs on Custom Runners +====================== + +Besides the jobs run under the various CI systems listed before, there +are a number additional jobs that will run before an actual merge. +These use the same GitLab CI's service/framework already used for all +other GitLab based CI jobs, but rely on additional systems, not the +ones provided by GitLab as "shared runners". + +The architecture of GitLab's CI service allows different machines to +be set up with GitLab's "agent", called gitlab-runner, which will take +care of running jobs created by events such as a push to a branch. +Here, the combination of a machine, properly configured with GitLab's +gitlab-runner, is called a "custom runner". + +The GitLab CI jobs definition for the custom runners are located under:: + + .gitlab-ci.d/custom-runners.yml + +Custom runners entail custom machines. To see a list of the machines +currently deployed in the QEMU GitLab CI and their maintainers, please +refer to the QEMU `wiki <https://wiki.qemu.org/AdminContacts>`__. + +Machine Setup Howto +------------------- + +For all Linux based systems, the setup can be mostly automated by the +execution of two Ansible playbooks. Create an ``inventory`` file +under ``scripts/ci/setup``, such as this:: + + fully.qualified.domain + other.machine.hostname + +You may need to set some variables in the inventory file itself. One +very common need is to tell Ansible to use a Python 3 interpreter on +those hosts. This would look like:: + + fully.qualified.domain ansible_python_interpreter=/usr/bin/python3 + other.machine.hostname ansible_python_interpreter=/usr/bin/python3 + +Build environment +~~~~~~~~~~~~~~~~~ + +The ``scripts/ci/setup/build-environment.yml`` Ansible playbook will +set up machines with the environment needed to perform builds and run +QEMU tests. This playbook consists on the installation of various +required packages (and a general package update while at it). It +currently covers a number of different Linux distributions, but it can +be expanded to cover other systems. + +The minimum required version of Ansible successfully tested in this +playbook is 2.8.0 (a version check is embedded within the playbook +itself). To run the playbook, execute:: + + cd scripts/ci/setup + ansible-playbook -i inventory build-environment.yml + +Please note that most of the tasks in the playbook require superuser +privileges, such as those from the ``root`` account or those obtained +by ``sudo``. If necessary, please refer to ``ansible-playbook`` +options such as ``--become``, ``--become-method``, ``--become-user`` +and ``--ask-become-pass``. + +gitlab-runner setup and registration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The gitlab-runner agent needs to be installed on each machine that +will run jobs. The association between a machine and a GitLab project +happens with a registration token. To find the registration token for +your repository/project, navigate on GitLab's web UI to: + + * Settings (the gears-like icon at the bottom of the left hand side + vertical toolbar), then + * CI/CD, then + * Runners, and click on the "Expand" button, then + * Under "Set up a specific Runner manually", look for the value under + "And this registration token:" + +Copy the ``scripts/ci/setup/vars.yml.template`` file to +``scripts/ci/setup/vars.yml``. Then, set the +``gitlab_runner_registration_token`` variable to the value obtained +earlier. + +To run the playbook, execute:: + + cd scripts/ci/setup + ansible-playbook -i inventory gitlab-runner.yml + +Following the registration, it's necessary to configure the runner tags, +and optionally other configurations on the GitLab UI. Navigate to: + + * Settings (the gears like icon), then + * CI/CD, then + * Runners, and click on the "Expand" button, then + * "Runners activated for this project", then + * Click on the "Edit" icon (next to the "Lock" Icon) + +Tags are very important as they are used to route specific jobs to +specific types of runners, so it's a good idea to double check that +the automatically created tags are consistent with the OS and +architecture. For instance, an Ubuntu 20.04 aarch64 system should +have tags set as:: + + ubuntu_20.04,aarch64 + +Because the job definition at ``.gitlab-ci.d/custom-runners.yml`` +would contain:: + + ubuntu-20.04-aarch64-all: + tags: + - ubuntu_20.04 + - aarch64 + +It's also recommended to: + + * increase the "Maximum job timeout" to something like ``2h`` + * give it a better Description diff --git a/docs/devel/ci.rst b/docs/devel/ci.rst new file mode 100644 index 000000000..d10661009 --- /dev/null +++ b/docs/devel/ci.rst @@ -0,0 +1,13 @@ +== +CI +== + +QEMU has configurations enabled for a number of different CI services. +The most up to date information about them and their status can be +found at:: + + https://wiki.qemu.org/Testing/CI + +.. include:: ci-definitions.rst.inc +.. include:: ci-jobs.rst.inc +.. include:: ci-runners.rst.inc diff --git a/docs/devel/clocks.rst b/docs/devel/clocks.rst new file mode 100644 index 000000000..675fbeb6a --- /dev/null +++ b/docs/devel/clocks.rst @@ -0,0 +1,528 @@ +Modelling a clock tree in QEMU +============================== + +What are clocks? +---------------- + +Clocks are QOM objects developed for the purpose of modelling the +distribution of clocks in QEMU. + +They allow us to model the clock distribution of a platform and detect +configuration errors in the clock tree such as badly configured PLL, clock +source selection or disabled clock. + +The object is *Clock* and its QOM name is ``clock`` (in C code, the macro +``TYPE_CLOCK``). + +Clocks are typically used with devices where they are used to model inputs +and outputs. They are created in a similar way to GPIOs. Inputs and outputs +of different devices can be connected together. + +In these cases a Clock object is a child of a Device object, but this +is not a requirement. Clocks can be independent of devices. For +example it is possible to create a clock outside of any device to +model the main clock source of a machine. + +Here is an example of clocks:: + + +---------+ +----------------------+ +--------------+ + | Clock 1 | | Device B | | Device C | + | | | +-------+ +-------+ | | +-------+ | + | |>>-+-->>|Clock 2| |Clock 3|>>--->>|Clock 6| | + +---------+ | | | (in) | | (out) | | | | (in) | | + | | +-------+ +-------+ | | +-------+ | + | | +-------+ | +--------------+ + | | |Clock 4|>> + | | | (out) | | +--------------+ + | | +-------+ | | Device D | + | | +-------+ | | +-------+ | + | | |Clock 5|>>--->>|Clock 7| | + | | | (out) | | | | (in) | | + | | +-------+ | | +-------+ | + | +----------------------+ | | + | | +-------+ | + +----------------------------->>|Clock 8| | + | | (in) | | + | +-------+ | + +--------------+ + +Clocks are defined in the ``include/hw/clock.h`` header and device +related functions are defined in the ``include/hw/qdev-clock.h`` +header. + +The clock state +--------------- + +The state of a clock is its period; it is stored as an integer +representing it in units of 2 :sup:`-32` ns. The special value of 0 is used to +represent the clock being inactive or gated. The clocks do not model +the signal itself (pin toggling) or other properties such as the duty +cycle. + +All clocks contain this state: outputs as well as inputs. This allows +the current period of a clock to be fetched at any time. When a clock +is updated, the value is immediately propagated to all connected +clocks in the tree. + +To ease interaction with clocks, helpers with a unit suffix are defined for +every clock state setter or getter. The suffixes are: + +- ``_ns`` for handling periods in nanoseconds +- ``_hz`` for handling frequencies in hertz + +The 0 period value is converted to 0 in hertz and vice versa. 0 always means +that the clock is disabled. + +Adding a new clock +------------------ + +Adding clocks to a device must be done during the init method of the Device +instance. + +To add an input clock to a device, the function ``qdev_init_clock_in()`` +must be used. It takes the name, a callback, an opaque parameter +for the callback and a mask of events when the callback should be +called (this will be explained in a following section). +Output is simpler; only the name is required. Typically:: + + qdev_init_clock_in(DEVICE(dev), "clk_in", clk_in_callback, dev, ClockUpdate); + qdev_init_clock_out(DEVICE(dev), "clk_out"); + +Both functions return the created Clock pointer, which should be saved in the +device's state structure for further use. + +These objects will be automatically deleted by the QOM reference mechanism. + +Note that it is possible to create a static array describing clock inputs and +outputs. The function ``qdev_init_clocks()`` must be called with the array as +parameter to initialize the clocks: it has the same behaviour as calling the +``qdev_init_clock_in/out()`` for each clock in the array. To ease the array +construction, some macros are defined in ``include/hw/qdev-clock.h``. +As an example, the following creates 2 clocks to a device: one input and one +output. + +.. code-block:: c + + /* device structure containing pointers to the clock objects */ + typedef struct MyDeviceState { + DeviceState parent_obj; + Clock *clk_in; + Clock *clk_out; + } MyDeviceState; + + /* + * callback for the input clock (see "Callback on input clock + * change" section below for more information). + */ + static void clk_in_callback(void *opaque, ClockEvent event); + + /* + * static array describing clocks: + * + a clock input named "clk_in", whose pointer is stored in + * the clk_in field of a MyDeviceState structure with callback + * clk_in_callback. + * + a clock output named "clk_out" whose pointer is stored in + * the clk_out field of a MyDeviceState structure. + */ + static const ClockPortInitArray mydev_clocks = { + QDEV_CLOCK_IN(MyDeviceState, clk_in, clk_in_callback, ClockUpdate), + QDEV_CLOCK_OUT(MyDeviceState, clk_out), + QDEV_CLOCK_END + }; + + /* device initialization function */ + static void mydev_init(Object *obj) + { + /* cast to MyDeviceState */ + MyDeviceState *mydev = MYDEVICE(obj); + /* create and fill the pointer fields in the MyDeviceState */ + qdev_init_clocks(mydev, mydev_clocks); + [...] + } + +An alternative way to create a clock is to simply call +``object_new(TYPE_CLOCK)``. In that case the clock will neither be an +input nor an output of a device. After the whole QOM hierarchy of the +clock has been set ``clock_setup_canonical_path()`` should be called. + +At creation, the period of the clock is 0: the clock is disabled. You can +change it using ``clock_set_ns()`` or ``clock_set_hz()``. + +Note that if you are creating a clock with a fixed period which will never +change (for example the main clock source of a board), then you'll have +nothing else to do. This value will be propagated to other clocks when +connecting the clocks together and devices will fetch the right value during +the first reset. + +Clock callbacks +--------------- + +You can give a clock a callback function in several ways: + + * by passing it as an argument to ``qdev_init_clock_in()`` + * as an argument to the ``QDEV_CLOCK_IN()`` macro initializing an + array to be passed to ``qdev_init_clocks()`` + * by directly calling the ``clock_set_callback()`` function + +The callback function must be of this type: + +.. code-block:: c + + typedef void ClockCallback(void *opaque, ClockEvent event); + +The ``opaque`` argument is the pointer passed to ``qdev_init_clock_in()`` +or ``clock_set_callback()``; for ``qdev_init_clocks()`` it is the +``dev`` device pointer. + +The ``event`` argument specifies why the callback has been called. +When you register the callback you specify a mask of ClockEvent values +that you are interested in. The callback will only be called for those +events. + +The events currently supported are: + + * ``ClockPreUpdate`` : called when the input clock's period is about to + update. This is useful if the device needs to do some action for + which it needs to know the old value of the clock period. During + this callback, Clock API functions like ``clock_get()`` or + ``clock_ticks_to_ns()`` will use the old period. + * ``ClockUpdate`` : called after the input clock's period has changed. + During this callback, Clock API functions like ``clock_ticks_to_ns()`` + will use the new period. + +Note that a clock only has one callback: it is not possible to register +different functions for different events. You must register a single +callback which listens for all of the events you are interested in, +and use the ``event`` argument to identify which event has happened. + +Retrieving clocks from a device +------------------------------- + +``qdev_get_clock_in()`` and ``dev_get_clock_out()`` are available to +get the clock inputs or outputs of a device. For example: + +.. code-block:: c + + Clock *clk = qdev_get_clock_in(DEVICE(mydev), "clk_in"); + +or: + +.. code-block:: c + + Clock *clk = qdev_get_clock_out(DEVICE(mydev), "clk_out"); + +Connecting two clocks together +------------------------------ + +To connect two clocks together, use the ``clock_set_source()`` function. +Given two clocks ``clk1``, and ``clk2``, ``clock_set_source(clk2, clk1);`` +configures ``clk2`` to follow the ``clk1`` period changes. Every time ``clk1`` +is updated, ``clk2`` will be updated too. + +When connecting clock between devices, prefer using the +``qdev_connect_clock_in()`` function to set the source of an input +device clock. For example, to connect the input clock ``clk2`` of +``devB`` to the output clock ``clk1`` of ``devA``, do: + +.. code-block:: c + + qdev_connect_clock_in(devB, "clk2", qdev_get_clock_out(devA, "clk1")) + +We used ``qdev_get_clock_out()`` above, but any clock can drive an +input clock, even another input clock. The following diagram shows +some examples of connections. Note also that a clock can drive several +other clocks. + +:: + + +------------+ +--------------------------------------------------+ + | Device A | | Device B | + | | | +---------------------+ | + | | | | Device C | | + | +-------+ | | +-------+ | +-------+ +-------+ | +-------+ | + | |Clock 1|>>-->>|Clock 2|>>+-->>|Clock 3| |Clock 5|>>>>|Clock 6|>> + | | (out) | | | | (in) | | | | (in) | | (out) | | | (out) | | + | +-------+ | | +-------+ | | +-------+ +-------+ | +-------+ | + +------------+ | | +---------------------+ | + | | | + | | +--------------+ | + | | | Device D | | + | | | +-------+ | | + | +-->>|Clock 4| | | + | | | (in) | | | + | | +-------+ | | + | +--------------+ | + +--------------------------------------------------+ + +In the above example, when *Clock 1* is updated by *Device A*, three +clocks get the new clock period value: *Clock 2*, *Clock 3* and *Clock 4*. + +It is not possible to disconnect a clock or to change the clock connection +after it is connected. + +Clock multiplier and divider settings +------------------------------------- + +By default, when clocks are connected together, the child +clocks run with the same period as their source (parent) clock. +The Clock API supports a built-in period multiplier/divider +mechanism so you can configure a clock to make its children +run at a different period from its own. If you call the +``clock_set_mul_div()`` function you can specify the clock's +multiplier and divider values. The children of that clock +will all run with a period of ``parent_period * multiplier / divider``. +For instance, if the clock has a frequency of 8MHz and you set its +multiplier to 2 and its divider to 3, the child clocks will run +at 12MHz. + +You can change the multiplier and divider of a clock at runtime, +so you can use this to model clock controller devices which +have guest-programmable frequency multipliers or dividers. + +Note that ``clock_set_mul_div()`` does not automatically call +``clock_propagate()``. If you make a runtime change to the +multiplier or divider you must call clock_propagate() yourself. + +Unconnected input clocks +------------------------ + +A newly created input clock is disabled (period of 0). This means the +clock will be considered as disabled until the period is updated. If +the clock remains unconnected it will always keep its initial value +of 0. If this is not the desired behaviour, ``clock_set()``, +``clock_set_ns()`` or ``clock_set_hz()`` should be called on the Clock +object during device instance init. For example: + +.. code-block:: c + + clk = qdev_init_clock_in(DEVICE(dev), "clk-in", clk_in_callback, + dev, ClockUpdate); + /* set initial value to 10ns / 100MHz */ + clock_set_ns(clk, 10); + +To enforce that the clock is wired up by the board code, you can +call ``clock_has_source()`` in your device's realize method: + +.. code-block:: c + + if (!clock_has_source(s->clk)) { + error_setg(errp, "MyDevice: clk input must be connected"); + return; + } + +Note that this only checks that the clock has been wired up; it is +still possible that the output clock connected to it is disabled +or has not yet been configured, in which case the period will be +zero. You should use the clock callback to find out when the clock +period changes. + +Fetching clock frequency/period +------------------------------- + +To get the current state of a clock, use the functions ``clock_get()`` +or ``clock_get_hz()``. + +``clock_get()`` returns the period of the clock in its fully precise +internal representation, as an unsigned 64-bit integer in units of +2^-32 nanoseconds. (For many purposes ``clock_ticks_to_ns()`` will +be more convenient; see the section below on expiry deadlines.) + +``clock_get_hz()`` returns the frequency of the clock, rounded to the +next lowest integer. This implies some inaccuracy due to the rounding, +so be cautious about using it in calculations. + +It is also possible to register a callback on clock frequency changes. +Here is an example, which assumes that ``clock_callback`` has been +specified as the callback for the ``ClockUpdate`` event: + +.. code-block:: c + + void clock_callback(void *opaque, ClockEvent event) { + MyDeviceState *s = (MyDeviceState *) opaque; + /* + * 'opaque' is the argument passed to qdev_init_clock_in(); + * usually this will be the device state pointer. + */ + + /* do something with the new period */ + fprintf(stdout, "device new period is %" PRIu64 "* 2^-32 ns\n", + clock_get(dev->my_clk_input)); + } + +If you are only interested in the frequency for displaying it to +humans (for instance in debugging), use ``clock_display_freq()``, +which returns a prettified string-representation, e.g. "33.3 MHz". +The caller must free the string with g_free() after use. + +Calculating expiry deadlines +---------------------------- + +A commonly required operation for a clock is to calculate how long +it will take for the clock to tick N times; this can then be used +to set a timer expiry deadline. Use the function ``clock_ticks_to_ns()``, +which takes an unsigned 64-bit count of ticks and returns the length +of time in nanoseconds required for the clock to tick that many times. + +It is important not to try to calculate expiry deadlines using a +shortcut like multiplying a "period of clock in nanoseconds" value +by the tick count, because clocks can have periods which are not a +whole number of nanoseconds, and the accumulated error in the +multiplication can be significant. + +For a clock with a very long period and a large number of ticks, +the result of this function could in theory be too large to fit in +a 64-bit value. To avoid overflow in this case, ``clock_ticks_to_ns()`` +saturates the result to INT64_MAX (because this is the largest valid +input to the QEMUTimer APIs). Since INT64_MAX nanoseconds is almost +300 years, anything with an expiry later than that is in the "will +never happen" category. Callers of ``clock_ticks_to_ns()`` should +therefore generally not special-case the possibility of a saturated +result but just allow the timer to be set to that far-future value. +(If you are performing further calculations on the returned value +rather than simply passing it to a QEMUTimer function like +``timer_mod_ns()`` then you should be careful to avoid overflow +in those calculations, of course.) + +Obtaining tick counts +--------------------- + +For calculations where you need to know the number of ticks in +a given duration, use ``clock_ns_to_ticks()``. This function handles +possible non-whole-number-of-nanoseconds periods and avoids +potential rounding errors. It will return '0' if the clock is stopped +(i.e. it has period zero). If the inputs imply a tick count that +overflows a 64-bit value (a very long duration for a clock with a +very short period) the output value is truncated, so effectively +the 64-bit output wraps around. + +Changing a clock period +----------------------- + +A device can change its outputs using the ``clock_update()``, +``clock_update_ns()`` or ``clock_update_hz()`` function. It will trigger +updates on every connected input. + +For example, let's say that we have an output clock *clkout* and we +have a pointer to it in the device state because we did the following +in init phase: + +.. code-block:: c + + dev->clkout = qdev_init_clock_out(DEVICE(dev), "clkout"); + +Then at any time (apart from the cases listed below), it is possible to +change the clock value by doing: + +.. code-block:: c + + clock_update_hz(dev->clkout, 1000 * 1000 * 1000); /* 1GHz */ + +Because updating a clock may trigger any side effects through +connected clocks and their callbacks, this operation must be done +while holding the qemu io lock. + +For the same reason, one can update clocks only when it is allowed to have +side effects on other objects. In consequence, it is forbidden: + +* during migration, +* and in the enter phase of reset. + +Note that calling ``clock_update[_ns|_hz]()`` is equivalent to calling +``clock_set[_ns|_hz]()`` (with the same arguments) then +``clock_propagate()`` on the clock. Thus, setting the clock value can +be separated from triggering the side-effects. This is often required +to factorize code to handle reset and migration in devices. + +Aliasing clocks +--------------- + +Sometimes, one needs to forward, or inherit, a clock from another +device. Typically, when doing device composition, a device might +expose a sub-device's clock without interfering with it. The function +``qdev_alias_clock()`` can be used to achieve this behaviour. Note +that it is possible to expose the clock under a different name. +``qdev_alias_clock()`` works for both input and output clocks. + +For example, if device B is a child of device A, +``device_a_instance_init()`` may do something like this: + +.. code-block:: c + + void device_a_instance_init(Object *obj) + { + AState *A = DEVICE_A(obj); + BState *B; + /* create object B as child of A */ + [...] + qdev_alias_clock(B, "clk", A, "b_clk"); + /* + * Now A has a clock "b_clk" which is an alias to + * the clock "clk" of its child B. + */ + } + +This function does not return any clock object. The new clock has the +same direction (input or output) as the original one. This function +only adds a link to the existing clock. In the above example, object B +remains the only object allowed to use the clock and device A must not +try to change the clock period or set a callback to the clock. This +diagram describes the example with an input clock:: + + +--------------------------+ + | Device A | + | +--------------+ | + | | Device B | | + | | +-------+ | | + >>"b_clk">>>| "clk" | | | + | (in) | | (in) | | | + | | +-------+ | | + | +--------------+ | + +--------------------------+ + +Migration +--------- + +Clock state is not migrated automatically. Every device must handle its +clock migration. Alias clocks must not be migrated. + +To ensure clock states are restored correctly during migration, there +are two solutions. + +Clock states can be migrated by adding an entry into the device +vmstate description. You should use the ``VMSTATE_CLOCK`` macro for this. +This is typically used to migrate an input clock state. For example: + +.. code-block:: c + + MyDeviceState { + DeviceState parent_obj; + [...] /* some fields */ + Clock *clk; + }; + + VMStateDescription my_device_vmstate = { + .name = "my_device", + .fields = (VMStateField[]) { + [...], /* other migrated fields */ + VMSTATE_CLOCK(clk, MyDeviceState), + VMSTATE_END_OF_LIST() + } + }; + +The second solution is to restore the clock state using information already +at our disposal. This can be used to restore output clock states using the +device state. The functions ``clock_set[_ns|_hz]()`` can be used during the +``post_load()`` migration callback. + +When adding clock support to an existing device, if you care about +migration compatibility you will need to be careful, as simply adding +a ``VMSTATE_CLOCK()`` line will break compatibility. Instead, you can +put the ``VMSTATE_CLOCK()`` line into a vmstate subsection with a +suitable ``needed`` function, and use ``clock_set()`` in a +``pre_load()`` function to set the default value that will be used if +the source virtual machine in the migration does not send the clock +state. + +Care should be taken not to use ``clock_update[_ns|_hz]()`` or +``clock_propagate()`` during the whole migration procedure because it +will trigger side effects to other devices in an unknown state. diff --git a/docs/devel/code-of-conduct.rst b/docs/devel/code-of-conduct.rst new file mode 100644 index 000000000..195444d1b --- /dev/null +++ b/docs/devel/code-of-conduct.rst @@ -0,0 +1,60 @@ +Code of Conduct +=============== + +The QEMU community is made up of a mixture of professionals and +volunteers from all over the world. Diversity is one of our strengths, +but it can also lead to communication issues and unhappiness. +To that end, we have a few ground rules that we ask people to adhere to. + +* Be welcoming. We are committed to making participation in this project + a harassment-free experience for everyone, regardless of level of + experience, gender, gender identity and expression, sexual orientation, + disability, personal appearance, body size, race, ethnicity, age, religion, + or nationality. + +* Be respectful. Not all of us will agree all the time. Disagreements, both + social and technical, happen all the time and the QEMU community is no + exception. When we disagree, we try to understand why. It is important that + we resolve disagreements and differing views constructively. Members of the + QEMU community should be respectful when dealing with other contributors as + well as with people outside the QEMU community and with users of QEMU. + +Harassment and other exclusionary behavior are not acceptable. A community +where people feel uncomfortable or threatened is neither welcoming nor +respectful. Examples of unacceptable behavior by participants include: + +* The use of sexualized language or imagery + +* Personal attacks + +* Trolling or insulting/derogatory comments + +* Public or private harassment + +* Publishing other's private information, such as physical or electronic + addresses, without explicit permission + +This isn't an exhaustive list of things that you can't do. Rather, take +it in the spirit in which it's intended: a guide to make it easier to +be excellent to each other. + +This code of conduct applies to all spaces managed by the QEMU project. +This includes IRC, the mailing lists, the issue tracker, community +events, and any other forums created by the project team which the +community uses for communication. This code of conduct also applies +outside these spaces, when an individual acts as a representative or a +member of the project or its community. + +By adopting this code of conduct, project maintainers commit themselves +to fairly and consistently applying these principles to every aspect of +managing this project. If you believe someone is violating the code of +conduct, please read the :ref:`conflict-resolution` document for +information about how to proceed. + +Sources +------- + +This document is based on the `Fedora Code of Conduct +<http://web.archive.org/web/20210429132536/https://docs.fedoraproject.org/en-US/project/code-of-conduct/>`__ +(as of April 2021) and the `Contributor Covenant version 1.3.0 +<https://www.contributor-covenant.org/version/1/3/0/code-of-conduct/>`__. diff --git a/docs/devel/conflict-resolution.rst b/docs/devel/conflict-resolution.rst new file mode 100644 index 000000000..bb25f6186 --- /dev/null +++ b/docs/devel/conflict-resolution.rst @@ -0,0 +1,80 @@ +.. _conflict-resolution: + +Conflict Resolution Policy +========================== + +Conflicts in the community can take many forms, from someone having a +bad day and using harsh and hurtful language on the mailing list to more +serious code of conduct violations (including sexist/racist statements +or threats of violence), and everything in between. + +For the vast majority of issues, we aim to empower individuals to first +resolve conflicts themselves, asking for help when needed, and only +after that fails to escalate further. This approach gives people more +control over the outcome of their dispute. + +How we resolve conflicts +------------------------ + +If you are experiencing conflict, please consider first addressing the +perceived conflict directly with other involved parties, preferably through +a real-time medium such as IRC. You could also try to get a third-party (e.g. +a mutual friend, and/or someone with background on the issue, but not +involved in the conflict) to intercede or mediate. + +If this fails or if you do not feel comfortable proceeding this way, or +if the problem requires immediate escalation, report the issue to the QEMU +leadership committee by sending an email to qemu@sfconservancy.org, providing +references to the misconduct. +For very urgent topics, you can also inform one or more members through IRC. +The up-to-date list of members is `available on the QEMU wiki +<https://wiki.qemu.org/Conservancy>`__. + +Your report will be treated confidentially by the leadership committee and +not be published without your agreement. The QEMU leadership committee will +then do its best to review the incident in a timely manner, and will either +seek further information, or will make a determination on next steps. + +Remedies +-------- + +Escalating an issue to the QEMU leadership committee may result in actions +impacting one or more involved parties. In the event the leadership +committee has to intervene, here are some of the ways they might respond: + +1. Take no action. For example, if the leadership committee determines + the complaint has not been substantiated or is being made in bad faith, + or if it is deemed to be outside its purview. + +2. A private reprimand, explaining the consequences of continued behavior, + to one or more involved individuals. + +3. A private reprimand and request for a private or public apology + +4. A public reprimand and request for a public apology + +5. A public reprimand plus a mandatory cooling off period. The cooling + off period may require, for example, one or more of the following: + abstaining from maintainer duties; not interacting with people involved, + including unsolicited interaction with those enforcing the guidelines + and interaction on social media; being denied participation to in-person + events. The cooling off period is voluntary but may escalate to a + temporary ban in order to enforce it. + +6. A temporary or permanent ban from some or all current and future QEMU + spaces (mailing lists, IRC, wiki, etc.), possibly including in-person + events. + +In the event of severe harassment, the leadership committee may advise that +the matter be escalated to the relevant local law enforcement agency. It +is however not the role of the leadership committee to initiate contact +with law enforcement on behalf of any of the community members involved +in an incident. + +Sources +------- + +This document was developed based on the `Drupal Conflict Resolution +Policy and Process <https://www.drupal.org/conflict-resolution>`__ +and the `Mozilla Consequence Ladder +<https://github.com/mozilla/diversity/blob/master/code-of-conduct-enforcement/consequence-ladder.md>`__ diff --git a/docs/devel/control-flow-integrity.rst b/docs/devel/control-flow-integrity.rst new file mode 100644 index 000000000..e6b73a4fe --- /dev/null +++ b/docs/devel/control-flow-integrity.rst @@ -0,0 +1,137 @@ +============================ +Control-Flow Integrity (CFI) +============================ + +This document describes the current control-flow integrity (CFI) mechanism in +QEMU. How it can be enabled, its benefits and deficiencies, and how it affects +new and existing code in QEMU + +Basics +------ + +CFI is a hardening technique that focusing on guaranteeing that indirect +function calls have not been altered by an attacker. +The type used in QEMU is a forward-edge control-flow integrity that ensures +function calls performed through function pointers, always call a "compatible" +function. A compatible function is a function with the same signature of the +function pointer declared in the source code. + +This type of CFI is entirely compiler-based and relies on the compiler knowing +the signature of every function and every function pointer used in the code. +As of now, the only compiler that provides support for CFI is Clang. + +CFI is best used on production binaries, to protect against unknown attack +vectors. + +In case of a CFI violation (i.e. call to a non-compatible function) QEMU will +terminate abruptly, to stop the possible attack. + +Building with CFI +----------------- + +NOTE: CFI requires the use of link-time optimization. Therefore, when CFI is +selected, LTO will be automatically enabled. + +To build with CFI, the minimum requirement is Clang 6+. If you +are planning to also enable fuzzing, then Clang 11+ is needed (more on this +later). + +Given the use of LTO, a version of AR that supports LLVM IR is required. +The easies way of doing this is by selecting the AR provided by LLVM:: + + AR=llvm-ar-9 CC=clang-9 CXX=clang++-9 /path/to/configure --enable-cfi + +CFI is enabled on every binary produced. + +If desired, an additional flag to increase the verbosity of the output in case +of a CFI violation is offered (``--enable-debug-cfi``). + +Using QEMU built with CFI +------------------------- + +A binary with CFI will work exactly like a standard binary. In case of a CFI +violation, the binary will terminate with an illegal instruction signal. + +Incompatible code with CFI +-------------------------- + +As mentioned above, CFI is entirely compiler-based and therefore relies on +compile-time knowledge of the code. This means that, while generally supported +for most code, some specific use pattern can break CFI compatibility, and +create false-positives. The two main patterns that can cause issues are: + +* Just-in-time compiled code: since such code is created at runtime, the jump + to the buffer containing JIT code will fail. + +* Libraries loaded dynamically, e.g. with dlopen/dlsym, since the library was + not known at compile time. + +Current areas of QEMU that are not entirely compatible with CFI are: + +1. TCG, since the idea of TCG is to pre-compile groups of instructions at + runtime to speed-up interpretation, quite similarly to a JIT compiler + +2. TCI, where the interpreter has to interpret the generic *call* operation + +3. Plugins, since a plugin is implemented as an external library + +4. Modules, since they are implemented as an external library + +5. Directly calling signal handlers from the QEMU source code, since the + signal handler may have been provided by an external library or even plugged + at runtime. + +Disabling CFI for a specific function +------------------------------------- + +If you are working on function that is performing a call using an +incompatible way, as described before, you can selectively disable CFI checks +for such function by using the decorator ``QEMU_DISABLE_CFI`` at function +definition, and add an explanation on why the function is not compatible +with CFI. An example of the use of ``QEMU_DISABLE_CFI`` is provided here:: + + /* + * Disable CFI checks. + * TCG creates binary blobs at runtime, with the transformed code. + * A TB is a blob of binary code, created at runtime and called with an + * indirect function call. Since such function did not exist at compile time, + * the CFI runtime has no way to verify its signature and would fail. + * TCG is not considered a security-sensitive part of QEMU so this does not + * affect the impact of CFI in environment with high security requirements + */ + QEMU_DISABLE_CFI + static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb) + +NOTE: CFI needs to be disabled at the **caller** function, (i.e. a compatible +cfi function that calls a non-compatible one), since the check is performed +when the function call is performed. + +CFI and fuzzing +--------------- + +There is generally no advantage of using CFI and fuzzing together, because +they target different environments (production for CFI, debug for fuzzing). + +CFI could be used in conjunction with fuzzing to identify a broader set of +bugs that may not end immediately in a segmentation fault or triggering +an assertion. However, other sanitizers such as address and ub sanitizers +can identify such bugs in a more precise way than CFI. + +There is, however, an interesting use case in using CFI in conjunction with +fuzzing, that is to make sure that CFI is not triggering any false positive +in remote-but-possible parts of the code. + +CFI can be enabled with fuzzing, but with some caveats: +1. Fuzzing relies on the linker performing function wrapping at link-time. +The standard BFD linker does not support function wrapping when LTO is +also enabled. The workaround is to use LLVM's lld linker. +2. Fuzzing also relies on a custom linker script, which is only supported by +lld with version 11+. + +In other words, to compile with fuzzing and CFI, clang 11+ is required, and +lld needs to be used as a linker:: + + AR=llvm-ar-11 CC=clang-11 CXX=clang++-11 /path/to/configure --enable-cfi \ + -enable-fuzzing --extra-ldflags="-fuse-ld=lld" + +and then, compile the fuzzers as usual. diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst new file mode 100644 index 000000000..49ea50c2a --- /dev/null +++ b/docs/devel/decodetree.rst @@ -0,0 +1,237 @@ +======================== +Decodetree Specification +======================== + +A *decodetree* is built from instruction *patterns*. A pattern may +represent a single architectural instruction or a group of same, depending +on what is convenient for further processing. + +Each pattern has both *fixedbits* and *fixedmask*, the combination of which +describes the condition under which the pattern is matched:: + + (insn & fixedmask) == fixedbits + +Each pattern may have *fields*, which are extracted from the insn and +passed along to the translator. Examples of such are registers, +immediates, and sub-opcodes. + +In support of patterns, one may declare *fields*, *argument sets*, and +*formats*, each of which may be re-used to simplify further definitions. + +Fields +====== + +Syntax:: + + field_def := '%' identifier ( unnamed_field )* ( !function=identifier )? + unnamed_field := number ':' ( 's' ) number + +For *unnamed_field*, the first number is the least-significant bit position +of the field and the second number is the length of the field. If the 's' is +present, the field is considered signed. If multiple ``unnamed_fields`` are +present, they are concatenated. In this way one can define disjoint fields. + +If ``!function`` is specified, the concatenated result is passed through the +named function, taking and returning an integral value. + +One may use ``!function`` with zero ``unnamed_fields``. This case is called +a *parameter*, and the named function is only passed the ``DisasContext`` +and returns an integral value extracted from there. + +A field with no ``unnamed_fields`` and no ``!function`` is in error. + +Field examples: + ++---------------------------+---------------------------------------------+ +| Input | Generated code | ++===========================+=============================================+ +| %disp 0:s16 | sextract(i, 0, 16) | ++---------------------------+---------------------------------------------+ +| %imm9 16:6 10:3 | extract(i, 16, 6) << 3 | extract(i, 10, 3) | ++---------------------------+---------------------------------------------+ +| %disp12 0:s1 1:1 2:10 | sextract(i, 0, 1) << 11 | | +| | extract(i, 1, 1) << 10 | | +| | extract(i, 2, 10) | ++---------------------------+---------------------------------------------+ +| %shimm8 5:s8 13:1 | expand_shimm8(sextract(i, 5, 8) << 1 | | +| !function=expand_shimm8 | extract(i, 13, 1)) | ++---------------------------+---------------------------------------------+ + +Argument Sets +============= + +Syntax:: + + args_def := '&' identifier ( args_elt )+ ( !extern )? + args_elt := identifier (':' identifier)? + +Each *args_elt* defines an argument within the argument set. +If the form of the *args_elt* contains a colon, the first +identifier is the argument name and the second identifier is +the argument type. If the colon is missing, the argument +type will be ``int``. + +Each argument set will be rendered as a C structure "arg_$name" +with each of the fields being one of the member arguments. + +If ``!extern`` is specified, the backing structure is assumed +to have been already declared, typically via a second decoder. + +Argument sets are useful when one wants to define helper functions +for the translator functions that can perform operations on a common +set of arguments. This can ensure, for instance, that the ``AND`` +pattern and the ``OR`` pattern put their operands into the same named +structure, so that a common ``gen_logic_insn`` may be able to handle +the operations common between the two. + +Argument set examples:: + + ®3 ra rb rc + &loadstore reg base offset + &longldst reg base offset:int64_t + + +Formats +======= + +Syntax:: + + fmt_def := '@' identifier ( fmt_elt )+ + fmt_elt := fixedbit_elt | field_elt | field_ref | args_ref + fixedbit_elt := [01.-]+ + field_elt := identifier ':' 's'? number + field_ref := '%' identifier | identifier '=' '%' identifier + args_ref := '&' identifier + +Defining a format is a handy way to avoid replicating groups of fields +across many instruction patterns. + +A *fixedbit_elt* describes a contiguous sequence of bits that must +be 1, 0, or don't care. The difference between '.' and '-' +is that '.' means that the bit will be covered with a field or a +final 0 or 1 from the pattern, and '-' means that the bit is really +ignored by the cpu and will not be specified. + +A *field_elt* describes a simple field only given a width; the position of +the field is implied by its position with respect to other *fixedbit_elt* +and *field_elt*. + +If any *fixedbit_elt* or *field_elt* appear, then all bits must be defined. +Padding with a *fixedbit_elt* of all '.' is an easy way to accomplish that. + +A *field_ref* incorporates a field by reference. This is the only way to +add a complex field to a format. A field may be renamed in the process +via assignment to another identifier. This is intended to allow the +same argument set be used with disjoint named fields. + +A single *args_ref* may specify an argument set to use for the format. +The set of fields in the format must be a subset of the arguments in +the argument set. If an argument set is not specified, one will be +inferred from the set of fields. + +It is recommended, but not required, that all *field_ref* and *args_ref* +appear at the end of the line, not interleaving with *fixedbit_elf* or +*field_elt*. + +Format examples:: + + @opr ...... ra:5 rb:5 ... 0 ....... rc:5 + @opi ...... ra:5 lit:8 1 ....... rc:5 + +Patterns +======== + +Syntax:: + + pat_def := identifier ( pat_elt )+ + pat_elt := fixedbit_elt | field_elt | field_ref | args_ref | fmt_ref | const_elt + fmt_ref := '@' identifier + const_elt := identifier '=' number + +The *fixedbit_elt* and *field_elt* specifiers are unchanged from formats. +A pattern that does not specify a named format will have one inferred +from a referenced argument set (if present) and the set of fields. + +A *const_elt* allows a argument to be set to a constant value. This may +come in handy when fields overlap between patterns and one has to +include the values in the *fixedbit_elt* instead. + +The decoder will call a translator function for each pattern matched. + +Pattern examples:: + + addl_r 010000 ..... ..... .... 0000000 ..... @opr + addl_i 010000 ..... ..... .... 0000000 ..... @opi + +which will, in part, invoke:: + + trans_addl_r(ctx, &arg_opr, insn) + +and:: + + trans_addl_i(ctx, &arg_opi, insn) + +Pattern Groups +============== + +Syntax:: + + group := overlap_group | no_overlap_group + overlap_group := '{' ( pat_def | group )+ '}' + no_overlap_group := '[' ( pat_def | group )+ ']' + +A *group* begins with a lone open-brace or open-bracket, with all +subsequent lines indented two spaces, and ending with a lone +close-brace or close-bracket. Groups may be nested, increasing the +required indentation of the lines within the nested group to two +spaces per nesting level. + +Patterns within overlap groups are allowed to overlap. Conflicts are +resolved by selecting the patterns in order. If all of the fixedbits +for a pattern match, its translate function will be called. If the +translate function returns false, then subsequent patterns within the +group will be matched. + +Patterns within no-overlap groups are not allowed to overlap, just +the same as ungrouped patterns. Thus no-overlap groups are intended +to be nested inside overlap groups. + +The following example from PA-RISC shows specialization of the *or* +instruction:: + + { + { + nop 000010 ----- ----- 0000 001001 0 00000 + copy 000010 00000 r1:5 0000 001001 0 rt:5 + } + or 000010 rt2:5 r1:5 cf:4 001001 0 rt:5 + } + +When the *cf* field is zero, the instruction has no side effects, +and may be specialized. When the *rt* field is zero, the output +is discarded and so the instruction has no effect. When the *rt2* +field is zero, the operation is ``reg[r1] | 0`` and so encodes +the canonical register copy operation. + +The output from the generator might look like:: + + switch (insn & 0xfc000fe0) { + case 0x08000240: + /* 000010.. ........ ....0010 010..... */ + if ((insn & 0x0000f000) == 0x00000000) { + /* 000010.. ........ 00000010 010..... */ + if ((insn & 0x0000001f) == 0x00000000) { + /* 000010.. ........ 00000010 01000000 */ + extract_decode_Fmt_0(&u.f_decode0, insn); + if (trans_nop(ctx, &u.f_decode0)) return true; + } + if ((insn & 0x03e00000) == 0x00000000) { + /* 00001000 000..... 00000010 010..... */ + extract_decode_Fmt_1(&u.f_decode1, insn); + if (trans_copy(ctx, &u.f_decode1)) return true; + } + } + extract_decode_Fmt_2(&u.f_decode2, insn); + if (trans_or(ctx, &u.f_decode2)) return true; + return false; + } diff --git a/docs/devel/ebpf_rss.rst b/docs/devel/ebpf_rss.rst new file mode 100644 index 000000000..4a68682b3 --- /dev/null +++ b/docs/devel/ebpf_rss.rst @@ -0,0 +1,125 @@ +=========================== +eBPF RSS virtio-net support +=========================== + +RSS(Receive Side Scaling) is used to distribute network packets to guest virtqueues +by calculating packet hash. Usually every queue is processed then by a specific guest CPU core. + +For now there are 2 RSS implementations in qemu: +- 'in-qemu' RSS (functions if qemu receives network packets, i.e. vhost=off) +- eBPF RSS (can function with also with vhost=on) + +eBPF support (CONFIG_EBPF) is enabled by 'configure' script. +To enable eBPF RSS support use './configure --enable-bpf'. + +If steering BPF is not set for kernel's TUN module, the TUN uses automatic selection +of rx virtqueue based on lookup table built according to calculated symmetric hash +of transmitted packets. +If steering BPF is set for TUN the BPF code calculates the hash of packet header and +returns the virtqueue number to place the packet to. + +Simplified decision formula: + +.. code:: C + + queue_index = indirection_table[hash(<packet data>)%<indirection_table size>] + + +Not for all packets, the hash can/should be calculated. + +Note: currently, eBPF RSS does not support hash reporting. + +eBPF RSS turned on by different combinations of vhost-net, vitrio-net and tap configurations: + +- eBPF is used: + + tap,vhost=off & virtio-net-pci,rss=on,hash=off + +- eBPF is used: + + tap,vhost=on & virtio-net-pci,rss=on,hash=off + +- 'in-qemu' RSS is used: + + tap,vhost=off & virtio-net-pci,rss=on,hash=on + +- eBPF is used, hash population feature is not reported to the guest: + + tap,vhost=on & virtio-net-pci,rss=on,hash=on + +If CONFIG_EBPF is not set then only 'in-qemu' RSS is supported. +Also 'in-qemu' RSS, as a fallback, is used if the eBPF program failed to load or set to TUN. + +RSS eBPF program +---------------- + +RSS program located in ebpf/rss.bpf.skeleton.h generated by bpftool. +So the program is part of the qemu binary. +Initially, the eBPF program was compiled by clang and source code located at tools/ebpf/rss.bpf.c. +Prerequisites to recompile the eBPF program (regenerate ebpf/rss.bpf.skeleton.h): + + llvm, clang, kernel source tree, bpftool + Adjust Makefile.ebpf to reflect the location of the kernel source tree + + $ cd tools/ebpf + $ make -f Makefile.ebpf + +Current eBPF RSS implementation uses 'bounded loops' with 'backward jump instructions' which present in the last kernels. +Overall eBPF RSS works on kernels 5.8+. + +eBPF RSS implementation +----------------------- + +eBPF RSS loading functionality located in ebpf/ebpf_rss.c and ebpf/ebpf_rss.h. + +The ``struct EBPFRSSContext`` structure that holds 4 file descriptors: + +- ctx - pointer of the libbpf context. +- program_fd - file descriptor of the eBPF RSS program. +- map_configuration - file descriptor of the 'configuration' map. This map contains one element of 'struct EBPFRSSConfig'. This configuration determines eBPF program behavior. +- map_toeplitz_key - file descriptor of the 'Toeplitz key' map. One element of the 40byte key prepared for the hashing algorithm. +- map_indirections_table - 128 elements of queue indexes. + +``struct EBPFRSSConfig`` fields: + +- redirect - "boolean" value, should the hash be calculated, on false - ``default_queue`` would be used as the final decision. +- populate_hash - for now, not used. eBPF RSS doesn't support hash reporting. +- hash_types - binary mask of different hash types. See ``VIRTIO_NET_RSS_HASH_TYPE_*`` defines. If for packet hash should not be calculated - ``default_queue`` would be used. +- indirections_len - length of the indirections table, maximum 128. +- default_queue - the queue index that used for packet that shouldn't be hashed. For some packets, the hash can't be calculated(g.e ARP). + +Functions: + +- ``ebpf_rss_init()`` - sets ctx to NULL, which indicates that EBPFRSSContext is not loaded. +- ``ebpf_rss_load()`` - creates 3 maps and loads eBPF program from the rss.bpf.skeleton.h. Returns 'true' on success. After that, program_fd can be used to set steering for TAP. +- ``ebpf_rss_set_all()`` - sets values for eBPF maps. ``indirections_table`` length is in EBPFRSSConfig. ``toeplitz_key`` is VIRTIO_NET_RSS_MAX_KEY_SIZE aka 40 bytes array. +- ``ebpf_rss_unload()`` - close all file descriptors and set ctx to NULL. + +Simplified eBPF RSS workflow: + +.. code:: C + + struct EBPFRSSConfig config; + config.redirect = 1; + config.hash_types = VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | VIRTIO_NET_RSS_HASH_TYPE_TCPv4; + config.indirections_len = VIRTIO_NET_RSS_MAX_TABLE_LEN; + config.default_queue = 0; + + uint16_t table[VIRTIO_NET_RSS_MAX_TABLE_LEN] = {...}; + uint8_t key[VIRTIO_NET_RSS_MAX_KEY_SIZE] = {...}; + + struct EBPFRSSContext ctx; + ebpf_rss_init(&ctx); + ebpf_rss_load(&ctx); + ebpf_rss_set_all(&ctx, &config, table, key); + if (net_client->info->set_steering_ebpf != NULL) { + net_client->info->set_steering_ebpf(net_client, ctx->program_fd); + } + ... + ebpf_unload(&ctx); + + +NetClientState SetSteeringEBPF() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For now, ``set_steering_ebpf()`` method supported by Linux TAP NetClientState. The method requires an eBPF program file descriptor as an argument. diff --git a/docs/devel/fuzzing.rst b/docs/devel/fuzzing.rst new file mode 100644 index 000000000..784ecb99e --- /dev/null +++ b/docs/devel/fuzzing.rst @@ -0,0 +1,322 @@ +======== +Fuzzing +======== + +This document describes the virtual-device fuzzing infrastructure in QEMU and +how to use it to implement additional fuzzers. + +Basics +------ + +Fuzzing operates by passing inputs to an entry point/target function. The +fuzzer tracks the code coverage triggered by the input. Based on these +findings, the fuzzer mutates the input and repeats the fuzzing. + +To fuzz QEMU, we rely on libfuzzer. Unlike other fuzzers such as AFL, libfuzzer +is an *in-process* fuzzer. For the developer, this means that it is their +responsibility to ensure that state is reset between fuzzing-runs. + +Building the fuzzers +-------------------- + +*NOTE*: If possible, build a 32-bit binary. When forking, the 32-bit fuzzer is +much faster, since the page-map has a smaller size. This is due to the fact that +AddressSanitizer maps ~20TB of memory, as part of its detection. This results +in a large page-map, and a much slower ``fork()``. + +To build the fuzzers, install a recent version of clang: +Configure with (substitute the clang binaries with the version you installed). +Here, enable-sanitizers, is optional but it allows us to reliably detect bugs +such as out-of-bounds accesses, use-after-frees, double-frees etc.:: + + CC=clang-8 CXX=clang++-8 /path/to/configure --enable-fuzzing \ + --enable-sanitizers + +Fuzz targets are built similarly to system targets:: + + make qemu-fuzz-i386 + +This builds ``./qemu-fuzz-i386`` + +The first option to this command is: ``--fuzz-target=FUZZ_NAME`` +To list all of the available fuzzers run ``qemu-fuzz-i386`` with no arguments. + +For example:: + + ./qemu-fuzz-i386 --fuzz-target=virtio-scsi-fuzz + +Internally, libfuzzer parses all arguments that do not begin with ``"--"``. +Information about these is available by passing ``-help=1`` + +Now the only thing left to do is wait for the fuzzer to trigger potential +crashes. + +Useful libFuzzer flags +---------------------- + +As mentioned above, libFuzzer accepts some arguments. Passing ``-help=1`` will +list the available arguments. In particular, these arguments might be helpful: + +* ``CORPUS_DIR/`` : Specify a directory as the last argument to libFuzzer. + libFuzzer stores each "interesting" input in this corpus directory. The next + time you run libFuzzer, it will read all of the inputs from the corpus, and + continue fuzzing from there. You can also specify multiple directories. + libFuzzer loads existing inputs from all specified directories, but will only + write new ones to the first one specified. + +* ``-max_len=4096`` : specify the maximum byte-length of the inputs libFuzzer + will generate. + +* ``-close_fd_mask={1,2,3}`` : close, stderr, or both. Useful for targets that + trigger many debug/error messages, or create output on the serial console. + +* ``-jobs=4 -workers=4`` : These arguments configure libFuzzer to run 4 fuzzers in + parallel (4 fuzzing jobs in 4 worker processes). Alternatively, with only + ``-jobs=N``, libFuzzer automatically spawns a number of workers less than or equal + to half the available CPU cores. Replace 4 with a number appropriate for your + machine. Make sure to specify a ``CORPUS_DIR``, which will allow the parallel + fuzzers to share information about the interesting inputs they find. + +* ``-use_value_profile=1`` : For each comparison operation, libFuzzer computes + ``(caller_pc&4095) | (popcnt(Arg1 ^ Arg2) << 12)`` and places this in the + coverage table. Useful for targets with "magic" constants. If Arg1 came from + the fuzzer's input and Arg2 is a magic constant, then each time the Hamming + distance between Arg1 and Arg2 decreases, libFuzzer adds the input to the + corpus. + +* ``-shrink=1`` : Tries to make elements of the corpus "smaller". Might lead to + better coverage performance, depending on the target. + +Note that libFuzzer's exact behavior will depend on the version of +clang and libFuzzer used to build the device fuzzers. + +Generating Coverage Reports +--------------------------- + +Code coverage is a crucial metric for evaluating a fuzzer's performance. +libFuzzer's output provides a "cov: " column that provides a total number of +unique blocks/edges covered. To examine coverage on a line-by-line basis we +can use Clang coverage: + + 1. Configure libFuzzer to store a corpus of all interesting inputs (see + CORPUS_DIR above) + 2. ``./configure`` the QEMU build with :: + + --enable-fuzzing \ + --extra-cflags="-fprofile-instr-generate -fcoverage-mapping" + + 3. Re-run the fuzzer. Specify $CORPUS_DIR/* as an argument, telling libfuzzer + to execute all of the inputs in $CORPUS_DIR and exit. Once the process + exits, you should find a file, "default.profraw" in the working directory. + 4. Execute these commands to generate a detailed HTML coverage-report:: + + llvm-profdata merge -output=default.profdata default.profraw + llvm-cov show ./path/to/qemu-fuzz-i386 -instr-profile=default.profdata \ + --format html -output-dir=/path/to/output/report + +Adding a new fuzzer +------------------- + +Coverage over virtual devices can be improved by adding additional fuzzers. +Fuzzers are kept in ``tests/qtest/fuzz/`` and should be added to +``tests/qtest/fuzz/meson.build`` + +Fuzzers can rely on both qtest and libqos to communicate with virtual devices. + +1. Create a new source file. For example ``tests/qtest/fuzz/foo-device-fuzz.c``. + +2. Write the fuzzing code using the libqtest/libqos API. See existing fuzzers + for reference. + +3. Add the fuzzer to ``tests/qtest/fuzz/meson.build``. + +Fuzzers can be more-or-less thought of as special qtest programs which can +modify the qtest commands and/or qtest command arguments based on inputs +provided by libfuzzer. Libfuzzer passes a byte array and length. Commonly the +fuzzer loops over the byte-array interpreting it as a list of qtest commands, +addresses, or values. + +The Generic Fuzzer +------------------ + +Writing a fuzz target can be a lot of effort (especially if a device driver has +not be built-out within libqos). Many devices can be fuzzed to some degree, +without any device-specific code, using the generic-fuzz target. + +The generic-fuzz target is capable of fuzzing devices over their PIO, MMIO, +and DMA input-spaces. To apply the generic-fuzz to a device, we need to define +two env-variables, at minimum: + +* ``QEMU_FUZZ_ARGS=`` is the set of QEMU arguments used to configure a machine, with + the device attached. For example, if we want to fuzz the virtio-net device + attached to a pc-i440fx machine, we can specify:: + + QEMU_FUZZ_ARGS="-M pc -nodefaults -netdev user,id=user0 \ + -device virtio-net,netdev=user0" + +* ``QEMU_FUZZ_OBJECTS=`` is a set of space-delimited strings used to identify + the MemoryRegions that will be fuzzed. These strings are compared against + MemoryRegion names and MemoryRegion owner names, to decide whether each + MemoryRegion should be fuzzed. These strings support globbing. For the + virtio-net example, we could use one of :: + + QEMU_FUZZ_OBJECTS='virtio-net' + QEMU_FUZZ_OBJECTS='virtio*' + QEMU_FUZZ_OBJECTS='virtio* pcspk' # Fuzz the virtio devices and the speaker + QEMU_FUZZ_OBJECTS='*' # Fuzz the whole machine`` + +The ``"info mtree"`` and ``"info qom-tree"`` monitor commands can be especially +useful for identifying the ``MemoryRegion`` and ``Object`` names used for +matching. + +As a generic rule-of-thumb, the more ``MemoryRegions``/Devices we match, the +greater the input-space, and the smaller the probability of finding crashing +inputs for individual devices. As such, it is usually a good idea to limit the +fuzzer to only a few ``MemoryRegions``. + +To ensure that these env variables have been configured correctly, we can use:: + + ./qemu-fuzz-i386 --fuzz-target=generic-fuzz -runs=0 + +The output should contain a complete list of matched MemoryRegions. + +OSS-Fuzz +-------- +QEMU is continuously fuzzed on `OSS-Fuzz +<https://github.com/google/oss-fuzz>`_. By default, the OSS-Fuzz build +will try to fuzz every fuzz-target. Since the generic-fuzz target +requires additional information provided in environment variables, we +pre-define some generic-fuzz configs in +``tests/qtest/fuzz/generic_fuzz_configs.h``. Each config must specify: + +- ``.name``: To identify the fuzzer config + +- ``.args`` OR ``.argfunc``: A string or pointer to a function returning a + string. These strings are used to specify the ``QEMU_FUZZ_ARGS`` + environment variable. ``argfunc`` is useful when the config relies on e.g. + a dynamically created temp directory, or a free tcp/udp port. + +- ``.objects``: A string that specifies the ``QEMU_FUZZ_OBJECTS`` environment + variable. + +To fuzz additional devices/device configuration on OSS-Fuzz, send patches for +either a new device-specific fuzzer or a new generic-fuzz config. + +Build details: + +- The Dockerfile that sets up the environment for building QEMU's + fuzzers on OSS-Fuzz can be fund in the OSS-Fuzz repository + __(https://github.com/google/oss-fuzz/blob/master/projects/qemu/Dockerfile) + +- The script responsible for building the fuzzers can be found in the + QEMU source tree at ``scripts/oss-fuzz/build.sh`` + +Building Crash Reproducers +----------------------------------------- +When we find a crash, we should try to create an independent reproducer, that +can be used on a non-fuzzer build of QEMU. This filters out any potential +false-positives, and improves the debugging experience for developers. +Here are the steps for building a reproducer for a crash found by the +generic-fuzz target. + +- Ensure the crash reproduces:: + + qemu-fuzz-i386 --fuzz-target... ./crash-... + +- Gather the QTest output for the crash:: + + QEMU_FUZZ_TIMEOUT=0 QTEST_LOG=1 FUZZ_SERIALIZE_QTEST=1 \ + qemu-fuzz-i386 --fuzz-target... ./crash-... &> /tmp/trace + +- Reorder and clean-up the resulting trace:: + + scripts/oss-fuzz/reorder_fuzzer_qtest_trace.py /tmp/trace > /tmp/reproducer + +- Get the arguments needed to start qemu, and provide a path to qemu:: + + less /tmp/trace # The args should be logged at the top of this file + export QEMU_ARGS="-machine ..." + export QEMU_PATH="path/to/qemu-system" + +- Ensure the crash reproduces in qemu-system:: + + $QEMU_PATH $QEMU_ARGS -qtest stdio < /tmp/reproducer + +- From the crash output, obtain some string that identifies the crash. This + can be a line in the stack-trace, for example:: + + export CRASH_TOKEN="hw/usb/hcd-xhci.c:1865" + +- Minimize the reproducer:: + + scripts/oss-fuzz/minimize_qtest_trace.py -M1 -M2 \ + /tmp/reproducer /tmp/reproducer-minimized + +- Confirm that the minimized reproducer still crashes:: + + $QEMU_PATH $QEMU_ARGS -qtest stdio < /tmp/reproducer-minimized + +- Create a one-liner reproducer that can be sent over email:: + + ./scripts/oss-fuzz/output_reproducer.py -bash /tmp/reproducer-minimized + +- Output the C source code for a test case that will reproduce the bug:: + + ./scripts/oss-fuzz/output_reproducer.py -owner "John Smith <john@smith.com>"\ + -name "test_function_name" /tmp/reproducer-minimized + +- Report the bug and send a patch with the C reproducer upstream + +Implementation Details / Fuzzer Lifecycle +----------------------------------------- + +The fuzzer has two entrypoints that libfuzzer calls. libfuzzer provides it's +own ``main()``, which performs some setup, and calls the entrypoints: + +``LLVMFuzzerInitialize``: called prior to fuzzing. Used to initialize all of the +necessary state + +``LLVMFuzzerTestOneInput``: called for each fuzzing run. Processes the input and +resets the state at the end of each run. + +In more detail: + +``LLVMFuzzerInitialize`` parses the arguments to the fuzzer (must start with two +dashes, so they are ignored by libfuzzer ``main()``). Currently, the arguments +select the fuzz target. Then, the qtest client is initialized. If the target +requires qos, qgraph is set up and the QOM/LIBQOS modules are initialized. +Then the QGraph is walked and the QEMU cmd_line is determined and saved. + +After this, the ``vl.c:qemu_main`` is called to set up the guest. There are +target-specific hooks that can be called before and after qemu_main, for +additional setup(e.g. PCI setup, or VM snapshotting). + +``LLVMFuzzerTestOneInput``: Uses qtest/qos functions to act based on the fuzz +input. It is also responsible for manually calling ``main_loop_wait`` to ensure +that bottom halves are executed and any cleanup required before the next input. + +Since the same process is reused for many fuzzing runs, QEMU state needs to +be reset at the end of each run. There are currently two implemented +options for resetting state: + +- Reboot the guest between runs. + - *Pros*: Straightforward and fast for simple fuzz targets. + + - *Cons*: Depending on the device, does not reset all device state. If the + device requires some initialization prior to being ready for fuzzing (common + for QOS-based targets), this initialization needs to be done after each + reboot. + + - *Example target*: ``i440fx-qtest-reboot-fuzz`` + +- Run each test case in a separate forked process and copy the coverage + information back to the parent. This is fairly similar to AFL's "deferred" + fork-server mode [3] + + - *Pros*: Relatively fast. Devices only need to be initialized once. No need to + do slow reboots or vmloads. + + - *Cons*: Not officially supported by libfuzzer. Does not work well for + devices that rely on dedicated threads. + + - *Example target*: ``virtio-net-fork-fuzz`` diff --git a/docs/devel/index.rst b/docs/devel/index.rst new file mode 100644 index 000000000..afd937535 --- /dev/null +++ b/docs/devel/index.rst @@ -0,0 +1,50 @@ +--------------------- +Developer Information +--------------------- + +This section of the manual documents various parts of the internals of QEMU. +You only need to read it if you are interested in reading or +modifying QEMU's source code. + +.. toctree:: + :maxdepth: 2 + :includehidden: + + code-of-conduct + conflict-resolution + build-system + style + kconfig + testing + fuzzing + control-flow-integrity + loads-stores + memory + migration + atomics + stable-process + ci + qtest + decodetree + secure-coding-practices + tcg + tcg-icount + tracing + multi-thread-tcg + tcg-plugins + bitops + ui + reset + s390-dasd-ipl + clocks + qom + modules + block-coroutine-wrapper + multi-process + ebpf_rss + vfio-migration + qapi-code-gen + writing-monitor-commands + trivial-patches + submitting-a-patch + submitting-a-pull-request diff --git a/docs/devel/kconfig.rst b/docs/devel/kconfig.rst new file mode 100644 index 000000000..a1cdbec75 --- /dev/null +++ b/docs/devel/kconfig.rst @@ -0,0 +1,307 @@ +.. _kconfig: + +================ +QEMU and Kconfig +================ + +QEMU is a very versatile emulator; it can be built for a variety of +targets, where each target can emulate various boards and at the same +time different targets can share large amounts of code. For example, +a POWER and an x86 board can run the same code to emulate a PCI network +card, even though the boards use different PCI host bridges, and they +can run the same code to emulate a SCSI disk while using different +SCSI adapters. Arm, s390 and x86 boards can all present a virtio-blk +disk to their guests, but with three different virtio guest interfaces. + +Each QEMU target enables a subset of the boards, devices and buses that +are included in QEMU's source code. As a result, each QEMU executable +only links a small subset of the files that form QEMU's source code; +anything that is not needed to support a particular target is culled. + +QEMU uses a simple domain-specific language to describe the dependencies +between components. This is useful for two reasons: + +* new targets and boards can be added without knowing in detail the + architecture of the hardware emulation subsystems. Boards only have + to list the components they need, and the compiled executable will + include all the required dependencies and all the devices that the + user can add to that board; + +* users can easily build reduced versions of QEMU that support only a subset + of boards or devices. For example, by default most targets will include + all emulated PCI devices that QEMU supports, but the build process is + configurable and it is easy to drop unnecessary (or otherwise unwanted) + code to make a leaner binary. + +This domain-specific language is based on the Kconfig language that +originated in the Linux kernel, though it was heavily simplified and +the handling of dependencies is stricter in QEMU. + +Unlike Linux, there is no user interface to edit the configuration, which +is instead specified in per-target files under the ``default-configs/`` +directory of the QEMU source tree. This is because, unlike Linux, +configuration and dependencies can be treated as a black box when building +QEMU; the default configuration that QEMU ships with should be okay in +almost all cases. + +The Kconfig language +-------------------- + +Kconfig defines configurable components in files named ``hw/*/Kconfig``. +Note that configurable components are _not_ visible in C code as preprocessor +symbols; they are only visible in the Makefile. Each configurable component +defines a Makefile variable whose name starts with ``CONFIG_``. + +All elements have boolean (true/false) type; truth is written as ``y``, while +falsehood is written ``n``. They are defined in a Kconfig +stanza like the following:: + + config ARM_VIRT + bool + imply PCI_DEVICES + imply VFIO_AMD_XGBE + imply VFIO_XGMAC + select A15MPCORE + select ACPI + select ARM_SMMUV3 + +The ``config`` keyword introduces a new configuration element. In the example +above, Makefiles will have access to a variable named ``CONFIG_ARM_VIRT``, +with value ``y`` or ``n`` (respectively for boolean true and false). + +Boolean expressions can be used within the language, whenever ``<expr>`` +is written in the remainder of this section. The ``&&``, ``||`` and +``!`` operators respectively denote conjunction (AND), disjunction (OR) +and negation (NOT). + +The ``bool`` data type declaration is optional, but it is suggested to +include it for clarity and future-proofing. After ``bool`` the following +directives can be included: + +**dependencies**: ``depends on <expr>`` + + This defines a dependency for this configurable element. Dependencies + evaluate an expression and force the value of the variable to false + if the expression is false. + +**reverse dependencies**: ``select <symbol> [if <expr>]`` + + While ``depends on`` can force a symbol to false, reverse dependencies can + be used to force another symbol to true. In the following example, + ``CONFIG_BAZ`` will be true whenever ``CONFIG_FOO`` is true:: + + config FOO + select BAZ + + The optional expression will prevent ``select`` from having any effect + unless it is true. + + Note that unlike Linux's Kconfig implementation, QEMU will detect + contradictions between ``depends on`` and ``select`` statements and prevent + you from building such a configuration. + +**default value**: ``default <value> [if <expr>]`` + + Default values are assigned to the config symbol if no other value was + set by the user via ``default-configs/*.mak`` files, and only if + ``select`` or ``depends on`` directives do not force the value to true + or false respectively. ``<value>`` can be ``y`` or ``n``; it cannot + be an arbitrary Boolean expression. However, a condition for applying + the default value can be added with ``if``. + + A configuration element can have any number of default values (usually, + if more than one default is present, they will have different + conditions). If multiple default values satisfy their condition, + only the first defined one is active. + +**reverse default** (weak reverse dependency): ``imply <symbol> [if <expr>]`` + + This is similar to ``select`` as it applies a lower limit of ``y`` + to another symbol. However, the lower limit is only a default + and the "implied" symbol's value may still be set to ``n`` from a + ``default-configs/*.mak`` files. The following two examples are + equivalent:: + + config FOO + bool + imply BAZ + + config BAZ + bool + default y if FOO + + The next section explains where to use ``imply`` or ``default y``. + +Guidelines for writing Kconfig files +------------------------------------ + +Configurable elements in QEMU fall under five broad groups. Each group +declares its dependencies in different ways: + +**subsystems**, of which **buses** are a special case + + Example:: + + config SCSI + bool + + Subsystems always default to false (they have no ``default`` directive) + and are never visible in ``default-configs/*.mak`` files. It's + up to other symbols to ``select`` whatever subsystems they require. + + They sometimes have ``select`` directives to bring in other required + subsystems or buses. For example, ``AUX`` (the DisplayPort auxiliary + channel "bus") selects ``I2C`` because it can act as an I2C master too. + +**devices** + + Example:: + + config MEGASAS_SCSI_PCI + bool + default y if PCI_DEVICES + depends on PCI + select SCSI + + Devices are the most complex of the five. They can have a variety + of directives that cooperate so that a default configuration includes + all the devices that can be accessed from QEMU. + + Devices *depend on* the bus that they lie on, for example a PCI + device would specify ``depends on PCI``. An MMIO device will likely + have no ``depends on`` directive. Devices also *select* the buses + that the device provides, for example a SCSI adapter would specify + ``select SCSI``. Finally, devices are usually ``default y`` if and + only if they have at least one ``depends on``; the default could be + conditional on a device group. + + Devices also select any optional subsystem that they use; for example + a video card might specify ``select EDID`` if it needs to build EDID + information and publish it to the guest. + +**device groups** + + Example:: + + config PCI_DEVICES + bool + + Device groups provide a convenient mechanism to enable/disable many + devices in one go. This is useful when a set of devices is likely to + be enabled/disabled by several targets. Device groups usually need + no directive and are not used in the Makefile either; they only appear + as conditions for ``default y`` directives. + + QEMU currently has two device groups, ``PCI_DEVICES`` and + ``TEST_DEVICES``. PCI devices usually have a ``default y if + PCI_DEVICES`` directive rather than just ``default y``. This lets + some boards (notably s390) easily support a subset of PCI devices, + for example only VFIO (passthrough) and virtio-pci devices. + ``TEST_DEVICES`` instead is used for devices that are rarely used on + production virtual machines, but provide useful hooks to test QEMU + or KVM. + +**boards** + + Example:: + + config SUN4M + bool + imply TCX + imply CG3 + select CS4231 + select ECCMEMCTL + select EMPTY_SLOT + select ESCC + select ESP + select FDC + select SLAVIO + select LANCE + select M48T59 + select STP2000 + + Boards specify their constituent devices using ``imply`` and ``select`` + directives. A device should be listed under ``select`` if the board + cannot be started at all without it. It should be listed under + ``imply`` if (depending on the QEMU command line) the board may or + may not be started without it. Boards also default to false; they are + enabled by the ``default-configs/*.mak`` for the target they apply to. + +**internal elements** + + Example:: + + config ECCMEMCTL + bool + select ECC + + Internal elements group code that is useful in several boards or + devices. They are usually enabled with ``select`` and in turn select + other elements; they are never visible in ``default-configs/*.mak`` + files, and often not even in the Makefile. + +Writing and modifying default configurations +-------------------------------------------- + +In addition to the Kconfig files under hw/, each target also includes +a file called ``default-configs/TARGETNAME-softmmu.mak``. These files +initialize some Kconfig variables to non-default values and provide the +starting point to turn on devices and subsystems. + +A file in ``default-configs/`` looks like the following example:: + + # Default configuration for alpha-softmmu + + # Uncomment the following lines to disable these optional devices: + # + #CONFIG_PCI_DEVICES=n + #CONFIG_TEST_DEVICES=n + + # Boards: + # + CONFIG_DP264=y + +The first part, consisting of commented-out ``=n`` assignments, tells +the user which devices or device groups are implied by the boards. +The second part, consisting of ``=y`` assignments, tells the user which +boards are supported by the target. The user will typically modify +the default configuration by uncommenting lines in the first group, +or commenting out lines in the second group. + +It is also possible to run QEMU's configure script with the +``--without-default-devices`` option. When this is done, everything defaults +to ``n`` unless it is ``select``ed or explicitly switched on in the +``.mak`` files. In other words, ``default`` and ``imply`` directives +are disabled. When QEMU is built with this option, the user will probably +want to change some lines in the first group, for example like this:: + + CONFIG_PCI_DEVICES=y + #CONFIG_TEST_DEVICES=n + +and/or pick a subset of the devices in those device groups. Right now +there is no single place that lists all the optional devices for +``CONFIG_PCI_DEVICES`` and ``CONFIG_TEST_DEVICES``. In the future, +we expect that ``.mak`` files will be automatically generated, so that +they will include all these symbols and some help text on what they do. + +``Kconfig.host`` +---------------- + +In some special cases, a configurable element depends on host features +that are detected by QEMU's configure or ``meson.build`` scripts; for +example some devices depend on the availability of KVM or on the presence +of a library on the host. + +These symbols should be listed in ``Kconfig.host`` like this:: + + config TPM + bool + +and also listed as follows in the top-level meson.build's host_kconfig +variable:: + + host_kconfig = \ + ('CONFIG_TPM' in config_host ? ['CONFIG_TPM=y'] : []) + \ + ('CONFIG_SPICE' in config_host ? ['CONFIG_SPICE=y'] : []) + \ + (have_ivshmem ? ['CONFIG_IVSHMEM=y'] : []) + \ + ... diff --git a/docs/devel/loads-stores.rst b/docs/devel/loads-stores.rst new file mode 100644 index 000000000..8f0035c82 --- /dev/null +++ b/docs/devel/loads-stores.rst @@ -0,0 +1,558 @@ +.. + Copyright (c) 2017 Linaro Limited + Written by Peter Maydell + +=================== +Load and Store APIs +=================== + +QEMU internally has multiple families of functions for performing +loads and stores. This document attempts to enumerate them all +and indicate when to use them. It does not provide detailed +documentation of each API -- for that you should look at the +documentation comments in the relevant header files. + + +``ld*_p and st*_p`` +~~~~~~~~~~~~~~~~~~~ + +These functions operate on a host pointer, and should be used +when you already have a pointer into host memory (corresponding +to guest ram or a local buffer). They deal with doing accesses +with the desired endianness and with correctly handling +potentially unaligned pointer values. + +Function names follow the pattern: + +load: ``ld{sign}{size}_{endian}_p(ptr)`` + +store: ``st{size}_{endian}_p(ptr, val)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``he`` : host endian + - ``be`` : big endian + - ``le`` : little endian + +The ``_{endian}`` infix is omitted for target-endian accesses. + +The target endian accessors are only available to source +files which are built per-target. + +There are also functions which take the size as an argument: + +load: ``ldn{endian}_p(ptr, sz)`` + +which performs an unsigned load of ``sz`` bytes from ``ptr`` +as an ``{endian}`` order value and returns it in a uint64_t. + +store: ``stn{endian}_p(ptr, sz, val)`` + +which stores ``val`` to ``ptr`` as an ``{endian}`` order value +of size ``sz`` bytes. + + +Regexes for git grep + - ``\<ld[us]\?[bwlq]\(_[hbl]e\)\?_p\>`` + - ``\<st[bwlq]\(_[hbl]e\)\?_p\>`` + - ``\<ldn_\([hbl]e\)?_p\>`` + - ``\<stn_\([hbl]e\)?_p\>`` + +``cpu_{ld,st}*_mmu`` +~~~~~~~~~~~~~~~~~~~~ + +These functions operate on a guest virtual address, plus a context +known as a "mmu index" which controls how that virtual address is +translated, plus a ``MemOp`` which contains alignment requirements +among other things. The ``MemOp`` and mmu index are combined into +a single argument of type ``MemOpIdx``. + +The meaning of the indexes are target specific, but specifying a +particular index might be necessary if, for instance, the helper +requires a "always as non-privileged" access rather than the +default access for the current state of the guest CPU. + +These functions may cause a guest CPU exception to be taken +(e.g. for an alignment fault or MMU fault) which will result in +guest CPU state being updated and control longjmp'ing out of the +function call. They should therefore only be used in code that is +implementing emulation of the guest CPU. + +The ``retaddr`` parameter is used to control unwinding of the +guest CPU state in case of a guest CPU exception. This is passed +to ``cpu_restore_state()``. Therefore the value should either be 0, +to indicate that the guest CPU state is already synchronized, or +the result of ``GETPC()`` from the top level ``HELPER(foo)`` +function, which is a return address into the generated code [#gpc]_. + +.. [#gpc] Note that ``GETPC()`` should be used with great care: calling + it in other functions that are *not* the top level + ``HELPER(foo)`` will cause unexpected behavior. Instead, the + value of ``GETPC()`` should be read from the helper and passed + if needed to the functions that the helper calls. + +Function names follow the pattern: + +load: ``cpu_ld{size}{end}_mmu(env, ptr, oi, retaddr)`` + +store: ``cpu_st{size}{end}_mmu(env, ptr, val, oi, retaddr)`` + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``end`` + - (empty) : for target endian, or 8 bit sizes + - ``_be`` : big endian + - ``_le`` : little endian + +Regexes for git grep: + - ``\<cpu_ld[bwlq](_[bl]e)\?_mmu\>`` + - ``\<cpu_st[bwlq](_[bl]e)\?_mmu\>`` + + +``cpu_{ld,st}*_mmuidx_ra`` +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These functions work like the ``cpu_{ld,st}_mmu`` functions except +that the ``mmuidx`` parameter is not combined with a ``MemOp``, +and therefore there is no required alignment supplied or enforced. + +Function names follow the pattern: + +load: ``cpu_ld{sign}{size}{end}_mmuidx_ra(env, ptr, mmuidx, retaddr)`` + +store: ``cpu_st{size}{end}_mmuidx_ra(env, ptr, val, mmuidx, retaddr)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``end`` + - (empty) : for target endian, or 8 bit sizes + - ``_be`` : big endian + - ``_le`` : little endian + +Regexes for git grep: + - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_mmuidx_ra\>`` + - ``\<cpu_st[bwlq](_[bl]e)\?_mmuidx_ra\>`` + +``cpu_{ld,st}*_data_ra`` +~~~~~~~~~~~~~~~~~~~~~~~~ + +These functions work like the ``cpu_{ld,st}_mmuidx_ra`` functions +except that the ``mmuidx`` parameter is taken from the current mode +of the guest CPU, as determined by ``cpu_mmu_index(env, false)``. + +These are generally the preferred way to do accesses by guest +virtual address from helper functions, unless the access should +be performed with a context other than the default, or alignment +should be enforced for the access. + +Function names follow the pattern: + +load: ``cpu_ld{sign}{size}{end}_data_ra(env, ptr, ra)`` + +store: ``cpu_st{size}{end}_data_ra(env, ptr, val, ra)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``end`` + - (empty) : for target endian, or 8 bit sizes + - ``_be`` : big endian + - ``_le`` : little endian + +Regexes for git grep: + - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_data_ra\>`` + - ``\<cpu_st[bwlq](_[bl]e)\?_data_ra\>`` + +``cpu_{ld,st}*_data`` +~~~~~~~~~~~~~~~~~~~~~ + +These functions work like the ``cpu_{ld,st}_data_ra`` functions +except that the ``retaddr`` parameter is 0, and thus does not +unwind guest CPU state. + +This means they must only be used from helper functions where the +translator has saved all necessary CPU state. These functions are +the right choice for calls made from hooks like the CPU ``do_interrupt`` +hook or when you know for certain that the translator had to save all +the CPU state anyway. + +Function names follow the pattern: + +load: ``cpu_ld{sign}{size}{end}_data(env, ptr)`` + +store: ``cpu_st{size}{end}_data(env, ptr, val)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``end`` + - (empty) : for target endian, or 8 bit sizes + - ``_be`` : big endian + - ``_le`` : little endian + +Regexes for git grep + - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_data\>`` + - ``\<cpu_st[bwlq](_[bl]e)\?_data\+\>`` + +``cpu_ld*_code`` +~~~~~~~~~~~~~~~~ + +These functions perform a read for instruction execution. The ``mmuidx`` +parameter is taken from the current mode of the guest CPU, as determined +by ``cpu_mmu_index(env, true)``. The ``retaddr`` parameter is 0, and +thus does not unwind guest CPU state, because CPU state is always +synchronized while translating instructions. Any guest CPU exception +that is raised will indicate an instruction execution fault rather than +a data read fault. + +In general these functions should not be used directly during translation. +There are wrapper functions that are to be used which also take care of +plugins for tracing. + +Function names follow the pattern: + +load: ``cpu_ld{sign}{size}_code(env, ptr)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +Regexes for git grep: + - ``\<cpu_ld[us]\?[bwlq]_code\>`` + +``translator_ld*`` +~~~~~~~~~~~~~~~~~~ + +These functions are a wrapper for ``cpu_ld*_code`` which also perform +any actions required by any tracing plugins. They are only to be +called during the translator callback ``translate_insn``. + +There is a set of functions ending in ``_swap`` which, if the parameter +is true, returns the value in the endianness that is the reverse of +the guest native endianness, as determined by ``TARGET_WORDS_BIGENDIAN``. + +Function names follow the pattern: + +load: ``translator_ld{sign}{size}(env, ptr)`` + +swap: ``translator_ld{sign}{size}_swap(env, ptr, swap)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +Regexes for git grep + - ``\<translator_ld[us]\?[bwlq]\(_swap\)\?\>`` + +``helper_*_{ld,st}*_mmu`` +~~~~~~~~~~~~~~~~~~~~~~~~~ + +These functions are intended primarily to be called by the code +generated by the TCG backend. They may also be called by target +CPU helper function code. Like the ``cpu_{ld,st}_mmuidx_ra`` functions +they perform accesses by guest virtual address, with a given ``mmuidx``. + +These functions specify an ``opindex`` parameter which encodes +(among other things) the mmu index to use for the access. This parameter +should be created by calling ``make_memop_idx()``. + +The ``retaddr`` parameter should be the result of GETPC() called directly +from the top level HELPER(foo) function (or 0 if no guest CPU state +unwinding is required). + +**TODO** The names of these functions are a bit odd for historical +reasons because they were originally expected to be called only from +within generated code. We should rename them to bring them more in +line with the other memory access functions. The explicit endianness +is the only feature they have beyond ``*_mmuidx_ra``. + +load: ``helper_{endian}_ld{sign}{size}_mmu(env, addr, opindex, retaddr)`` + +store: ``helper_{endian}_st{size}_mmu(env, addr, val, opindex, retaddr)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + - ``s`` : signed + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``le`` : little endian + - ``be`` : big endian + - ``ret`` : target endianness + +Regexes for git grep + - ``\<helper_\(le\|be\|ret\)_ld[us]\?[bwlq]_mmu\>`` + - ``\<helper_\(le\|be\|ret\)_st[bwlq]_mmu\>`` + +``address_space_*`` +~~~~~~~~~~~~~~~~~~~ + +These functions are the primary ones to use when emulating CPU +or device memory accesses. They take an AddressSpace, which is the +way QEMU defines the view of memory that a device or CPU has. +(They generally correspond to being the "master" end of a hardware bus +or bus fabric.) + +Each CPU has an AddressSpace. Some kinds of CPU have more than +one AddressSpace (for instance Arm guest CPUs have an AddressSpace +for the Secure world and one for NonSecure if they implement TrustZone). +Devices which can do DMA-type operations should generally have an +AddressSpace. There is also a "system address space" which typically +has all the devices and memory that all CPUs can see. (Some older +device models use the "system address space" rather than properly +modelling that they have an AddressSpace of their own.) + +Functions are provided for doing byte-buffer reads and writes, +and also for doing one-data-item loads and stores. + +In all cases the caller provides a MemTxAttrs to specify bus +transaction attributes, and can check whether the memory transaction +succeeded using a MemTxResult return code. + +``address_space_read(address_space, addr, attrs, buf, len)`` + +``address_space_write(address_space, addr, attrs, buf, len)`` + +``address_space_rw(address_space, addr, attrs, buf, len, is_write)`` + +``address_space_ld{sign}{size}_{endian}(address_space, addr, attrs, txresult)`` + +``address_space_st{size}_{endian}(address_space, addr, val, attrs, txresult)`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + +(No signed load operations are provided.) + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``le`` : little endian + - ``be`` : big endian + +The ``_{endian}`` suffix is omitted for byte accesses. + +Regexes for git grep + - ``\<address_space_\(read\|write\|rw\)\>`` + - ``\<address_space_ldu\?[bwql]\(_[lb]e\)\?\>`` + - ``\<address_space_st[bwql]\(_[lb]e\)\?\>`` + +``address_space_write_rom`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This function performs a write by physical address like +``address_space_write``, except that if the write is to a ROM then +the ROM contents will be modified, even though a write by the guest +CPU to the ROM would be ignored. This is used for non-guest writes +like writes from the gdb debug stub or initial loading of ROM contents. + +Note that portions of the write which attempt to write data to a +device will be silently ignored -- only real RAM and ROM will +be written to. + +Regexes for git grep + - ``address_space_write_rom`` + +``{ld,st}*_phys`` +~~~~~~~~~~~~~~~~~ + +These are functions which are identical to +``address_space_{ld,st}*``, except that they always pass +``MEMTXATTRS_UNSPECIFIED`` for the transaction attributes, and ignore +whether the transaction succeeded or failed. + +The fact that they ignore whether the transaction succeeded means +they should not be used in new code, unless you know for certain +that your code will only be used in a context where the CPU or +device doing the access has no way to report such an error. + +``load: ld{sign}{size}_{endian}_phys`` + +``store: st{size}_{endian}_phys`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + +(No signed load operations are provided.) + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``le`` : little endian + - ``be`` : big endian + +The ``_{endian}_`` infix is omitted for byte accesses. + +Regexes for git grep + - ``\<ldu\?[bwlq]\(_[bl]e\)\?_phys\>`` + - ``\<st[bwlq]\(_[bl]e\)\?_phys\>`` + +``cpu_physical_memory_*`` +~~~~~~~~~~~~~~~~~~~~~~~~~ + +These are convenience functions which are identical to +``address_space_*`` but operate specifically on the system address space, +always pass a ``MEMTXATTRS_UNSPECIFIED`` set of memory attributes and +ignore whether the memory transaction succeeded or failed. +For new code they are better avoided: + +* there is likely to be behaviour you need to model correctly for a + failed read or write operation +* a device should usually perform operations on its own AddressSpace + rather than using the system address space + +``cpu_physical_memory_read`` + +``cpu_physical_memory_write`` + +``cpu_physical_memory_rw`` + +Regexes for git grep + - ``\<cpu_physical_memory_\(read\|write\|rw\)\>`` + +``cpu_memory_rw_debug`` +~~~~~~~~~~~~~~~~~~~~~~~ + +Access CPU memory by virtual address for debug purposes. + +This function is intended for use by the GDB stub and similar code. +It takes a virtual address, converts it to a physical address via +an MMU lookup using the current settings of the specified CPU, +and then performs the access (using ``address_space_rw`` for +reads or ``cpu_physical_memory_write_rom`` for writes). +This means that if the access is a write to a ROM then this +function will modify the contents (whereas a normal guest CPU access +would ignore the write attempt). + +``cpu_memory_rw_debug`` + +``dma_memory_*`` +~~~~~~~~~~~~~~~~ + +These behave like ``address_space_*``, except that they perform a DMA +barrier operation first. + +**TODO**: We should provide guidance on when you need the DMA +barrier operation and when it's OK to use ``address_space_*``, and +make sure our existing code is doing things correctly. + +``dma_memory_read`` + +``dma_memory_write`` + +``dma_memory_rw`` + +Regexes for git grep + - ``\<dma_memory_\(read\|write\|rw\)\>`` + - ``\<ldu\?[bwlq]\(_[bl]e\)\?_dma\>`` + - ``\<st[bwlq]\(_[bl]e\)\?_dma\>`` + +``pci_dma_*`` and ``{ld,st}*_pci_dma`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These functions are specifically for PCI device models which need to +perform accesses where the PCI device is a bus master. You pass them a +``PCIDevice *`` and they will do ``dma_memory_*`` operations on the +correct address space for that device. + +``pci_dma_read`` + +``pci_dma_write`` + +``pci_dma_rw`` + +``load: ld{sign}{size}_{endian}_pci_dma`` + +``store: st{size}_{endian}_pci_dma`` + +``sign`` + - (empty) : for 32 or 64 bit sizes + - ``u`` : unsigned + +(No signed load operations are provided.) + +``size`` + - ``b`` : 8 bits + - ``w`` : 16 bits + - ``l`` : 32 bits + - ``q`` : 64 bits + +``endian`` + - ``le`` : little endian + - ``be`` : big endian + +The ``_{endian}_`` infix is omitted for byte accesses. + +Regexes for git grep + - ``\<pci_dma_\(read\|write\|rw\)\>`` + - ``\<ldu\?[bwlq]\(_[bl]e\)\?_pci_dma\>`` + - ``\<st[bwlq]\(_[bl]e\)\?_pci_dma\>`` diff --git a/docs/devel/lockcnt.txt b/docs/devel/lockcnt.txt new file mode 100644 index 000000000..a3fb3bc5d --- /dev/null +++ b/docs/devel/lockcnt.txt @@ -0,0 +1,277 @@ +DOCUMENTATION FOR LOCKED COUNTERS (aka QemuLockCnt) +=================================================== + +QEMU often uses reference counts to track data structures that are being +accessed and should not be freed. For example, a loop that invoke +callbacks like this is not safe: + + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + +QLIST_FOREACH_SAFE protects against deletion of the current node (ioh) +by stashing away its "next" pointer. However, ioh->fd_write could +actually delete the next node from the list. The simplest way to +avoid this is to mark the node as deleted, and remove it from the +list in the above loop: + + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + } else { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + } + +If however this loop must also be reentrant, i.e. it is possible that +ioh->fd_write invokes the loop again, some kind of counting is needed: + + walking_handlers++; + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + if (walking_handlers == 1) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + } + } else { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + } + walking_handlers--; + +One may think of using the RCU primitives, rcu_read_lock() and +rcu_read_unlock(); effectively, the RCU nesting count would take +the place of the walking_handlers global variable. Indeed, +reference counting and RCU have similar purposes, but their usage in +general is complementary: + +- reference counting is fine-grained and limited to a single data + structure; RCU delays reclamation of *all* RCU-protected data + structures; + +- reference counting works even in the presence of code that keeps + a reference for a long time; RCU critical sections in principle + should be kept short; + +- reference counting is often applied to code that is not thread-safe + but is reentrant; in fact, usage of reference counting in QEMU predates + the introduction of threads by many years. RCU is generally used to + protect readers from other threads freeing memory after concurrent + modifications to a data structure. + +- reclaiming data can be done by a separate thread in the case of RCU; + this can improve performance, but also delay reclamation undesirably. + With reference counting, reclamation is deterministic. + +This file documents QemuLockCnt, an abstraction for using reference +counting in code that has to be both thread-safe and reentrant. + + +QemuLockCnt concepts +-------------------- + +A QemuLockCnt comprises both a counter and a mutex; it has primitives +to increment and decrement the counter, and to take and release the +mutex. The counter notes how many visits to the data structures are +taking place (the visits could be from different threads, or there could +be multiple reentrant visits from the same thread). The basic rules +governing the counter/mutex pair then are the following: + +- Data protected by the QemuLockCnt must not be freed unless the + counter is zero and the mutex is taken. + +- A new visit cannot be started while the counter is zero and the + mutex is taken. + +Most of the time, the mutex protects all writes to the data structure, +not just frees, though there could be cases where this is not necessary. + +Reads, instead, can be done without taking the mutex, as long as the +readers and writers use the same macros that are used for RCU, for +example qatomic_rcu_read, qatomic_rcu_set, QLIST_FOREACH_RCU, etc. This is +because the reads are done outside a lock and a set or QLIST_INSERT_HEAD +can happen concurrently with the read. The RCU API ensures that the +processor and the compiler see all required memory barriers. + +This could be implemented simply by protecting the counter with the +mutex, for example: + + // (1) + qemu_mutex_lock(&walking_handlers_mutex); + walking_handlers++; + qemu_mutex_unlock(&walking_handlers_mutex); + + ... + + // (2) + qemu_mutex_lock(&walking_handlers_mutex); + if (--walking_handlers == 0) { + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + } + } + } + qemu_mutex_unlock(&walking_handlers_mutex); + +Here, no frees can happen in the code represented by the ellipsis. +If another thread is executing critical section (2), that part of +the code cannot be entered, because the thread will not be able +to increment the walking_handlers variable. And of course +during the visit any other thread will see a nonzero value for +walking_handlers, as in the single-threaded code. + +Note that it is possible for multiple concurrent accesses to delay +the cleanup arbitrarily; in other words, for the walking_handlers +counter to never become zero. For this reason, this technique is +more easily applicable if concurrent access to the structure is rare. + +However, critical sections are easy to forget since you have to do +them for each modification of the counter. QemuLockCnt ensures that +all modifications of the counter take the lock appropriately, and it +can also be more efficient in two ways: + +- it avoids taking the lock for many operations (for example + incrementing the counter while it is non-zero); + +- on some platforms, one can implement QemuLockCnt to hold the lock + and the mutex in a single word, making the fast path no more expensive + than simply managing a counter using atomic operations (see + docs/devel/atomics.rst). This can be very helpful if concurrent access to + the data structure is expected to be rare. + + +Using the same mutex for frees and writes can still incur some small +inefficiencies; for example, a visit can never start if the counter is +zero and the mutex is taken---even if the mutex is taken by a write, +which in principle need not block a visit of the data structure. +However, these are usually not a problem if any of the following +assumptions are valid: + +- concurrent access is possible but rare + +- writes are rare + +- writes are frequent, but this kind of write (e.g. appending to a + list) has a very small critical section. + +For example, QEMU uses QemuLockCnt to manage an AioContext's list of +bottom halves and file descriptor handlers. Modifications to the list +of file descriptor handlers are rare. Creation of a new bottom half is +frequent and can happen on a fast path; however: 1) it is almost never +concurrent with a visit to the list of bottom halves; 2) it only has +three instructions in the critical path, two assignments and a smp_wmb(). + + +QemuLockCnt API +--------------- + +The QemuLockCnt API is described in include/qemu/thread.h. + + +QemuLockCnt usage +----------------- + +This section explains the typical usage patterns for QemuLockCnt functions. + +Setting a variable to a non-NULL value can be done between +qemu_lockcnt_lock and qemu_lockcnt_unlock: + + qemu_lockcnt_lock(&xyz_lockcnt); + if (!xyz) { + new_xyz = g_new(XYZ, 1); + ... + qatomic_rcu_set(&xyz, new_xyz); + } + qemu_lockcnt_unlock(&xyz_lockcnt); + +Accessing the value can be done between qemu_lockcnt_inc and +qemu_lockcnt_dec: + + qemu_lockcnt_inc(&xyz_lockcnt); + if (xyz) { + XYZ *p = qatomic_rcu_read(&xyz); + ... + /* Accesses can now be done through "p". */ + } + qemu_lockcnt_dec(&xyz_lockcnt); + +Freeing the object can similarly use qemu_lockcnt_lock and +qemu_lockcnt_unlock, but you also need to ensure that the count +is zero (i.e. there is no concurrent visit). Because qemu_lockcnt_inc +takes the QemuLockCnt's lock, the count cannot become non-zero while +the object is being freed. Freeing an object looks like this: + + qemu_lockcnt_lock(&xyz_lockcnt); + if (!qemu_lockcnt_count(&xyz_lockcnt)) { + g_free(xyz); + xyz = NULL; + } + qemu_lockcnt_unlock(&xyz_lockcnt); + +If an object has to be freed right after a visit, you can combine +the decrement, the locking and the check on count as follows: + + qemu_lockcnt_inc(&xyz_lockcnt); + if (xyz) { + XYZ *p = qatomic_rcu_read(&xyz); + ... + /* Accesses can now be done through "p". */ + } + if (qemu_lockcnt_dec_and_lock(&xyz_lockcnt)) { + g_free(xyz); + xyz = NULL; + qemu_lockcnt_unlock(&xyz_lockcnt); + } + +QemuLockCnt can also be used to access a list as follows: + + qemu_lockcnt_inc(&io_handlers_lockcnt); + QLIST_FOREACH_RCU(ioh, &io_handlers, pioh) { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + + if (qemu_lockcnt_dec_and_lock(&io_handlers_lockcnt)) { + QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + } + } + qemu_lockcnt_unlock(&io_handlers_lockcnt); + } + +Again, the RCU primitives are used because new items can be added to the +list during the walk. QLIST_FOREACH_RCU ensures that the processor and +the compiler see the appropriate memory barriers. + +An alternative pattern uses qemu_lockcnt_dec_if_lock: + + qemu_lockcnt_inc(&io_handlers_lockcnt); + QLIST_FOREACH_SAFE_RCU(ioh, &io_handlers, next, pioh) { + if (ioh->deleted) { + if (qemu_lockcnt_dec_if_lock(&io_handlers_lockcnt)) { + QLIST_REMOVE(ioh, next); + g_free(ioh); + qemu_lockcnt_inc_and_unlock(&io_handlers_lockcnt); + } + } else { + if (ioh->revents & G_IO_OUT) { + ioh->fd_write(ioh->opaque); + } + } + } + qemu_lockcnt_dec(&io_handlers_lockcnt); + +Here you can use qemu_lockcnt_dec instead of qemu_lockcnt_dec_and_lock, +because there is no special task to do if the count goes from 1 to 0. diff --git a/docs/devel/memory.rst b/docs/devel/memory.rst new file mode 100644 index 000000000..5dc8a1268 --- /dev/null +++ b/docs/devel/memory.rst @@ -0,0 +1,368 @@ +============== +The memory API +============== + +The memory API models the memory and I/O buses and controllers of a QEMU +machine. It attempts to allow modelling of: + +- ordinary RAM +- memory-mapped I/O (MMIO) +- memory controllers that can dynamically reroute physical memory regions + to different destinations + +The memory model provides support for + +- tracking RAM changes by the guest +- setting up coalesced memory for kvm +- setting up ioeventfd regions for kvm + +Memory is modelled as an acyclic graph of MemoryRegion objects. Sinks +(leaves) are RAM and MMIO regions, while other nodes represent +buses, memory controllers, and memory regions that have been rerouted. + +In addition to MemoryRegion objects, the memory API provides AddressSpace +objects for every root and possibly for intermediate MemoryRegions too. +These represent memory as seen from the CPU or a device's viewpoint. + +Types of regions +---------------- + +There are multiple types of memory regions (all represented by a single C type +MemoryRegion): + +- RAM: a RAM region is simply a range of host memory that can be made available + to the guest. + You typically initialize these with memory_region_init_ram(). Some special + purposes require the variants memory_region_init_resizeable_ram(), + memory_region_init_ram_from_file(), or memory_region_init_ram_ptr(). + +- MMIO: a range of guest memory that is implemented by host callbacks; + each read or write causes a callback to be called on the host. + You initialize these with memory_region_init_io(), passing it a + MemoryRegionOps structure describing the callbacks. + +- ROM: a ROM memory region works like RAM for reads (directly accessing + a region of host memory), and forbids writes. You initialize these with + memory_region_init_rom(). + +- ROM device: a ROM device memory region works like RAM for reads + (directly accessing a region of host memory), but like MMIO for + writes (invoking a callback). You initialize these with + memory_region_init_rom_device(). + +- IOMMU region: an IOMMU region translates addresses of accesses made to it + and forwards them to some other target memory region. As the name suggests, + these are only needed for modelling an IOMMU, not for simple devices. + You initialize these with memory_region_init_iommu(). + +- container: a container simply includes other memory regions, each at + a different offset. Containers are useful for grouping several regions + into one unit. For example, a PCI BAR may be composed of a RAM region + and an MMIO region. + + A container's subregions are usually non-overlapping. In some cases it is + useful to have overlapping regions; for example a memory controller that + can overlay a subregion of RAM with MMIO or ROM, or a PCI controller + that does not prevent card from claiming overlapping BARs. + + You initialize a pure container with memory_region_init(). + +- alias: a subsection of another region. Aliases allow a region to be + split apart into discontiguous regions. Examples of uses are memory banks + used when the guest address space is smaller than the amount of RAM + addressed, or a memory controller that splits main memory to expose a "PCI + hole". Aliases may point to any type of region, including other aliases, + but an alias may not point back to itself, directly or indirectly. + You initialize these with memory_region_init_alias(). + +- reservation region: a reservation region is primarily for debugging. + It claims I/O space that is not supposed to be handled by QEMU itself. + The typical use is to track parts of the address space which will be + handled by the host kernel when KVM is enabled. You initialize these + by passing a NULL callback parameter to memory_region_init_io(). + +It is valid to add subregions to a region which is not a pure container +(that is, to an MMIO, RAM or ROM region). This means that the region +will act like a container, except that any addresses within the container's +region which are not claimed by any subregion are handled by the +container itself (ie by its MMIO callbacks or RAM backing). However +it is generally possible to achieve the same effect with a pure container +one of whose subregions is a low priority "background" region covering +the whole address range; this is often clearer and is preferred. +Subregions cannot be added to an alias region. + +Migration +--------- + +Where the memory region is backed by host memory (RAM, ROM and +ROM device memory region types), this host memory needs to be +copied to the destination on migration. These APIs which allocate +the host memory for you will also register the memory so it is +migrated: + +- memory_region_init_ram() +- memory_region_init_rom() +- memory_region_init_rom_device() + +For most devices and boards this is the correct thing. If you +have a special case where you need to manage the migration of +the backing memory yourself, you can call the functions: + +- memory_region_init_ram_nomigrate() +- memory_region_init_rom_nomigrate() +- memory_region_init_rom_device_nomigrate() + +which only initialize the MemoryRegion and leave handling +migration to the caller. + +The functions: + +- memory_region_init_resizeable_ram() +- memory_region_init_ram_from_file() +- memory_region_init_ram_from_fd() +- memory_region_init_ram_ptr() +- memory_region_init_ram_device_ptr() + +are for special cases only, and so they do not automatically +register the backing memory for migration; the caller must +manage migration if necessary. + +Region names +------------ + +Regions are assigned names by the constructor. For most regions these are +only used for debugging purposes, but RAM regions also use the name to identify +live migration sections. This means that RAM region names need to have ABI +stability. + +Region lifecycle +---------------- + +A region is created by one of the memory_region_init*() functions and +attached to an object, which acts as its owner or parent. QEMU ensures +that the owner object remains alive as long as the region is visible to +the guest, or as long as the region is in use by a virtual CPU or another +device. For example, the owner object will not die between an +address_space_map operation and the corresponding address_space_unmap. + +After creation, a region can be added to an address space or a +container with memory_region_add_subregion(), and removed using +memory_region_del_subregion(). + +Various region attributes (read-only, dirty logging, coalesced mmio, +ioeventfd) can be changed during the region lifecycle. They take effect +as soon as the region is made visible. This can be immediately, later, +or never. + +Destruction of a memory region happens automatically when the owner +object dies. + +If however the memory region is part of a dynamically allocated data +structure, you should call object_unparent() to destroy the memory region +before the data structure is freed. For an example see VFIOMSIXInfo +and VFIOQuirk in hw/vfio/pci.c. + +You must not destroy a memory region as long as it may be in use by a +device or CPU. In order to do this, as a general rule do not create or +destroy memory regions dynamically during a device's lifetime, and only +call object_unparent() in the memory region owner's instance_finalize +callback. The dynamically allocated data structure that contains the +memory region then should obviously be freed in the instance_finalize +callback as well. + +If you break this rule, the following situation can happen: + +- the memory region's owner had a reference taken via memory_region_ref + (for example by address_space_map) + +- the region is unparented, and has no owner anymore + +- when address_space_unmap is called, the reference to the memory region's + owner is leaked. + + +There is an exception to the above rule: it is okay to call +object_unparent at any time for an alias or a container region. It is +therefore also okay to create or destroy alias and container regions +dynamically during a device's lifetime. + +This exceptional usage is valid because aliases and containers only help +QEMU building the guest's memory map; they are never accessed directly. +memory_region_ref and memory_region_unref are never called on aliases +or containers, and the above situation then cannot happen. Exploiting +this exception is rarely necessary, and therefore it is discouraged, +but nevertheless it is used in a few places. + +For regions that "have no owner" (NULL is passed at creation time), the +machine object is actually used as the owner. Since instance_finalize is +never called for the machine object, you must never call object_unparent +on regions that have no owner, unless they are aliases or containers. + + +Overlapping regions and priority +-------------------------------- +Usually, regions may not overlap each other; a memory address decodes into +exactly one target. In some cases it is useful to allow regions to overlap, +and sometimes to control which of an overlapping regions is visible to the +guest. This is done with memory_region_add_subregion_overlap(), which +allows the region to overlap any other region in the same container, and +specifies a priority that allows the core to decide which of two regions at +the same address are visible (highest wins). +Priority values are signed, and the default value is zero. This means that +you can use memory_region_add_subregion_overlap() both to specify a region +that must sit 'above' any others (with a positive priority) and also a +background region that sits 'below' others (with a negative priority). + +If the higher priority region in an overlap is a container or alias, then +the lower priority region will appear in any "holes" that the higher priority +region has left by not mapping subregions to that area of its address range. +(This applies recursively -- if the subregions are themselves containers or +aliases that leave holes then the lower priority region will appear in these +holes too.) + +For example, suppose we have a container A of size 0x8000 with two subregions +B and C. B is a container mapped at 0x2000, size 0x4000, priority 2; C is +an MMIO region mapped at 0x0, size 0x6000, priority 1. B currently has two +of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at +offset 0x2000. As a diagram:: + + 0 1000 2000 3000 4000 5000 6000 7000 8000 + |------|------|------|------|------|------|------|------| + A: [ ] + C: [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC] + B: [ ] + D: [DDDDD] + E: [EEEEE] + +The regions that will be seen within this address range then are:: + + [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC] + +Since B has higher priority than C, its subregions appear in the flat map +even where they overlap with C. In ranges where B has not mapped anything +C's region appears. + +If B had provided its own MMIO operations (ie it was not a pure container) +then these would be used for any addresses in its range not handled by +D or E, and the result would be:: + + [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB] + +Priority values are local to a container, because the priorities of two +regions are only compared when they are both children of the same container. +This means that the device in charge of the container (typically modelling +a bus or a memory controller) can use them to manage the interaction of +its child regions without any side effects on other parts of the system. +In the example above, the priorities of D and E are unimportant because +they do not overlap each other. It is the relative priority of B and C +that causes D and E to appear on top of C: D and E's priorities are never +compared against the priority of C. + +Visibility +---------- +The memory core uses the following rules to select a memory region when the +guest accesses an address: + +- all direct subregions of the root region are matched against the address, in + descending priority order + + - if the address lies outside the region offset/size, the subregion is + discarded + - if the subregion is a leaf (RAM or MMIO), the search terminates, returning + this leaf region + - if the subregion is a container, the same algorithm is used within the + subregion (after the address is adjusted by the subregion offset) + - if the subregion is an alias, the search is continued at the alias target + (after the address is adjusted by the subregion offset and alias offset) + - if a recursive search within a container or alias subregion does not + find a match (because of a "hole" in the container's coverage of its + address range), then if this is a container with its own MMIO or RAM + backing the search terminates, returning the container itself. Otherwise + we continue with the next subregion in priority order + +- if none of the subregions match the address then the search terminates + with no match found + +Example memory map +------------------ + +:: + + system_memory: container@0-2^48-1 + | + +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff) + | + +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff) + | + +---- vga-window: alias@0xa0000-0xbffff ---> #pci (0xa0000-0xbffff) + | (prio 1) + | + +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff) + + pci (0-2^32-1) + | + +--- vga-area: container@0xa0000-0xbffff + | | + | +--- alias@0x00000-0x7fff ---> #vram (0x010000-0x017fff) + | | + | +--- alias@0x08000-0xffff ---> #vram (0x020000-0x027fff) + | + +---- vram: ram@0xe1000000-0xe1ffffff + | + +---- vga-mmio: mmio@0xe2000000-0xe200ffff + + ram: ram@0x00000000-0xffffffff + +This is a (simplified) PC memory map. The 4GB RAM block is mapped into the +system address space via two aliases: "lomem" is a 1:1 mapping of the first +3.5GB; "himem" maps the last 0.5GB at address 4GB. This leaves 0.5GB for the +so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with +4GB of memory. + +The memory controller diverts addresses in the range 640K-768K to the PCI +address space. This is modelled using the "vga-window" alias, mapped at a +higher priority so it obscures the RAM at the same addresses. The vga window +can be removed by programming the memory controller; this is modelled by +removing the alias and exposing the RAM underneath. + +The pci address space is not a direct child of the system address space, since +we only want parts of it to be visible (we accomplish this using aliases). +It has two subregions: vga-area models the legacy vga window and is occupied +by two 32K memory banks pointing at two sections of the framebuffer. +In addition the vram is mapped as a BAR at address e1000000, and an additional +BAR containing MMIO registers is mapped after it. + +Note that if the guest maps a BAR outside the PCI hole, it would not be +visible as the pci-hole alias clips it to a 0.5GB range. + +MMIO Operations +--------------- + +MMIO regions are provided with ->read() and ->write() callbacks, +which are sufficient for most devices. Some devices change behaviour +based on the attributes used for the memory transaction, or need +to be able to respond that the access should provoke a bus error +rather than completing successfully; those devices can use the +->read_with_attrs() and ->write_with_attrs() callbacks instead. + +In addition various constraints can be supplied to control how these +callbacks are called: + +- .valid.min_access_size, .valid.max_access_size define the access sizes + (in bytes) which the device accepts; accesses outside this range will + have device and bus specific behaviour (ignored, or machine check) +- .valid.unaligned specifies that the *device being modelled* supports + unaligned accesses; if false, unaligned accesses will invoke the + appropriate bus or CPU specific behaviour. +- .impl.min_access_size, .impl.max_access_size define the access sizes + (in bytes) supported by the *implementation*; other access sizes will be + emulated using the ones available. For example a 4-byte write will be + emulated using four 1-byte writes, if .impl.max_access_size = 1. +- .impl.unaligned specifies that the *implementation* supports unaligned + accesses; if false, unaligned accesses will be emulated by two aligned + accesses. + +API Reference +------------- + +.. kernel-doc:: include/exec/memory.h diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst new file mode 100644 index 000000000..240125348 --- /dev/null +++ b/docs/devel/migration.rst @@ -0,0 +1,883 @@ +========= +Migration +========= + +QEMU has code to load/save the state of the guest that it is running. +These are two complementary operations. Saving the state just does +that, saves the state for each device that the guest is running. +Restoring a guest is just the opposite operation: we need to load the +state of each device. + +For this to work, QEMU has to be launched with the same arguments the +two times. I.e. it can only restore the state in one guest that has +the same devices that the one it was saved (this last requirement can +be relaxed a bit, but for now we can consider that configuration has +to be exactly the same). + +Once that we are able to save/restore a guest, a new functionality is +requested: migration. This means that QEMU is able to start in one +machine and being "migrated" to another machine. I.e. being moved to +another machine. + +Next was the "live migration" functionality. This is important +because some guests run with a lot of state (specially RAM), and it +can take a while to move all state from one machine to another. Live +migration allows the guest to continue running while the state is +transferred. Only while the last part of the state is transferred has +the guest to be stopped. Typically the time that the guest is +unresponsive during live migration is the low hundred of milliseconds +(notice that this depends on a lot of things). + +Transports +========== + +The migration stream is normally just a byte stream that can be passed +over any transport. + +- tcp migration: do the migration using tcp sockets +- unix migration: do the migration using unix sockets +- exec migration: do the migration using the stdin/stdout through a process. +- fd migration: do the migration using a file descriptor that is + passed to QEMU. QEMU doesn't care how this file descriptor is opened. + +In addition, support is included for migration using RDMA, which +transports the page data using ``RDMA``, where the hardware takes care of +transporting the pages, and the load on the CPU is much lower. While the +internals of RDMA migration are a bit different, this isn't really visible +outside the RAM migration code. + +All these migration protocols use the same infrastructure to +save/restore state devices. This infrastructure is shared with the +savevm/loadvm functionality. + +Debugging +========= + +The migration stream can be analyzed thanks to ``scripts/analyze-migration.py``. + +Example usage: + +.. code-block:: shell + + $ qemu-system-x86_64 -display none -monitor stdio + (qemu) migrate "exec:cat > mig" + (qemu) q + $ ./scripts/analyze-migration.py -f mig + { + "ram (3)": { + "section sizes": { + "pc.ram": "0x0000000008000000", + ... + +See also ``analyze-migration.py -h`` help for more options. + +Common infrastructure +===================== + +The files, sockets or fd's that carry the migration stream are abstracted by +the ``QEMUFile`` type (see ``migration/qemu-file.h``). In most cases this +is connected to a subtype of ``QIOChannel`` (see ``io/``). + + +Saving the state of one device +============================== + +For most devices, the state is saved in a single call to the migration +infrastructure; these are *non-iterative* devices. The data for these +devices is sent at the end of precopy migration, when the CPUs are paused. +There are also *iterative* devices, which contain a very large amount of +data (e.g. RAM or large tables). See the iterative device section below. + +General advice for device developers +------------------------------------ + +- The migration state saved should reflect the device being modelled rather + than the way your implementation works. That way if you change the implementation + later the migration stream will stay compatible. That model may include + internal state that's not directly visible in a register. + +- When saving a migration stream the device code may walk and check + the state of the device. These checks might fail in various ways (e.g. + discovering internal state is corrupt or that the guest has done something bad). + Consider carefully before asserting/aborting at this point, since the + normal response from users is that *migration broke their VM* since it had + apparently been running fine until then. In these error cases, the device + should log a message indicating the cause of error, and should consider + putting the device into an error state, allowing the rest of the VM to + continue execution. + +- The migration might happen at an inconvenient point, + e.g. right in the middle of the guest reprogramming the device, during + guest reboot or shutdown or while the device is waiting for external IO. + It's strongly preferred that migrations do not fail in this situation, + since in the cloud environment migrations might happen automatically to + VMs that the administrator doesn't directly control. + +- If you do need to fail a migration, ensure that sufficient information + is logged to identify what went wrong. + +- The destination should treat an incoming migration stream as hostile + (which we do to varying degrees in the existing code). Check that offsets + into buffers and the like can't cause overruns. Fail the incoming migration + in the case of a corrupted stream like this. + +- Take care with internal device state or behaviour that might become + migration version dependent. For example, the order of PCI capabilities + is required to stay constant across migration. Another example would + be that a special case handled by subsections (see below) might become + much more common if a default behaviour is changed. + +- The state of the source should not be changed or destroyed by the + outgoing migration. Migrations timing out or being failed by + higher levels of management, or failures of the destination host are + not unusual, and in that case the VM is restarted on the source. + Note that the management layer can validly revert the migration + even though the QEMU level of migration has succeeded as long as it + does it before starting execution on the destination. + +- Buses and devices should be able to explicitly specify addresses when + instantiated, and management tools should use those. For example, + when hot adding USB devices it's important to specify the ports + and addresses, since implicit ordering based on the command line order + may be different on the destination. This can result in the + device state being loaded into the wrong device. + +VMState +------- + +Most device data can be described using the ``VMSTATE`` macros (mostly defined +in ``include/migration/vmstate.h``). + +An example (from hw/input/pckbd.c) + +.. code:: c + + static const VMStateDescription vmstate_kbd = { + .name = "pckbd", + .version_id = 3, + .minimum_version_id = 3, + .fields = (VMStateField[]) { + VMSTATE_UINT8(write_cmd, KBDState), + VMSTATE_UINT8(status, KBDState), + VMSTATE_UINT8(mode, KBDState), + VMSTATE_UINT8(pending, KBDState), + VMSTATE_END_OF_LIST() + } + }; + +We are declaring the state with name "pckbd". +The ``version_id`` is 3, and the fields are 4 uint8_t in a KBDState structure. +We registered this with: + +.. code:: c + + vmstate_register(NULL, 0, &vmstate_kbd, s); + +For devices that are ``qdev`` based, we can register the device in the class +init function: + +.. code:: c + + dc->vmsd = &vmstate_kbd_isa; + +The VMState macros take care of ensuring that the device data section +is formatted portably (normally big endian) and make some compile time checks +against the types of the fields in the structures. + +VMState macros can include other VMStateDescriptions to store substructures +(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length +arrays (``VMSTATE_VARRAY_``). Various other macros exist for special +cases. + +Note that the format on the wire is still very raw; i.e. a VMSTATE_UINT32 +ends up with a 4 byte bigendian representation on the wire; in the future +it might be possible to use a more structured format. + +Legacy way +---------- + +This way is going to disappear as soon as all current users are ported to VMSTATE; +although converting existing code can be tricky, and thus 'soon' is relative. + +Each device has to register two functions, one to save the state and +another to load the state back. + +.. code:: c + + int register_savevm_live(const char *idstr, + int instance_id, + int version_id, + SaveVMHandlers *ops, + void *opaque); + +Two functions in the ``ops`` structure are the ``save_state`` +and ``load_state`` functions. Notice that ``load_state`` receives a version_id +parameter to know what state format is receiving. ``save_state`` doesn't +have a version_id parameter because it always uses the latest version. + +Note that because the VMState macros still save the data in a raw +format, in many cases it's possible to replace legacy code +with a carefully constructed VMState description that matches the +byte layout of the existing code. + +Changing migration data structures +---------------------------------- + +When we migrate a device, we save/load the state as a series +of fields. Sometimes, due to bugs or new functionality, we need to +change the state to store more/different information. Changing the migration +state saved for a device can break migration compatibility unless +care is taken to use the appropriate techniques. In general QEMU tries +to maintain forward migration compatibility (i.e. migrating from +QEMU n->n+1) and there are users who benefit from backward compatibility +as well. + +Subsections +----------- + +The most common structure change is adding new data, e.g. when adding +a newer form of device, or adding that state that you previously +forgot to migrate. This is best solved using a subsection. + +A subsection is "like" a device vmstate, but with a particularity, it +has a Boolean function that tells if that values are needed to be sent +or not. If this functions returns false, the subsection is not sent. +Subsections have a unique name, that is looked for on the receiving +side. + +On the receiving side, if we found a subsection for a device that we +don't understand, we just fail the migration. If we understand all +the subsections, then we load the state with success. There's no check +that a subsection is loaded, so a newer QEMU that knows about a subsection +can (with care) load a stream from an older QEMU that didn't send +the subsection. + +If the new data is only needed in a rare case, then the subsection +can be made conditional on that case and the migration will still +succeed to older QEMUs in most cases. This is OK for data that's +critical, but in some use cases it's preferred that the migration +should succeed even with the data missing. To support this the +subsection can be connected to a device property and from there +to a versioned machine type. + +The 'pre_load' and 'post_load' functions on subsections are only +called if the subsection is loaded. + +One important note is that the outer post_load() function is called "after" +loading all subsections, because a newer subsection could change the same +value that it uses. A flag, and the combination of outer pre_load and +post_load can be used to detect whether a subsection was loaded, and to +fall back on default behaviour when the subsection isn't present. + +Example: + +.. code:: c + + static bool ide_drive_pio_state_needed(void *opaque) + { + IDEState *s = opaque; + + return ((s->status & DRQ_STAT) != 0) + || (s->bus->error_status & BM_STATUS_PIO_RETRY); + } + + const VMStateDescription vmstate_ide_drive_pio_state = { + .name = "ide_drive/pio_state", + .version_id = 1, + .minimum_version_id = 1, + .pre_save = ide_drive_pio_pre_save, + .post_load = ide_drive_pio_post_load, + .needed = ide_drive_pio_state_needed, + .fields = (VMStateField[]) { + VMSTATE_INT32(req_nb_sectors, IDEState), + VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1, + vmstate_info_uint8, uint8_t), + VMSTATE_INT32(cur_io_buffer_offset, IDEState), + VMSTATE_INT32(cur_io_buffer_len, IDEState), + VMSTATE_UINT8(end_transfer_fn_idx, IDEState), + VMSTATE_INT32(elementary_transfer_size, IDEState), + VMSTATE_INT32(packet_transfer_size, IDEState), + VMSTATE_END_OF_LIST() + } + }; + + const VMStateDescription vmstate_ide_drive = { + .name = "ide_drive", + .version_id = 3, + .minimum_version_id = 0, + .post_load = ide_drive_post_load, + .fields = (VMStateField[]) { + .... several fields .... + VMSTATE_END_OF_LIST() + }, + .subsections = (const VMStateDescription*[]) { + &vmstate_ide_drive_pio_state, + NULL + } + }; + +Here we have a subsection for the pio state. We only need to +save/send this state when we are in the middle of a pio operation +(that is what ``ide_drive_pio_state_needed()`` checks). If DRQ_STAT is +not enabled, the values on that fields are garbage and don't need to +be sent. + +Connecting subsections to properties +------------------------------------ + +Using a condition function that checks a 'property' to determine whether +to send a subsection allows backward migration compatibility when +new subsections are added, especially when combined with versioned +machine types. + +For example: + + a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and + default it to true. + b) Add an entry to the ``hw_compat_`` for the previous version that sets + the property to false. + c) Add a static bool support_foo function that tests the property. + d) Add a subsection with a .needed set to the support_foo function + e) (potentially) Add an outer pre_load that sets up a default value + for 'foo' to be used if the subsection isn't loaded. + +Now that subsection will not be generated when using an older +machine type and the migration stream will be accepted by older +QEMU versions. + +Not sending existing elements +----------------------------- + +Sometimes members of the VMState are no longer needed: + + - removing them will break migration compatibility + + - making them version dependent and bumping the version will break backward migration + compatibility. + +Adding a dummy field into the migration stream is normally the best way to preserve +compatibility. + +If the field really does need to be removed then: + + a) Add a new property/compatibility/function in the same way for subsections above. + b) replace the VMSTATE macro with the _TEST version of the macro, e.g.: + + ``VMSTATE_UINT32(foo, barstruct)`` + + becomes + + ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)`` + + Sometime in the future when we no longer care about the ancient versions these can be killed off. + Note that for backward compatibility it's important to fill in the structure with + data that the destination will understand. + +Any difference in the predicates on the source and destination will end up +with different fields being enabled and data being loaded into the wrong +fields; for this reason conditional fields like this are very fragile. + +Versions +-------- + +Version numbers are intended for major incompatible changes to the +migration of a device, and using them breaks backward-migration +compatibility; in general most changes can be made by adding Subsections +(see above) or _TEST macros (see above) which won't break compatibility. + +Each version is associated with a series of fields saved. The ``save_state`` always saves +the state as the newer version. But ``load_state`` sometimes is able to +load state from an older version. + +You can see that there are several version fields: + +- ``version_id``: the maximum version_id supported by VMState for that device. +- ``minimum_version_id``: the minimum version_id that VMState is able to understand + for that device. +- ``minimum_version_id_old``: For devices that were not able to port to vmstate, we can + assign a function that knows how to read this old state. This field is + ignored if there is no ``load_state_old`` handler. + +VMState is able to read versions from minimum_version_id to +version_id. And the function ``load_state_old()`` (if present) is able to +load state from minimum_version_id_old to minimum_version_id. This +function is deprecated and will be removed when no more users are left. + +There are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields, +e.g. + +.. code:: c + + VMSTATE_UINT16_V(ip_id, Slirp, 2), + +only loads that field for versions 2 and newer. + +Saving state will always create a section with the 'version_id' value +and thus can't be loaded by any older QEMU. + +Massaging functions +------------------- + +Sometimes, it is not enough to be able to save the state directly +from one structure, we need to fill the correct values there. One +example is when we are using kvm. Before saving the cpu state, we +need to ask kvm to copy to QEMU the state that it is using. And the +opposite when we are loading the state, we need a way to tell kvm to +load the state for the cpu that we have just loaded from the QEMUFile. + +The functions to do that are inside a vmstate definition, and are called: + +- ``int (*pre_load)(void *opaque);`` + + This function is called before we load the state of one device. + +- ``int (*post_load)(void *opaque, int version_id);`` + + This function is called after we load the state of one device. + +- ``int (*pre_save)(void *opaque);`` + + This function is called before we save the state of one device. + +- ``int (*post_save)(void *opaque);`` + + This function is called after we save the state of one device + (even upon failure, unless the call to pre_save returned an error). + +Example: You can look at hpet.c, that uses the first three functions +to massage the state that is transferred. + +The ``VMSTATE_WITH_TMP`` macro may be useful when the migration +data doesn't match the stored device data well; it allows an +intermediate temporary structure to be populated with migration +data and then transferred to the main structure. + +If you use memory API functions that update memory layout outside +initialization (i.e., in response to a guest action), this is a strong +indication that you need to call these functions in a ``post_load`` callback. +Examples of such memory API functions are: + + - memory_region_add_subregion() + - memory_region_del_subregion() + - memory_region_set_readonly() + - memory_region_set_nonvolatile() + - memory_region_set_enabled() + - memory_region_set_address() + - memory_region_set_alias_offset() + +Iterative device migration +-------------------------- + +Some devices, such as RAM, Block storage or certain platform devices, +have large amounts of data that would mean that the CPUs would be +paused for too long if they were sent in one section. For these +devices an *iterative* approach is taken. + +The iterative devices generally don't use VMState macros +(although it may be possible in some cases) and instead use +qemu_put_*/qemu_get_* macros to read/write data to the stream. Specialist +versions exist for high bandwidth IO. + + +An iterative device must provide: + + - A ``save_setup`` function that initialises the data structures and + transmits a first section containing information on the device. In the + case of RAM this transmits a list of RAMBlocks and sizes. + + - A ``load_setup`` function that initialises the data structures on the + destination. + + - A ``save_live_pending`` function that is called repeatedly and must + indicate how much more data the iterative data must save. The core + migration code will use this to determine when to pause the CPUs + and complete the migration. + + - A ``save_live_iterate`` function (called after ``save_live_pending`` + when there is significant data still to be sent). It should send + a chunk of data until the point that stream bandwidth limits tell it + to stop. Each call generates one section. + + - A ``save_live_complete_precopy`` function that must transmit the + last section for the device containing any remaining data. + + - A ``load_state`` function used to load sections generated by + any of the save functions that generate sections. + + - ``cleanup`` functions for both save and load that are called + at the end of migration. + +Note that the contents of the sections for iterative migration tend +to be open-coded by the devices; care should be taken in parsing +the results and structuring the stream to make them easy to validate. + +Device ordering +--------------- + +There are cases in which the ordering of device loading matters; for +example in some systems where a device may assert an interrupt during loading, +if the interrupt controller is loaded later then it might lose the state. + +Some ordering is implicitly provided by the order in which the machine +definition creates devices, however this is somewhat fragile. + +The ``MigrationPriority`` enum provides a means of explicitly enforcing +ordering. Numerically higher priorities are loaded earlier. +The priority is set by setting the ``priority`` field of the top level +``VMStateDescription`` for the device. + +Stream structure +================ + +The stream tries to be word and endian agnostic, allowing migration between hosts +of different characteristics running the same VM. + + - Header + + - Magic + - Version + - VM configuration section + + - Machine type + - Target page bits + - List of sections + Each section contains a device, or one iteration of a device save. + + - section type + - section id + - ID string (First section of each device) + - instance id (First section of each device) + - version id (First section of each device) + - <device data> + - Footer mark + - EOF mark + - VM Description structure + Consisting of a JSON description of the contents for analysis only + +The ``device data`` in each section consists of the data produced +by the code described above. For non-iterative devices they have a single +section; iterative devices have an initial and last section and a set +of parts in between. +Note that there is very little checking by the common code of the integrity +of the ``device data`` contents, that's up to the devices themselves. +The ``footer mark`` provides a little bit of protection for the case where +the receiving side reads more or less data than expected. + +The ``ID string`` is normally unique, having been formed from a bus name +and device address, PCI devices and storage devices hung off PCI controllers +fit this pattern well. Some devices are fixed single instances (e.g. "pc-ram"). +Others (especially either older devices or system devices which for +some reason don't have a bus concept) make use of the ``instance id`` +for otherwise identically named devices. + +Return path +----------- + +Only a unidirectional stream is required for normal migration, however a +``return path`` can be created when bidirectional communication is desired. +This is primarily used by postcopy, but is also used to return a success +flag to the source at the end of migration. + +``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return +path. + + Source side + + Forward path - written by migration thread + Return path - opened by main thread, read by return-path thread + + Destination side + + Forward path - read by main thread + Return path - opened by main thread, written by main thread AND postcopy + thread (protected by rp_mutex) + +Postcopy +======== + +'Postcopy' migration is a way to deal with migrations that refuse to converge +(or take too long to converge) its plus side is that there is an upper bound on +the amount of migration traffic and time it takes, the down side is that during +the postcopy phase, a failure of *either* side or the network connection causes +the guest to be lost. + +In postcopy the destination CPUs are started before all the memory has been +transferred, and accesses to pages that are yet to be transferred cause +a fault that's translated by QEMU into a request to the source QEMU. + +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy +doesn't finish in a given time the switch is made to postcopy. + +Enabling postcopy +----------------- + +To enable postcopy, issue this command on the monitor (both source and +destination) prior to the start of migration: + +``migrate_set_capability postcopy-ram on`` + +The normal commands are then used to start a migration, which is still +started in precopy mode. Issuing: + +``migrate_start_postcopy`` + +will now cause the transition from precopy to postcopy. +It can be issued immediately after migration is started or any +time later on. Issuing it after the end of a migration is harmless. + +Blocktime is a postcopy live migration metric, intended to show how +long the vCPU was in state of interruptible sleep due to pagefault. +That metric is calculated both for all vCPUs as overlapped value, and +separately for each vCPU. These values are calculated on destination +side. To enable postcopy blocktime calculation, enter following +command on destination monitor: + +``migrate_set_capability postcopy-blocktime on`` + +Postcopy blocktime can be retrieved by query-migrate qmp command. +postcopy-blocktime value of qmp command will show overlapped blocking +time for all vCPU, postcopy-vcpu-blocktime will show list of blocking +time per vCPU. + +.. note:: + During the postcopy phase, the bandwidth limits set using + ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that + the destination is waiting for). + +Postcopy device transfer +------------------------ + +Loading of device data may cause the device emulation to access guest RAM +that may trigger faults that have to be resolved by the source, as such +the migration stream has to be able to respond with page data *during* the +device load, and hence the device data has to be read from the stream completely +before the device load begins to free the stream up. This is achieved by +'packaging' the device data into a blob that's read in one go. + +Source behaviour +---------------- + +Until postcopy is entered the migration stream is identical to normal +precopy, except for the addition of a 'postcopy advise' command at +the beginning, to tell the destination that postcopy might happen. +When postcopy starts the source sends the page discard data and then +forms the 'package' containing: + + - Command: 'postcopy listen' + - The device state + + A series of sections, identical to the precopy streams device state stream + containing everything except postcopiable devices (i.e. RAM) + - Command: 'postcopy run' + +The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the +contents are formatted in the same way as the main migration stream. + +During postcopy the source scans the list of dirty pages and sends them +to the destination without being requested (in much the same way as precopy), +however when a page request is received from the destination, the dirty page +scanning restarts from the requested location. This causes requested pages +to be sent quickly, and also causes pages directly after the requested page +to be sent quickly in the hope that those pages are likely to be used +by the destination soon. + +Destination behaviour +--------------------- + +Initially the destination looks the same as precopy, with a single thread +reading the migration stream; the 'postcopy advise' and 'discard' commands +are processed to change the way RAM is managed, but don't affect the stream +processing. + +:: + + ------------------------------------------------------------------------------ + 1 2 3 4 5 6 7 + main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) + thread | | + | (page request) + | \___ + v \ + listen thread: --- page -- page -- page -- page -- page -- + + a b c + ------------------------------------------------------------------------------ + +- On receipt of ``CMD_PACKAGED`` (1) + + All the data associated with the package - the ( ... ) section in the diagram - + is read into memory, and the main thread recurses into qemu_loadvm_state_main + to process the contents of the package (2) which contains commands (3,6) and + devices (4...) + +- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) + + a new thread (a) is started that takes over servicing the migration stream, + while the main thread carries on loading the package. It loads normal + background page data (b) but if during a device load a fault happens (5) + the returned page (c) is loaded by the listen thread allowing the main + threads device load to carry on. + +- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) + + letting the destination CPUs start running. At the end of the + ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and + is no longer used by migration, while the listen thread carries on servicing + page data until the end of migration. + +Postcopy states +--------------- + +Postcopy moves through a series of states (see postcopy_state) from +ADVISE->DISCARD->LISTEN->RUNNING->END + + - Advise + + Set at the start of migration if postcopy is enabled, even + if it hasn't had the start command; here the destination + checks that its OS has the support needed for postcopy, and performs + setup to ensure the RAM mappings are suitable for later postcopy. + The destination will fail early in migration at this point if the + required OS support is not present. + (Triggered by reception of POSTCOPY_ADVISE command) + + - Discard + + Entered on receipt of the first 'discard' command; prior to + the first Discard being performed, hugepages are switched off + (using madvise) to ensure that no new huge pages are created + during the postcopy phase, and to cause any huge pages that + have discards on them to be broken. + + - Listen + + The first command in the package, POSTCOPY_LISTEN, switches + the destination state to Listen, and starts a new thread + (the 'listen thread') which takes over the job of receiving + pages off the migration stream, while the main thread carries + on processing the blob. With this thread able to process page + reception, the destination now 'sensitises' the RAM to detect + any access to missing pages (on Linux using the 'userfault' + system). + + - Running + + POSTCOPY_RUN causes the destination to synchronise all + state and start the CPUs and IO devices running. The main + thread now finishes processing the migration package and + now carries on as it would for normal precopy migration + (although it can't do the cleanup it would do as it + finishes a normal migration). + + - End + + The listen thread can now quit, and perform the cleanup of migration + state, the migration is now complete. + +Source side page maps +--------------------- + +The source side keeps two bitmaps during postcopy; 'the migration bitmap' +and 'unsent map'. The 'migration bitmap' is basically the same as in +the precopy case, and holds a bit to indicate that page is 'dirty' - +i.e. needs sending. During the precopy phase this is updated as the CPU +dirties pages, however during postcopy the CPUs are stopped and nothing +should dirty anything any more. + +The 'unsent map' is used for the transition to postcopy. It is a bitmap that +has a bit cleared whenever a page is sent to the destination, however during +the transition to postcopy mode it is combined with the migration bitmap +to form a set of pages that: + + a) Have been sent but then redirtied (which must be discarded) + b) Have not yet been sent - which also must be discarded to cause any + transparent huge pages built during precopy to be broken. + +Note that the contents of the unsentmap are sacrificed during the calculation +of the discard set and thus aren't valid once in postcopy. The dirtymap +is still valid and is used to ensure that no page is sent more than once. Any +request for a page that has already been sent is ignored. Duplicate requests +such as this can happen as a page is sent at about the same time the +destination accesses it. + +Postcopy with hugepages +----------------------- + +Postcopy now works with hugetlbfs backed memory: + + a) The linux kernel on the destination must support userfault on hugepages. + b) The huge-page configuration on the source and destination VMs must be + identical; i.e. RAMBlocks on both sides must use the same page size. + c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal + RAM if it doesn't have enough hugepages, triggering (b) to fail. + Using ``-mem-prealloc`` enforces the allocation using hugepages. + d) Care should be taken with the size of hugepage used; postcopy with 2MB + hugepages works well, however 1GB hugepages are likely to be problematic + since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, + and until the full page is transferred the destination thread is blocked. + +Postcopy with shared memory +--------------------------- + +Postcopy migration with shared memory needs explicit support from the other +processes that share memory and from QEMU. There are restrictions on the type of +memory that userfault can support shared. + +The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs`` +(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)`` +for hugetlbfs which may be a problem in some configurations). + +The vhost-user code in QEMU supports clients that have Postcopy support, +and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes +to support postcopy. + +The client needs to open a userfaultfd and register the areas +of memory that it maps with userfault. The client must then pass the +userfaultfd back to QEMU together with a mapping table that allows +fault addresses in the clients address space to be converted back to +RAMBlock/offsets. The client's userfaultfd is added to the postcopy +fault-thread and page requests are made on behalf of the client by QEMU. +QEMU performs 'wake' operations on the client's userfaultfd to allow it +to continue after a page has arrived. + +.. note:: + There are two future improvements that would be nice: + a) Some way to make QEMU ignorant of the addresses in the clients + address space + b) Avoiding the need for QEMU to perform ufd-wake calls after the + pages have arrived + +Retro-fitting postcopy to existing clients is possible: + a) A mechanism is needed for the registration with userfault as above, + and the registration needs to be coordinated with the phases of + postcopy. In vhost-user extra messages are added to the existing + control channel. + b) Any thread that can block due to guest memory accesses must be + identified and the implication understood; for example if the + guest memory access is made while holding a lock then all other + threads waiting for that lock will also be blocked. + +Firmware +======== + +Migration migrates the copies of RAM and ROM, and thus when running +on the destination it includes the firmware from the source. Even after +resetting a VM, the old firmware is used. Only once QEMU has been restarted +is the new firmware in use. + +- Changes in firmware size can cause changes in the required RAMBlock size + to hold the firmware and thus migration can fail. In practice it's best + to pad firmware images to convenient powers of 2 with plenty of space + for growth. + +- Care should be taken with device emulation code so that newer + emulation code can work with older firmware to allow forward migration. + +- Care should be taken with newer firmware so that backward migration + to older systems with older device emulation code will work. + +In some cases it may be best to tie specific firmware versions to specific +versioned machine types to cut down on the combinations that will need +support. This is also useful when newer versions of firmware outgrow +the padding. + diff --git a/docs/devel/modules.rst b/docs/devel/modules.rst new file mode 100644 index 000000000..8e999c4fa --- /dev/null +++ b/docs/devel/modules.rst @@ -0,0 +1,5 @@ +============ +QEMU modules +============ + +.. kernel-doc:: include/qemu/module.h diff --git a/docs/devel/multi-process.rst b/docs/devel/multi-process.rst new file mode 100644 index 000000000..e4801751f --- /dev/null +++ b/docs/devel/multi-process.rst @@ -0,0 +1,968 @@ +Multi-process QEMU +=================== + +.. note:: + + This is the design document for multi-process QEMU. It does not + necessarily reflect the status of the current implementation, which + may lack features or be considerably different from what is described + in this document. This document is still useful as a description of + the goals and general direction of this feature. + + Please refer to the following wiki for latest details: + https://wiki.qemu.org/Features/MultiProcessQEMU + +QEMU is often used as the hypervisor for virtual machines running in the +Oracle cloud. Since one of the advantages of cloud computing is the +ability to run many VMs from different tenants in the same cloud +infrastructure, a guest that compromised its hypervisor could +potentially use the hypervisor's access privileges to access data it is +not authorized for. + +QEMU can be susceptible to security attacks because it is a large, +monolithic program that provides many features to the VMs it services. +Many of these features can be configured out of QEMU, but even a reduced +configuration QEMU has a large amount of code a guest can potentially +attack. Separating QEMU reduces the attack surface by aiding to +limit each component in the system to only access the resources that +it needs to perform its job. + +QEMU services +------------- + +QEMU can be broadly described as providing three main services. One is a +VM control point, where VMs can be created, migrated, re-configured, and +destroyed. A second is to emulate the CPU instructions within the VM, +often accelerated by HW virtualization features such as Intel's VT +extensions. Finally, it provides IO services to the VM by emulating HW +IO devices, such as disk and network devices. + +A multi-process QEMU +~~~~~~~~~~~~~~~~~~~~ + +A multi-process QEMU involves separating QEMU services into separate +host processes. Each of these processes can be given only the privileges +it needs to provide its service, e.g., a disk service could be given +access only to the disk images it provides, and not be allowed to +access other files, or any network devices. An attacker who compromised +this service would not be able to use this exploit to access files or +devices beyond what the disk service was given access to. + +A QEMU control process would remain, but in multi-process mode, will +have no direct interfaces to the VM. During VM execution, it would still +provide the user interface to hot-plug devices or live migrate the VM. + +A first step in creating a multi-process QEMU is to separate IO services +from the main QEMU program, which would continue to provide CPU +emulation. i.e., the control process would also be the CPU emulation +process. In a later phase, CPU emulation could be separated from the +control process. + +Separating IO services +---------------------- + +Separating IO services into individual host processes is a good place to +begin for a couple of reasons. One is the sheer number of IO devices QEMU +can emulate provides a large surface of interfaces which could potentially +be exploited, and, indeed, have been a source of exploits in the past. +Another is the modular nature of QEMU device emulation code provides +interface points where the QEMU functions that perform device emulation +can be separated from the QEMU functions that manage the emulation of +guest CPU instructions. The devices emulated in the separate process are +referred to as remote devices. + +QEMU device emulation +~~~~~~~~~~~~~~~~~~~~~ + +QEMU uses an object oriented SW architecture for device emulation code. +Configured objects are all compiled into the QEMU binary, then objects +are instantiated by name when used by the guest VM. For example, the +code to emulate a device named "foo" is always present in QEMU, but its +instantiation code is only run when the device is included in the target +VM. (e.g., via the QEMU command line as *-device foo*) + +The object model is hierarchical, so device emulation code names its +parent object (such as "pci-device" for a PCI device) and QEMU will +instantiate a parent object before calling the device's instantiation +code. + +Current separation models +~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to separate the device emulation code from the CPU emulation +code, the device object code must run in a different process. There are +a couple of existing QEMU features that can run emulation code +separately from the main QEMU process. These are examined below. + +vhost user model +^^^^^^^^^^^^^^^^ + +Virtio guest device drivers can be connected to vhost user applications +in order to perform their IO operations. This model uses special virtio +device drivers in the guest and vhost user device objects in QEMU, but +once the QEMU vhost user code has configured the vhost user application, +mission-mode IO is performed by the application. The vhost user +application is a daemon process that can be contacted via a known UNIX +domain socket. + +vhost socket +'''''''''''' + +As mentioned above, one of the tasks of the vhost device object within +QEMU is to contact the vhost application and send it configuration +information about this device instance. As part of the configuration +process, the application can also be sent other file descriptors over +the socket, which then can be used by the vhost user application in +various ways, some of which are described below. + +vhost MMIO store acceleration +''''''''''''''''''''''''''''' + +VMs are often run using HW virtualization features via the KVM kernel +driver. This driver allows QEMU to accelerate the emulation of guest CPU +instructions by running the guest in a virtual HW mode. When the guest +executes instructions that cannot be executed by virtual HW mode, +execution returns to the KVM driver so it can inform QEMU to emulate the +instructions in SW. + +One of the events that can cause a return to QEMU is when a guest device +driver accesses an IO location. QEMU then dispatches the memory +operation to the corresponding QEMU device object. In the case of a +vhost user device, the memory operation would need to be sent over a +socket to the vhost application. This path is accelerated by the QEMU +virtio code by setting up an eventfd file descriptor that the vhost +application can directly receive MMIO store notifications from the KVM +driver, instead of needing them to be sent to the QEMU process first. + +vhost interrupt acceleration +'''''''''''''''''''''''''''' + +Another optimization used by the vhost application is the ability to +directly inject interrupts into the VM via the KVM driver, again, +bypassing the need to send the interrupt back to the QEMU process first. +The QEMU virtio setup code configures the KVM driver with an eventfd +that triggers the device interrupt in the guest when the eventfd is +written. This irqfd file descriptor is then passed to the vhost user +application program. + +vhost access to guest memory +'''''''''''''''''''''''''''' + +The vhost application is also allowed to directly access guest memory, +instead of needing to send the data as messages to QEMU. This is also +done with file descriptors sent to the vhost user application by QEMU. +These descriptors can be passed to ``mmap()`` by the vhost application +to map the guest address space into the vhost application. + +IOMMUs introduce another level of complexity, since the address given to +the guest virtio device to DMA to or from is not a guest physical +address. This case is handled by having vhost code within QEMU register +as a listener for IOMMU mapping changes. The vhost application maintains +a cache of IOMMMU translations: sending translation requests back to +QEMU on cache misses, and in turn receiving flush requests from QEMU +when mappings are purged. + +applicability to device separation +'''''''''''''''''''''''''''''''''' + +Much of the vhost model can be re-used by separated device emulation. In +particular, the ideas of using a socket between QEMU and the device +emulation application, using a file descriptor to inject interrupts into +the VM via KVM, and allowing the application to ``mmap()`` the guest +should be re used. + +There are, however, some notable differences between how a vhost +application works and the needs of separated device emulation. The most +basic is that vhost uses custom virtio device drivers which always +trigger IO with MMIO stores. A separated device emulation model must +work with existing IO device models and guest device drivers. MMIO loads +break vhost store acceleration since they are synchronous - guest +progress cannot continue until the load has been emulated. By contrast, +stores are asynchronous, the guest can continue after the store event +has been sent to the vhost application. + +Another difference is that in the vhost user model, a single daemon can +support multiple QEMU instances. This is contrary to the security regime +desired, in which the emulation application should only be allowed to +access the files or devices the VM it's running on behalf of can access. +#### qemu-io model + +``qemu-io`` is a test harness used to test changes to the QEMU block backend +object code (e.g., the code that implements disk images for disk driver +emulation). ``qemu-io`` is not a device emulation application per se, but it +does compile the QEMU block objects into a separate binary from the main +QEMU one. This could be useful for disk device emulation, since its +emulation applications will need to include the QEMU block objects. + +New separation model based on proxy objects +------------------------------------------- + +A different model based on proxy objects in the QEMU program +communicating with remote emulation programs could provide separation +while minimizing the changes needed to the device emulation code. The +rest of this section is a discussion of how a proxy object model would +work. + +Remote emulation processes +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The remote emulation process will run the QEMU object hierarchy without +modification. The device emulation objects will be also be based on the +QEMU code, because for anything but the simplest device, it would not be +a tractable to re-implement both the object model and the many device +backends that QEMU has. + +The processes will communicate with the QEMU process over UNIX domain +sockets. The processes can be executed either as standalone processes, +or be executed by QEMU. In both cases, the host backends the emulation +processes will provide are specified on its command line, as they would +be for QEMU. For example: + +:: + + disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ + -blockdev driver=qcow2,node-name=drive0,file=file0 + +would indicate process *disk-proc* uses a qcow2 emulated disk named +*file0* as its backend. + +Emulation processes may emulate more than one guest controller. A common +configuration might be to put all controllers of the same device class +(e.g., disk, network, etc.) in a single process, so that all backends of +the same type can be managed by a single QMP monitor. + +communication with QEMU +^^^^^^^^^^^^^^^^^^^^^^^ + +The first argument to the remote emulation process will be a Unix domain +socket that connects with the Proxy object. This is a required argument. + +:: + + disk-proc <socket number> <backend list> + +remote process QMP monitor +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Remote emulation processes can be monitored via QMP, similar to QEMU +itself. The QMP monitor socket is specified the same as for a QEMU +process: + +:: + + disk-proc -qmp unix:/tmp/disk-mon,server + +can be monitored over the UNIX socket path */tmp/disk-mon*. + +QEMU command line +~~~~~~~~~~~~~~~~~ + +Each remote device emulated in a remote process on the host is +represented as a *-device* of type *pci-proxy-dev*. A socket +sub-option to this option specifies the Unix socket that connects +to the remote process. An *id* sub-option is required, and it should +be the same id as used in the remote process. + +:: + + qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3 + +can be used to add a device emulated in a remote process + + +QEMU management of remote processes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU is not aware of the type of type of the remote PCI device. It is +a pass through device as far as QEMU is concerned. + +communication with emulation process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +primary channel +''''''''''''''' + +The primary channel (referred to as com in the code) is used to bootstrap +the remote process. It is also used to pass on device-agnostic commands +like reset. + +per-device channels +''''''''''''''''''' + +Each remote device communicates with QEMU using a dedicated communication +channel. The proxy object sets up this channel using the primary +channel during its initialization. + +QEMU device proxy objects +~~~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU has an object model based on sub-classes inherited from the +"object" super-class. The sub-classes that are of interest here are the +"device" and "bus" sub-classes whose child sub-classes make up the +device tree of a QEMU emulated system. + +The proxy object model will use device proxy objects to replace the +device emulation code within the QEMU process. These objects will live +in the same place in the object and bus hierarchies as the objects they +replace. i.e., the proxy object for an LSI SCSI controller will be a +sub-class of the "pci-device" class, and will have the same PCI bus +parent and the same SCSI bus child objects as the LSI controller object +it replaces. + +It is worth noting that the same proxy object is used to mediate with +all types of remote PCI devices. + +object initialization +^^^^^^^^^^^^^^^^^^^^^ + +The Proxy device objects are initialized in the exact same manner in +which any other QEMU device would be initialized. + +In addition, the Proxy objects perform the following two tasks: +- Parses the "socket" sub option and connects to the remote process +using this channel +- Uses the "id" sub-option to connect to the emulated device on the +separate process + +class\_init +''''''''''' + +The ``class_init()`` method of a proxy object will, in general behave +similarly to the object it replaces, including setting any static +properties and methods needed by the proxy. + +instance\_init / realize +'''''''''''''''''''''''' + +The ``instance_init()`` and ``realize()`` functions would only need to +perform tasks related to being a proxy, such are registering its own +MMIO handlers, or creating a child bus that other proxy devices can be +attached to later. + +Other tasks will be device-specific. For example, PCI device objects +will initialize the PCI config space in order to make a valid PCI device +tree within the QEMU process. + +address space registration +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Most devices are driven by guest device driver accesses to IO addresses +or ports. The QEMU device emulation code uses QEMU's memory region +function calls (such as ``memory_region_init_io()``) to add callback +functions that QEMU will invoke when the guest accesses the device's +areas of the IO address space. When a guest driver does access the +device, the VM will exit HW virtualization mode and return to QEMU, +which will then lookup and execute the corresponding callback function. + +A proxy object would need to mirror the memory region calls the actual +device emulator would perform in its initialization code, but with its +own callbacks. When invoked by QEMU as a result of a guest IO operation, +they will forward the operation to the device emulation process. + +PCI config space +^^^^^^^^^^^^^^^^ + +PCI devices also have a configuration space that can be accessed by the +guest driver. Guest accesses to this space is not handled by the device +emulation object, but by its PCI parent object. Much of this space is +read-only, but certain registers (especially BAR and MSI-related ones) +need to be propagated to the emulation process. + +PCI parent proxy +'''''''''''''''' + +One way to propagate guest PCI config accesses is to create a +"pci-device-proxy" class that can serve as the parent of a PCI device +proxy object. This class's parent would be "pci-device" and it would +override the PCI parent's ``config_read()`` and ``config_write()`` +methods with ones that forward these operations to the emulation +program. + +interrupt receipt +^^^^^^^^^^^^^^^^^ + +A proxy for a device that generates interrupts will need to create a +socket to receive interrupt indications from the emulation process. An +incoming interrupt indication would then be sent up to its bus parent to +be injected into the guest. For example, a PCI device object may use +``pci_set_irq()``. + +live migration +^^^^^^^^^^^^^^ + +The proxy will register to save and restore any *vmstate* it needs over +a live migration event. The device proxy does not need to manage the +remote device's *vmstate*; that will be handled by the remote process +proxy (see below). + +QEMU remote device operation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Generic device operations, such as DMA, will be performed by the remote +process proxy by sending messages to the remote process. + +DMA operations +^^^^^^^^^^^^^^ + +DMA operations would be handled much like vhost applications do. One of +the initial messages sent to the emulation process is a guest memory +table. Each entry in this table consists of a file descriptor and size +that the emulation process can ``mmap()`` to directly access guest +memory, similar to ``vhost_user_set_mem_table()``. Note guest memory +must be backed by file descriptors, such as when QEMU is given the +*-mem-path* command line option. + +IOMMU operations +^^^^^^^^^^^^^^^^ + +When the emulated system includes an IOMMU, the remote process proxy in +QEMU will need to create a socket for IOMMU requests from the emulation +process. It will handle those requests with an +``address_space_get_iotlb_entry()`` call. In order to handle IOMMU +unmaps, the remote process proxy will also register as a listener on the +device's DMA address space. When an IOMMU memory region is created +within the DMA address space, an IOMMU notifier for unmaps will be added +to the memory region that will forward unmaps to the emulation process +over the IOMMU socket. + +device hot-plug via QMP +^^^^^^^^^^^^^^^^^^^^^^^ + +An QMP "device\_add" command can add a device emulated by a remote +process. It will also have "rid" option to the command, just as the +*-device* command line option does. The remote process may either be one +started at QEMU startup, or be one added by the "add-process" QMP +command described above. In either case, the remote process proxy will +forward the new device's JSON description to the corresponding emulation +process. + +live migration +^^^^^^^^^^^^^^ + +The remote process proxy will also register for live migration +notifications with ``vmstate_register()``. When called to save state, +the proxy will send the remote process a secondary socket file +descriptor to save the remote process's device *vmstate* over. The +incoming byte stream length and data will be saved as the proxy's +*vmstate*. When the proxy is resumed on its new host, this *vmstate* +will be extracted, and a secondary socket file descriptor will be sent +to the new remote process through which it receives the *vmstate* in +order to restore the devices there. + +device emulation in remote process +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The parts of QEMU that the emulation program will need include the +object model; the memory emulation objects; the device emulation objects +of the targeted device, and any dependent devices; and, the device's +backends. It will also need code to setup the machine environment, +handle requests from the QEMU process, and route machine-level requests +(such as interrupts or IOMMU mappings) back to the QEMU process. + +initialization +^^^^^^^^^^^^^^ + +The process initialization sequence will follow the same sequence +followed by QEMU. It will first initialize the backend objects, then +device emulation objects. The JSON descriptions sent by the QEMU process +will drive which objects need to be created. + +- address spaces + +Before the device objects are created, the initial address spaces and +memory regions must be configured with ``memory_map_init()``. This +creates a RAM memory region object (*system\_memory*) and an IO memory +region object (*system\_io*). + +- RAM + +RAM memory region creation will follow how ``pc_memory_init()`` creates +them, but must use ``memory_region_init_ram_from_fd()`` instead of +``memory_region_allocate_system_memory()``. The file descriptors needed +will be supplied by the guest memory table from above. Those RAM regions +would then be added to the *system\_memory* memory region with +``memory_region_add_subregion()``. + +- PCI + +IO initialization will be driven by the JSON descriptions sent from the +QEMU process. For a PCI device, a PCI bus will need to be created with +``pci_root_bus_new()``, and a PCI memory region will need to be created +and added to the *system\_memory* memory region with +``memory_region_add_subregion_overlap()``. The overlap version is +required for architectures where PCI memory overlaps with RAM memory. + +MMIO handling +^^^^^^^^^^^^^ + +The device emulation objects will use ``memory_region_init_io()`` to +install their MMIO handlers, and ``pci_register_bar()`` to associate +those handlers with a PCI BAR, as they do within QEMU currently. + +In order to use ``address_space_rw()`` in the emulation process to +handle MMIO requests from QEMU, the PCI physical addresses must be the +same in the QEMU process and the device emulation process. In order to +accomplish that, guest BAR programming must also be forwarded from QEMU +to the emulation process. + +interrupt injection +^^^^^^^^^^^^^^^^^^^ + +When device emulation wants to inject an interrupt into the VM, the +request climbs the device's bus object hierarchy until the point where a +bus object knows how to signal the interrupt to the guest. The details +depend on the type of interrupt being raised. + +- PCI pin interrupts + +On x86 systems, there is an emulated IOAPIC object attached to the root +PCI bus object, and the root PCI object forwards interrupt requests to +it. The IOAPIC object, in turn, calls the KVM driver to inject the +corresponding interrupt into the VM. The simplest way to handle this in +an emulation process would be to setup the root PCI bus driver (via +``pci_bus_irqs()``) to send a interrupt request back to the QEMU +process, and have the device proxy object reflect it up the PCI tree +there. + +- PCI MSI/X interrupts + +PCI MSI/X interrupts are implemented in HW as DMA writes to a +CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives +these DMA writes, then calls into the KVM driver to inject the interrupt +into the VM. A simple emulation process implementation would be to send +the MSI DMA address from QEMU as a message at initialization, then +install an address space handler at that address which forwards the MSI +message back to QEMU. + +DMA operations +^^^^^^^^^^^^^^ + +When a emulation object wants to DMA into or out of guest memory, it +first must use dma\_memory\_map() to convert the DMA address to a local +virtual address. The emulation process memory region objects setup above +will be used to translate the DMA address to a local virtual address the +device emulation code can access. + +IOMMU +^^^^^ + +When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory +regions to translate the DMA address to a guest physical address before +that physical address can be translated to a local virtual address. The +emulation process will need similar functionality. + +- IOTLB cache + +The emulation process will maintain a cache of recent IOMMU translations +(the IOTLB). When the translate() callback of an IOMMU memory region is +invoked, the IOTLB cache will be searched for an entry that will map the +DMA address to a guest PA. On a cache miss, a message will be sent back +to QEMU requesting the corresponding translation entry, which be both be +used to return a guest address and be added to the cache. + +- IOTLB purge + +The IOMMU emulation will also need to act on unmap requests from QEMU. +These happen when the guest IOMMU driver purges an entry from the +guest's translation table. + +live migration +^^^^^^^^^^^^^^ + +When a remote process receives a live migration indication from QEMU, it +will set up a channel using the received file descriptor with +``qio_channel_socket_new_fd()``. This channel will be used to create a +*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send +the process's device state back to QEMU. This method will be reversed on +restore - the channel will be passed to ``qemu_loadvm_state()`` to +restore the device state. + +Accelerating device emulation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The messages that are required to be sent between QEMU and the emulation +process can add considerable latency to IO operations. The optimizations +described below attempt to ameliorate this effect by allowing the +emulation process to communicate directly with the kernel KVM driver. +The KVM file descriptors created would be passed to the emulation process +via initialization messages, much like the guest memory table is done. +#### MMIO acceleration + +Vhost user applications can receive guest virtio driver stores directly +from KVM. The issue with the eventfd mechanism used by vhost user is +that it does not pass any data with the event indication, so it cannot +handle guest loads or guest stores that carry store data. This concept +could, however, be expanded to cover more cases. + +The expanded idea would require a new type of KVM device: +*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master +descriptor that QEMU can use for configuration, and a slave descriptor +that the emulation process can use to receive MMIO notifications. QEMU +would create both descriptors using the KVM driver, and pass the slave +descriptor to the emulation process via an initialization message. + +data structures +^^^^^^^^^^^^^^^ + +- guest physical range + +The guest physical range structure describes the address range that a +device will respond to. It includes the base and length of the range, as +well as which bus the range resides on (e.g., on an x86machine, it can +specify whether the range refers to memory or IO addresses). + +A device can have multiple physical address ranges it responds to (e.g., +a PCI device can have multiple BARs), so the structure will also include +an enumerated identifier to specify which of the device's ranges is +being referred to. + ++--------+----------------------------+ +| Name | Description | ++========+============================+ +| addr | range base address | ++--------+----------------------------+ +| len | range length | ++--------+----------------------------+ +| bus | addr type (memory or IO) | ++--------+----------------------------+ +| id | range ID (e.g., PCI BAR) | ++--------+----------------------------+ + +- MMIO request structure + +This structure describes an MMIO operation. It includes which guest +physical range the MMIO was within, the offset within that range, the +MMIO type (e.g., load or store), and its length and data. It also +includes a sequence number that can be used to reply to the MMIO, and +the CPU that issued the MMIO. + ++----------+------------------------+ +| Name | Description | ++==========+========================+ +| rid | range MMIO is within | ++----------+------------------------+ +| offset | offset within *rid* | ++----------+------------------------+ +| type | e.g., load or store | ++----------+------------------------+ +| len | MMIO length | ++----------+------------------------+ +| data | store data | ++----------+------------------------+ +| seq | sequence ID | ++----------+------------------------+ + +- MMIO request queues + +MMIO request queues are FIFO arrays of MMIO request structures. There +are two queues: pending queue is for MMIOs that haven't been read by the +emulation program, and the sent queue is for MMIOs that haven't been +acknowledged. The main use of the second queue is to validate MMIO +replies from the emulation program. + +- scoreboard + +Each CPU in the VM is emulated in QEMU by a separate thread, so multiple +MMIOs may be waiting to be consumed by an emulation program and multiple +threads may be waiting for MMIO replies. The scoreboard would contain a +wait queue and sequence number for the per-CPU threads, allowing them to +be individually woken when the MMIO reply is received from the emulation +program. It also tracks the number of posted MMIO stores to the device +that haven't been replied to, in order to satisfy the PCI constraint +that a load to a device will not complete until all previous stores to +that device have been completed. + +- device shadow memory + +Some MMIO loads do not have device side-effects. These MMIOs can be +completed without sending a MMIO request to the emulation program if the +emulation program shares a shadow image of the device's memory image +with the KVM driver. + +The emulation program will ask the KVM driver to allocate memory for the +shadow image, and will then use ``mmap()`` to directly access it. The +emulation program can control KVM access to the shadow image by sending +KVM an access map telling it which areas of the image have no +side-effects (and can be completed immediately), and which require a +MMIO request to the emulation program. The access map can also inform +the KVM drive which size accesses are allowed to the image. + +master descriptor +^^^^^^^^^^^^^^^^^ + +The master descriptor is used by QEMU to configure the new KVM device. +The descriptor would be returned by the KVM driver when QEMU issues a +*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. + +KVM\_DEV\_TYPE\_USER device ops + + +The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a +``kvm_register_device_ops()`` call when the KVM system in initialized by +``kvm_init()``. These device ops are called by the KVM driver when QEMU +executes certain ``ioctl()`` operations on its KVM file descriptor. They +include: + +- create + +This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* +``ioctl()`` on its per-VM file descriptor. It will allocate and +initialize a KVM user device specific data structure, and assign the +*kvm\_device* private field to it. + +- ioctl + +This routine is invoked when QEMU issues an ``ioctl()`` on the master +descriptor. The ``ioctl()`` commands supported are defined by the KVM +device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: + +*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will +be passed to the device emulation program. Only one slave can be created +by each master descriptor. The file operations performed by this +descriptor are described below. + +The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical +address range that the slave descriptor will receive MMIO notifications +for. The range is specified by a guest physical range structure +argument. For buses that assign addresses to devices dynamically, this +command can be executed while the guest is running, such as the case +when a guest changes a device's PCI BAR registers. + +*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to +register *kvm\_io\_device\_ops* callbacks to be invoked when the guest +performs a MMIO operation within the range. When a range is changed, +``kvm_io_bus_unregister_dev()`` is used to remove the previous +instantiation. + +*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies +how long KVM will wait for the emulation process to respond to a MMIO +indication. + +- destroy + +This routine is called when the VM instance is destroyed. It will need +to destroy the slave descriptor; and free any memory allocated by the +driver, as well as the *kvm\_device* structure itself. + +slave descriptor +^^^^^^^^^^^^^^^^ + +The slave descriptor will have its own file operations vector, which +responds to system calls on the descriptor performed by the device +emulation program. + +- read + +A read returns any pending MMIO requests from the KVM driver as MMIO +request structures. Multiple structures can be returned if there are +multiple MMIO operations pending. The MMIO requests are moved from the +pending queue to the sent queue, and if there are threads waiting for +space in the pending to add new MMIO operations, they will be woken +here. + +- write + +A write also consists of a set of MMIO requests. They are compared to +the MMIO requests in the sent queue. Matches are removed from the sent +queue, and any threads waiting for the reply are woken. If a store is +removed, then the number of posted stores in the per-CPU scoreboard is +decremented. When the number is zero, and a non side-effect load was +waiting for posted stores to complete, the load is continued. + +- ioctl + +There are several ioctl()s that can be performed on the slave +descriptor. + +A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to +allocate memory for the shadow image. This memory can later be +``mmap()``\ ed by the emulation process to share the emulation's view of +device memory with the KVM driver. + +A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the +shadow image. It will send the KVM driver a shadow control map, which +specifies which areas of the image can complete guest loads without +sending the load request to the emulation program. It will also specify +the size of load operations that are allowed. + +- poll + +An emulation program will use the ``poll()`` call with a *POLLIN* flag +to determine if there are MMIO requests waiting to be read. It will +return if the pending MMIO request queue is not empty. + +- mmap + +This call allows the emulation program to directly access the shadow +image allocated by the KVM driver. As device emulation updates device +memory, changes with no side-effects will be reflected in the shadow, +and the KVM driver can satisfy guest loads from the shadow image without +needing to wait for the emulation program. + +kvm\_io\_device ops +^^^^^^^^^^^^^^^^^^^ + +Each KVM per-CPU thread can handle MMIO operation on behalf of the guest +VM. KVM will use the MMIO's guest physical address to search for a +matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM +driver instead of exiting back to QEMU. If a match is found, the +corresponding callback will be invoked. + +- read + +This callback is invoked when the guest performs a load to the device. +Loads with side-effects must be handled synchronously, with the KVM +driver putting the QEMU thread to sleep waiting for the emulation +process reply before re-starting the guest. Loads that do not have +side-effects may be optimized by satisfying them from the shadow image, +if there are no outstanding stores to the device by this CPU. PCI memory +ordering demands that a load cannot complete before all older stores to +the same device have been completed. + +- write + +Stores can be handled asynchronously unless the pending MMIO request +queue is full. In this case, the QEMU thread must sleep waiting for +space in the queue. Stores will increment the number of posted stores in +the per-CPU scoreboard, in order to implement the PCI ordering +constraint above. + +interrupt acceleration +^^^^^^^^^^^^^^^^^^^^^^ + +This performance optimization would work much like a vhost user +application does, where the QEMU process sets up *eventfds* that cause +the device's corresponding interrupt to be triggered by the KVM driver. +These irq file descriptors are sent to the emulation process at +initialization, and are used when the emulation code raises a device +interrupt. + +intx acceleration +''''''''''''''''' + +Traditional PCI pin interrupts are level based, so, in addition to an +irq file descriptor, a re-sampling file descriptor needs to be sent to +the emulation program. This second file descriptor allows multiple +devices sharing an irq to be notified when the interrupt has been +acknowledged by the guest, so they can re-trigger the interrupt if their +device has not de-asserted its interrupt. + +intx irq descriptor + + +The irq descriptors are created by the proxy object +``using event_notifier_init()`` to create the irq and re-sampling +*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. +The interrupt route can be found with +``pci_device_route_intx_to_irq()``. + +intx routing changes + + +Intx routing can be changed when the guest programs the APIC the device +pin is connected to. The proxy object in QEMU will use +``pci_device_set_intx_routing_notifier()`` to be informed of any guest +changes to the route. This handler will broadly follow the VFIO +interrupt logic to change the route: de-assigning the existing irq +descriptor from its route, then assigning it the new route. (see +``vfio_intx_update()``) + +MSI/X acceleration +'''''''''''''''''' + +MSI/X interrupts are sent as DMA transactions to the host. The interrupt +data contains a vector that is programmed by the guest, A device may have +multiple MSI interrupts associated with it, so multiple irq descriptors +may need to be sent to the emulation program. + +MSI/X irq descriptor + + +This case will also follow the VFIO example. For each MSI/X interrupt, +an *eventfd* is created, a virtual interrupt is allocated by +``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to +the eventfd with ``kvm_irqchip_add_irqfd_notifier()``. + +MSI/X config space changes + + +The guest may dynamically update several MSI-related tables in the +device's PCI config space. These include per-MSI interrupt enables and +vector data. Additionally, MSIX tables exist in device memory space, not +config space. Much like the BAR case above, the proxy object must look +at guest config space programming to keep the MSI interrupt state +consistent between QEMU and the emulation program. + +-------------- + +Disaggregated CPU emulation +--------------------------- + +After IO services have been disaggregated, a second phase would be to +separate a process to handle CPU instruction emulation from the main +QEMU control function. There are no object separation points for this +code, so the first task would be to create one. + +Host access controls +-------------------- + +Separating QEMU relies on the host OS's access restriction mechanisms to +enforce that the differing processes can only access the objects they +are entitled to. There are a couple types of mechanisms usually provided +by general purpose OSs. + +Discretionary access control +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Discretionary access control allows each user to control who can access +their files. In Linux, this type of control is usually too coarse for +QEMU separation, since it only provides three separate access controls: +one for the same user ID, the second for users IDs with the same group +ID, and the third for all other user IDs. Each device instance would +need a separate user ID to provide access control, which is likely to be +unwieldy for dynamically created VMs. + +Mandatory access control +~~~~~~~~~~~~~~~~~~~~~~~~ + +Mandatory access control allows the OS to add an additional set of +controls on top of discretionary access for the OS to control. It also +adds other attributes to processes and files such as types, roles, and +categories, and can establish rules for how processes and files can +interact. + +Type enforcement +^^^^^^^^^^^^^^^^ + +Type enforcement assigns a *type* attribute to processes and files, and +allows rules to be written on what operations a process with a given +type can perform on a file with a given type. QEMU separation could take +advantage of type enforcement by running the emulation processes with +different types, both from the main QEMU process, and from the emulation +processes of different classes of devices. + +For example, guest disk images and disk emulation processes could have +types separate from the main QEMU process and non-disk emulation +processes, and the type rules could prevent processes other than disk +emulation ones from accessing guest disk images. Similarly, network +emulation processes can have a type separate from the main QEMU process +and non-network emulation process, and only that type can access the +host tun/tap device used to provide guest networking. + +Category enforcement +^^^^^^^^^^^^^^^^^^^^ + +Category enforcement assigns a set of numbers within a given range to +the process or file. The process is granted access to the file if the +process's set is a superset of the file's set. This enforcement can be +used to separate multiple instances of devices in the same class. + +For example, if there are multiple disk devices provides to a guest, +each device emulation process could be provisioned with a separate +category. The different device emulation processes would not be able to +access each other's backing disk images. + +Alternatively, categories could be used in lieu of the type enforcement +scheme described above. In this scenario, different categories would be +used to prevent device emulation processes in different classes from +accessing resources assigned to other classes. diff --git a/docs/devel/multi-thread-tcg.rst b/docs/devel/multi-thread-tcg.rst new file mode 100644 index 000000000..c9541a7b2 --- /dev/null +++ b/docs/devel/multi-thread-tcg.rst @@ -0,0 +1,373 @@ +.. + Copyright (c) 2015-2020 Linaro Ltd. + + This work is licensed under the terms of the GNU GPL, version 2 or + later. See the COPYING file in the top-level directory. + +================== +Multi-threaded TCG +================== + +This document outlines the design for multi-threaded TCG (a.k.a MTTCG) +system-mode emulation. user-mode emulation has always mirrored the +thread structure of the translated executable although some of the +changes done for MTTCG system emulation have improved the stability of +linux-user emulation. + +The original system-mode TCG implementation was single threaded and +dealt with multiple CPUs with simple round-robin scheduling. This +simplified a lot of things but became increasingly limited as systems +being emulated gained additional cores and per-core performance gains +for host systems started to level off. + +vCPU Scheduling +=============== + +We introduce a new running mode where each vCPU will run on its own +user-space thread. This is enabled by default for all FE/BE +combinations where the host memory model is able to accommodate the +guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the +guest has had the required work done to support this safely +(TARGET_SUPPORTS_MTTCG). + +System emulation will fall back to the original round robin approach +if: + +* forced by --accel tcg,thread=single +* enabling --icount mode +* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST) + +In the general case of running translated code there should be no +inter-vCPU dependencies and all vCPUs should be able to run at full +speed. Synchronisation will only be required while accessing internal +shared data structures or when the emulated architecture requires a +coherent representation of the emulated machine state. + +Shared Data Structures +====================== + +Main Run Loop +------------- + +Even when there is no code being generated there are a number of +structures associated with the hot-path through the main run-loop. +These are associated with looking up the next translation block to +execute. These include: + + tb_jmp_cache (per-vCPU, cache of recent jumps) + tb_ctx.htable (global hash table, phys address->tb lookup) + +As TB linking only occurs when blocks are in the same page this code +is critical to performance as looking up the next TB to execute is the +most common reason to exit the generated code. + +DESIGN REQUIREMENT: Make access to lookup structures safe with +multiple reader/writer threads. Minimise any lock contention to do it. + +The hot-path avoids using locks where possible. The tb_jmp_cache is +updated with atomic accesses to ensure consistent results. The fall +back QHT based hash table is also designed for lockless lookups. Locks +are only taken when code generation is required or TranslationBlocks +have their block-to-block jumps patched. + +Global TCG State +---------------- + +User-mode emulation +~~~~~~~~~~~~~~~~~~~ + +We need to protect the entire code generation cycle including any post +generation patching of the translated code. This also implies a shared +translation buffer which contains code running on all cores. Any +execution path that comes to the main run loop will need to hold a +mutex for code generation. This also includes times when we need flush +code or entries from any shared lookups/caches. Structures held on a +per-vCPU basis won't need locking unless other vCPUs will need to +modify them. + +DESIGN REQUIREMENT: Add locking around all code generation and TB +patching. + +(Current solution) + +Code generation is serialised with mmap_lock(). + +!User-mode emulation +~~~~~~~~~~~~~~~~~~~~ + +Each vCPU has its own TCG context and associated TCG region, thereby +requiring no locking during translation. + +Translation Blocks +------------------ + +Currently the whole system shares a single code generation buffer +which when full will force a flush of all translations and start from +scratch again. Some operations also force a full flush of translations +including: + + - debugging operations (breakpoint insertion/removal) + - some CPU helper functions + - linux-user spawning its first thread + +This is done with the async_safe_run_on_cpu() mechanism to ensure all +vCPUs are quiescent when changes are being made to shared global +structures. + +More granular translation invalidation events are typically due +to a change of the state of a physical page: + + - code modification (self modify code, patching code) + - page changes (new page mapping in linux-user mode) + +While setting the invalid flag in a TranslationBlock will stop it +being used when looked up in the hot-path there are a number of other +book-keeping structures that need to be safely cleared. + +Any TranslationBlocks which have been patched to jump directly to the +now invalid blocks need the jump patches reversing so they will return +to the C code. + +There are a number of look-up caches that need to be properly updated +including the: + + - jump lookup cache + - the physical-to-tb lookup hash table + - the global page table + +The global page table (l1_map) which provides a multi-level look-up +for PageDesc structures which contain pointers to the start of a +linked list of all Translation Blocks in that page (see page_next). + +Both the jump patching and the page cache involve linked lists that +the invalidated TranslationBlock needs to be removed from. + +DESIGN REQUIREMENT: Safely handle invalidation of TBs + - safely patch/revert direct jumps + - remove central PageDesc lookup entries + - ensure lookup caches/hashes are safely updated + +(Current solution) + +The direct jump themselves are updated atomically by the TCG +tb_set_jmp_target() code. Modification to the linked lists that allow +searching for linked pages are done under the protection of tb->jmp_lock, +where tb is the destination block of a jump. Each origin block keeps a +pointer to its destinations so that the appropriate lock can be acquired before +iterating over a jump list. + +The global page table is a lockless radix tree; cmpxchg is used +to atomically insert new elements. + +The lookup caches are updated atomically and the lookup hash uses QHT +which is designed for concurrent safe lookup. + +Parallel code generation is supported. QHT is used at insertion time +as the synchronization point across threads, thereby ensuring that we only +keep track of a single TranslationBlock for each guest code block. + +Memory maps and TLBs +-------------------- + +The memory handling code is fairly critical to the speed of memory +access in the emulated system. The SoftMMU code is designed so the +hot-path can be handled entirely within translated code. This is +handled with a per-vCPU TLB structure which once populated will allow +a series of accesses to the page to occur without exiting the +translated code. It is possible to set flags in the TLB address which +will ensure the slow-path is taken for each access. This can be done +to support: + + - Memory regions (dividing up access to PIO, MMIO and RAM) + - Dirty page tracking (for code gen, SMC detection, migration and display) + - Virtual TLB (for translating guest address->real address) + +When the TLB tables are updated by a vCPU thread other than their own +we need to ensure it is done in a safe way so no inconsistent state is +seen by the vCPU thread. + +Some operations require updating a number of vCPUs TLBs at the same +time in a synchronised manner. + +DESIGN REQUIREMENTS: + + - TLB Flush All/Page + - can be across-vCPUs + - cross vCPU TLB flush may need other vCPU brought to halt + - change may need to be visible to the calling vCPU immediately + - TLB Flag Update + - usually cross-vCPU + - want change to be visible as soon as possible + - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) + - This is a per-vCPU table - by definition can't race + - updated by its own thread when the slow-path is forced + +(Current solution) + +We have updated cputlb.c to defer operations when a cross-vCPU +operation with async_run_on_cpu() which ensures each vCPU sees a +coherent state when it next runs its work (in a few instructions +time). + +A new set up operations (tlb_flush_*_all_cpus) take an additional flag +which when set will force synchronisation by setting the source vCPUs +work as "safe work" and exiting the cpu run loop. This ensure by the +time execution restarts all flush operations have completed. + +TLB flag updates are all done atomically and are also protected by the +corresponding page lock. + +(Known limitation) + +Not really a limitation but the wait mechanism is overly strict for +some architectures which only need flushes completed by a barrier +instruction. This could be a future optimisation. + +Emulated hardware state +----------------------- + +Currently thanks to KVM work any access to IO memory is automatically +protected by the global iothread mutex, also known as the BQL (Big +QEMU Lock). Any IO region that doesn't use global mutex is expected to +do its own locking. + +However IO memory isn't the only way emulated hardware state can be +modified. Some architectures have model specific registers that +trigger hardware emulation features. Generally any translation helper +that needs to update more than a single vCPUs of state should take the +BQL. + +As the BQL, or global iothread mutex is shared across the system we +push the use of the lock as far down into the TCG code as possible to +minimise contention. + +(Current solution) + +MMIO access automatically serialises hardware emulation by way of the +BQL. Currently Arm targets serialise all ARM_CP_IO register accesses +and also defer the reset/startup of vCPUs to the vCPU context by way +of async_run_on_cpu(). + +Updates to interrupt state are also protected by the BQL as they can +often be cross vCPU. + +Memory Consistency +================== + +Between emulated guests and host systems there are a range of memory +consistency models. Even emulating weakly ordered systems on strongly +ordered hosts needs to ensure things like store-after-load re-ordering +can be prevented when the guest wants to. + +Memory Barriers +--------------- + +Barriers (sometimes known as fences) provide a mechanism for software +to enforce a particular ordering of memory operations from the point +of view of external observers (e.g. another processor core). They can +apply to any memory operations as well as just loads or stores. + +The Linux kernel has an excellent `write-up +<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_ +on the various forms of memory barrier and the guarantees they can +provide. + +Barriers are often wrapped around synchronisation primitives to +provide explicit memory ordering semantics. However they can be used +by themselves to provide safe lockless access by ensuring for example +a change to a signal flag will only be visible once the changes to +payload are. + +DESIGN REQUIREMENT: Add a new tcg_memory_barrier op + +This would enforce a strong load/store ordering so all loads/stores +complete at the memory barrier. On single-core non-SMP strongly +ordered backends this could become a NOP. + +Aside from explicit standalone memory barrier instructions there are +also implicit memory ordering semantics which comes with each guest +memory access instruction. For example all x86 load/stores come with +fairly strong guarantees of sequential consistency whereas Arm has +special variants of load/store instructions that imply acquire/release +semantics. + +In the case of a strongly ordered guest architecture being emulated on +a weakly ordered host the scope for a heavy performance impact is +quite high. + +DESIGN REQUIREMENTS: Be efficient with use of memory barriers + - host systems with stronger implied guarantees can skip some barriers + - merge consecutive barriers to the strongest one + +(Current solution) + +The system currently has a tcg_gen_mb() which will add memory barrier +operations if code generation is being done in a parallel context. The +tcg_optimize() function attempts to merge barriers up to their +strongest form before any load/store operations. The solution was +originally developed and tested for linux-user based systems. All +backends have been converted to emit fences when required. So far the +following front-ends have been updated to emit fences when required: + + - target-i386 + - target-arm + - target-aarch64 + - target-alpha + - target-mips + +Memory Control and Maintenance +------------------------------ + +This includes a class of instructions for controlling system cache +behaviour. While QEMU doesn't model cache behaviour these instructions +are often seen when code modification has taken place to ensure the +changes take effect. + +Synchronisation Primitives +-------------------------- + +There are two broad types of synchronisation primitives found in +modern ISAs: atomic instructions and exclusive regions. + +The first type offer a simple atomic instruction which will guarantee +some sort of test and conditional store will be truly atomic w.r.t. +other cores sharing access to the memory. The classic example is the +x86 cmpxchg instruction. + +The second type offer a pair of load/store instructions which offer a +guarantee that a region of memory has not been touched between the +load and store instructions. An example of this is Arm's ldrex/strex +pair where the strex instruction will return a flag indicating a +successful store only if no other CPU has accessed the memory region +since the ldrex. + +Traditionally TCG has generated a series of operations that work +because they are within the context of a single translation block so +will have completed before another CPU is scheduled. However with +the ability to have multiple threads running to emulate multiple CPUs +we will need to explicitly expose these semantics. + +DESIGN REQUIREMENTS: + - Support classic atomic instructions + - Support load/store exclusive (or load link/store conditional) pairs + - Generic enough infrastructure to support all guest architectures +CURRENT OPEN QUESTIONS: + - How problematic is the ABA problem in general? + +(Current solution) + +The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which +can be used directly or combined to emulate other instructions like +Arm's ldrex/strex instructions. While they are susceptible to the ABA +problem so far common guests have not implemented patterns where +this may be a problem - typically presenting a locking ABI which +assumes cmpxchg like semantics. + +The code also includes a fall-back for cases where multi-threaded TCG +ops can't work (e.g. guest atomic width > host atomic width). In this +case an EXCP_ATOMIC exit occurs and the instruction is emulated with +an exclusive lock which ensures all emulation is serialised. + +While the atomic helpers look good enough for now there may be a need +to look at solutions that can more closely model the guest +architectures semantics. diff --git a/docs/devel/multiple-iothreads.txt b/docs/devel/multiple-iothreads.txt new file mode 100644 index 000000000..aeb997bed --- /dev/null +++ b/docs/devel/multiple-iothreads.txt @@ -0,0 +1,138 @@ +Copyright (c) 2014-2017 Red Hat Inc. + +This work is licensed under the terms of the GNU GPL, version 2 or later. See +the COPYING file in the top-level directory. + + +This document explains the IOThread feature and how to write code that runs +outside the QEMU global mutex. + +The main loop and IOThreads +--------------------------- +QEMU is an event-driven program that can do several things at once using an +event loop. The VNC server and the QMP monitor are both processed from the +same event loop, which monitors their file descriptors until they become +readable and then invokes a callback. + +The default event loop is called the main loop (see main-loop.c). It is +possible to create additional event loop threads using -object +iothread,id=my-iothread. + +Side note: The main loop and IOThread are both event loops but their code is +not shared completely. Sometimes it is useful to remember that although they +are conceptually similar they are currently not interchangeable. + +Why IOThreads are useful +------------------------ +IOThreads allow the user to control the placement of work. The main loop is a +scalability bottleneck on hosts with many CPUs. Work can be spread across +several IOThreads instead of just one main loop. When set up correctly this +can improve I/O latency and reduce jitter seen by the guest. + +The main loop is also deeply associated with the QEMU global mutex, which is a +scalability bottleneck in itself. vCPU threads and the main loop use the QEMU +global mutex to serialize execution of QEMU code. This mutex is necessary +because a lot of QEMU's code historically was not thread-safe. + +The fact that all I/O processing is done in a single main loop and that the +QEMU global mutex is contended by all vCPU threads and the main loop explain +why it is desirable to place work into IOThreads. + +The experimental virtio-blk data-plane implementation has been benchmarked and +shows these effects: +ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf + +How to program for IOThreads +---------------------------- +The main difference between legacy code and new code that can run in an +IOThread is dealing explicitly with the event loop object, AioContext +(see include/block/aio.h). Code that only works in the main loop +implicitly uses the main loop's AioContext. Code that supports running +in IOThreads must be aware of its AioContext. + +AioContext supports the following services: + * File descriptor monitoring (read/write/error on POSIX hosts) + * Event notifiers (inter-thread signalling) + * Timers + * Bottom Halves (BH) deferred callbacks + +There are several old APIs that use the main loop AioContext: + * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor + * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier + * LEGACY timer_new_ms() - create a timer + * LEGACY qemu_bh_new() - create a BH + * LEGACY qemu_aio_wait() - run an event loop iteration + +Since they implicitly work on the main loop they cannot be used in code that +runs in an IOThread. They might cause a crash or deadlock if called from an +IOThread since the QEMU global mutex is not held. + +Instead, use the AioContext functions directly (see include/block/aio.h): + * aio_set_fd_handler() - monitor a file descriptor + * aio_set_event_notifier() - monitor an event notifier + * aio_timer_new() - create a timer + * aio_bh_new() - create a BH + * aio_poll() - run an event loop iteration + +The AioContext can be obtained from the IOThread using +iothread_get_aio_context() or for the main loop using qemu_get_aio_context(). +Code that takes an AioContext argument works both in IOThreads or the main +loop, depending on which AioContext instance the caller passes in. + +How to synchronize with an IOThread +----------------------------------- +AioContext is not thread-safe so some rules must be followed when using file +descriptors, event notifiers, timers, or BHs across threads: + +1. AioContext functions can always be called safely. They handle their +own locking internally. + +2. Other threads wishing to access the AioContext must use +aio_context_acquire()/aio_context_release() for mutual exclusion. Once the +context is acquired no other thread can access it or run event loop iterations +in this AioContext. + +Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls. +Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro +used in the block layer and can lead to hangs. + +There is currently no lock ordering rule if a thread needs to acquire multiple +AioContexts simultaneously. Therefore, it is only safe for code holding the +QEMU global mutex to acquire other AioContexts. + +Side note: the best way to schedule a function call across threads is to call +aio_bh_schedule_oneshot(). No acquire/release or locking is needed. + +AioContext and the block layer +------------------------------ +The AioContext originates from the QEMU block layer, even though nowadays +AioContext is a generic event loop that can be used by any QEMU subsystem. + +The block layer has support for AioContext integrated. Each BlockDriverState +is associated with an AioContext using bdrv_try_set_aio_context() and +bdrv_get_aio_context(). This allows block layer code to process I/O inside the +right AioContext. Other subsystems may wish to follow a similar approach. + +Block layer code must therefore expect to run in an IOThread and avoid using +old APIs that implicitly use the main loop. See the "How to program for +IOThreads" above for information on how to do that. + +If main loop code such as a QMP function wishes to access a BlockDriverState +it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure +that callbacks in the IOThread do not run in parallel. + +Code running in the monitor typically needs to ensure that past +requests from the guest are completed. When a block device is running +in an IOThread, the IOThread can also process requests from the guest +(via ioeventfd). To achieve both objects, wrap the code between +bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained +section". The functions must be called between aio_context_acquire() +and aio_context_release(). You can freely release and re-acquire the +AioContext within a drained section. + +Long-running jobs (usually in the form of coroutines) are best scheduled in +the BlockDriverState's AioContext to avoid the need to acquire/release around +each bdrv_*() call. The functions bdrv_add/remove_aio_context_notifier, +or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends, +can be used to get a notification whenever bdrv_try_set_aio_context() moves a +BlockDriverState to a different AioContext. diff --git a/docs/devel/qapi-code-gen.rst b/docs/devel/qapi-code-gen.rst new file mode 100644 index 000000000..a3b547308 --- /dev/null +++ b/docs/devel/qapi-code-gen.rst @@ -0,0 +1,1932 @@ +================================== +How to use the QAPI code generator +================================== + +.. + Copyright IBM Corp. 2011 + Copyright (C) 2012-2016 Red Hat, Inc. + + This work is licensed under the terms of the GNU GPL, version 2 or + later. See the COPYING file in the top-level directory. + + +Introduction +============ + +QAPI is a native C API within QEMU which provides management-level +functionality to internal and external users. For external +users/processes, this interface is made available by a JSON-based wire +format for the QEMU Monitor Protocol (QMP) for controlling qemu, as +well as the QEMU Guest Agent (QGA) for communicating with the guest. +The remainder of this document uses "Client JSON Protocol" when +referring to the wire contents of a QMP or QGA connection. + +To map between Client JSON Protocol interfaces and the native C API, +we generate C code from a QAPI schema. This document describes the +QAPI schema language, and how it gets mapped to the Client JSON +Protocol and to C. It additionally provides guidance on maintaining +Client JSON Protocol compatibility. + + +The QAPI schema language +======================== + +The QAPI schema defines the Client JSON Protocol's commands and +events, as well as types used by them. Forward references are +allowed. + +It is permissible for the schema to contain additional types not used +by any commands or events, for the side effect of generated C code +used internally. + +There are several kinds of types: simple types (a number of built-in +types, such as ``int`` and ``str``; as well as enumerations), arrays, +complex types (structs and two flavors of unions), and alternate types +(a choice between other types). + + +Schema syntax +------------- + +Syntax is loosely based on `JSON <http://www.ietf.org/rfc/rfc8259.txt>`_. +Differences: + +* Comments: start with a hash character (``#``) that is not part of a + string, and extend to the end of the line. + +* Strings are enclosed in ``'single quotes'``, not ``"double quotes"``. + +* Strings are restricted to printable ASCII, and escape sequences to + just ``\\``. + +* Numbers and ``null`` are not supported. + +A second layer of syntax defines the sequences of JSON texts that are +a correctly structured QAPI schema. We provide a grammar for this +syntax in an EBNF-like notation: + +* Production rules look like ``non-terminal = expression`` +* Concatenation: expression ``A B`` matches expression ``A``, then ``B`` +* Alternation: expression ``A | B`` matches expression ``A`` or ``B`` +* Repetition: expression ``A...`` matches zero or more occurrences of + expression ``A`` +* Repetition: expression ``A, ...`` matches zero or more occurrences of + expression ``A`` separated by ``,`` +* Grouping: expression ``( A )`` matches expression ``A`` +* JSON's structural characters are terminals: ``{ } [ ] : ,`` +* JSON's literal names are terminals: ``false true`` +* String literals enclosed in ``'single quotes'`` are terminal, and match + this JSON string, with a leading ``*`` stripped off +* When JSON object member's name starts with ``*``, the member is + optional. +* The symbol ``STRING`` is a terminal, and matches any JSON string +* The symbol ``BOOL`` is a terminal, and matches JSON ``false`` or ``true`` +* ALL-CAPS words other than ``STRING`` are non-terminals + +The order of members within JSON objects does not matter unless +explicitly noted. + +A QAPI schema consists of a series of top-level expressions:: + + SCHEMA = TOP-LEVEL-EXPR... + +The top-level expressions are all JSON objects. Code and +documentation is generated in schema definition order. Code order +should not matter. + +A top-level expressions is either a directive or a definition:: + + TOP-LEVEL-EXPR = DIRECTIVE | DEFINITION + +There are two kinds of directives and six kinds of definitions:: + + DIRECTIVE = INCLUDE | PRAGMA + DEFINITION = ENUM | STRUCT | UNION | ALTERNATE | COMMAND | EVENT + +These are discussed in detail below. + + +Built-in Types +-------------- + +The following types are predefined, and map to C as follows: + + ============= ============== ============================================ + Schema C JSON + ============= ============== ============================================ + ``str`` ``char *`` any JSON string, UTF-8 + ``number`` ``double`` any JSON number + ``int`` ``int64_t`` a JSON number without fractional part + that fits into the C integer type + ``int8`` ``int8_t`` likewise + ``int16`` ``int16_t`` likewise + ``int32`` ``int32_t`` likewise + ``int64`` ``int64_t`` likewise + ``uint8`` ``uint8_t`` likewise + ``uint16`` ``uint16_t`` likewise + ``uint32`` ``uint32_t`` likewise + ``uint64`` ``uint64_t`` likewise + ``size`` ``uint64_t`` like ``uint64_t``, except + ``StringInputVisitor`` accepts size suffixes + ``bool`` ``bool`` JSON ``true`` or ``false`` + ``null`` ``QNull *`` JSON ``null`` + ``any`` ``QObject *`` any JSON value + ``QType`` ``QType`` JSON string matching enum ``QType`` values + ============= ============== ============================================ + + +Include directives +------------------ + +Syntax:: + + INCLUDE = { 'include': STRING } + +The QAPI schema definitions can be modularized using the 'include' directive:: + + { 'include': 'path/to/file.json' } + +The directive is evaluated recursively, and include paths are relative +to the file using the directive. Multiple includes of the same file +are idempotent. + +As a matter of style, it is a good idea to have all files be +self-contained, but at the moment, nothing prevents an included file +from making a forward reference to a type that is only introduced by +an outer file. The parser may be made stricter in the future to +prevent incomplete include files. + +.. _pragma: + +Pragma directives +----------------- + +Syntax:: + + PRAGMA = { 'pragma': { + '*doc-required': BOOL, + '*command-name-exceptions': [ STRING, ... ], + '*command-returns-exceptions': [ STRING, ... ], + '*member-name-exceptions': [ STRING, ... ] } } + +The pragma directive lets you control optional generator behavior. + +Pragma's scope is currently the complete schema. Setting the same +pragma to different values in parts of the schema doesn't work. + +Pragma 'doc-required' takes a boolean value. If true, documentation +is required. Default is false. + +Pragma 'command-name-exceptions' takes a list of commands whose names +may contain ``"_"`` instead of ``"-"``. Default is none. + +Pragma 'command-returns-exceptions' takes a list of commands that may +violate the rules on permitted return types. Default is none. + +Pragma 'member-name-exceptions' takes a list of types whose member +names may contain uppercase letters, and ``"_"`` instead of ``"-"``. +Default is none. + +.. _ENUM-VALUE: + +Enumeration types +----------------- + +Syntax:: + + ENUM = { 'enum': STRING, + 'data': [ ENUM-VALUE, ... ], + '*prefix': STRING, + '*if': COND, + '*features': FEATURES } + ENUM-VALUE = STRING + | { 'name': STRING, + '*if': COND, + '*features': FEATURES } + +Member 'enum' names the enum type. + +Each member of the 'data' array defines a value of the enumeration +type. The form STRING is shorthand for :code:`{ 'name': STRING }`. The +'name' values must be be distinct. + +Example:: + + { 'enum': 'MyEnum', 'data': [ 'value1', 'value2', 'value3' ] } + +Nothing prevents an empty enumeration, although it is probably not +useful. + +On the wire, an enumeration type's value is represented by its +(string) name. In C, it's represented by an enumeration constant. +These are of the form PREFIX_NAME, where PREFIX is derived from the +enumeration type's name, and NAME from the value's name. For the +example above, the generator maps 'MyEnum' to MY_ENUM and 'value1' to +VALUE1, resulting in the enumeration constant MY_ENUM_VALUE1. The +optional 'prefix' member overrides PREFIX. + +The generated C enumeration constants have values 0, 1, ..., N-1 (in +QAPI schema order), where N is the number of values. There is an +additional enumeration constant PREFIX__MAX with value N. + +Do not use string or an integer type when an enumeration type can do +the job satisfactorily. + +The optional 'if' member specifies a conditional. See `Configuring the +schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +.. _TYPE-REF: + +Type references and array types +------------------------------- + +Syntax:: + + TYPE-REF = STRING | ARRAY-TYPE + ARRAY-TYPE = [ STRING ] + +A string denotes the type named by the string. + +A one-element array containing a string denotes an array of the type +named by the string. Example: ``['int']`` denotes an array of ``int``. + + +Struct types +------------ + +Syntax:: + + STRUCT = { 'struct': STRING, + 'data': MEMBERS, + '*base': STRING, + '*if': COND, + '*features': FEATURES } + MEMBERS = { MEMBER, ... } + MEMBER = STRING : TYPE-REF + | STRING : { 'type': TYPE-REF, + '*if': COND, + '*features': FEATURES } + +Member 'struct' names the struct type. + +Each MEMBER of the 'data' object defines a member of the struct type. + +.. _MEMBERS: + +The MEMBER's STRING name consists of an optional ``*`` prefix and the +struct member name. If ``*`` is present, the member is optional. + +The MEMBER's value defines its properties, in particular its type. +The form TYPE-REF_ is shorthand for :code:`{ 'type': TYPE-REF }`. + +Example:: + + { 'struct': 'MyType', + 'data': { 'member1': 'str', 'member2': ['int'], '*member3': 'str' } } + +A struct type corresponds to a struct in C, and an object in JSON. +The C struct's members are generated in QAPI schema order. + +The optional 'base' member names a struct type whose members are to be +included in this type. They go first in the C struct. + +Example:: + + { 'struct': 'BlockdevOptionsGenericFormat', + 'data': { 'file': 'str' } } + { 'struct': 'BlockdevOptionsGenericCOWFormat', + 'base': 'BlockdevOptionsGenericFormat', + 'data': { '*backing': 'str' } } + +An example BlockdevOptionsGenericCOWFormat object on the wire could use +both members like this:: + + { "file": "/some/place/my-image", + "backing": "/some/place/my-backing-file" } + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +Union types +----------- + +Syntax:: + + UNION = { 'union': STRING, + 'base': ( MEMBERS | STRING ), + 'discriminator': STRING, + 'data': BRANCHES, + '*if': COND, + '*features': FEATURES } + BRANCHES = { BRANCH, ... } + BRANCH = STRING : TYPE-REF + | STRING : { 'type': TYPE-REF, '*if': COND } + +Member 'union' names the union type. + +The 'base' member defines the common members. If it is a MEMBERS_ +object, it defines common members just like a struct type's 'data' +member defines struct type members. If it is a STRING, it names a +struct type whose members are the common members. + +Member 'discriminator' must name a non-optional enum-typed member of +the base struct. That member's value selects a branch by its name. +If no such branch exists, an empty branch is assumed. + +Each BRANCH of the 'data' object defines a branch of the union. A +union must have at least one branch. + +The BRANCH's STRING name is the branch name. It must be a value of +the discriminator enum type. + +The BRANCH's value defines the branch's properties, in particular its +type. The type must a struct type. The form TYPE-REF_ is shorthand +for :code:`{ 'type': TYPE-REF }`. + +In the Client JSON Protocol, a union is represented by an object with +the common members (from the base type) and the selected branch's +members. The two sets of member names must be disjoint. + +Example:: + + { 'enum': 'BlockdevDriver', 'data': [ 'file', 'qcow2' ] } + { 'union': 'BlockdevOptions', + 'base': { 'driver': 'BlockdevDriver', '*read-only': 'bool' }, + 'discriminator': 'driver', + 'data': { 'file': 'BlockdevOptionsFile', + 'qcow2': 'BlockdevOptionsQcow2' } } + +Resulting in these JSON objects:: + + { "driver": "file", "read-only": true, + "filename": "/some/place/my-image" } + { "driver": "qcow2", "read-only": false, + "backing": "/some/place/my-image", "lazy-refcounts": true } + +The order of branches need not match the order of the enum values. +The branches need not cover all possible enum values. In the +resulting generated C data types, a union is represented as a struct +with the base members in QAPI schema order, and then a union of +structures for each branch of the struct. + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +Alternate types +--------------- + +Syntax:: + + ALTERNATE = { 'alternate': STRING, + 'data': ALTERNATIVES, + '*if': COND, + '*features': FEATURES } + ALTERNATIVES = { ALTERNATIVE, ... } + ALTERNATIVE = STRING : STRING + | STRING : { 'type': STRING, '*if': COND } + +Member 'alternate' names the alternate type. + +Each ALTERNATIVE of the 'data' object defines a branch of the +alternate. An alternate must have at least one branch. + +The ALTERNATIVE's STRING name is the branch name. + +The ALTERNATIVE's value defines the branch's properties, in particular +its type. The form STRING is shorthand for :code:`{ 'type': STRING }`. + +Example:: + + { 'alternate': 'BlockdevRef', + 'data': { 'definition': 'BlockdevOptions', + 'reference': 'str' } } + +An alternate type is like a union type, except there is no +discriminator on the wire. Instead, the branch to use is inferred +from the value. An alternate can only express a choice between types +represented differently on the wire. + +If a branch is typed as the 'bool' built-in, the alternate accepts +true and false; if it is typed as any of the various numeric +built-ins, it accepts a JSON number; if it is typed as a 'str' +built-in or named enum type, it accepts a JSON string; if it is typed +as the 'null' built-in, it accepts JSON null; and if it is typed as a +complex type (struct or union), it accepts a JSON object. + +The example alternate declaration above allows using both of the +following example objects:: + + { "file": "my_existing_block_device_id" } + { "file": { "driver": "file", + "read-only": false, + "filename": "/tmp/mydisk.qcow2" } } + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +Commands +-------- + +Syntax:: + + COMMAND = { 'command': STRING, + ( + '*data': ( MEMBERS | STRING ), + | + 'data': STRING, + 'boxed': true, + ) + '*returns': TYPE-REF, + '*success-response': false, + '*gen': false, + '*allow-oob': true, + '*allow-preconfig': true, + '*coroutine': true, + '*if': COND, + '*features': FEATURES } + +Member 'command' names the command. + +Member 'data' defines the arguments. It defaults to an empty MEMBERS_ +object. + +If 'data' is a MEMBERS_ object, then MEMBERS defines arguments just +like a struct type's 'data' defines struct type members. + +If 'data' is a STRING, then STRING names a complex type whose members +are the arguments. A union type requires ``'boxed': true``. + +Member 'returns' defines the command's return type. It defaults to an +empty struct type. It must normally be a complex type or an array of +a complex type. To return anything else, the command must be listed +in pragma 'commands-returns-exceptions'. If you do this, extending +the command to return additional information will be harder. Use of +the pragma for new commands is strongly discouraged. + +A command's error responses are not specified in the QAPI schema. +Error conditions should be documented in comments. + +In the Client JSON Protocol, the value of the "execute" or "exec-oob" +member is the command name. The value of the "arguments" member then +has to conform to the arguments, and the value of the success +response's "return" member will conform to the return type. + +Some example commands:: + + { 'command': 'my-first-command', + 'data': { 'arg1': 'str', '*arg2': 'str' } } + { 'struct': 'MyType', 'data': { '*value': 'str' } } + { 'command': 'my-second-command', + 'returns': [ 'MyType' ] } + +which would validate this Client JSON Protocol transaction:: + + => { "execute": "my-first-command", + "arguments": { "arg1": "hello" } } + <= { "return": { } } + => { "execute": "my-second-command" } + <= { "return": [ { "value": "one" }, { } ] } + +The generator emits a prototype for the C function implementing the +command. The function itself needs to be written by hand. See +section `Code generated for commands`_ for examples. + +The function returns the return type. When member 'boxed' is absent, +it takes the command arguments as arguments one by one, in QAPI schema +order. Else it takes them wrapped in the C struct generated for the +complex argument type. It takes an additional ``Error **`` argument in +either case. + +The generator also emits a marshalling function that extracts +arguments for the user's function out of an input QDict, calls the +user's function, and if it succeeded, builds an output QObject from +its return value. This is for use by the QMP monitor core. + +In rare cases, QAPI cannot express a type-safe representation of a +corresponding Client JSON Protocol command. You then have to suppress +generation of a marshalling function by including a member 'gen' with +boolean value false, and instead write your own function. For +example:: + + { 'command': 'netdev_add', + 'data': {'type': 'str', 'id': 'str'}, + 'gen': false } + +Please try to avoid adding new commands that rely on this, and instead +use type-safe unions. + +Normally, the QAPI schema is used to describe synchronous exchanges, +where a response is expected. But in some cases, the action of a +command is expected to change state in a way that a successful +response is not possible (although the command will still return an +error object on failure). When a successful reply is not possible, +the command definition includes the optional member 'success-response' +with boolean value false. So far, only QGA makes use of this member. + +Member 'allow-oob' declares whether the command supports out-of-band +(OOB) execution. It defaults to false. For example:: + + { 'command': 'migrate_recover', + 'data': { 'uri': 'str' }, 'allow-oob': true } + +See qmp-spec.txt for out-of-band execution syntax and semantics. + +Commands supporting out-of-band execution can still be executed +in-band. + +When a command is executed in-band, its handler runs in the main +thread with the BQL held. + +When a command is executed out-of-band, its handler runs in a +dedicated monitor I/O thread with the BQL *not* held. + +An OOB-capable command handler must satisfy the following conditions: + +- It terminates quickly. +- It does not invoke system calls that may block. +- It does not access guest RAM that may block when userfaultfd is + enabled for postcopy live migration. +- It takes only "fast" locks, i.e. all critical sections protected by + any lock it takes also satisfy the conditions for OOB command + handler code. + +The restrictions on locking limit access to shared state. Such access +requires synchronization, but OOB commands can't take the BQL or any +other "slow" lock. + +When in doubt, do not implement OOB execution support. + +Member 'allow-preconfig' declares whether the command is available +before the machine is built. It defaults to false. For example:: + + { 'enum': 'QMPCapability', + 'data': [ 'oob' ] } + { 'command': 'qmp_capabilities', + 'data': { '*enable': [ 'QMPCapability' ] }, + 'allow-preconfig': true } + +QMP is available before the machine is built only when QEMU was +started with --preconfig. + +Member 'coroutine' tells the QMP dispatcher whether the command handler +is safe to be run in a coroutine. It defaults to false. If it is true, +the command handler is called from coroutine context and may yield while +waiting for an external event (such as I/O completion) in order to avoid +blocking the guest and other background operations. + +Coroutine safety can be hard to prove, similar to thread safety. Common +pitfalls are: + +- The global mutex isn't held across ``qemu_coroutine_yield()``, so + operations that used to assume that they execute atomically may have + to be more careful to protect against changes in the global state. + +- Nested event loops (``AIO_WAIT_WHILE()`` etc.) are problematic in + coroutine context and can easily lead to deadlocks. They should be + replaced by yielding and reentering the coroutine when the condition + becomes false. + +Since the command handler may assume coroutine context, any callers +other than the QMP dispatcher must also call it in coroutine context. +In particular, HMP commands calling such a QMP command handler must be +marked ``.coroutine = true`` in hmp-commands.hx. + +It is an error to specify both ``'coroutine': true`` and ``'allow-oob': true`` +for a command. We don't currently have a use case for both together and +without a use case, it's not entirely clear what the semantics should +be. + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +Events +------ + +Syntax:: + + EVENT = { 'event': STRING, + ( + '*data': ( MEMBERS | STRING ), + | + 'data': STRING, + 'boxed': true, + ) + '*if': COND, + '*features': FEATURES } + +Member 'event' names the event. This is the event name used in the +Client JSON Protocol. + +Member 'data' defines the event-specific data. It defaults to an +empty MEMBERS object. + +If 'data' is a MEMBERS object, then MEMBERS defines event-specific +data just like a struct type's 'data' defines struct type members. + +If 'data' is a STRING, then STRING names a complex type whose members +are the event-specific data. A union type requires ``'boxed': true``. + +An example event is:: + + { 'event': 'EVENT_C', + 'data': { '*a': 'int', 'b': 'str' } } + +Resulting in this JSON object:: + + { "event": "EVENT_C", + "data": { "b": "test string" }, + "timestamp": { "seconds": 1267020223, "microseconds": 435656 } } + +The generator emits a function to send the event. When member 'boxed' +is absent, it takes event-specific data one by one, in QAPI schema +order. Else it takes them wrapped in the C struct generated for the +complex type. See section `Code generated for events`_ for examples. + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +The optional 'features' member specifies features. See Features_ +below for more on this. + + +.. _FEATURE: + +Features +-------- + +Syntax:: + + FEATURES = [ FEATURE, ... ] + FEATURE = STRING + | { 'name': STRING, '*if': COND } + +Sometimes, the behaviour of QEMU changes compatibly, but without a +change in the QMP syntax (usually by allowing values or operations +that previously resulted in an error). QMP clients may still need to +know whether the extension is available. + +For this purpose, a list of features can be specified for a command or +struct type. Each list member can either be ``{ 'name': STRING, '*if': +COND }``, or STRING, which is shorthand for ``{ 'name': STRING }``. + +The optional 'if' member specifies a conditional. See `Configuring +the schema`_ below for more on this. + +Example:: + + { 'struct': 'TestType', + 'data': { 'number': 'int' }, + 'features': [ 'allow-negative-numbers' ] } + +The feature strings are exposed to clients in introspection, as +explained in section `Client JSON Protocol introspection`_. + +Intended use is to have each feature string signal that this build of +QEMU shows a certain behaviour. + + +Special features +~~~~~~~~~~~~~~~~ + +Feature "deprecated" marks a command, event, enum value, or struct +member as deprecated. It is not supported elsewhere so far. +Interfaces so marked may be withdrawn in future releases in accordance +with QEMU's deprecation policy. + +Feature "unstable" marks a command, event, enum value, or struct +member as unstable. It is not supported elsewhere so far. Interfaces +so marked may be withdrawn or changed incompatibly in future releases. + + +Naming rules and reserved names +------------------------------- + +All names must begin with a letter, and contain only ASCII letters, +digits, hyphen, and underscore. There are two exceptions: enum values +may start with a digit, and names that are downstream extensions (see +section `Downstream extensions`_) start with underscore. + +Names beginning with ``q_`` are reserved for the generator, which uses +them for munging QMP names that resemble C keywords or other +problematic strings. For example, a member named ``default`` in qapi +becomes ``q_default`` in the generated C code. + +Types, commands, and events share a common namespace. Therefore, +generally speaking, type definitions should always use CamelCase for +user-defined type names, while built-in types are lowercase. + +Type names ending with ``Kind`` or ``List`` are reserved for the +generator, which uses them for implicit union enums and array types, +respectively. + +Command names, and member names within a type, should be all lower +case with words separated by a hyphen. However, some existing older +commands and complex types use underscore; when extending them, +consistency is preferred over blindly avoiding underscore. + +Event names should be ALL_CAPS with words separated by underscore. + +Member name ``u`` and names starting with ``has-`` or ``has_`` are reserved +for the generator, which uses them for unions and for tracking +optional members. + +Names beginning with ``x-`` used to signify "experimental". This +convention has been replaced by special feature "unstable". + +Pragmas ``command-name-exceptions`` and ``member-name-exceptions`` let +you violate naming rules. Use for new code is strongly discouraged. See +`Pragma directives`_ for details. + + +Downstream extensions +--------------------- + +QAPI schema names that are externally visible, say in the Client JSON +Protocol, need to be managed with care. Names starting with a +downstream prefix of the form __RFQDN_ are reserved for the downstream +who controls the valid, reverse fully qualified domain name RFQDN. +RFQDN may only contain ASCII letters, digits, hyphen and period. + +Example: Red Hat, Inc. controls redhat.com, and may therefore add a +downstream command ``__com.redhat_drive-mirror``. + + +Configuring the schema +---------------------- + +Syntax:: + + COND = STRING + | { 'all: [ COND, ... ] } + | { 'any: [ COND, ... ] } + | { 'not': COND } + +All definitions take an optional 'if' member. Its value must be a +string, or an object with a single member 'all', 'any' or 'not'. + +The C code generated for the definition will then be guarded by an #if +preprocessing directive with an operand generated from that condition: + + * STRING will generate defined(STRING) + * { 'all': [COND, ...] } will generate (COND && ...) + * { 'any': [COND, ...] } will generate (COND || ...) + * { 'not': COND } will generate !COND + +Example: a conditional struct :: + + { 'struct': 'IfStruct', 'data': { 'foo': 'int' }, + 'if': { 'all': [ 'CONFIG_FOO', 'HAVE_BAR' ] } } + +gets its generated code guarded like this:: + + #if defined(CONFIG_FOO) && defined(HAVE_BAR) + ... generated code ... + #endif /* defined(HAVE_BAR) && defined(CONFIG_FOO) */ + +Individual members of complex types, commands arguments, and +event-specific data can also be made conditional. This requires the +longhand form of MEMBER. + +Example: a struct type with unconditional member 'foo' and conditional +member 'bar' :: + + { 'struct': 'IfStruct', + 'data': { 'foo': 'int', + 'bar': { 'type': 'int', 'if': 'IFCOND'} } } + +A union's discriminator may not be conditional. + +Likewise, individual enumeration values be conditional. This requires +the longhand form of ENUM-VALUE_. + +Example: an enum type with unconditional value 'foo' and conditional +value 'bar' :: + + { 'enum': 'IfEnum', + 'data': [ 'foo', + { 'name' : 'bar', 'if': 'IFCOND' } ] } + +Likewise, features can be conditional. This requires the longhand +form of FEATURE_. + +Example: a struct with conditional feature 'allow-negative-numbers' :: + + { 'struct': 'TestType', + 'data': { 'number': 'int' }, + 'features': [ { 'name': 'allow-negative-numbers', + 'if': 'IFCOND' } ] } + +Please note that you are responsible to ensure that the C code will +compile with an arbitrary combination of conditions, since the +generator is unable to check it at this point. + +The conditions apply to introspection as well, i.e. introspection +shows a conditional entity only when the condition is satisfied in +this particular build. + + +Documentation comments +---------------------- + +A multi-line comment that starts and ends with a ``##`` line is a +documentation comment. + +If the documentation comment starts like :: + + ## + # @SYMBOL: + +it documents the definition of SYMBOL, else it's free-form +documentation. + +See below for more on `Definition documentation`_. + +Free-form documentation may be used to provide additional text and +structuring content. + + +Headings and subheadings +~~~~~~~~~~~~~~~~~~~~~~~~ + +A free-form documentation comment containing a line which starts with +some ``=`` symbols and then a space defines a section heading:: + + ## + # = This is a top level heading + # + # This is a free-form comment which will go under the + # top level heading. + ## + + ## + # == This is a second level heading + ## + +A heading line must be the first line of the documentation +comment block. + +Section headings must always be correctly nested, so you can only +define a third-level heading inside a second-level heading, and so on. + + +Documentation markup +~~~~~~~~~~~~~~~~~~~~ + +Documentation comments can use most rST markup. In particular, +a ``::`` literal block can be used for examples:: + + # :: + # + # Text of the example, may span + # multiple lines + +``*`` starts an itemized list:: + + # * First item, may span + # multiple lines + # * Second item + +You can also use ``-`` instead of ``*``. + +A decimal number followed by ``.`` starts a numbered list:: + + # 1. First item, may span + # multiple lines + # 2. Second item + +The actual number doesn't matter. + +Lists of either kind must be preceded and followed by a blank line. +If a list item's text spans multiple lines, then the second and +subsequent lines must be correctly indented to line up with the +first character of the first line. + +The usual ****strong****, *\*emphasized\** and ````literal```` markup +should be used. If you need a single literal ``*``, you will need to +backslash-escape it. As an extension beyond the usual rST syntax, you +can also use ``@foo`` to reference a name in the schema; this is rendered +the same way as ````foo````. + +Example:: + + ## + # Some text foo with **bold** and *emphasis* + # 1. with a list + # 2. like that + # + # And some code: + # + # :: + # + # $ echo foo + # -> do this + # <- get that + ## + + +Definition documentation +~~~~~~~~~~~~~~~~~~~~~~~~ + +Definition documentation, if present, must immediately precede the +definition it documents. + +When documentation is required (see pragma_ 'doc-required'), every +definition must have documentation. + +Definition documentation starts with a line naming the definition, +followed by an optional overview, a description of each argument (for +commands and events), member (for structs and unions), branch (for +alternates), or value (for enums), a description of each feature (if +any), and finally optional tagged sections. + +The description of an argument or feature 'name' starts with +'\@name:'. The description text can start on the line following the +'\@name:', in which case it must not be indented at all. It can also +start on the same line as the '\@name:'. In this case if it spans +multiple lines then second and subsequent lines must be indented to +line up with the first character of the first line of the +description:: + + # @argone: + # This is a two line description + # in the first style. + # + # @argtwo: This is a two line description + # in the second style. + +The number of spaces between the ':' and the text is not significant. + +.. admonition:: FIXME + + The parser accepts these things in almost any order. + +.. admonition:: FIXME + + union branches should be described, too. + +Extensions added after the definition was first released carry a +'(since x.y.z)' comment. + +The feature descriptions must be preceded by a line "Features:", like +this:: + + # Features: + # @feature: Description text + +A tagged section starts with one of the following words: +"Note:"/"Notes:", "Since:", "Example"/"Examples", "Returns:", "TODO:". +The section ends with the start of a new section. + +The text of a section can start on a new line, in +which case it must not be indented at all. It can also start +on the same line as the 'Note:', 'Returns:', etc tag. In this +case if it spans multiple lines then second and subsequent +lines must be indented to match the first, in the same way as +multiline argument descriptions. + +A 'Since: x.y.z' tagged section lists the release that introduced the +definition. + +An 'Example' or 'Examples' section is automatically rendered +entirely as literal fixed-width text. In other sections, +the text is formatted, and rST markup can be used. + +For example:: + + ## + # @BlockStats: + # + # Statistics of a virtual block device or a block backing device. + # + # @device: If the stats are for a virtual block device, the name + # corresponding to the virtual block device. + # + # @node-name: The node name of the device. (since 2.3) + # + # ... more members ... + # + # Since: 0.14.0 + ## + { 'struct': 'BlockStats', + 'data': {'*device': 'str', '*node-name': 'str', + ... more members ... } } + + ## + # @query-blockstats: + # + # Query the @BlockStats for all virtual block devices. + # + # @query-nodes: If true, the command will query all the + # block nodes ... explain, explain ... (since 2.3) + # + # Returns: A list of @BlockStats for each virtual block devices. + # + # Since: 0.14.0 + # + # Example: + # + # -> { "execute": "query-blockstats" } + # <- { + # ... lots of output ... + # } + # + ## + { 'command': 'query-blockstats', + 'data': { '*query-nodes': 'bool' }, + 'returns': ['BlockStats'] } + + +Client JSON Protocol introspection +================================== + +Clients of a Client JSON Protocol commonly need to figure out what +exactly the server (QEMU) supports. + +For this purpose, QMP provides introspection via command +query-qmp-schema. QGA currently doesn't support introspection. + +While Client JSON Protocol wire compatibility should be maintained +between qemu versions, we cannot make the same guarantees for +introspection stability. For example, one version of qemu may provide +a non-variant optional member of a struct, and a later version rework +the member to instead be non-optional and associated with a variant. +Likewise, one version of qemu may list a member with open-ended type +'str', and a later version could convert it to a finite set of strings +via an enum type; or a member may be converted from a specific type to +an alternate that represents a choice between the original type and +something else. + +query-qmp-schema returns a JSON array of SchemaInfo objects. These +objects together describe the wire ABI, as defined in the QAPI schema. +There is no specified order to the SchemaInfo objects returned; a +client must search for a particular name throughout the entire array +to learn more about that name, but is at least guaranteed that there +will be no collisions between type, command, and event names. + +However, the SchemaInfo can't reflect all the rules and restrictions +that apply to QMP. It's interface introspection (figuring out what's +there), not interface specification. The specification is in the QAPI +schema. To understand how QMP is to be used, you need to study the +QAPI schema. + +Like any other command, query-qmp-schema is itself defined in the QAPI +schema, along with the SchemaInfo type. This text attempts to give an +overview how things work. For details you need to consult the QAPI +schema. + +SchemaInfo objects have common members "name", "meta-type", +"features", and additional variant members depending on the value of +meta-type. + +Each SchemaInfo object describes a wire ABI entity of a certain +meta-type: a command, event or one of several kinds of type. + +SchemaInfo for commands and events have the same name as in the QAPI +schema. + +Command and event names are part of the wire ABI, but type names are +not. Therefore, the SchemaInfo for types have auto-generated +meaningless names. For readability, the examples in this section use +meaningful type names instead. + +Optional member "features" exposes the entity's feature strings as a +JSON array of strings. + +To examine a type, start with a command or event using it, then follow +references by name. + +QAPI schema definitions not reachable that way are omitted. + +The SchemaInfo for a command has meta-type "command", and variant +members "arg-type", "ret-type" and "allow-oob". On the wire, the +"arguments" member of a client's "execute" command must conform to the +object type named by "arg-type". The "return" member that the server +passes in a success response conforms to the type named by "ret-type". +When "allow-oob" is true, it means the command supports out-of-band +execution. It defaults to false. + +If the command takes no arguments, "arg-type" names an object type +without members. Likewise, if the command returns nothing, "ret-type" +names an object type without members. + +Example: the SchemaInfo for command query-qmp-schema :: + + { "name": "query-qmp-schema", "meta-type": "command", + "arg-type": "q_empty", "ret-type": "SchemaInfoList" } + + Type "q_empty" is an automatic object type without members, and type + "SchemaInfoList" is the array of SchemaInfo type. + +The SchemaInfo for an event has meta-type "event", and variant member +"arg-type". On the wire, a "data" member that the server passes in an +event conforms to the object type named by "arg-type". + +If the event carries no additional information, "arg-type" names an +object type without members. The event may not have a data member on +the wire then. + +Each command or event defined with 'data' as MEMBERS object in the +QAPI schema implicitly defines an object type. + +Example: the SchemaInfo for EVENT_C from section Events_ :: + + { "name": "EVENT_C", "meta-type": "event", + "arg-type": "q_obj-EVENT_C-arg" } + + Type "q_obj-EVENT_C-arg" is an implicitly defined object type with + the two members from the event's definition. + +The SchemaInfo for struct and union types has meta-type "object". + +The SchemaInfo for a struct type has variant member "members". + +The SchemaInfo for a union type additionally has variant members "tag" +and "variants". + +"members" is a JSON array describing the object's common members, if +any. Each element is a JSON object with members "name" (the member's +name), "type" (the name of its type), "features" (a JSON array of +feature strings), and "default". The latter two are optional. The +member is optional if "default" is present. Currently, "default" can +only have value null. Other values are reserved for future +extensions. The "members" array is in no particular order; clients +must search the entire object when learning whether a particular +member is supported. + +Example: the SchemaInfo for MyType from section `Struct types`_ :: + + { "name": "MyType", "meta-type": "object", + "members": [ + { "name": "member1", "type": "str" }, + { "name": "member2", "type": "int" }, + { "name": "member3", "type": "str", "default": null } ] } + +"features" exposes the command's feature strings as a JSON array of +strings. + +Example: the SchemaInfo for TestType from section Features_:: + + { "name": "TestType", "meta-type": "object", + "members": [ + { "name": "number", "type": "int" } ], + "features": ["allow-negative-numbers"] } + +"tag" is the name of the common member serving as type tag. +"variants" is a JSON array describing the object's variant members. +Each element is a JSON object with members "case" (the value of type +tag this element applies to) and "type" (the name of an object type +that provides the variant members for this type tag value). The +"variants" array is in no particular order, and is not guaranteed to +list cases in the same order as the corresponding "tag" enum type. + +Example: the SchemaInfo for union BlockdevOptions from section +`Union types`_ :: + + { "name": "BlockdevOptions", "meta-type": "object", + "members": [ + { "name": "driver", "type": "BlockdevDriver" }, + { "name": "read-only", "type": "bool", "default": null } ], + "tag": "driver", + "variants": [ + { "case": "file", "type": "BlockdevOptionsFile" }, + { "case": "qcow2", "type": "BlockdevOptionsQcow2" } ] } + +Note that base types are "flattened": its members are included in the +"members" array. + +The SchemaInfo for an alternate type has meta-type "alternate", and +variant member "members". "members" is a JSON array. Each element is +a JSON object with member "type", which names a type. Values of the +alternate type conform to exactly one of its member types. There is +no guarantee on the order in which "members" will be listed. + +Example: the SchemaInfo for BlockdevRef from section `Alternate types`_ :: + + { "name": "BlockdevRef", "meta-type": "alternate", + "members": [ + { "type": "BlockdevOptions" }, + { "type": "str" } ] } + +The SchemaInfo for an array type has meta-type "array", and variant +member "element-type", which names the array's element type. Array +types are implicitly defined. For convenience, the array's name may +resemble the element type; however, clients should examine member +"element-type" instead of making assumptions based on parsing member +"name". + +Example: the SchemaInfo for ['str'] :: + + { "name": "[str]", "meta-type": "array", + "element-type": "str" } + +The SchemaInfo for an enumeration type has meta-type "enum" and +variant member "members". + +"members" is a JSON array describing the enumeration values. Each +element is a JSON object with member "name" (the member's name), and +optionally "features" (a JSON array of feature strings). The +"members" array is in no particular order; clients must search the +entire array when learning whether a particular value is supported. + +Example: the SchemaInfo for MyEnum from section `Enumeration types`_ :: + + { "name": "MyEnum", "meta-type": "enum", + "members": [ + { "name": "value1" }, + { "name": "value2" }, + { "name": "value3" } + ] } + +The SchemaInfo for a built-in type has the same name as the type in +the QAPI schema (see section `Built-in Types`_), with one exception +detailed below. It has variant member "json-type" that shows how +values of this type are encoded on the wire. + +Example: the SchemaInfo for str :: + + { "name": "str", "meta-type": "builtin", "json-type": "string" } + +The QAPI schema supports a number of integer types that only differ in +how they map to C. They are identical as far as SchemaInfo is +concerned. Therefore, they get all mapped to a single type "int" in +SchemaInfo. + +As explained above, type names are not part of the wire ABI. Not even +the names of built-in types. Clients should examine member +"json-type" instead of hard-coding names of built-in types. + + +Compatibility considerations +============================ + +Maintaining backward compatibility at the Client JSON Protocol level +while evolving the schema requires some care. This section is about +syntactic compatibility, which is necessary, but not sufficient, for +actual compatibility. + +Clients send commands with argument data, and receive command +responses with return data and events with event data. + +Adding opt-in functionality to the send direction is backwards +compatible: adding commands, optional arguments, enumeration values, +union and alternate branches; turning an argument type into an +alternate of that type; making mandatory arguments optional. Clients +oblivious of the new functionality continue to work. + +Incompatible changes include removing commands, command arguments, +enumeration values, union and alternate branches, adding mandatory +command arguments, and making optional arguments mandatory. + +The specified behavior of an absent optional argument should remain +the same. With proper documentation, this policy still allows some +flexibility; for example, when an optional 'buffer-size' argument is +specified to default to a sensible buffer size, the actual default +value can still be changed. The specified default behavior is not the +exact size of the buffer, only that the default size is sensible. + +Adding functionality to the receive direction is generally backwards +compatible: adding events, adding return and event data members. +Clients are expected to ignore the ones they don't know. + +Removing "unreachable" stuff like events that can't be triggered +anymore, optional return or event data members that can't be sent +anymore, and return or event data member (enumeration) values that +can't be sent anymore makes no difference to clients, except for +introspection. The latter can conceivably confuse clients, so tread +carefully. + +Incompatible changes include removing return and event data members. + +Any change to a command definition's 'data' or one of the types used +there (recursively) needs to consider send direction compatibility. + +Any change to a command definition's 'return', an event definition's +'data', or one of the types used there (recursively) needs to consider +receive direction compatibility. + +Any change to types used in both contexts need to consider both. + +Enumeration type values and complex and alternate type members may be +reordered freely. For enumerations and alternate types, this doesn't +affect the wire encoding. For complex types, this might make the +implementation emit JSON object members in a different order, which +the Client JSON Protocol permits. + +Since type names are not visible in the Client JSON Protocol, types +may be freely renamed. Even certain refactorings are invisible, such +as splitting members from one type into a common base type. + + +Code generation +=============== + +The QAPI code generator qapi-gen.py generates code and documentation +from the schema. Together with the core QAPI libraries, this code +provides everything required to take JSON commands read in by a Client +JSON Protocol server, unmarshal the arguments into the underlying C +types, call into the corresponding C function, map the response back +to a Client JSON Protocol response to be returned to the user, and +introspect the commands. + +As an example, we'll use the following schema, which describes a +single complex user-defined type, along with command which takes a +list of that type as a parameter, and returns a single element of that +type. The user is responsible for writing the implementation of +qmp_my_command(); everything else is produced by the generator. :: + + $ cat example-schema.json + { 'struct': 'UserDefOne', + 'data': { 'integer': 'int', '*string': 'str' } } + + { 'command': 'my-command', + 'data': { 'arg1': ['UserDefOne'] }, + 'returns': 'UserDefOne' } + + { 'event': 'MY_EVENT' } + +We run qapi-gen.py like this:: + + $ python scripts/qapi-gen.py --output-dir="qapi-generated" \ + --prefix="example-" example-schema.json + +For a more thorough look at generated code, the testsuite includes +tests/qapi-schema/qapi-schema-tests.json that covers more examples of +what the generator will accept, and compiles the resulting C code as +part of 'make check-unit'. + + +Code generated for QAPI types +----------------------------- + +The following files are created: + + ``$(prefix)qapi-types.h`` + C types corresponding to types defined in the schema + + ``$(prefix)qapi-types.c`` + Cleanup functions for the above C types + +The $(prefix) is an optional parameter used as a namespace to keep the +generated code from one schema/code-generation separated from others so code +can be generated/used from multiple schemas without clobbering previously +created code. + +Example:: + + $ cat qapi-generated/example-qapi-types.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_TYPES_H + #define EXAMPLE_QAPI_TYPES_H + + #include "qapi/qapi-builtin-types.h" + + typedef struct UserDefOne UserDefOne; + + typedef struct UserDefOneList UserDefOneList; + + typedef struct q_obj_my_command_arg q_obj_my_command_arg; + + struct UserDefOne { + int64_t integer; + bool has_string; + char *string; + }; + + void qapi_free_UserDefOne(UserDefOne *obj); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(UserDefOne, qapi_free_UserDefOne) + + struct UserDefOneList { + UserDefOneList *next; + UserDefOne *value; + }; + + void qapi_free_UserDefOneList(UserDefOneList *obj); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(UserDefOneList, qapi_free_UserDefOneList) + + struct q_obj_my_command_arg { + UserDefOneList *arg1; + }; + + #endif /* EXAMPLE_QAPI_TYPES_H */ + $ cat qapi-generated/example-qapi-types.c + [Uninteresting stuff omitted...] + + void qapi_free_UserDefOne(UserDefOne *obj) + { + Visitor *v; + + if (!obj) { + return; + } + + v = qapi_dealloc_visitor_new(); + visit_type_UserDefOne(v, NULL, &obj, NULL); + visit_free(v); + } + + void qapi_free_UserDefOneList(UserDefOneList *obj) + { + Visitor *v; + + if (!obj) { + return; + } + + v = qapi_dealloc_visitor_new(); + visit_type_UserDefOneList(v, NULL, &obj, NULL); + visit_free(v); + } + + [Uninteresting stuff omitted...] + +For a modular QAPI schema (see section `Include directives`_), code for +each sub-module SUBDIR/SUBMODULE.json is actually generated into :: + + SUBDIR/$(prefix)qapi-types-SUBMODULE.h + SUBDIR/$(prefix)qapi-types-SUBMODULE.c + +If qapi-gen.py is run with option --builtins, additional files are +created: + + ``qapi-builtin-types.h`` + C types corresponding to built-in types + + ``qapi-builtin-types.c`` + Cleanup functions for the above C types + + +Code generated for visiting QAPI types +-------------------------------------- + +These are the visitor functions used to walk through and convert +between a native QAPI C data structure and some other format (such as +QObject); the generated functions are named visit_type_FOO() and +visit_type_FOO_members(). + +The following files are generated: + + ``$(prefix)qapi-visit.c`` + Visitor function for a particular C type, used to automagically + convert QObjects into the corresponding C type and vice-versa, as + well as for deallocating memory for an existing C type + + ``$(prefix)qapi-visit.h`` + Declarations for previously mentioned visitor functions + +Example:: + + $ cat qapi-generated/example-qapi-visit.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_VISIT_H + #define EXAMPLE_QAPI_VISIT_H + + #include "qapi/qapi-builtin-visit.h" + #include "example-qapi-types.h" + + + bool visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp); + + bool visit_type_UserDefOne(Visitor *v, const char *name, + UserDefOne **obj, Error **errp); + + bool visit_type_UserDefOneList(Visitor *v, const char *name, + UserDefOneList **obj, Error **errp); + + bool visit_type_q_obj_my_command_arg_members(Visitor *v, q_obj_my_command_arg *obj, Error **errp); + + #endif /* EXAMPLE_QAPI_VISIT_H */ + $ cat qapi-generated/example-qapi-visit.c + [Uninteresting stuff omitted...] + + bool visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp) + { + if (!visit_type_int(v, "integer", &obj->integer, errp)) { + return false; + } + if (visit_optional(v, "string", &obj->has_string)) { + if (!visit_type_str(v, "string", &obj->string, errp)) { + return false; + } + } + return true; + } + + bool visit_type_UserDefOne(Visitor *v, const char *name, + UserDefOne **obj, Error **errp) + { + bool ok = false; + + if (!visit_start_struct(v, name, (void **)obj, sizeof(UserDefOne), errp)) { + return false; + } + if (!*obj) { + /* incomplete */ + assert(visit_is_dealloc(v)); + ok = true; + goto out_obj; + } + if (!visit_type_UserDefOne_members(v, *obj, errp)) { + goto out_obj; + } + ok = visit_check_struct(v, errp); + out_obj: + visit_end_struct(v, (void **)obj); + if (!ok && visit_is_input(v)) { + qapi_free_UserDefOne(*obj); + *obj = NULL; + } + return ok; + } + + bool visit_type_UserDefOneList(Visitor *v, const char *name, + UserDefOneList **obj, Error **errp) + { + bool ok = false; + UserDefOneList *tail; + size_t size = sizeof(**obj); + + if (!visit_start_list(v, name, (GenericList **)obj, size, errp)) { + return false; + } + + for (tail = *obj; tail; + tail = (UserDefOneList *)visit_next_list(v, (GenericList *)tail, size)) { + if (!visit_type_UserDefOne(v, NULL, &tail->value, errp)) { + goto out_obj; + } + } + + ok = visit_check_list(v, errp); + out_obj: + visit_end_list(v, (void **)obj); + if (!ok && visit_is_input(v)) { + qapi_free_UserDefOneList(*obj); + *obj = NULL; + } + return ok; + } + + bool visit_type_q_obj_my_command_arg_members(Visitor *v, q_obj_my_command_arg *obj, Error **errp) + { + if (!visit_type_UserDefOneList(v, "arg1", &obj->arg1, errp)) { + return false; + } + return true; + } + + [Uninteresting stuff omitted...] + +For a modular QAPI schema (see section `Include directives`_), code for +each sub-module SUBDIR/SUBMODULE.json is actually generated into :: + + SUBDIR/$(prefix)qapi-visit-SUBMODULE.h + SUBDIR/$(prefix)qapi-visit-SUBMODULE.c + +If qapi-gen.py is run with option --builtins, additional files are +created: + + ``qapi-builtin-visit.h`` + Visitor functions for built-in types + + ``qapi-builtin-visit.c`` + Declarations for these visitor functions + + +Code generated for commands +--------------------------- + +These are the marshaling/dispatch functions for the commands defined +in the schema. The generated code provides qmp_marshal_COMMAND(), and +declares qmp_COMMAND() that the user must implement. + +The following files are generated: + + ``$(prefix)qapi-commands.c`` + Command marshal/dispatch functions for each QMP command defined in + the schema + + ``$(prefix)qapi-commands.h`` + Function prototypes for the QMP commands specified in the schema + + ``$(prefix)qapi-init-commands.h`` + Command initialization prototype + + ``$(prefix)qapi-init-commands.c`` + Command initialization code + +Example:: + + $ cat qapi-generated/example-qapi-commands.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_COMMANDS_H + #define EXAMPLE_QAPI_COMMANDS_H + + #include "example-qapi-types.h" + + UserDefOne *qmp_my_command(UserDefOneList *arg1, Error **errp); + void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp); + + #endif /* EXAMPLE_QAPI_COMMANDS_H */ + $ cat qapi-generated/example-qapi-commands.c + [Uninteresting stuff omitted...] + + + static void qmp_marshal_output_UserDefOne(UserDefOne *ret_in, + QObject **ret_out, Error **errp) + { + Visitor *v; + + v = qobject_output_visitor_new_qmp(ret_out); + if (visit_type_UserDefOne(v, "unused", &ret_in, errp)) { + visit_complete(v, ret_out); + } + visit_free(v); + v = qapi_dealloc_visitor_new(); + visit_type_UserDefOne(v, "unused", &ret_in, NULL); + visit_free(v); + } + + void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp) + { + Error *err = NULL; + bool ok = false; + Visitor *v; + UserDefOne *retval; + q_obj_my_command_arg arg = {0}; + + v = qobject_input_visitor_new_qmp(QOBJECT(args)); + if (!visit_start_struct(v, NULL, NULL, 0, errp)) { + goto out; + } + if (visit_type_q_obj_my_command_arg_members(v, &arg, errp)) { + ok = visit_check_struct(v, errp); + } + visit_end_struct(v, NULL); + if (!ok) { + goto out; + } + + retval = qmp_my_command(arg.arg1, &err); + error_propagate(errp, err); + if (err) { + goto out; + } + + qmp_marshal_output_UserDefOne(retval, ret, errp); + + out: + visit_free(v); + v = qapi_dealloc_visitor_new(); + visit_start_struct(v, NULL, NULL, 0, NULL); + visit_type_q_obj_my_command_arg_members(v, &arg, NULL); + visit_end_struct(v, NULL); + visit_free(v); + } + + [Uninteresting stuff omitted...] + $ cat qapi-generated/example-qapi-init-commands.h + [Uninteresting stuff omitted...] + #ifndef EXAMPLE_QAPI_INIT_COMMANDS_H + #define EXAMPLE_QAPI_INIT_COMMANDS_H + + #include "qapi/qmp/dispatch.h" + + void example_qmp_init_marshal(QmpCommandList *cmds); + + #endif /* EXAMPLE_QAPI_INIT_COMMANDS_H */ + $ cat qapi-generated/example-qapi-init-commands.c + [Uninteresting stuff omitted...] + void example_qmp_init_marshal(QmpCommandList *cmds) + { + QTAILQ_INIT(cmds); + + qmp_register_command(cmds, "my-command", + qmp_marshal_my_command, QCO_NO_OPTIONS); + } + [Uninteresting stuff omitted...] + +For a modular QAPI schema (see section `Include directives`_), code for +each sub-module SUBDIR/SUBMODULE.json is actually generated into:: + + SUBDIR/$(prefix)qapi-commands-SUBMODULE.h + SUBDIR/$(prefix)qapi-commands-SUBMODULE.c + + +Code generated for events +------------------------- + +This is the code related to events defined in the schema, providing +qapi_event_send_EVENT(). + +The following files are created: + + ``$(prefix)qapi-events.h`` + Function prototypes for each event type + + ``$(prefix)qapi-events.c`` + Implementation of functions to send an event + + ``$(prefix)qapi-emit-events.h`` + Enumeration of all event names, and common event code declarations + + ``$(prefix)qapi-emit-events.c`` + Common event code definitions + +Example:: + + $ cat qapi-generated/example-qapi-events.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_EVENTS_H + #define EXAMPLE_QAPI_EVENTS_H + + #include "qapi/util.h" + #include "example-qapi-types.h" + + void qapi_event_send_my_event(void); + + #endif /* EXAMPLE_QAPI_EVENTS_H */ + $ cat qapi-generated/example-qapi-events.c + [Uninteresting stuff omitted...] + + void qapi_event_send_my_event(void) + { + QDict *qmp; + + qmp = qmp_event_build_dict("MY_EVENT"); + + example_qapi_event_emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp); + + qobject_unref(qmp); + } + + [Uninteresting stuff omitted...] + $ cat qapi-generated/example-qapi-emit-events.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_EMIT_EVENTS_H + #define EXAMPLE_QAPI_EMIT_EVENTS_H + + #include "qapi/util.h" + + typedef enum example_QAPIEvent { + EXAMPLE_QAPI_EVENT_MY_EVENT, + EXAMPLE_QAPI_EVENT__MAX, + } example_QAPIEvent; + + #define example_QAPIEvent_str(val) \ + qapi_enum_lookup(&example_QAPIEvent_lookup, (val)) + + extern const QEnumLookup example_QAPIEvent_lookup; + + void example_qapi_event_emit(example_QAPIEvent event, QDict *qdict); + + #endif /* EXAMPLE_QAPI_EMIT_EVENTS_H */ + $ cat qapi-generated/example-qapi-emit-events.c + [Uninteresting stuff omitted...] + + const QEnumLookup example_QAPIEvent_lookup = { + .array = (const char *const[]) { + [EXAMPLE_QAPI_EVENT_MY_EVENT] = "MY_EVENT", + }, + .size = EXAMPLE_QAPI_EVENT__MAX + }; + + [Uninteresting stuff omitted...] + +For a modular QAPI schema (see section `Include directives`_), code for +each sub-module SUBDIR/SUBMODULE.json is actually generated into :: + + SUBDIR/$(prefix)qapi-events-SUBMODULE.h + SUBDIR/$(prefix)qapi-events-SUBMODULE.c + + +Code generated for introspection +-------------------------------- + +The following files are created: + + ``$(prefix)qapi-introspect.c`` + Defines a string holding a JSON description of the schema + + ``$(prefix)qapi-introspect.h`` + Declares the above string + +Example:: + + $ cat qapi-generated/example-qapi-introspect.h + [Uninteresting stuff omitted...] + + #ifndef EXAMPLE_QAPI_INTROSPECT_H + #define EXAMPLE_QAPI_INTROSPECT_H + + #include "qapi/qmp/qlit.h" + + extern const QLitObject example_qmp_schema_qlit; + + #endif /* EXAMPLE_QAPI_INTROSPECT_H */ + $ cat qapi-generated/example-qapi-introspect.c + [Uninteresting stuff omitted...] + + const QLitObject example_qmp_schema_qlit = QLIT_QLIST(((QLitObject[]) { + QLIT_QDICT(((QLitDictEntry[]) { + { "arg-type", QLIT_QSTR("0"), }, + { "meta-type", QLIT_QSTR("command"), }, + { "name", QLIT_QSTR("my-command"), }, + { "ret-type", QLIT_QSTR("1"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "arg-type", QLIT_QSTR("2"), }, + { "meta-type", QLIT_QSTR("event"), }, + { "name", QLIT_QSTR("MY_EVENT"), }, + {} + })), + /* "0" = q_obj_my-command-arg */ + QLIT_QDICT(((QLitDictEntry[]) { + { "members", QLIT_QLIST(((QLitObject[]) { + QLIT_QDICT(((QLitDictEntry[]) { + { "name", QLIT_QSTR("arg1"), }, + { "type", QLIT_QSTR("[1]"), }, + {} + })), + {} + })), }, + { "meta-type", QLIT_QSTR("object"), }, + { "name", QLIT_QSTR("0"), }, + {} + })), + /* "1" = UserDefOne */ + QLIT_QDICT(((QLitDictEntry[]) { + { "members", QLIT_QLIST(((QLitObject[]) { + QLIT_QDICT(((QLitDictEntry[]) { + { "name", QLIT_QSTR("integer"), }, + { "type", QLIT_QSTR("int"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "default", QLIT_QNULL, }, + { "name", QLIT_QSTR("string"), }, + { "type", QLIT_QSTR("str"), }, + {} + })), + {} + })), }, + { "meta-type", QLIT_QSTR("object"), }, + { "name", QLIT_QSTR("1"), }, + {} + })), + /* "2" = q_empty */ + QLIT_QDICT(((QLitDictEntry[]) { + { "members", QLIT_QLIST(((QLitObject[]) { + {} + })), }, + { "meta-type", QLIT_QSTR("object"), }, + { "name", QLIT_QSTR("2"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "element-type", QLIT_QSTR("1"), }, + { "meta-type", QLIT_QSTR("array"), }, + { "name", QLIT_QSTR("[1]"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "json-type", QLIT_QSTR("int"), }, + { "meta-type", QLIT_QSTR("builtin"), }, + { "name", QLIT_QSTR("int"), }, + {} + })), + QLIT_QDICT(((QLitDictEntry[]) { + { "json-type", QLIT_QSTR("string"), }, + { "meta-type", QLIT_QSTR("builtin"), }, + { "name", QLIT_QSTR("str"), }, + {} + })), + {} + })); + + [Uninteresting stuff omitted...] diff --git a/docs/devel/qgraph.rst b/docs/devel/qgraph.rst new file mode 100644 index 000000000..43342d9d6 --- /dev/null +++ b/docs/devel/qgraph.rst @@ -0,0 +1,628 @@ +.. _qgraph: + +Qtest Driver Framework +====================== + +In order to test a specific driver, plain libqos tests need to +take care of booting QEMU with the right machine and devices. +This makes each test "hardcoded" for a specific configuration, reducing +the possible coverage that it can reach. + +For example, the sdhci device is supported on both x86_64 and ARM boards, +therefore a generic sdhci test should test all machines and drivers that +support that device. +Using only libqos APIs, the test has to manually take care of +covering all the setups, and build the correct command line. + +This also introduces backward compatibility issues: if a device/driver command +line name is changed, all tests that use that will not work +properly anymore and need to be adjusted. + +The aim of qgraph is to create a graph of drivers, machines and tests such that +a test aimed to a certain driver does not have to care of +booting the right QEMU machine, pick the right device, build the command line +and so on. Instead, it only defines what type of device it is testing +(interface in qgraph terms) and the framework takes care of +covering all supported types of devices and machine architectures. + +Following the above example, an interface would be ``sdhci``, +so the sdhci-test should only care of linking its qgraph node with +that interface. In this way, if the command line of a sdhci driver +is changed, only the respective qgraph driver node has to be adjusted. + +QGraph concepts +--------------- + +The graph is composed by nodes that represent machines, drivers, tests +and edges that define the relationships between them (``CONSUMES``, ``PRODUCES``, and +``CONTAINS``). + +Nodes +~~~~~ + +A node can be of four types: + +- **QNODE_MACHINE**: for example ``arm/raspi2b`` +- **QNODE_DRIVER**: for example ``generic-sdhci`` +- **QNODE_INTERFACE**: for example ``sdhci`` (interface for all ``-sdhci`` + drivers). + An interface is not explicitly created, it will be automatically + instantiated when a node consumes or produces it. + An interface is simply a struct that abstracts the various drivers + for the same type of device, and offers an API to the nodes that + use it ("consume" relation in qgraph terms) that is implemented/backed up by the drivers that implement it ("produce" relation in qgraph terms). +- **QNODE_TEST**: for example ``sdhci-test``. A test consumes an interface + and tests the functions provided by it. + +Notes for the nodes: + +- QNODE_MACHINE: each machine struct must have a ``QGuestAllocator`` and + implement ``get_driver()`` to return the allocator mapped to the interface + "memory". The function can also return ``NULL`` if the allocator + is not set. +- QNODE_DRIVER: driver names must be unique, and machines and nodes + planned to be "consumed" by other nodes must match QEMU + drivers name, otherwise they won't be discovered + +Edges +~~~~~ + +An edge relation between two nodes (drivers or machines) ``X`` and ``Y`` can be: + +- ``X CONSUMES Y``: ``Y`` can be plugged into ``X`` +- ``X PRODUCES Y``: ``X`` provides the interface ``Y`` +- ``X CONTAINS Y``: ``Y`` is part of ``X`` component + +Execution steps +~~~~~~~~~~~~~~~ + +The basic framework steps are the following: + +- All nodes and edges are created in their respective + machine/driver/test files +- The framework starts QEMU and asks for a list of available devices + and machines (note that only machines and "consumed" nodes are mapped + 1:1 with QEMU devices) +- The framework walks the graph starting from the available machines and + performs a Depth First Search for tests +- Once a test is found, the path is walked again and all drivers are + allocated accordingly and the final interface is passed to the test +- The test is executed +- Unused objects are cleaned and the path discovery is continued + +Depending on the QEMU binary used, only some drivers/machines will be +available and only test that are reached by them will be executed. + +Command line +~~~~~~~~~~~~ + +Command line is built by using node names and optional arguments +passed by the user when building the edges. + +There are three types of command line arguments: + +- ``in node`` : created from the node name. For example, machines will + have ``-M <machine>`` to its command line, while devices + ``-device <device>``. It is automatically done by the framework. +- ``after node`` : added as additional argument to the node name. + This argument is added optionally when creating edges, + by setting the parameter ``after_cmd_line`` and + ``extra_edge_opts`` in ``QOSGraphEdgeOptions``. + The framework automatically adds + a comma before ``extra_edge_opts``, + because it is going to add attributes + after the destination node pointed by + the edge containing these options, and automatically + adds a space before ``after_cmd_line``, because it + adds an additional device, not an attribute. +- ``before node`` : added as additional argument to the node name. + This argument is added optionally when creating edges, + by setting the parameter ``before_cmd_line`` in + ``QOSGraphEdgeOptions``. This attribute + is going to add attributes before the destination node + pointed by the edge containing these options. It is + helpful to commands that are not node-representable, + such as ``-fdsev`` or ``-netdev``. + +While adding command line in edges is always used, not all nodes names are +used in every path walk: this is because the contained or produced ones +are already added by QEMU, so only nodes that "consumes" will be used to +build the command line. Also, nodes that will have ``{ "abstract" : true }`` +as QMP attribute will loose their command line, since they are not proper +devices to be added in QEMU. + +Example:: + + QOSGraphEdgeOptions opts = { + .before_cmd_line = "-drive id=drv0,if=none,file=null-co://," + "file.read-zeroes=on,format=raw", + .after_cmd_line = "-device scsi-hd,bus=vs0.0,drive=drv0", + + opts.extra_device_opts = "id=vs0"; + }; + + qos_node_create_driver("virtio-scsi-device", + virtio_scsi_device_create); + qos_node_consumes("virtio-scsi-device", "virtio-bus", &opts); + +Will produce the following command line: +``-drive id=drv0,if=none,file=null-co://, -device virtio-scsi-device,id=vs0 -device scsi-hd,bus=vs0.0,drive=drv0`` + +Troubleshooting unavailable tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If there is no path from an available machine to a test then that test will be +unavailable and won't execute. This can happen if a test or driver did not set +up its qgraph node correctly. It can also happen if the necessary machine type +or device is missing from the QEMU binary because it was compiled out or +otherwise. + +It is possible to troubleshoot unavailable tests by running:: + + $ QTEST_QEMU_BINARY=build/qemu-system-x86_64 build/tests/qtest/qos-test --verbose + # ALL QGRAPH EDGES: { + # src='virtio-net' + # |-> dest='virtio-net-tests/vhost-user/multiqueue' type=2 (node=0x559142109e30) + # |-> dest='virtio-net-tests/vhost-user/migrate' type=2 (node=0x559142109d00) + # src='virtio-net-pci' + # |-> dest='virtio-net' type=1 (node=0x55914210d740) + # src='pci-bus' + # |-> dest='virtio-net-pci' type=2 (node=0x55914210d880) + # src='pci-bus-pc' + # |-> dest='pci-bus' type=1 (node=0x559142103f40) + # src='i440FX-pcihost' + # |-> dest='pci-bus-pc' type=0 (node=0x55914210ac70) + # src='x86_64/pc' + # |-> dest='i440FX-pcihost' type=0 (node=0x5591421117f0) + # src='' + # |-> dest='x86_64/pc' type=0 (node=0x559142111600) + # |-> dest='arm/raspi2b' type=0 (node=0x559142110740) + ... + # } + # ALL QGRAPH NODES: { + # name='virtio-net-tests/announce-self' type=3 cmd_line='(null)' [available] + # name='arm/raspi2b' type=0 cmd_line='-M raspi2b ' [UNAVAILABLE] + ... + # } + +The ``virtio-net-tests/announce-self`` test is listed as "available" in the +"ALL QGRAPH NODES" output. This means the test will execute. We can follow the +qgraph path in the "ALL QGRAPH EDGES" output as follows: '' -> 'x86_64/pc' -> +'i440FX-pcihost' -> 'pci-bus-pc' -> 'pci-bus' -> 'virtio-net-pci' -> +'virtio-net'. The root of the qgraph is '' and the depth first search begins +there. + +The ``arm/raspi2b`` machine node is listed as "UNAVAILABLE". Although it is +reachable from the root via '' -> 'arm/raspi2b' the node is unavailable because +the QEMU binary did not list it when queried by the framework. This is expected +because we used the ``qemu-system-x86_64`` binary which does not support ARM +machine types. + +If a test is unexpectedly listed as "UNAVAILABLE", first check that the "ALL +QGRAPH EDGES" output reports edge connectivity from the root ('') to the test. +If there is no connectivity then the qgraph nodes were not set up correctly and +the driver or test code is incorrect. If there is connectivity, check the +availability of each node in the path in the "ALL QGRAPH NODES" output. The +first unavailable node in the path is the reason why the test is unavailable. +Typically this is because the QEMU binary lacks support for the necessary +machine type or device. + +Creating a new driver and its interface +--------------------------------------- + +Here we continue the ``sdhci`` use case, with the following scenario: + +- ``sdhci-test`` aims to test the ``read[q,w], writeq`` functions + offered by the ``sdhci`` drivers. +- The current ``sdhci`` device is supported by both ``x86_64/pc`` and ``ARM`` + (in this example we focus on the ``arm-raspi2b``) machines. +- QEMU offers 2 types of drivers: ``QSDHCI_MemoryMapped`` for ``ARM`` and + ``QSDHCI_PCI`` for ``x86_64/pc``. Both implement the + ``read[q,w], writeq`` functions. + +In order to implement such scenario in qgraph, the test developer needs to: + +- Create the ``x86_64/pc`` machine node. This machine uses the + ``pci-bus`` architecture so it ``contains`` a PCI driver, + ``pci-bus-pc``. The actual path is + + ``x86_64/pc --contains--> 1440FX-pcihost --contains--> + pci-bus-pc --produces--> pci-bus``. + + For the sake of this example, + we do not focus on the PCI interface implementation. +- Create the ``sdhci-pci`` driver node, representing ``QSDHCI_PCI``. + The driver uses the PCI bus (and its API), + so it must ``consume`` the ``pci-bus`` generic interface (which abstracts + all the pci drivers available) + + ``sdhci-pci --consumes--> pci-bus`` +- Create an ``arm/raspi2b`` machine node. This machine ``contains`` + a ``generic-sdhci`` memory mapped ``sdhci`` driver node, representing + ``QSDHCI_MemoryMapped``. + + ``arm/raspi2b --contains--> generic-sdhci`` +- Create the ``sdhci`` interface node. This interface offers the + functions that are shared by all ``sdhci`` devices. + The interface is produced by ``sdhci-pci`` and ``generic-sdhci``, + the available architecture-specific drivers. + + ``sdhci-pci --produces--> sdhci`` + + ``generic-sdhci --produces--> sdhci`` +- Create the ``sdhci-test`` test node. The test ``consumes`` the + ``sdhci`` interface, using its API. It doesn't need to look at + the supported machines or drivers. + + ``sdhci-test --consumes--> sdhci`` + +``arm-raspi2b`` machine, simplified from +``tests/qtest/libqos/arm-raspi2-machine.c``:: + + #include "qgraph.h" + + struct QRaspi2Machine { + QOSGraphObject obj; + QGuestAllocator alloc; + QSDHCI_MemoryMapped sdhci; + }; + + static void *raspi2_get_driver(void *object, const char *interface) + { + QRaspi2Machine *machine = object; + if (!g_strcmp0(interface, "memory")) { + return &machine->alloc; + } + + fprintf(stderr, "%s not present in arm/raspi2b\n", interface); + g_assert_not_reached(); + } + + static QOSGraphObject *raspi2_get_device(void *obj, + const char *device) + { + QRaspi2Machine *machine = obj; + if (!g_strcmp0(device, "generic-sdhci")) { + return &machine->sdhci.obj; + } + + fprintf(stderr, "%s not present in arm/raspi2b\n", device); + g_assert_not_reached(); + } + + static void *qos_create_machine_arm_raspi2(QTestState *qts) + { + QRaspi2Machine *machine = g_new0(QRaspi2Machine, 1); + + alloc_init(&machine->alloc, ...); + + /* Get node(s) contained inside (CONTAINS) */ + machine->obj.get_device = raspi2_get_device; + + /* Get node(s) produced (PRODUCES) */ + machine->obj.get_driver = raspi2_get_driver; + + /* free the object */ + machine->obj.destructor = raspi2_destructor; + qos_init_sdhci_mm(&machine->sdhci, ...); + return &machine->obj; + } + + static void raspi2_register_nodes(void) + { + /* arm/raspi2b --contains--> generic-sdhci */ + qos_node_create_machine("arm/raspi2b", + qos_create_machine_arm_raspi2); + qos_node_contains("arm/raspi2b", "generic-sdhci", NULL); + } + + libqos_init(raspi2_register_nodes); + +``x86_64/pc`` machine, simplified from +``tests/qtest/libqos/x86_64_pc-machine.c``:: + + #include "qgraph.h" + + struct i440FX_pcihost { + QOSGraphObject obj; + QPCIBusPC pci; + }; + + struct QX86PCMachine { + QOSGraphObject obj; + QGuestAllocator alloc; + i440FX_pcihost bridge; + }; + + /* i440FX_pcihost */ + + static QOSGraphObject *i440FX_host_get_device(void *obj, + const char *device) + { + i440FX_pcihost *host = obj; + if (!g_strcmp0(device, "pci-bus-pc")) { + return &host->pci.obj; + } + fprintf(stderr, "%s not present in i440FX-pcihost\n", device); + g_assert_not_reached(); + } + + /* x86_64/pc machine */ + + static void *pc_get_driver(void *object, const char *interface) + { + QX86PCMachine *machine = object; + if (!g_strcmp0(interface, "memory")) { + return &machine->alloc; + } + + fprintf(stderr, "%s not present in x86_64/pc\n", interface); + g_assert_not_reached(); + } + + static QOSGraphObject *pc_get_device(void *obj, const char *device) + { + QX86PCMachine *machine = obj; + if (!g_strcmp0(device, "i440FX-pcihost")) { + return &machine->bridge.obj; + } + + fprintf(stderr, "%s not present in x86_64/pc\n", device); + g_assert_not_reached(); + } + + static void *qos_create_machine_pc(QTestState *qts) + { + QX86PCMachine *machine = g_new0(QX86PCMachine, 1); + + /* Get node(s) contained inside (CONTAINS) */ + machine->obj.get_device = pc_get_device; + + /* Get node(s) produced (PRODUCES) */ + machine->obj.get_driver = pc_get_driver; + + /* free the object */ + machine->obj.destructor = pc_destructor; + pc_alloc_init(&machine->alloc, qts, ALLOC_NO_FLAGS); + + /* Get node(s) contained inside (CONTAINS) */ + machine->bridge.obj.get_device = i440FX_host_get_device; + + return &machine->obj; + } + + static void pc_machine_register_nodes(void) + { + /* x86_64/pc --contains--> 1440FX-pcihost --contains--> + * pci-bus-pc [--produces--> pci-bus (in pci.h)] */ + qos_node_create_machine("x86_64/pc", qos_create_machine_pc); + qos_node_contains("x86_64/pc", "i440FX-pcihost", NULL); + + /* contained drivers don't need a constructor, + * they will be init by the parent */ + qos_node_create_driver("i440FX-pcihost", NULL); + qos_node_contains("i440FX-pcihost", "pci-bus-pc", NULL); + } + + libqos_init(pc_machine_register_nodes); + +``sdhci`` taken from ``tests/qtest/libqos/sdhci.c``:: + + /* Interface node, offers the sdhci API */ + struct QSDHCI { + uint16_t (*readw)(QSDHCI *s, uint32_t reg); + uint64_t (*readq)(QSDHCI *s, uint32_t reg); + void (*writeq)(QSDHCI *s, uint32_t reg, uint64_t val); + /* other fields */ + }; + + /* Memory Mapped implementation of QSDHCI */ + struct QSDHCI_MemoryMapped { + QOSGraphObject obj; + QSDHCI sdhci; + /* other driver-specific fields */ + }; + + /* PCI implementation of QSDHCI */ + struct QSDHCI_PCI { + QOSGraphObject obj; + QSDHCI sdhci; + /* other driver-specific fields */ + }; + + /* Memory mapped implementation of QSDHCI */ + + static void *sdhci_mm_get_driver(void *obj, const char *interface) + { + QSDHCI_MemoryMapped *smm = obj; + if (!g_strcmp0(interface, "sdhci")) { + return &smm->sdhci; + } + fprintf(stderr, "%s not present in generic-sdhci\n", interface); + g_assert_not_reached(); + } + + void qos_init_sdhci_mm(QSDHCI_MemoryMapped *sdhci, QTestState *qts, + uint32_t addr, QSDHCIProperties *common) + { + /* Get node contained inside (CONTAINS) */ + sdhci->obj.get_driver = sdhci_mm_get_driver; + + /* SDHCI interface API */ + sdhci->sdhci.readw = sdhci_mm_readw; + sdhci->sdhci.readq = sdhci_mm_readq; + sdhci->sdhci.writeq = sdhci_mm_writeq; + sdhci->qts = qts; + } + + /* PCI implementation of QSDHCI */ + + static void *sdhci_pci_get_driver(void *object, + const char *interface) + { + QSDHCI_PCI *spci = object; + if (!g_strcmp0(interface, "sdhci")) { + return &spci->sdhci; + } + + fprintf(stderr, "%s not present in sdhci-pci\n", interface); + g_assert_not_reached(); + } + + static void *sdhci_pci_create(void *pci_bus, + QGuestAllocator *alloc, + void *addr) + { + QSDHCI_PCI *spci = g_new0(QSDHCI_PCI, 1); + QPCIBus *bus = pci_bus; + uint64_t barsize; + + qpci_device_init(&spci->dev, bus, addr); + + /* SDHCI interface API */ + spci->sdhci.readw = sdhci_pci_readw; + spci->sdhci.readq = sdhci_pci_readq; + spci->sdhci.writeq = sdhci_pci_writeq; + + /* Get node(s) produced (PRODUCES) */ + spci->obj.get_driver = sdhci_pci_get_driver; + + spci->obj.start_hw = sdhci_pci_start_hw; + spci->obj.destructor = sdhci_destructor; + return &spci->obj; + } + + static void qsdhci_register_nodes(void) + { + QOSGraphEdgeOptions opts = { + .extra_device_opts = "addr=04.0", + }; + + /* generic-sdhci */ + /* generic-sdhci --produces--> sdhci */ + qos_node_create_driver("generic-sdhci", NULL); + qos_node_produces("generic-sdhci", "sdhci"); + + /* sdhci-pci */ + /* sdhci-pci --produces--> sdhci + * sdhci-pci --consumes--> pci-bus */ + qos_node_create_driver("sdhci-pci", sdhci_pci_create); + qos_node_produces("sdhci-pci", "sdhci"); + qos_node_consumes("sdhci-pci", "pci-bus", &opts); + } + + libqos_init(qsdhci_register_nodes); + +In the above example, all possible types of relations are created:: + + x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc + | + sdhci-pci --consumes--> pci-bus <--produces--+ + | + +--produces--+ + | + v + sdhci + ^ + | + +--produces-- + + | + arm/raspi2b --contains--> generic-sdhci + +or inverting the consumes edge in consumed_by:: + + x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc + | + sdhci-pci <--consumed by-- pci-bus <--produces--+ + | + +--produces--+ + | + v + sdhci + ^ + | + +--produces-- + + | + arm/raspi2b --contains--> generic-sdhci + +Adding a new test +----------------- + +Given the above setup, adding a new test is very simple. +``sdhci-test``, taken from ``tests/qtest/sdhci-test.c``:: + + static void check_capab_sdma(QSDHCI *s, bool supported) + { + uint64_t capab, capab_sdma; + + capab = s->readq(s, SDHC_CAPAB); + capab_sdma = FIELD_EX64(capab, SDHC_CAPAB, SDMA); + g_assert_cmpuint(capab_sdma, ==, supported); + } + + static void test_registers(void *obj, void *data, + QGuestAllocator *alloc) + { + QSDHCI *s = obj; + + /* example test */ + check_capab_sdma(s, s->props.capab.sdma); + } + + static void register_sdhci_test(void) + { + /* sdhci-test --consumes--> sdhci */ + qos_add_test("registers", "sdhci", test_registers, NULL); + } + + libqos_init(register_sdhci_test); + +Here a new test is created, consuming ``sdhci`` interface node +and creating a valid path from both machines to a test. +Final graph will be like this:: + + x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc + | + sdhci-pci --consumes--> pci-bus <--produces--+ + | + +--produces--+ + | + v + sdhci <--consumes-- sdhci-test + ^ + | + +--produces-- + + | + arm/raspi2b --contains--> generic-sdhci + +or inverting the consumes edge in consumed_by:: + + x86_64/pc --contains--> 1440FX-pcihost --contains--> pci-bus-pc + | + sdhci-pci <--consumed by-- pci-bus <--produces--+ + | + +--produces--+ + | + v + sdhci --consumed by--> sdhci-test + ^ + | + +--produces-- + + | + arm/raspi2b --contains--> generic-sdhci + +Assuming there the binary is +``QTEST_QEMU_BINARY=./qemu-system-x86_64`` +a valid test path will be: +``/x86_64/pc/1440FX-pcihost/pci-bus-pc/pci-bus/sdhci-pc/sdhci/sdhci-test`` + +and for the binary ``QTEST_QEMU_BINARY=./qemu-system-arm``: + +``/arm/raspi2b/generic-sdhci/sdhci/sdhci-test`` + +Additional examples are also in ``test-qgraph.c`` + +Qgraph API reference +-------------------- + +.. kernel-doc:: tests/qtest/libqos/qgraph.h diff --git a/docs/devel/qom.rst b/docs/devel/qom.rst new file mode 100644 index 000000000..e5fe3597c --- /dev/null +++ b/docs/devel/qom.rst @@ -0,0 +1,389 @@ +=========================== +The QEMU Object Model (QOM) +=========================== + +.. highlight:: c + +The QEMU Object Model provides a framework for registering user creatable +types and instantiating objects from those types. QOM provides the following +features: + +- System for dynamically registering types +- Support for single-inheritance of types +- Multiple inheritance of stateless interfaces + +.. code-block:: c + :caption: Creating a minimal type + + #include "qdev.h" + + #define TYPE_MY_DEVICE "my-device" + + // No new virtual functions: we can reuse the typedef for the + // superclass. + typedef DeviceClass MyDeviceClass; + typedef struct MyDevice + { + DeviceState parent; + + int reg0, reg1, reg2; + } MyDevice; + + static const TypeInfo my_device_info = { + .name = TYPE_MY_DEVICE, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDevice), + }; + + static void my_device_register_types(void) + { + type_register_static(&my_device_info); + } + + type_init(my_device_register_types) + +In the above example, we create a simple type that is described by #TypeInfo. +#TypeInfo describes information about the type including what it inherits +from, the instance and class size, and constructor/destructor hooks. + +Alternatively several static types could be registered using helper macro +DEFINE_TYPES() + +.. code-block:: c + + static const TypeInfo device_types_info[] = { + { + .name = TYPE_MY_DEVICE_A, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDeviceA), + }, + { + .name = TYPE_MY_DEVICE_B, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDeviceB), + }, + }; + + DEFINE_TYPES(device_types_info) + +Every type has an #ObjectClass associated with it. #ObjectClass derivatives +are instantiated dynamically but there is only ever one instance for any +given type. The #ObjectClass typically holds a table of function pointers +for the virtual methods implemented by this type. + +Using object_new(), a new #Object derivative will be instantiated. You can +cast an #Object to a subclass (or base-class) type using +object_dynamic_cast(). You typically want to define macro wrappers around +OBJECT_CHECK() and OBJECT_CLASS_CHECK() to make it easier to convert to a +specific type: + +.. code-block:: c + :caption: Typecasting macros + + #define MY_DEVICE_GET_CLASS(obj) \ + OBJECT_GET_CLASS(MyDeviceClass, obj, TYPE_MY_DEVICE) + #define MY_DEVICE_CLASS(klass) \ + OBJECT_CLASS_CHECK(MyDeviceClass, klass, TYPE_MY_DEVICE) + #define MY_DEVICE(obj) \ + OBJECT_CHECK(MyDevice, obj, TYPE_MY_DEVICE) + +In case the ObjectClass implementation can be built as module a +module_obj() line must be added to make sure qemu loads the module +when the object is needed. + +.. code-block:: c + + module_obj(TYPE_MY_DEVICE); + +Class Initialization +==================== + +Before an object is initialized, the class for the object must be +initialized. There is only one class object for all instance objects +that is created lazily. + +Classes are initialized by first initializing any parent classes (if +necessary). After the parent class object has initialized, it will be +copied into the current class object and any additional storage in the +class object is zero filled. + +The effect of this is that classes automatically inherit any virtual +function pointers that the parent class has already initialized. All +other fields will be zero filled. + +Once all of the parent classes have been initialized, #TypeInfo::class_init +is called to let the class being instantiated provide default initialize for +its virtual functions. Here is how the above example might be modified +to introduce an overridden virtual function: + +.. code-block:: c + :caption: Overriding a virtual function + + #include "qdev.h" + + void my_device_class_init(ObjectClass *klass, void *class_data) + { + DeviceClass *dc = DEVICE_CLASS(klass); + dc->reset = my_device_reset; + } + + static const TypeInfo my_device_info = { + .name = TYPE_MY_DEVICE, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDevice), + .class_init = my_device_class_init, + }; + +Introducing new virtual methods requires a class to define its own +struct and to add a .class_size member to the #TypeInfo. Each method +will also have a wrapper function to call it easily: + +.. code-block:: c + :caption: Defining an abstract class + + #include "qdev.h" + + typedef struct MyDeviceClass + { + DeviceClass parent; + + void (*frobnicate) (MyDevice *obj); + } MyDeviceClass; + + static const TypeInfo my_device_info = { + .name = TYPE_MY_DEVICE, + .parent = TYPE_DEVICE, + .instance_size = sizeof(MyDevice), + .abstract = true, // or set a default in my_device_class_init + .class_size = sizeof(MyDeviceClass), + }; + + void my_device_frobnicate(MyDevice *obj) + { + MyDeviceClass *klass = MY_DEVICE_GET_CLASS(obj); + + klass->frobnicate(obj); + } + +Interfaces +========== + +Interfaces allow a limited form of multiple inheritance. Instances are +similar to normal types except for the fact that are only defined by +their classes and never carry any state. As a consequence, a pointer to +an interface instance should always be of incomplete type in order to be +sure it cannot be dereferenced. That is, you should define the +'typedef struct SomethingIf SomethingIf' so that you can pass around +``SomethingIf *si`` arguments, but not define a ``struct SomethingIf { ... }``. +The only things you can validly do with a ``SomethingIf *`` are to pass it as +an argument to a method on its corresponding SomethingIfClass, or to +dynamically cast it to an object that implements the interface. + +Methods +======= + +A *method* is a function within the namespace scope of +a class. It usually operates on the object instance by passing it as a +strongly-typed first argument. +If it does not operate on an object instance, it is dubbed +*class method*. + +Methods cannot be overloaded. That is, the #ObjectClass and method name +uniquely identity the function to be called; the signature does not vary +except for trailing varargs. + +Methods are always *virtual*. Overriding a method in +#TypeInfo.class_init of a subclass leads to any user of the class obtained +via OBJECT_GET_CLASS() accessing the overridden function. +The original function is not automatically invoked. It is the responsibility +of the overriding class to determine whether and when to invoke the method +being overridden. + +To invoke the method being overridden, the preferred solution is to store +the original value in the overriding class before overriding the method. +This corresponds to ``{super,base}.method(...)`` in Java and C# +respectively; this frees the overriding class from hardcoding its parent +class, which someone might choose to change at some point. + +.. code-block:: c + :caption: Overriding a virtual method + + typedef struct MyState MyState; + + typedef void (*MyDoSomething)(MyState *obj); + + typedef struct MyClass { + ObjectClass parent_class; + + MyDoSomething do_something; + } MyClass; + + static void my_do_something(MyState *obj) + { + // do something + } + + static void my_class_init(ObjectClass *oc, void *data) + { + MyClass *mc = MY_CLASS(oc); + + mc->do_something = my_do_something; + } + + static const TypeInfo my_type_info = { + .name = TYPE_MY, + .parent = TYPE_OBJECT, + .instance_size = sizeof(MyState), + .class_size = sizeof(MyClass), + .class_init = my_class_init, + }; + + typedef struct DerivedClass { + MyClass parent_class; + + MyDoSomething parent_do_something; + } DerivedClass; + + static void derived_do_something(MyState *obj) + { + DerivedClass *dc = DERIVED_GET_CLASS(obj); + + // do something here + dc->parent_do_something(obj); + // do something else here + } + + static void derived_class_init(ObjectClass *oc, void *data) + { + MyClass *mc = MY_CLASS(oc); + DerivedClass *dc = DERIVED_CLASS(oc); + + dc->parent_do_something = mc->do_something; + mc->do_something = derived_do_something; + } + + static const TypeInfo derived_type_info = { + .name = TYPE_DERIVED, + .parent = TYPE_MY, + .class_size = sizeof(DerivedClass), + .class_init = derived_class_init, + }; + +Alternatively, object_class_by_name() can be used to obtain the class and +its non-overridden methods for a specific type. This would correspond to +``MyClass::method(...)`` in C++. + +The first example of such a QOM method was #CPUClass.reset, +another example is #DeviceClass.realize. + +Standard type declaration and definition macros +=============================================== + +A lot of the code outlined above follows a standard pattern and naming +convention. To reduce the amount of boilerplate code that needs to be +written for a new type there are two sets of macros to generate the +common parts in a standard format. + +A type is declared using the OBJECT_DECLARE macro family. In types +which do not require any virtual functions in the class, the +OBJECT_DECLARE_SIMPLE_TYPE macro is suitable, and is commonly placed +in the header file: + +.. code-block:: c + :caption: Declaring a simple type + + OBJECT_DECLARE_SIMPLE_TYPE(MyDevice, my_device, + MY_DEVICE, DEVICE) + +This is equivalent to the following: + +.. code-block:: c + :caption: Expansion from declaring a simple type + + typedef struct MyDevice MyDevice; + typedef struct MyDeviceClass MyDeviceClass; + + G_DEFINE_AUTOPTR_CLEANUP_FUNC(MyDeviceClass, object_unref) + + #define MY_DEVICE_GET_CLASS(void *obj) \ + OBJECT_GET_CLASS(MyDeviceClass, obj, TYPE_MY_DEVICE) + #define MY_DEVICE_CLASS(void *klass) \ + OBJECT_CLASS_CHECK(MyDeviceClass, klass, TYPE_MY_DEVICE) + #define MY_DEVICE(void *obj) + OBJECT_CHECK(MyDevice, obj, TYPE_MY_DEVICE) + + struct MyDeviceClass { + DeviceClass parent_class; + }; + +The 'struct MyDevice' needs to be declared separately. +If the type requires virtual functions to be declared in the class +struct, then the alternative OBJECT_DECLARE_TYPE() macro can be +used. This does the same as OBJECT_DECLARE_SIMPLE_TYPE(), but without +the 'struct MyDeviceClass' definition. + +To implement the type, the OBJECT_DEFINE macro family is available. +In the simple case the OBJECT_DEFINE_TYPE macro is suitable: + +.. code-block:: c + :caption: Defining a simple type + + OBJECT_DEFINE_TYPE(MyDevice, my_device, MY_DEVICE, DEVICE) + +This is equivalent to the following: + +.. code-block:: c + :caption: Expansion from defining a simple type + + static void my_device_finalize(Object *obj); + static void my_device_class_init(ObjectClass *oc, void *data); + static void my_device_init(Object *obj); + + static const TypeInfo my_device_info = { + .parent = TYPE_DEVICE, + .name = TYPE_MY_DEVICE, + .instance_size = sizeof(MyDevice), + .instance_init = my_device_init, + .instance_finalize = my_device_finalize, + .class_size = sizeof(MyDeviceClass), + .class_init = my_device_class_init, + }; + + static void + my_device_register_types(void) + { + type_register_static(&my_device_info); + } + type_init(my_device_register_types); + +This is sufficient to get the type registered with the type +system, and the three standard methods now need to be implemented +along with any other logic required for the type. + +If the type needs to implement one or more interfaces, then the +OBJECT_DEFINE_TYPE_WITH_INTERFACES() macro can be used instead. +This accepts an array of interface type names. + +.. code-block:: c + :caption: Defining a simple type implementing interfaces + + OBJECT_DEFINE_TYPE_WITH_INTERFACES(MyDevice, my_device, + MY_DEVICE, DEVICE, + { TYPE_USER_CREATABLE }, + { NULL }) + +If the type is not intended to be instantiated, then then +the OBJECT_DEFINE_ABSTRACT_TYPE() macro can be used instead: + +.. code-block:: c + :caption: Defining a simple abstract type + + OBJECT_DEFINE_ABSTRACT_TYPE(MyDevice, my_device, + MY_DEVICE, DEVICE) + + + +API Reference +------------- + +.. kernel-doc:: include/qom/object.h diff --git a/docs/devel/qtest.rst b/docs/devel/qtest.rst new file mode 100644 index 000000000..c3dceb6c8 --- /dev/null +++ b/docs/devel/qtest.rst @@ -0,0 +1,92 @@ +======================================== +QTest Device Emulation Testing Framework +======================================== + +.. toctree:: + :hidden: + + qgraph + +QTest is a device emulation testing framework. It can be very useful to test +device models; it could also control certain aspects of QEMU (such as virtual +clock stepping), with a special purpose "qtest" protocol. Refer to +:ref:`qtest-protocol` for more details of the protocol. + +QTest cases can be executed with + +.. code:: + + make check-qtest + +The QTest library is implemented by ``tests/qtest/libqtest.c`` and the API is +defined in ``tests/qtest/libqtest.h``. + +Consider adding a new QTest case when you are introducing a new virtual +hardware, or extending one if you are adding functionalities to an existing +virtual device. + +On top of libqtest, a higher level library, ``libqos``, was created to +encapsulate common tasks of device drivers, such as memory management and +communicating with system buses or devices. Many virtual device tests use +libqos instead of directly calling into libqtest. +Libqos also offers the Qgraph API to increase each test coverage and +automate QEMU command line arguments and devices setup. +Refer to :ref:`qgraph` for Qgraph explanation and API. + +Steps to add a new QTest case are: + +1. Create a new source file for the test. (More than one file can be added as + necessary.) For example, ``tests/qtest/foo-test.c``. + +2. Write the test code with the glib and libqtest/libqos API. See also existing + tests and the library headers for reference. + +3. Register the new test in ``tests/qtest/meson.build``. Add the test + executable name to an appropriate ``qtests_*`` variable. There is + one variable per architecture, plus ``qtests_generic`` for tests + that can be run for all architectures. For example:: + + qtests_generic = [ + ... + 'foo-test', + ... + ] + +4. If the test has more than one source file or needs to be linked with any + dependency other than ``qemuutil`` and ``qos``, list them in the ``qtests`` + dictionary. For example a test that needs to use the ``QIO`` library + will have an entry like:: + + { + ... + 'foo-test': [io], + ... + } + +Debugging a QTest failure is slightly harder than the unit test because the +tests look up QEMU program names in the environment variables, such as +``QTEST_QEMU_BINARY`` and ``QTEST_QEMU_IMG``, and also because it is not easy +to attach gdb to the QEMU process spawned from the test. But manual invoking +and using gdb on the test is still simple to do: find out the actual command +from the output of + +.. code:: + + make check-qtest V=1 + +which you can run manually. + + +.. _qtest-protocol: + +QTest Protocol +-------------- + +.. kernel-doc:: softmmu/qtest.c + :doc: QTest Protocol + + +libqtest API reference +---------------------- + +.. kernel-doc:: tests/qtest/libqos/libqtest.h diff --git a/docs/devel/rcu.txt b/docs/devel/rcu.txt new file mode 100644 index 000000000..2e6cc607a --- /dev/null +++ b/docs/devel/rcu.txt @@ -0,0 +1,406 @@ +Using RCU (Read-Copy-Update) for synchronization +================================================ + +Read-copy update (RCU) is a synchronization mechanism that is used to +protect read-mostly data structures. RCU is very efficient and scalable +on the read side (it is wait-free), and thus can make the read paths +extremely fast. + +RCU supports concurrency between a single writer and multiple readers, +thus it is not used alone. Typically, the write-side will use a lock to +serialize multiple updates, but other approaches are possible (e.g., +restricting updates to a single task). In QEMU, when a lock is used, +this will often be the "iothread mutex", also known as the "big QEMU +lock" (BQL). Also, restricting updates to a single task is done in +QEMU using the "bottom half" API. + +RCU is fundamentally a "wait-to-finish" mechanism. The read side marks +sections of code with "critical sections", and the update side will wait +for the execution of all *currently running* critical sections before +proceeding, or before asynchronously executing a callback. + +The key point here is that only the currently running critical sections +are waited for; critical sections that are started _after_ the beginning +of the wait do not extend the wait, despite running concurrently with +the updater. This is the reason why RCU is more scalable than, +for example, reader-writer locks. It is so much more scalable that +the system will have a single instance of the RCU mechanism; a single +mechanism can be used for an arbitrary number of "things", without +having to worry about things such as contention or deadlocks. + +How is this possible? The basic idea is to split updates in two phases, +"removal" and "reclamation". During removal, we ensure that subsequent +readers will not be able to get a reference to the old data. After +removal has completed, a critical section will not be able to access +the old data. Therefore, critical sections that begin after removal +do not matter; as soon as all previous critical sections have finished, +there cannot be any readers who hold references to the data structure, +and these can now be safely reclaimed (e.g., freed or unref'ed). + +Here is a picture: + + thread 1 thread 2 thread 3 + ------------------- ------------------------ ------------------- + enter RCU crit.sec. + | finish removal phase + | begin wait + | | enter RCU crit.sec. + exit RCU crit.sec | | + complete wait | + begin reclamation phase | + exit RCU crit.sec. + + +Note how thread 3 is still executing its critical section when thread 2 +starts reclaiming data. This is possible, because the old version of the +data structure was not accessible at the time thread 3 began executing +that critical section. + + +RCU API +======= + +The core RCU API is small: + + void rcu_read_lock(void); + + Used by a reader to inform the reclaimer that the reader is + entering an RCU read-side critical section. + + void rcu_read_unlock(void); + + Used by a reader to inform the reclaimer that the reader is + exiting an RCU read-side critical section. Note that RCU + read-side critical sections may be nested and/or overlapping. + + void synchronize_rcu(void); + + Blocks until all pre-existing RCU read-side critical sections + on all threads have completed. This marks the end of the removal + phase and the beginning of reclamation phase. + + Note that it would be valid for another update to come while + synchronize_rcu is running. Because of this, it is better that + the updater releases any locks it may hold before calling + synchronize_rcu. If this is not possible (for example, because + the updater is protected by the BQL), you can use call_rcu. + + void call_rcu1(struct rcu_head * head, + void (*func)(struct rcu_head *head)); + + This function invokes func(head) after all pre-existing RCU + read-side critical sections on all threads have completed. This + marks the end of the removal phase, with func taking care + asynchronously of the reclamation phase. + + The foo struct needs to have an rcu_head structure added, + perhaps as follows: + + struct foo { + struct rcu_head rcu; + int a; + char b; + long c; + }; + + so that the reclaimer function can fetch the struct foo address + and free it: + + call_rcu1(&foo.rcu, foo_reclaim); + + void foo_reclaim(struct rcu_head *rp) + { + struct foo *fp = container_of(rp, struct foo, rcu); + g_free(fp); + } + + For the common case where the rcu_head member is the first of the + struct, you can use the following macro. + + void call_rcu(T *p, + void (*func)(T *p), + field-name); + void g_free_rcu(T *p, + field-name); + + call_rcu1 is typically used through these macro, in the common case + where the "struct rcu_head" is the first field in the struct. If + the callback function is g_free, in particular, g_free_rcu can be + used. In the above case, one could have written simply: + + g_free_rcu(&foo, rcu); + + typeof(*p) qatomic_rcu_read(p); + + qatomic_rcu_read() is similar to qatomic_load_acquire(), but it makes + some assumptions on the code that calls it. This allows a more + optimized implementation. + + qatomic_rcu_read assumes that whenever a single RCU critical + section reads multiple shared data, these reads are either + data-dependent or need no ordering. This is almost always the + case when using RCU, because read-side critical sections typically + navigate one or more pointers (the pointers that are changed on + every update) until reaching a data structure of interest, + and then read from there. + + RCU read-side critical sections must use qatomic_rcu_read() to + read data, unless concurrent writes are prevented by another + synchronization mechanism. + + Furthermore, RCU read-side critical sections should traverse the + data structure in a single direction, opposite to the direction + in which the updater initializes it. + + void qatomic_rcu_set(p, typeof(*p) v); + + qatomic_rcu_set() is similar to qatomic_store_release(), though it also + makes assumptions on the code that calls it in order to allow a more + optimized implementation. + + In particular, qatomic_rcu_set() suffices for synchronization + with readers, if the updater never mutates a field within a + data item that is already accessible to readers. This is the + case when initializing a new copy of the RCU-protected data + structure; just ensure that initialization of *p is carried out + before qatomic_rcu_set() makes the data item visible to readers. + If this rule is observed, writes will happen in the opposite + order as reads in the RCU read-side critical sections (or if + there is just one update), and there will be no need for other + synchronization mechanism to coordinate the accesses. + +The following APIs must be used before RCU is used in a thread: + + void rcu_register_thread(void); + + Mark a thread as taking part in the RCU mechanism. Such a thread + will have to report quiescent points regularly, either manually + or through the QemuCond/QemuSemaphore/QemuEvent APIs. + + void rcu_unregister_thread(void); + + Mark a thread as not taking part anymore in the RCU mechanism. + It is not a problem if such a thread reports quiescent points, + either manually or by using the QemuCond/QemuSemaphore/QemuEvent + APIs. + +Note that these APIs are relatively heavyweight, and should _not_ be +nested. + +Convenience macros +================== + +Two macros are provided that automatically release the read lock at the +end of the scope. + + RCU_READ_LOCK_GUARD() + + Takes the lock and will release it at the end of the block it's + used in. + + WITH_RCU_READ_LOCK_GUARD() { code } + + Is used at the head of a block to protect the code within the block. + +Note that 'goto'ing out of the guarded block will also drop the lock. + +DIFFERENCES WITH LINUX +====================== + +- Waiting on a mutex is possible, though discouraged, within an RCU critical + section. This is because spinlocks are rarely (if ever) used in userspace + programming; not allowing this would prevent upgrading an RCU read-side + critical section to become an updater. + +- qatomic_rcu_read and qatomic_rcu_set replace rcu_dereference and + rcu_assign_pointer. They take a _pointer_ to the variable being accessed. + +- call_rcu is a macro that has an extra argument (the name of the first + field in the struct, which must be a struct rcu_head), and expects the + type of the callback's argument to be the type of the first argument. + call_rcu1 is the same as Linux's call_rcu. + + +RCU PATTERNS +============ + +Many patterns using read-writer locks translate directly to RCU, with +the advantages of higher scalability and deadlock immunity. + +In general, RCU can be used whenever it is possible to create a new +"version" of a data structure every time the updater runs. This may +sound like a very strict restriction, however: + +- the updater does not mean "everything that writes to a data structure", + but rather "everything that involves a reclamation step". See the + array example below + +- in some cases, creating a new version of a data structure may actually + be very cheap. For example, modifying the "next" pointer of a singly + linked list is effectively creating a new version of the list. + +Here are some frequently-used RCU idioms that are worth noting. + + +RCU list processing +------------------- + +TBD (not yet used in QEMU) + + +RCU reference counting +---------------------- + +Because grace periods are not allowed to complete while there is an RCU +read-side critical section in progress, the RCU read-side primitives +may be used as a restricted reference-counting mechanism. For example, +consider the following code fragment: + + rcu_read_lock(); + p = qatomic_rcu_read(&foo); + /* do something with p. */ + rcu_read_unlock(); + +The RCU read-side critical section ensures that the value of "p" remains +valid until after the rcu_read_unlock(). In some sense, it is acquiring +a reference to p that is later released when the critical section ends. +The write side looks simply like this (with appropriate locking): + + qemu_mutex_lock(&foo_mutex); + old = foo; + qatomic_rcu_set(&foo, new); + qemu_mutex_unlock(&foo_mutex); + synchronize_rcu(); + free(old); + +If the processing cannot be done purely within the critical section, it +is possible to combine this idiom with a "real" reference count: + + rcu_read_lock(); + p = qatomic_rcu_read(&foo); + foo_ref(p); + rcu_read_unlock(); + /* do something with p. */ + foo_unref(p); + +The write side can be like this: + + qemu_mutex_lock(&foo_mutex); + old = foo; + qatomic_rcu_set(&foo, new); + qemu_mutex_unlock(&foo_mutex); + synchronize_rcu(); + foo_unref(old); + +or with call_rcu: + + qemu_mutex_lock(&foo_mutex); + old = foo; + qatomic_rcu_set(&foo, new); + qemu_mutex_unlock(&foo_mutex); + call_rcu(foo_unref, old, rcu); + +In both cases, the write side only performs removal. Reclamation +happens when the last reference to a "foo" object is dropped. +Using synchronize_rcu() is undesirably expensive, because the +last reference may be dropped on the read side. Hence you can +use call_rcu() instead: + + foo_unref(struct foo *p) { + if (qatomic_fetch_dec(&p->refcount) == 1) { + call_rcu(foo_destroy, p, rcu); + } + } + + +Note that the same idioms would be possible with reader/writer +locks: + + read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); + p = foo; p = foo; + /* do something with p. */ foo = new; + read_unlock(&foo_rwlock); free(p); + write_mutex_unlock(&foo_rwlock); + free(p); + + ------------------------------------------------------------------ + + read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); + p = foo; old = foo; + foo_ref(p); foo = new; + read_unlock(&foo_rwlock); foo_unref(old); + /* do something with p. */ write_mutex_unlock(&foo_rwlock); + read_lock(&foo_rwlock); + foo_unref(p); + read_unlock(&foo_rwlock); + +foo_unref could use a mechanism such as bottom halves to move deallocation +out of the write-side critical section. + + +RCU resizable arrays +-------------------- + +Resizable arrays can be used with RCU. The expensive RCU synchronization +(or call_rcu) only needs to take place when the array is resized. +The two items to take care of are: + +- ensuring that the old version of the array is available between removal + and reclamation; + +- avoiding mismatches in the read side between the array data and the + array size. + +The first problem is avoided simply by not using realloc. Instead, +each resize will allocate a new array and copy the old data into it. +The second problem would arise if the size and the data pointers were +two members of a larger struct: + + struct mystuff { + ... + int data_size; + int data_alloc; + T *data; + ... + }; + +Instead, we store the size of the array with the array itself: + + struct arr { + int size; + int alloc; + T data[]; + }; + struct arr *global_array; + + read side: + rcu_read_lock(); + struct arr *array = qatomic_rcu_read(&global_array); + x = i < array->size ? array->data[i] : -1; + rcu_read_unlock(); + return x; + + write side (running under a lock): + if (global_array->size == global_array->alloc) { + /* Creating a new version. */ + new_array = g_malloc(sizeof(struct arr) + + global_array->alloc * 2 * sizeof(T)); + new_array->size = global_array->size; + new_array->alloc = global_array->alloc * 2; + memcpy(new_array->data, global_array->data, + global_array->alloc * sizeof(T)); + + /* Removal phase. */ + old_array = global_array; + qatomic_rcu_set(&global_array, new_array); + synchronize_rcu(); + + /* Reclamation phase. */ + free(old_array); + } + + +SOURCES +======= + +* Documentation/RCU/ from the Linux kernel diff --git a/docs/devel/replay.txt b/docs/devel/replay.txt new file mode 100644 index 000000000..e641c35ad --- /dev/null +++ b/docs/devel/replay.txt @@ -0,0 +1,46 @@ +Record/replay mechanism, that could be enabled through icount mode, expects +the virtual devices to satisfy the following requirements. + +The main idea behind this document is that everything that affects +the guest state during execution in icount mode should be deterministic. + +Timers +====== + +All virtual devices should use virtual clock for timers that change the guest +state. Virtual clock is deterministic, therefore such timers are deterministic +too. + +Virtual devices can also use realtime clock for the events that do not change +the guest state directly. When the clock ticking should depend on VM execution +speed, use virtual clock with EXTERNAL attribute. It is not deterministic, +but its speed depends on the guest execution. This clock is used by +the virtual devices (e.g., slirp routing device) that lie outside the +replayed guest. + +Bottom halves +============= + +Bottom half callbacks, that affect the guest state, should be invoked through +replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions. +Their invocations are saved in record mode and synchronized with the existing +log in replay mode. + +Saving/restoring the VM state +============================= + +All fields in the device state structure (including virtual timers) +should be restored by loadvm to the same values they had before savevm. + +Avoid accessing other devices' state, because the order of saving/restoring +is not defined. It means that you should not call functions like +'update_irq' in post_load callback. Save everything explicitly to avoid +the dependencies that may make restoring the VM state non-deterministic. + +Stopping the VM +=============== + +Stopping the guest should not interfere with its state (with the exception +of the network connections, that could be broken by the remote timeouts). +VM can be stopped at any moment of replay by the user. Restarting the VM +after that stop should not break the replay by the unneeded guest state change. diff --git a/docs/devel/reset.rst b/docs/devel/reset.rst new file mode 100644 index 000000000..abea1102d --- /dev/null +++ b/docs/devel/reset.rst @@ -0,0 +1,289 @@ + +======================================= +Reset in QEMU: the Resettable interface +======================================= + +The reset of qemu objects is handled using the resettable interface declared +in ``include/hw/resettable.h``. + +This interface allows objects to be grouped (on a tree basis); so that the +whole group can be reset consistently. Each individual member object does not +have to care about others; in particular, problems of order (which object is +reset first) are addressed. + +As of now DeviceClass and BusClass implement this interface. + + +Triggering reset +---------------- + +This section documents the APIs which "users" of a resettable object should use +to control it. All resettable control functions must be called while holding +the iothread lock. + +You can apply a reset to an object using ``resettable_assert_reset()``. You need +to call ``resettable_release_reset()`` to release the object from reset. To +instantly reset an object, without keeping it in reset state, just call +``resettable_reset()``. These functions take two parameters: a pointer to the +object to reset and a reset type. + +Several types of reset will be supported. For now only cold reset is defined; +others may be added later. The Resettable interface handles reset types with an +enum: + +``RESET_TYPE_COLD`` + Cold reset is supported by every resettable object. In QEMU, it means we reset + to the initial state corresponding to the start of QEMU; this might differ + from what is a real hardware cold reset. It differs from other resets (like + warm or bus resets) which may keep certain parts untouched. + +Calling ``resettable_reset()`` is equivalent to calling +``resettable_assert_reset()`` then ``resettable_release_reset()``. It is +possible to interleave multiple calls to these three functions. There may +be several reset sources/controllers of a given object. The interface handles +everything and the different reset controllers do not need to know anything +about each others. The object will leave reset state only when each other +controllers end their reset operation. This point is handled internally by +maintaining a count of in-progress resets; it is crucial to call +``resettable_release_reset()`` one time and only one time per +``resettable_assert_reset()`` call. + +For now migration of a device or bus in reset is not supported. Care must be +taken not to delay ``resettable_release_reset()`` after its +``resettable_assert_reset()`` counterpart. + +Note that, since resettable is an interface, the API takes a simple Object as +parameter. Still, it is a programming error to call a resettable function on a +non-resettable object and it will trigger a run time assert error. Since most +calls to resettable interface are done through base class functions, such an +error is not likely to happen. + +For Devices and Buses, the following helper functions exist: + +- ``device_cold_reset()`` +- ``bus_cold_reset()`` + +These are simple wrappers around resettable_reset() function; they only cast the +Device or Bus into an Object and pass the cold reset type. When possible +prefer to use these functions instead of ``resettable_reset()``. + +Device and bus functions co-exist because there can be semantic differences +between resetting a bus and resetting the controller bridge which owns it. +For example, consider a SCSI controller. Resetting the controller puts all +its registers back to what reset state was as well as reset everything on the +SCSI bus, whereas resetting just the SCSI bus only resets everything that's on +it but not the controller. + + +Multi-phase mechanism +--------------------- + +This section documents the internals of the resettable interface. + +The resettable interface uses a multi-phase system to relieve objects and +machines from reset ordering problems. To address this, the reset operation +of an object is split into three well defined phases. + +When resetting several objects (for example the whole machine at simulation +startup), all first phases of all objects are executed, then all second phases +and then all third phases. + +The three phases are: + +1. The **enter** phase is executed when the object enters reset. It resets only + local state of the object; it must not do anything that has a side-effect + on other objects, such as raising or lowering a qemu_irq line or reading or + writing guest memory. + +2. The **hold** phase is executed for entry into reset, once every object in the + group which is being reset has had its *enter* phase executed. At this point + devices can do actions that affect other objects. + +3. The **exit** phase is executed when the object leaves the reset state. + Actions affecting other objects are permitted. + +As said in previous section, the interface maintains a count of reset. This +count is used to ensure phases are executed only when required. *enter* and +*hold* phases are executed only when asserting reset for the first time +(if an object is already in reset state when calling +``resettable_assert_reset()`` or ``resettable_reset()``, they are not +executed). +The *exit* phase is executed only when the last reset operation ends. Therefore +the object does not need to care how many of reset controllers it has and how +many of them have started a reset. + + +Handling reset in a resettable object +------------------------------------- + +This section documents the APIs that an implementation of a resettable object +must provide and what functions it has access to. It is intended for people +who want to implement or convert a class which has the resettable interface; +for example when specializing an existing device or bus. + +Methods to implement +.................... + +Three methods should be defined or left empty. Each method corresponds to a +phase of the reset; they are name ``phases.enter()``, ``phases.hold()`` and +``phases.exit()``. They all take the object as parameter. The *enter* method +also take the reset type as second parameter. + +When extending an existing class, these methods may need to be extended too. +The ``resettable_class_set_parent_phases()`` class function may be used to +backup parent class methods. + +Here follows an example to implement reset for a Device which sets an IO while +in reset. + +:: + + static void mydev_reset_enter(Object *obj, ResetType type) + { + MyDevClass *myclass = MYDEV_GET_CLASS(obj); + MyDevState *mydev = MYDEV(obj); + /* call parent class enter phase */ + if (myclass->parent_phases.enter) { + myclass->parent_phases.enter(obj, type); + } + /* initialize local state only */ + mydev->var = 0; + } + + static void mydev_reset_hold(Object *obj) + { + MyDevClass *myclass = MYDEV_GET_CLASS(obj); + MyDevState *mydev = MYDEV(obj); + /* call parent class hold phase */ + if (myclass->parent_phases.hold) { + myclass->parent_phases.hold(obj); + } + /* set an IO */ + qemu_set_irq(mydev->irq, 1); + } + + static void mydev_reset_exit(Object *obj) + { + MyDevClass *myclass = MYDEV_GET_CLASS(obj); + MyDevState *mydev = MYDEV(obj); + /* call parent class exit phase */ + if (myclass->parent_phases.exit) { + myclass->parent_phases.exit(obj); + } + /* clear an IO */ + qemu_set_irq(mydev->irq, 0); + } + + typedef struct MyDevClass { + MyParentClass parent_class; + /* to store eventual parent reset methods */ + ResettablePhases parent_phases; + } MyDevClass; + + static void mydev_class_init(ObjectClass *class, void *data) + { + MyDevClass *myclass = MYDEV_CLASS(class); + ResettableClass *rc = RESETTABLE_CLASS(class); + resettable_class_set_parent_reset_phases(rc, + mydev_reset_enter, + mydev_reset_hold, + mydev_reset_exit, + &myclass->parent_phases); + } + +In the above example, we override all three phases. It is possible to override +only some of them by passing NULL instead of a function pointer to +``resettable_class_set_parent_reset_phases()``. For example, the following will +only override the *enter* phase and leave *hold* and *exit* untouched:: + + resettable_class_set_parent_reset_phases(rc, mydev_reset_enter, + NULL, NULL, + &myclass->parent_phases); + +This is equivalent to providing a trivial implementation of the hold and exit +phases which does nothing but call the parent class's implementation of the +phase. + +Polling the reset state +....................... + +Resettable interface provides the ``resettable_is_in_reset()`` function. +This function returns true if the object parameter is currently under reset. + +An object is under reset from the beginning of the *init* phase to the end of +the *exit* phase. During all three phases, the function will return that the +object is in reset. + +This function may be used if the object behavior has to be adapted +while in reset state. For example if a device has an irq input, +it will probably need to ignore it while in reset; then it can for +example check the reset state at the beginning of the irq callback. + +Note that until migration of the reset state is supported, an object +should not be left in reset. So apart from being currently executing +one of the reset phases, the only cases when this function will return +true is if an external interaction (like changing an io) is made during +*hold* or *exit* phase of another object in the same reset group. + +Helpers ``device_is_in_reset()`` and ``bus_is_in_reset()`` are also provided +for devices and buses and should be preferred. + + +Base class handling of reset +---------------------------- + +This section documents parts of the reset mechanism that you only need to know +about if you are extending it to work with a new base class other than +DeviceClass or BusClass, or maintaining the existing code in those classes. Most +people can ignore it. + +Methods to implement +.................... + +There are two other methods that need to exist in a class implementing the +interface: ``get_state()`` and ``child_foreach()``. + +``get_state()`` is simple. *resettable* is an interface and, as a consequence, +does not have any class state structure. But in order to factorize the code, we +need one. This method must return a pointer to ``ResettableState`` structure. +The structure must be allocated by the base class; preferably it should be +located inside the object instance structure. + +``child_foreach()`` is more complex. It should execute the given callback on +every reset child of the given resettable object. All children must be +resettable too. Additional parameters (a reset type and an opaque pointer) must +be passed to the callback too. + +In ``DeviceClass`` and ``BusClass`` the ``ResettableState`` is located +``DeviceState`` and ``BusState`` structure. ``child_foreach()`` is implemented +to follow the bus hierarchy; for a bus, it calls the function on every child +device; for a device, it calls the function on every bus child. When we reset +the main system bus, we reset the whole machine bus tree. + +Changing a resettable parent +............................ + +One thing which should be taken care of by the base class is handling reset +hierarchy changes. + +The reset hierarchy is supposed to be static and built during machine creation. +But there are actually some exceptions. To cope with this, the resettable API +provides ``resettable_change_parent()``. This function allows to set, update or +remove the parent of a resettable object after machine creation is done. As +parameters, it takes the object being moved, the old parent if any and the new +parent if any. + +This function can be used at any time when not in a reset operation. During +a reset operation it must be used only in *hold* phase. Using it in *enter* or +*exit* phase is an error. +Also it should not be used during machine creation, although it is harmless to +do so: the function is a no-op as long as old and new parent are NULL or not +in reset. + +There is currently 2 cases where this function is used: + +1. *device hotplug*; it means a new device is introduced on a live bus. + +2. *hot bus change*; it means an existing live device is added, moved or + removed in the bus hierarchy. At the moment, it occurs only in the raspi + machines for changing the sdbus used by sd card. diff --git a/docs/devel/s390-dasd-ipl.rst b/docs/devel/s390-dasd-ipl.rst new file mode 100644 index 000000000..2529eb5f5 --- /dev/null +++ b/docs/devel/s390-dasd-ipl.rst @@ -0,0 +1,138 @@ +Booting from real channel-attached devices on s390x +=================================================== + +s390 hardware IPL +----------------- + +The s390 hardware IPL process consists of the following steps. + +1. A READ IPL ccw is constructed in memory location ``0x0``. + This ccw, by definition, reads the IPL1 record which is located on the disk + at cylinder 0 track 0 record 1. Note that the chain flag is on in this ccw + so when it is complete another ccw will be fetched and executed from memory + location ``0x08``. + +2. Execute the Read IPL ccw at ``0x00``, thereby reading IPL1 data into ``0x00``. + IPL1 data is 24 bytes in length and consists of the following pieces of + information: ``[psw][read ccw][tic ccw]``. When the machine executes the Read + IPL ccw it read the 24-bytes of IPL1 to be read into memory starting at + location ``0x0``. Then the ccw program at ``0x08`` which consists of a read + ccw and a tic ccw is automatically executed because of the chain flag from + the original READ IPL ccw. The read ccw will read the IPL2 data into memory + and the TIC (Transfer In Channel) will transfer control to the channel + program contained in the IPL2 data. The TIC channel command is the + equivalent of a branch/jump/goto instruction for channel programs. + + NOTE: The ccws in IPL1 are defined by the architecture to be format 0. + +3. Execute IPL2. + The TIC ccw instruction at the end of the IPL1 channel program will begin + the execution of the IPL2 channel program. IPL2 is stage-2 of the boot + process and will contain a larger channel program than IPL1. The point of + IPL2 is to find and load either the operating system or a small program that + loads the operating system from disk. At the end of this step all or some of + the real operating system is loaded into memory and we are ready to hand + control over to the guest operating system. At this point the guest + operating system is entirely responsible for loading any more data it might + need to function. + + NOTE: The IPL2 channel program might read data into memory + location ``0x0`` thereby overwriting the IPL1 psw and channel program. This is ok + as long as the data placed in location ``0x0`` contains a psw whose instruction + address points to the guest operating system code to execute at the end of + the IPL/boot process. + + NOTE: The ccws in IPL2 are defined by the architecture to be format 0. + +4. Start executing the guest operating system. + The psw that was loaded into memory location ``0x0`` as part of the ipl process + should contain the needed flags for the operating system we have loaded. The + psw's instruction address will point to the location in memory where we want + to start executing the operating system. This psw is loaded (via LPSW + instruction) causing control to be passed to the operating system code. + +In a non-virtualized environment this process, handled entirely by the hardware, +is kicked off by the user initiating a "Load" procedure from the hardware +management console. This "Load" procedure crafts a special "Read IPL" ccw in +memory location 0x0 that reads IPL1. It then executes this ccw thereby kicking +off the reading of IPL1 data. Since the channel program from IPL1 will be +written immediately after the special "Read IPL" ccw, the IPL1 channel program +will be executed immediately (the special read ccw has the chaining bit turned +on). The TIC at the end of the IPL1 channel program will cause the IPL2 channel +program to be executed automatically. After this sequence completes the "Load" +procedure then loads the psw from ``0x0``. + +How this all pertains to QEMU (and the kernel) +---------------------------------------------- + +In theory we should merely have to do the following to IPL/boot a guest +operating system from a DASD device: + +1. Place a "Read IPL" ccw into memory location ``0x0`` with chaining bit on. +2. Execute channel program at ``0x0``. +3. LPSW ``0x0``. + +However, our emulation of the machine's channel program logic within the kernel +is missing one key feature that is required for this process to work: +non-prefetch of ccw data. + +When we start a channel program we pass the channel subsystem parameters via an +ORB (Operation Request Block). One of those parameters is a prefetch bit. If the +bit is on then the vfio-ccw kernel driver is allowed to read the entire channel +program from guest memory before it starts executing it. This means that any +channel commands that read additional channel commands will not work as expected +because the newly read commands will only exist in guest memory and NOT within +the kernel's channel subsystem memory. The kernel vfio-ccw driver currently +requires this bit to be on for all channel programs. This is a problem because +the IPL process consists of transferring control from the "Read IPL" ccw +immediately to the IPL1 channel program that was read by "Read IPL". + +Not being able to turn off prefetch will also prevent the TIC at the end of the +IPL1 channel program from transferring control to the IPL2 channel program. + +Lastly, in some cases (the zipl bootloader for example) the IPL2 program also +transfers control to another channel program segment immediately after reading +it from the disk. So we need to be able to handle this case. + +What QEMU does +-------------- + +Since we are forced to live with prefetch we cannot use the very simple IPL +procedure we defined in the preceding section. So we compensate by doing the +following. + +1. Place "Read IPL" ccw into memory location ``0x0``, but turn off chaining bit. +2. Execute "Read IPL" at ``0x0``. + + So now IPL1's psw is at ``0x0`` and IPL1's channel program is at ``0x08``. + +3. Write a custom channel program that will seek to the IPL2 record and then + execute the READ and TIC ccws from IPL1. Normally the seek is not required + because after reading the IPL1 record the disk is automatically positioned + to read the very next record which will be IPL2. But since we are not reading + both IPL1 and IPL2 as part of the same channel program we must manually set + the position. + +4. Grab the target address of the TIC instruction from the IPL1 channel program. + This address is where the IPL2 channel program starts. + + Now IPL2 is loaded into memory somewhere, and we know the address. + +5. Execute the IPL2 channel program at the address obtained in step #4. + + Because this channel program can be dynamic, we must use a special algorithm + that detects a READ immediately followed by a TIC and breaks the ccw chain + by turning off the chain bit in the READ ccw. When control is returned from + the kernel/hardware to the QEMU bios code we immediately issue another start + subchannel to execute the remaining TIC instruction. This causes the entire + channel program (starting from the TIC) and all needed data to be refetched + thereby stepping around the limitation that would otherwise prevent this + channel program from executing properly. + + Now the operating system code is loaded somewhere in guest memory and the psw + in memory location ``0x0`` will point to entry code for the guest operating + system. + +6. LPSW ``0x0`` + + LPSW transfers control to the guest operating system and we're done. diff --git a/docs/devel/secure-coding-practices.rst b/docs/devel/secure-coding-practices.rst new file mode 100644 index 000000000..0454cc527 --- /dev/null +++ b/docs/devel/secure-coding-practices.rst @@ -0,0 +1,115 @@ +======================= +Secure Coding Practices +======================= +This document covers topics that both developers and security researchers must +be aware of so that they can develop safe code and audit existing code +properly. + +Reporting Security Bugs +----------------------- +For details on how to report security bugs or ask questions about potential +security bugs, see the `Security Process wiki page +<https://wiki.qemu.org/SecurityProcess>`_. + +General Secure C Coding Practices +--------------------------------- +Most CVEs (security bugs) reported against QEMU are not specific to +virtualization or emulation. They are simply C programming bugs. Therefore +it's critical to be aware of common classes of security bugs. + +There is a wide selection of resources available covering secure C coding. For +example, the `CERT C Coding Standard +<https://wiki.sei.cmu.edu/confluence/display/c/SEI+CERT+C+Coding+Standard>`_ +covers the most important classes of security bugs. + +Instead of describing them in detail here, only the names of the most important +classes of security bugs are mentioned: + +* Buffer overflows +* Use-after-free and double-free +* Integer overflows +* Format string vulnerabilities + +Some of these classes of bugs can be detected by analyzers. Static analysis is +performed regularly by Coverity and the most obvious of these bugs are even +reported by compilers. Dynamic analysis is possible with valgrind, tsan, and +asan. + +Input Validation +---------------- +Inputs from the guest or external sources (e.g. network, files) cannot be +trusted and may be invalid. Inputs must be checked before using them in a way +that could crash the program, expose host memory to the guest, or otherwise be +exploitable by an attacker. + +The most sensitive attack surface is device emulation. All hardware register +accesses and data read from guest memory must be validated. A typical example +is a device that contains multiple units that are selectable by the guest via +an index register:: + + typedef struct { + ProcessingUnit unit[2]; + ... + } MyDeviceState; + + static void mydev_writel(void *opaque, uint32_t addr, uint32_t val) + { + MyDeviceState *mydev = opaque; + ProcessingUnit *unit; + + switch (addr) { + case MYDEV_SELECT_UNIT: + unit = &mydev->unit[val]; <-- this input wasn't validated! + ... + } + } + +If ``val`` is not in range [0, 1] then an out-of-bounds memory access will take +place when ``unit`` is dereferenced. The code must check that ``val`` is 0 or +1 and handle the case where it is invalid. + +Unexpected Device Accesses +-------------------------- +The guest may access device registers in unusual orders or at unexpected +moments. Device emulation code must not assume that the guest follows the +typical "theory of operation" presented in driver writer manuals. The guest +may make nonsense accesses to device registers such as starting operations +before the device has been fully initialized. + +A related issue is that device emulation code must be prepared for unexpected +device register accesses while asynchronous operations are in progress. A +well-behaved guest might wait for a completion interrupt before accessing +certain device registers. Device emulation code must handle the case where the +guest overwrites registers or submits further requests before an ongoing +request completes. Unexpected accesses must not cause memory corruption or +leaks in QEMU. + +Invalid device register accesses can be reported with +``qemu_log_mask(LOG_GUEST_ERROR, ...)``. The ``-d guest_errors`` command-line +option enables these log messages. + +Live Migration +-------------- +Device state can be saved to disk image files and shared with other users. +Live migration code must validate inputs when loading device state so an +attacker cannot gain control by crafting invalid device states. Device state +is therefore considered untrusted even though it is typically generated by QEMU +itself. + +Guest Memory Access Races +------------------------- +Guests with multiple vCPUs may modify guest RAM while device emulation code is +running. Device emulation code must copy in descriptors and other guest RAM +structures and only process the local copy. This prevents +time-of-check-to-time-of-use (TOCTOU) race conditions that could cause QEMU to +crash when a vCPU thread modifies guest RAM while device emulation is +processing it. + +Use of null-co block drivers +---------------------------- + +The ``null-co`` block driver is designed for performance: its read accesses are +not initialized by default. In case this driver has to be used for security +research, it must be used with the ``read-zeroes=on`` option which fills read +buffers with zeroes. Security issues reported with the default +(``read-zeroes=off``) will be discarded. diff --git a/docs/devel/stable-process.rst b/docs/devel/stable-process.rst new file mode 100644 index 000000000..c21fb8664 --- /dev/null +++ b/docs/devel/stable-process.rst @@ -0,0 +1,73 @@ +.. _stable-process: + +QEMU and the stable process +=========================== + +QEMU stable releases +-------------------- + +QEMU stable releases are based upon the last released QEMU version +and marked by an additional version number, e.g. 2.10.1. Occasionally, +a four-number version is released, if a single urgent fix needs to go +on top. + +Usually, stable releases are only provided for the last major QEMU +release. For example, when QEMU 2.11.0 is released, 2.11.x or 2.11.x.y +stable releases are produced only until QEMU 2.12.0 is released, at +which point the stable process moves to producing 2.12.x/2.12.x.y releases. + +What should go into a stable release? +------------------------------------- + +Generally, the following patches are considered stable material: + +* Patches that fix severe issues, like fixes for CVEs + +* Patches that fix regressions + +If you think the patch would be important for users of the current release +(or for a distribution picking fixes), it is usually a good candidate +for stable. + + +How to get a patch into QEMU stable +----------------------------------- + +There are various ways to get a patch into stable: + +* Preferred: Make sure that the stable maintainers are on copy when you send + the patch by adding + + .. code:: + + Cc: qemu-stable@nongnu.org + + to the patch description. By default, this will send a copy of the patch + to ``qemu-stable@nongnu.org`` if you use git send-email, which is where + patches that are stable candidates are tracked by the maintainers. + +* You can also reply to a patch and put ``qemu-stable@nongnu.org`` on copy + directly in your mail client if you think a previously submitted patch + should be considered for a stable release. + +* If a maintainer judges the patch appropriate for stable later on (or you + notify them), they will add the same line to the patch, meaning that + the stable maintainers will be on copy on the maintainer's pull request. + +* If you judge an already merged patch suitable for stable, send a mail + (preferably as a reply to the most recent patch submission) to + ``qemu-stable@nongnu.org`` along with ``qemu-devel@nongnu.org`` and + appropriate other people (like the patch author or the relevant maintainer) + on copy. + +Stable release process +---------------------- + +When the stable maintainers prepare a new stable release, they will prepare +a git branch with a release candidate and send the patches out to +``qemu-devel@nongnu.org`` for review. If any of your patches are included, +please verify that they look fine, especially if the maintainer had to tweak +the patch as part of back-porting things across branches. You may also +nominate other patches that you think are suitable for inclusion. After +review is complete (may involve more release candidates), a new stable release +is made available. diff --git a/docs/devel/style.rst b/docs/devel/style.rst new file mode 100644 index 000000000..9c5c0fffd --- /dev/null +++ b/docs/devel/style.rst @@ -0,0 +1,703 @@ +.. _coding-style: + +================= +QEMU Coding Style +================= + +.. contents:: Table of Contents + +Please use the script checkpatch.pl in the scripts directory to check +patches before submitting. + +Formatting and style +******************** + +Whitespace +========== + +Of course, the most important aspect in any coding style is whitespace. +Crusty old coders who have trouble spotting the glasses on their noses +can tell the difference between a tab and eight spaces from a distance +of approximately fifteen parsecs. Many a flamewar has been fought and +lost on this issue. + +QEMU indents are four spaces. Tabs are never used, except in Makefiles +where they have been irreversibly coded into the syntax. +Spaces of course are superior to tabs because: + +* You have just one way to specify whitespace, not two. Ambiguity breeds + mistakes. +* The confusion surrounding 'use tabs to indent, spaces to justify' is gone. +* Tab indents push your code to the right, making your screen seriously + unbalanced. +* Tabs will be rendered incorrectly on editors who are misconfigured not + to use tab stops of eight positions. +* Tabs are rendered badly in patches, causing off-by-one errors in almost + every line. +* It is the QEMU coding style. + +Do not leave whitespace dangling off the ends of lines. + +Multiline Indent +---------------- + +There are several places where indent is necessary: + +* if/else +* while/for +* function definition & call + +When breaking up a long line to fit within line width, we need a proper indent +for the following lines. + +In case of if/else, while/for, align the secondary lines just after the +opening parenthesis of the first. + +For example: + +.. code-block:: c + + if (a == 1 && + b == 2) { + + while (a == 1 && + b == 2) { + +In case of function, there are several variants: + +* 4 spaces indent from the beginning +* align the secondary lines just after the opening parenthesis of the first + +For example: + +.. code-block:: c + + do_something(x, y, + z); + + do_something(x, y, + z); + + do_something(x, do_another(y, + z)); + +Line width +========== + +Lines should be 80 characters; try not to make them longer. + +Sometimes it is hard to do, especially when dealing with QEMU subsystems +that use long function or symbol names. If wrapping the line at 80 columns +is obviously less readable and more awkward, prefer not to wrap it; better +to have an 85 character line than one which is awkwardly wrapped. + +Even in that case, try not to make lines much longer than 80 characters. +(The checkpatch script will warn at 100 characters, but this is intended +as a guard against obviously-overlength lines, not a target.) + +Rationale: + +* Some people like to tile their 24" screens with a 6x4 matrix of 80x24 + xterms and use vi in all of them. The best way to punish them is to + let them keep doing it. +* Code and especially patches is much more readable if limited to a sane + line length. Eighty is traditional. +* The four-space indentation makes the most common excuse ("But look + at all that white space on the left!") moot. +* It is the QEMU coding style. + +Naming +====== + +Variables are lower_case_with_underscores; easy to type and read. Structured +type names are in CamelCase; harder to type but standing out. Enum type +names and function type names should also be in CamelCase. Scalar type +names are lower_case_with_underscores_ending_with_a_t, like the POSIX +uint64_t and family. Note that this last convention contradicts POSIX +and is therefore likely to be changed. + +Variable Naming Conventions +--------------------------- + +A number of short naming conventions exist for variables that use +common QEMU types. For example, the architecture independent CPUState +is often held as a ``cs`` pointer variable, whereas the concrete +CPUArchState is usually held in a pointer called ``env``. + +Likewise, in device emulation code the common DeviceState is usually +called ``dev``. + +Function Naming Conventions +--------------------------- + +Wrapped version of standard library or GLib functions use a ``qemu_`` +prefix to alert readers that they are seeing a wrapped version, for +example ``qemu_strtol`` or ``qemu_mutex_lock``. Other utility functions +that are widely called from across the codebase should not have any +prefix, for example ``pstrcpy`` or bit manipulation functions such as +``find_first_bit``. + +The ``qemu_`` prefix is also used for functions that modify global +emulator state, for example ``qemu_add_vm_change_state_handler``. +However, if there is an obvious subsystem-specific prefix it should be +used instead. + +Public functions from a file or subsystem (declared in headers) tend +to have a consistent prefix to show where they came from. For example, +``tlb_`` for functions from ``cputlb.c`` or ``cpu_`` for functions +from cpus.c. + +If there are two versions of a function to be called with or without a +lock held, the function that expects the lock to be already held +usually uses the suffix ``_locked``. + + +Block structure +=============== + +Every indented statement is braced; even if the block contains just one +statement. The opening brace is on the line that contains the control +flow statement that introduces the new block; the closing brace is on the +same line as the else keyword, or on a line by itself if there is no else +keyword. Example: + +.. code-block:: c + + if (a == 5) { + printf("a was 5.\n"); + } else if (a == 6) { + printf("a was 6.\n"); + } else { + printf("a was something else entirely.\n"); + } + +Note that 'else if' is considered a single statement; otherwise a long if/ +else if/else if/.../else sequence would need an indent for every else +statement. + +An exception is the opening brace for a function; for reasons of tradition +and clarity it comes on a line by itself: + +.. code-block:: c + + void a_function(void) + { + do_something(); + } + +Rationale: a consistent (except for functions...) bracing style reduces +ambiguity and avoids needless churn when lines are added or removed. +Furthermore, it is the QEMU coding style. + +Declarations +============ + +Mixed declarations (interleaving statements and declarations within +blocks) are generally not allowed; declarations should be at the beginning +of blocks. + +Every now and then, an exception is made for declarations inside a +#ifdef or #ifndef block: if the code looks nicer, such declarations can +be placed at the top of the block even if there are statements above. +On the other hand, however, it's often best to move that #ifdef/#ifndef +block to a separate function altogether. + +Conditional statements +====================== + +When comparing a variable for (in)equality with a constant, list the +constant on the right, as in: + +.. code-block:: c + + if (a == 1) { + /* Reads like: "If a equals 1" */ + do_something(); + } + +Rationale: Yoda conditions (as in 'if (1 == a)') are awkward to read. +Besides, good compilers already warn users when '==' is mis-typed as '=', +even when the constant is on the right. + +Comment style +============= + +We use traditional C-style /``*`` ``*``/ comments and avoid // comments. + +Rationale: The // form is valid in C99, so this is purely a matter of +consistency of style. The checkpatch script will warn you about this. + +Multiline comment blocks should have a row of stars on the left, +and the initial /``*`` and terminating ``*``/ both on their own lines: + +.. code-block:: c + + /* + * like + * this + */ + +This is the same format required by the Linux kernel coding style. + +(Some of the existing comments in the codebase use the GNU Coding +Standards form which does not have stars on the left, or other +variations; avoid these when writing new comments, but don't worry +about converting to the preferred form unless you're editing that +comment anyway.) + +Rationale: Consistency, and ease of visually picking out a multiline +comment from the surrounding code. + +Language usage +************** + +Preprocessor +============ + +Variadic macros +--------------- + +For variadic macros, stick with this C99-like syntax: + +.. code-block:: c + + #define DPRINTF(fmt, ...) \ + do { printf("IRQ: " fmt, ## __VA_ARGS__); } while (0) + +Include directives +------------------ + +Order include directives as follows: + +.. code-block:: c + + #include "qemu/osdep.h" /* Always first... */ + #include <...> /* then system headers... */ + #include "..." /* and finally QEMU headers. */ + +The "qemu/osdep.h" header contains preprocessor macros that affect the behavior +of core system headers like <stdint.h>. It must be the first include so that +core system headers included by external libraries get the preprocessor macros +that QEMU depends on. + +Do not include "qemu/osdep.h" from header files since the .c file will have +already included it. + +C types +======= + +It should be common sense to use the right type, but we have collected +a few useful guidelines here. + +Scalars +------- + +If you're using "int" or "long", odds are good that there's a better type. +If a variable is counting something, it should be declared with an +unsigned type. + +If it's host memory-size related, size_t should be a good choice (use +ssize_t only if required). Guest RAM memory offsets must use ram_addr_t, +but only for RAM, it may not cover whole guest address space. + +If it's file-size related, use off_t. +If it's file-offset related (i.e., signed), use off_t. +If it's just counting small numbers use "unsigned int"; +(on all but oddball embedded systems, you can assume that that +type is at least four bytes wide). + +In the event that you require a specific width, use a standard type +like int32_t, uint32_t, uint64_t, etc. The specific types are +mandatory for VMState fields. + +Don't use Linux kernel internal types like u32, __u32 or __le32. + +Use hwaddr for guest physical addresses except pcibus_t +for PCI addresses. In addition, ram_addr_t is a QEMU internal address +space that maps guest RAM physical addresses into an intermediate +address space that can map to host virtual address spaces. Generally +speaking, the size of guest memory can always fit into ram_addr_t but +it would not be correct to store an actual guest physical address in a +ram_addr_t. + +For CPU virtual addresses there are several possible types. +vaddr is the best type to use to hold a CPU virtual address in +target-independent code. It is guaranteed to be large enough to hold a +virtual address for any target, and it does not change size from target +to target. It is always unsigned. +target_ulong is a type the size of a virtual address on the CPU; this means +it may be 32 or 64 bits depending on which target is being built. It should +therefore be used only in target-specific code, and in some +performance-critical built-per-target core code such as the TLB code. +There is also a signed version, target_long. +abi_ulong is for the ``*``-user targets, and represents a type the size of +'void ``*``' in that target's ABI. (This may not be the same as the size of a +full CPU virtual address in the case of target ABIs which use 32 bit pointers +on 64 bit CPUs, like sparc32plus.) Definitions of structures that must match +the target's ABI must use this type for anything that on the target is defined +to be an 'unsigned long' or a pointer type. +There is also a signed version, abi_long. + +Of course, take all of the above with a grain of salt. If you're about +to use some system interface that requires a type like size_t, pid_t or +off_t, use matching types for any corresponding variables. + +Also, if you try to use e.g., "unsigned int" as a type, and that +conflicts with the signedness of a related variable, sometimes +it's best just to use the *wrong* type, if "pulling the thread" +and fixing all related variables would be too invasive. + +Finally, while using descriptive types is important, be careful not to +go overboard. If whatever you're doing causes warnings, or requires +casts, then reconsider or ask for help. + +Pointers +-------- + +Ensure that all of your pointers are "const-correct". +Unless a pointer is used to modify the pointed-to storage, +give it the "const" attribute. That way, the reader knows +up-front that this is a read-only pointer. Perhaps more +importantly, if we're diligent about this, when you see a non-const +pointer, you're guaranteed that it is used to modify the storage +it points to, or it is aliased to another pointer that is. + +Typedefs +-------- + +Typedefs are used to eliminate the redundant 'struct' keyword, since type +names have a different style than other identifiers ("CamelCase" versus +"snake_case"). Each named struct type should have a CamelCase name and a +corresponding typedef. + +Since certain C compilers choke on duplicated typedefs, you should avoid +them and declare a typedef only in one header file. For common types, +you can use "include/qemu/typedefs.h" for example. However, as a matter +of convenience it is also perfectly fine to use forward struct +definitions instead of typedefs in headers and function prototypes; this +avoids problems with duplicated typedefs and reduces the need to include +headers from other headers. + +Reserved namespaces in C and POSIX +---------------------------------- + +Underscore capital, double underscore, and underscore 't' suffixes should be +avoided. + +Low level memory management +=========================== + +Use of the ``malloc/free/realloc/calloc/valloc/memalign/posix_memalign`` +APIs is not allowed in the QEMU codebase. Instead of these routines, +use the GLib memory allocation routines +``g_malloc/g_malloc0/g_new/g_new0/g_realloc/g_free`` +or QEMU's ``qemu_memalign/qemu_blockalign/qemu_vfree`` APIs. + +Please note that ``g_malloc`` will exit on allocation failure, so +there is no need to test for failure (as you would have to with +``malloc``). Generally using ``g_malloc`` on start-up is fine as the +result of a failure to allocate memory is going to be a fatal exit +anyway. There may be some start-up cases where failing is unreasonable +(for example speculatively loading a large debug symbol table). + +Care should be taken to avoid introducing places where the guest could +trigger an exit by causing a large allocation. For small allocations, +of the order of 4k, a failure to allocate is likely indicative of an +overloaded host and allowing ``g_malloc`` to ``exit`` is a reasonable +approach. However for larger allocations where we could realistically +fall-back to a smaller one if need be we should use functions like +``g_try_new`` and check the result. For example this is valid approach +for a time/space trade-off like ``tlb_mmu_resize_locked`` in the +SoftMMU TLB code. + +If the lifetime of the allocation is within the function and there are +multiple exist paths you can also improve the readability of the code +by using ``g_autofree`` and related annotations. See :ref:`autofree-ref` +for more details. + +Calling ``g_malloc`` with a zero size is valid and will return NULL. + +Prefer ``g_new(T, n)`` instead of ``g_malloc(sizeof(T) * n)`` for the following +reasons: + +* It catches multiplication overflowing size_t; +* It returns T ``*`` instead of void ``*``, letting compiler catch more type errors. + +Declarations like + +.. code-block:: c + + T *v = g_malloc(sizeof(*v)) + +are acceptable, though. + +Memory allocated by ``qemu_memalign`` or ``qemu_blockalign`` must be freed with +``qemu_vfree``, since breaking this will cause problems on Win32. + +String manipulation +=================== + +Do not use the strncpy function. As mentioned in the man page, it does *not* +guarantee a NULL-terminated buffer, which makes it extremely dangerous to use. +It also zeros trailing destination bytes out to the specified length. Instead, +use this similar function when possible, but note its different signature: + +.. code-block:: c + + void pstrcpy(char *dest, int dest_buf_size, const char *src) + +Don't use strcat because it can't check for buffer overflows, but: + +.. code-block:: c + + char *pstrcat(char *buf, int buf_size, const char *s) + +The same limitation exists with sprintf and vsprintf, so use snprintf and +vsnprintf. + +QEMU provides other useful string functions: + +.. code-block:: c + + int strstart(const char *str, const char *val, const char **ptr) + int stristart(const char *str, const char *val, const char **ptr) + int qemu_strnlen(const char *s, int max_len) + +There are also replacement character processing macros for isxyz and toxyz, +so instead of e.g. isalnum you should use qemu_isalnum. + +Because of the memory management rules, you must use g_strdup/g_strndup +instead of plain strdup/strndup. + +Printf-style functions +====================== + +Whenever you add a new printf-style function, i.e., one with a format +string argument and following "..." in its prototype, be sure to use +gcc's printf attribute directive in the prototype. + +This makes it so gcc's -Wformat and -Wformat-security options can do +their jobs and cross-check format strings with the number and types +of arguments. + +C standard, implementation defined and undefined behaviors +========================================================== + +C code in QEMU should be written to the C99 language specification. A copy +of the final version of the C99 standard with corrigenda TC1, TC2, and TC3 +included, formatted as a draft, can be downloaded from: + + `<http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf>`_ + +The C language specification defines regions of undefined behavior and +implementation defined behavior (to give compiler authors enough leeway to +produce better code). In general, code in QEMU should follow the language +specification and avoid both undefined and implementation defined +constructs. ("It works fine on the gcc I tested it with" is not a valid +argument...) However there are a few areas where we allow ourselves to +assume certain behaviors because in practice all the platforms we care about +behave in the same way and writing strictly conformant code would be +painful. These are: + +* you may assume that integers are 2s complement representation +* you may assume that right shift of a signed integer duplicates + the sign bit (ie it is an arithmetic shift, not a logical shift) + +In addition, QEMU assumes that the compiler does not use the latitude +given in C99 and C11 to treat aspects of signed '<<' as undefined, as +documented in the GNU Compiler Collection manual starting at version 4.0. + +.. _autofree-ref: + +Automatic memory deallocation +============================= + +QEMU has a mandatory dependency either the GCC or CLang compiler. As +such it has the freedom to make use of a C language extension for +automatically running a cleanup function when a stack variable goes +out of scope. This can be used to simplify function cleanup paths, +often allowing many goto jumps to be eliminated, through automatic +free'ing of memory. + +The GLib2 library provides a number of functions/macros for enabling +automatic cleanup: + + `<https://developer.gnome.org/glib/stable/glib-Miscellaneous-Macros.html>`_ + +Most notably: + +* g_autofree - will invoke g_free() on the variable going out of scope + +* g_autoptr - for structs / objects, will invoke the cleanup func created + by a previous use of G_DEFINE_AUTOPTR_CLEANUP_FUNC. This is + supported for most GLib data types and GObjects + +For example, instead of + +.. code-block:: c + + int somefunc(void) { + int ret = -1; + char *foo = g_strdup_printf("foo%", "wibble"); + GList *bar = ..... + + if (eek) { + goto cleanup; + } + + ret = 0; + + cleanup: + g_free(foo); + g_list_free(bar); + return ret; + } + +Using g_autofree/g_autoptr enables the code to be written as: + +.. code-block:: c + + int somefunc(void) { + g_autofree char *foo = g_strdup_printf("foo%", "wibble"); + g_autoptr (GList) bar = ..... + + if (eek) { + return -1; + } + + return 0; + } + +While this generally results in simpler, less leak-prone code, there +are still some caveats to beware of + +* Variables declared with g_auto* MUST always be initialized, + otherwise the cleanup function will use uninitialized stack memory + +* If a variable declared with g_auto* holds a value which must + live beyond the life of the function, that value must be saved + and the original variable NULL'd out. This can be simpler using + g_steal_pointer + + +.. code-block:: c + + char *somefunc(void) { + g_autofree char *foo = g_strdup_printf("foo%", "wibble"); + g_autoptr (GList) bar = ..... + + if (eek) { + return NULL; + } + + return g_steal_pointer(&foo); + } + + +QEMU Specific Idioms +******************** + +Error handling and reporting +============================ + +Reporting errors to the human user +---------------------------------- + +Do not use printf(), fprintf() or monitor_printf(). Instead, use +error_report() or error_vreport() from error-report.h. This ensures the +error is reported in the right place (current monitor or stderr), and in +a uniform format. + +Use error_printf() & friends to print additional information. + +error_report() prints the current location. In certain common cases +like command line parsing, the current location is tracked +automatically. To manipulate it manually, use the loc_``*``() from +error-report.h. + +Propagating errors +------------------ + +An error can't always be reported to the user right where it's detected, +but often needs to be propagated up the call chain to a place that can +handle it. This can be done in various ways. + +The most flexible one is Error objects. See error.h for usage +information. + +Use the simplest suitable method to communicate success / failure to +callers. Stick to common methods: non-negative on success / -1 on +error, non-negative / -errno, non-null / null, or Error objects. + +Example: when a function returns a non-null pointer on success, and it +can fail only in one way (as far as the caller is concerned), returning +null on failure is just fine, and certainly simpler and a lot easier on +the eyes than propagating an Error object through an Error ``*````*`` parameter. + +Example: when a function's callers need to report details on failure +only the function really knows, use Error ``*````*``, and set suitable errors. + +Do not report an error to the user when you're also returning an error +for somebody else to handle. Leave the reporting to the place that +consumes the error returned. + +Handling errors +--------------- + +Calling exit() is fine when handling configuration errors during +startup. It's problematic during normal operation. In particular, +monitor commands should never exit(). + +Do not call exit() or abort() to handle an error that can be triggered +by the guest (e.g., some unimplemented corner case in guest code +translation or device emulation). Guests should not be able to +terminate QEMU. + +Note that &error_fatal is just another way to exit(1), and &error_abort +is just another way to abort(). + + +trace-events style +================== + +0x prefix +--------- + +In trace-events files, use a '0x' prefix to specify hex numbers, as in: + +.. code-block:: c + + some_trace(unsigned x, uint64_t y) "x 0x%x y 0x" PRIx64 + +An exception is made for groups of numbers that are hexadecimal by +convention and separated by the symbols '.', '/', ':', or ' ' (such as +PCI bus id): + +.. code-block:: c + + another_trace(int cssid, int ssid, int dev_num) "bus id: %x.%x.%04x" + +However, you can use '0x' for such groups if you want. Anyway, be sure that +it is obvious that numbers are in hex, ex.: + +.. code-block:: c + + data_dump(uint8_t c1, uint8_t c2, uint8_t c3) "bytes (in hex): %02x %02x %02x" + +Rationale: hex numbers are hard to read in logs when there is no 0x prefix, +especially when (occasionally) the representation doesn't contain any letters +and especially in one line with other decimal numbers. Number groups are allowed +to not use '0x' because for some things notations like %x.%x.%x are used not +only in QEMU. Also dumping raw data bytes with '0x' is less readable. + +'#' printf flag +--------------- + +Do not use printf flag '#', like '%#x'. + +Rationale: there are two ways to add a '0x' prefix to printed number: '0x%...' +and '%#...'. For consistency the only one way should be used. Arguments for +'0x%' are: + +* it is more popular +* '%#' omits the 0x for the value 0 which makes output inconsistent diff --git a/docs/devel/submitting-a-patch.rst b/docs/devel/submitting-a-patch.rst new file mode 100644 index 000000000..e51259eb9 --- /dev/null +++ b/docs/devel/submitting-a-patch.rst @@ -0,0 +1,562 @@ +.. _submitting-a-patch: + +Submitting a Patch +================== + +QEMU welcomes contributions of code (either fixing bugs or adding new +functionality). However, we get a lot of patches, and so we have some +guidelines about submitting patches. If you follow these, you'll help +make our task of code review easier and your patch is likely to be +committed faster. + +This page seems very long, so if you are only trying to post a quick +one-shot fix, the bare minimum we ask is that: + +- You **must** provide a Signed-off-by: line (this is a hard + requirement because it's how you say "I'm legally okay to contribute + this and happy for it to go into QEMU", modeled after the `Linux kernel + <http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__ + policy.) ``git commit -s`` or ``git format-patch -s`` will add one. +- All contributions to QEMU must be **sent as patches** to the + qemu-devel `mailing list <MailingLists>`__. Patch contributions + should not be posted on the bug tracker, posted on forums, or + externally hosted and linked to. (We have other mailing lists too, + but all patches must go to qemu-devel, possibly with a Cc: to another + list.) ``git send-email`` (`step-by-step setup + guide <https://git-send-email.io/>`__ and `hints and + tips <https://elixir.bootlin.com/linux/latest/source/Documentation/process/email-clients.rst>`__) + works best for delivering the patch without mangling it, but + attachments can be used as a last resort on a first-time submission. +- You must read replies to your message, and be willing to act on them. + Note, however, that maintainers are often willing to manually fix up + first-time contributions, since there is a learning curve involved in + making an ideal patch submission. + +You do not have to subscribe to post (list policy is to reply-to-all to +preserve CCs and keep non-subscribers in the loop on the threads they +start), although you may find it easier as a subscriber to pick up good +ideas from other posts. If you do subscribe, be prepared for a high +volume of email, often over one thousand messages in a week. The list is +moderated; first-time posts from an email address (whether or not you +subscribed) may be subject to some delay while waiting for a moderator +to whitelist your address. + +The larger your contribution is, or if you plan on becoming a long-term +contributor, then the more important the rest of this page becomes. +Reading the table of contents below should already give you an idea of +the basic requirements. Use the table of contents as a reference, and +read the parts that you have doubts about. + +.. contents:: Table of Contents + +.. _writing_your_patches: + +Writing your Patches +-------------------- + +.. _use_the_qemu_coding_style: + +Use the QEMU coding style +~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can run run *scripts/checkpatch.pl <patchfile>* before submitting to +check that you are in compliance with our coding standards. Be aware +that ``checkpatch.pl`` is not infallible, though, especially where C +preprocessor macros are involved; use some common sense too. See also: + +- :ref:`coding-style` +- `Automate a checkpatch run on + commit <https://blog.vmsplice.net/2011/03/how-to-automatically-run-checkpatchpl.html>`__ + +.. _base_patches_against_current_git_master: + +Base patches against current git master +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There's no point submitting a patch which is based on a released version +of QEMU because development will have moved on from then and it probably +won't even apply to master. We only apply selected bugfixes to release +branches and then only as backports once the code has gone into master. + +It is also okay to base patches on top of other on-going work that is +not yet part of the git master branch. To aid continuous integration +tools, such as `patchew <http://patchew.org/QEMU/>`__, you should `add a +tag <https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg01288.html>`__ +line ``Based-on: $MESSAGE_ID`` to your cover letter to make the series +dependency obvious. + +.. _split_up_long_patches: + +Split up long patches +~~~~~~~~~~~~~~~~~~~~~ + +Split up longer patches into a patch series of logical code changes. +Each change should compile and execute successfully. For instance, don't +add a file to the makefile in patch one and then add the file itself in +patch two. (This rule is here so that people can later use tools like +`git bisect <http://git-scm.com/docs/git-bisect>`__ without hitting +points in the commit history where QEMU doesn't work for reasons +unrelated to the bug they're chasing.) Put documentation first, not +last, so that someone reading the series can do a clean-room evaluation +of the documentation, then validate that the code matched the +documentation. A commit message that mentions "Also, ..." is often a +good candidate for splitting into multiple patches. For more thoughts on +properly splitting patches and writing good commit messages, see `this +advice from +OpenStack <https://wiki.openstack.org/wiki/GitCommitMessages>`__. + +.. _make_code_motion_patches_easy_to_review: + +Make code motion patches easy to review +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a series requires large blocks of code motion, there are tricks for +making the refactoring easier to review. Split up the series so that +semantic changes (or even function renames) are done in a separate patch +from the raw code motion. Use a one-time setup of ``git config +diff.renames true;`` ``git config diff.algorithm patience`` (refer to +`git-config <http://git-scm.com/docs/git-config>`__). The 'diff.renames' +property ensures file rename patches will be given in a more compact +representation that focuses only on the differences across the file +rename, instead of showing the entire old file as a deletion and the new +file as an insertion. Meanwhile, the 'diff.algorithm' property ensures +that extracting a non-contiguous subset of one file into a new file, but +where all extracted parts occur in the same order both before and after +the patch, will reduce churn in trying to treat unrelated ``}`` lines in +the original file as separating hunks of changes. + +Ideally, a code motion patch can be reviewed by doing:: + + git format-patch --stdout -1 > patch; + diff -u <(sed -n 's/^-//p' patch) <(sed -n 's/^\+//p' patch) + +to focus on the few changes that weren't wholesale code motion. + +.. _dont_include_irrelevant_changes: + +Don't include irrelevant changes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In particular, don't include formatting, coding style or whitespace +changes to bits of code that would otherwise not be touched by the +patch. (It's OK to fix coding style issues in the immediate area (few +lines) of the lines you're changing.) If you think a section of code +really does need a reindent or other large-scale style fix, submit this +as a separate patch which makes no semantic changes; don't put it in the +same patch as your bug fix. + +For smaller patches in less frequently changed areas of QEMU, consider +using the :ref:`trivial-patches` process. + +.. _write_a_meaningful_commit_message: + +Write a meaningful commit message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Commit messages should be meaningful and should stand on their own as a +historical record of why the changes you applied were necessary or +useful. + +QEMU follows the usual standard for git commit messages: the first line +(which becomes the email subject line) is "subsystem: single line +summary of change". Whether the "single line summary of change" starts +with a capital is a matter of taste, but we prefer that the summary does +not end in a dot. Look at ``git shortlog -30`` for an idea of sample +subject lines. Then there is a blank line and a more detailed +description of the patch, another blank and your Signed-off-by: line. +Please do not use lines that are longer than 76 characters in your +commit message (so that the text still shows up nicely with "git show" +in a 80-columns terminal window). + +The body of the commit message is a good place to document why your +change is important. Don't include comments like "This is a suggestion +for fixing this bug" (they can go below the ``---`` line in the email so +they don't go into the final commit message). Make sure the body of the +commit message can be read in isolation even if the reader's mailer +displays the subject line some distance apart (that is, a body that +starts with "... so that" as a continuation of the subject line is +harder to follow). + +If your patch fixes a commit that is already in the repository, please +add an additional line with "Fixes: <at-least-12-digits-of-SHA-commit-id> +("Fixed commit subject")" below the patch description / before your +"Signed-off-by:" line in the commit message. + +If your patch fixes a bug in the gitlab bug tracker, please add a line +with "Resolves: <URL-of-the-bug>" to the commit message, too. Gitlab can +close bugs automatically once commits with the "Resolved:" keyword get +merged into the master branch of the project. And if your patch addresses +a bug in another public bug tracker, you can also use a line with +"Buglink: <URL-of-the-bug>" for reference here, too. + +Example:: + + Fixes: 14055ce53c2d ("s390x/tcg: avoid overflows in time2tod/tod2time") + Resolves: https://gitlab.com/qemu-project/qemu/-/issues/42 + Buglink: https://bugs.launchpad.net/qemu/+bug/1804323`` + +Some other tags that are used in commit messages include "Message-Id:" +"Tested-by:", "Acked-by:", "Reported-by:", "Suggested-by:". See ``git +log`` for these keywords for example usage. + +.. _test_your_patches: + +Test your patches +~~~~~~~~~~~~~~~~~ + +Although QEMU has `continuous integration +services <Testing#Continuous_Integration>`__ that attempt to test +patches submitted to the list, it still saves everyone time if you have +already tested that your patch compiles and works. Because QEMU is such +a large project, it's okay to use configure arguments to limit what is +built for faster turnaround during your development time; but it is +still wise to also check that your patches work with a full build before +submitting a series, especially if your changes might have an unintended +effect on other areas of the code you don't normally experiment with. +See `Testing <Testing>`__ for more details on what tests are available. +Also, it is a wise idea to include a testsuite addition as part of your +patches - either to ensure that future changes won't regress your new +feature, or to add a test which exposes the bug that the rest of your +series fixes. Keeping separate commits for the test and the fix allows +reviewers to rebase the test to occur first to prove it catches the +problem, then again to place it last in the series so that bisection +doesn't land on a known-broken state. + +.. _submitting_your_patches: + +Submitting your Patches +----------------------- + +.. _if_you_cannot_send_patch_emails: + +If you cannot send patch emails +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In rare cases it may not be possible to send properly formatted patch +emails. You can use `sourcehut <https://sourcehut.org/>`__ to send your +patches to the QEMU mailing list by following these steps: + +#. Register or sign in to your account +#. Add your SSH public key in `meta \| + keys <https://meta.sr.ht/keys>`__. +#. Publish your git branch using **git push git@git.sr.ht:~USERNAME/qemu + HEAD** +#. Send your patches to the QEMU mailing list using the web-based + ``git-send-email`` UI at https://git.sr.ht/~USERNAME/qemu/send-email + +`This video +<https://spacepub.space/videos/watch/ad258d23-0ac6-488c-83fc-2bacf578de3a>`__ +shows the web-based ``git-send-email`` workflow. Documentation is +available `here +<https://man.sr.ht/git.sr.ht/#sending-patches-upstream>`__. + +.. _cc_the_relevant_maintainer: + +CC the relevant maintainer +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Send patches both to the mailing list and CC the maintainer(s) of the +files you are modifying. look in the MAINTAINERS file to find out who +that is. Also try using scripts/get_maintainer.pl from the repository +for learning the most common committers for the files you touched. + +Example:: + + ~/src/qemu/scripts/get_maintainer.pl -f hw/ide/core.c + +In fact, you can automate this, via a one-time setup of ``git config +sendemail.cccmd 'scripts/get_maintainer.pl --nogit-fallback'`` (Refer to +`git-config <http://git-scm.com/docs/git-config>`__.) + +.. _do_not_send_as_an_attachment: + +Do not send as an attachment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Send patches inline so they are easy to reply to with review comments. +Do not put patches in attachments. + +.. _use_git_format_patch: + +Use ``git format-patch`` +~~~~~~~~~~~~~~~~~~~~~~~~ + +Use the right diff format. +`git format-patch <http://git-scm.com/docs/git-format-patch>`__ will +produce patch emails in the right format (check the documentation to +find out how to drive it). You can then edit the cover letter before +using ``git send-email`` to mail the files to the mailing list. (We +recommend `git send-email <http://git-scm.com/docs/git-send-email>`__ +because mail clients often mangle patches by wrapping long lines or +messing up whitespace. Some distributions do not include send-email in a +default install of git; you may need to download additional packages, +such as 'git-email' on Fedora-based systems.) Patch series need a cover +letter, with shallow threading (all patches in the series are +in-reply-to the cover letter, but not to each other); single unrelated +patches do not need a cover letter (but if you do send a cover letter, +use ``--numbered`` so the cover and the patch have distinct subject lines). +Patches are easier to find if they start a new top-level thread, rather +than being buried in-reply-to another existing thread. + +.. _avoid_posting_large_binary_blob: + +Avoid posting large binary blob +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you added binaries to the repository, consider producing the patch +emails using ``git format-patch --no-binary`` and include a link to a +git repository to fetch the original commit. + +.. _patch_emails_must_include_a_signed_off_by_line: + +Patch emails must include a ``Signed-off-by:`` line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For more information see `SubmittingPatches 1.12 +<http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__. +This is vital or we will not be able to apply your patch! Please use +your real name to sign a patch (not an alias or acronym). + +If you wrote the patch, make sure your "From:" and "Signed-off-by:" +lines use the same spelling. It's okay if you subscribe or contribute to +the list via more than one address, but using multiple addresses in one +commit just confuses things. If someone else wrote the patch, git will +include a "From:" line in the body of the email (different from your +envelope From:) that will give credit to the correct author; but again, +that author's Signed-off-by: line is mandatory, with the same spelling. + +.. _include_a_meaningful_cover_letter: + +Include a meaningful cover letter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is a requirement for any series with multiple patches (as it aids +continuous integration), but optional for an isolated patch. The cover +letter explains the overall goal of such a series, and also provides a +convenient 0/N email for others to reply to the series as a whole. A +one-time setup of ``git config format.coverletter auto`` (refer to +`git-config <http://git-scm.com/docs/git-config>`__) will generate the +cover letter as needed. + +When reviewers don't know your goal at the start of their review, they +may object to early changes that don't make sense until the end of the +series, because they do not have enough context yet at that point of +their review. A series where the goal is unclear also risks a higher +number of review-fix cycles because the reviewers haven't bought into +the idea yet. If the cover letter can explain these points to the +reviewer, the process will be smoother patches will get merged faster. +Make sure your cover letter includes a diffstat of changes made over the +entire series; potential reviewers know what files they are interested +in, and they need an easy way determine if your series touches them. + +.. _use_the_rfc_tag_if_needed: + +Use the RFC tag if needed +~~~~~~~~~~~~~~~~~~~~~~~~~ + +For example, "[PATCH RFC v2]". ``git format-patch --subject-prefix=RFC`` +can help. + +"RFC" means "Request For Comments" and is a statement that you don't +intend for your patchset to be applied to master, but would like some +review on it anyway. Reasons for doing this include: + +- the patch depends on some pending kernel changes which haven't yet + been accepted, so the QEMU patch series is blocked until that + dependency has been dealt with, but is worth reviewing anyway +- the patch set is not finished yet (perhaps it doesn't cover all use + cases or work with all targets) but you want early review of a major + API change or design structure before continuing + +In general, since it's asking other people to do review work on a +patchset that the submitter themselves is saying shouldn't be applied, +it's best to: + +- use it sparingly +- in the cover letter, be clear about why a patch is an RFC, what areas + of the patchset you're looking for review on, and why reviewers + should care + +.. _consider_whether_your_patch_is_applicable_for_stable: + +Consider whether your patch is applicable for stable +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If your patch fixes a severe issue or a regression, it may be applicable +for stable. In that case, consider adding ``Cc: qemu-stable@nongnu.org`` +to your patch to notify the stable maintainers. + +For more details on how QEMU's stable process works, refer to the +:ref:`stable-process` page. + +.. _participating_in_code_review: + +Participating in Code Review +---------------------------- + +All patches submitted to the QEMU project go through a code review +process before they are accepted. Some areas of code that are well +maintained may review patches quickly, lesser-loved areas of code may +have a longer delay. + +.. _stay_around_to_fix_problems_raised_in_code_review: + +Stay around to fix problems raised in code review +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Not many patches get into QEMU straight away -- it is quite common that +developers will identify bugs, or suggest a cleaner approach, or even +just point out code style issues or commit message typos. You'll need to +respond to these, and then send a second version of your patches with +the issues fixed. This takes a little time and effort on your part, but +if you don't do it then your changes will never get into QEMU. It's also +just polite -- it is quite disheartening for a developer to spend time +reviewing your code and suggesting improvements, only to find that +you're not going to do anything further and it was all wasted effort. + +When replying to comments on your patches **reply to all and not just +the sender** -- keeping discussion on the mailing list means everybody +can follow it. + +.. _pay_attention_to_review_comments: + +Pay attention to review comments +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Someone took their time to review your work, and it pays to respect that +effort; repeatedly submitting a series without addressing all comments +from the previous round tends to alienate reviewers and stall your +patch. Reviewers aren't always perfect, so it is okay if you want to +argue that your code was correct in the first place instead of blindly +doing everything the reviewer asked. On the other hand, if someone +pointed out a potential issue during review, then even if your code +turns out to be correct, it's probably a sign that you should improve +your commit message and/or comments in the code explaining why the code +is correct. + +If you fix issues that are raised during review **resend the entire +patch series** not just the one patch that was changed. This allows +maintainers to easily apply the fixed series without having to manually +identify which patches are relevant. Send the new version as a complete +fresh email or series of emails -- don't try to make it a followup to +version 1. (This helps automatic patch email handling tools distinguish +between v1 and v2 emails.) + +.. _when_resending_patches_add_a_version_tag: + +When resending patches add a version tag +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +All patches beyond the first version should include a version tag -- for +example, "[PATCH v2]". This means people can easily identify whether +they're looking at the most recent version. (The first version of a +patch need not say "v1", just [PATCH] is sufficient.) For patch series, +the version applies to the whole series -- even if you only change one +patch, you resend the entire series and mark it as "v2". Don't try to +track versions of different patches in the series separately. `git +format-patch <http://git-scm.com/docs/git-format-patch>`__ and `git +send-email <http://git-scm.com/docs/git-send-email>`__ both understand +the ``-v2`` option to make this easier. Send each new revision as a new +top-level thread, rather than burying it in-reply-to an earlier +revision, as many reviewers are not looking inside deep threads for new +patches. + +.. _include_version_history_in_patchset_revisions: + +Include version history in patchset revisions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For later versions of patches, include a summary of changes from +previous versions, but not in the commit message itself. In an email +formatted as a git patch, the commit message is the part above the ``---`` +line, and this will go into the git changelog when the patch is +committed. This part should be a self-contained description of what this +version of the patch does, written to make sense to anybody who comes +back to look at this commit in git in six months' time. The part below +the ``---`` line and above the patch proper (git format-patch puts the +diffstat here) is a good place to put remarks for people reading the +patch email, and this is where the "changes since previous version" +summary belongs. The `git-publish +<https://github.com/stefanha/git-publish>`__ script can help with +tracking a good summary across versions. Also, the `git-backport-diff +<https://github.com/codyprime/git-scripts>`__ script can help focus +reviewers on what changed between revisions. + +.. _tips_and_tricks: + +Tips and Tricks +--------------- + +.. _proper_use_of_reviewed_by_tags_can_aid_review: + +Proper use of Reviewed-by: tags can aid review +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When reviewing a large series, a reviewer can reply to some of the +patches with a Reviewed-by tag, stating that they are happy with that +patch in isolation (sometimes conditional on minor cleanup, like fixing +whitespace, that doesn't affect code content). You should then update +those commit messages by hand to include the Reviewed-by tag, so that in +the next revision, reviewers can spot which patches were already clean +from the previous round. Conversely, if you significantly modify a patch +that was previously reviewed, remove the reviewed-by tag out of the +commit message, as well as listing the changes from the previous +version, to make it easier to focus a reviewer's attention to your +changes. + +.. _if_your_patch_seems_to_have_been_ignored: + +If your patch seems to have been ignored +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If your patchset has received no replies you should "ping" it after a +week or two, by sending an email as a reply-to-all to the patch mail, +including the word "ping" and ideally also a link to the page for the +patch on `patchew <https://patchew.org/QEMU/>`__ or +`lore.kernel.org <https://lore.kernel.org/qemu-devel/>`__. It's worth +double-checking for reasons why your patch might have been ignored +(forgot to CC the maintainer? annoyed people by failing to respond to +review comments on an earlier version?), but often for less-maintained +areas of QEMU patches do just slip through the cracks. If your ping is +also ignored, ping again after another week or so. As the submitter, you +are the person with the most motivation to get your patch applied, so +you have to be persistent. + +.. _is_my_patch_in: + +Is my patch in? +~~~~~~~~~~~~~~~ + +QEMU has some Continuous Integration machines that try to catch patch +submission problems as soon as possible. `patchew +<http://patchew.org/QEMU/>`__ includes a web interface for tracking the +status of various threads that have been posted to the list, and may +send you an automated mail if it detected a problem with your patch. + +Once your patch has had enough review on list, the maintainer for that +area of code will send notification to the list that they are including +your patch in a particular staging branch. Periodically, the maintainer +then takes care of :ref:`submitting-a-pull-request` +for aggregating topic branches into mainline QEMU. Generally, you do not +need to send a pull request unless you have contributed enough patches +to become a maintainer over a particular section of code. Maintainers +may further modify your commit, by resolving simple merge conflicts or +fixing minor typos pointed out during review, but will always add a +Signed-off-by line in addition to yours, indicating that it went through +their tree. Occasionally, the maintainer's pull request may hit more +difficult merge conflicts, where you may be requested to help rebase and +resolve the problems. It may take a couple of weeks between when your +patch first had a positive review to when it finally lands in qemu.git; +release cycle freezes may extend that time even longer. + +.. _return_the_favor: + +Return the favor +~~~~~~~~~~~~~~~~ + +Peer review only works if everyone chips in a bit of review time. If +everyone submitted more patches than they reviewed, we would have a +patch backlog. A good goal is to try to review at least as many patches +from others as what you submit. Don't worry if you don't know the code +base as well as a maintainer; it's perfectly fine to admit when your +review is weak because you are unfamiliar with the code. diff --git a/docs/devel/submitting-a-pull-request.rst b/docs/devel/submitting-a-pull-request.rst new file mode 100644 index 000000000..c9d1e8afd --- /dev/null +++ b/docs/devel/submitting-a-pull-request.rst @@ -0,0 +1,77 @@ +.. _submitting-a-pull-request: + +Submitting a Pull Request +========================= + +QEMU welcomes contributions of code, but we generally expect these to be +sent as simple patch emails to the mailing list (see our page on +:ref:`submitting-a-patch` +for more details). Generally only existing submaintainers of a tree +will need to submit pull requests, although occasionally for a large +patch series we might ask a submitter to send a pull request. This page +documents our recommendations on pull requests for those people. + +A good rule of thumb is not to send a pull request unless somebody asks +you to. + +**Resend the patches with the pull request** as emails which are +threaded as follow-ups to the pull request itself. The simplest way to +do this is to use ``git format-patch --cover-letter`` to create the +emails, and then edit the cover letter to include the pull request +details that ``git request-pull`` outputs. + +**Use PULL as the subject line tag** in both the cover letter and the +retransmitted patch mails (for example, by using +``--subject-prefix=PULL`` in your ``git format-patch`` command). This +helps people to filter in or out the resulting emails (especially useful +if they are only CC'd on one email out of the set). + +**Each patch must have your own Signed-off-by: line** as well as that of +the original author if the patch was not written by you. This is because +with a pull request you're now indicating that the patch has passed via +you rather than directly from the original author. + +**Don't forget to add Reviewed-by: and Acked-by: lines**. When other +people have reviewed the patches you're putting in the pull request, +make sure you've copied their signoffs across. (If you use the `patches +tool <https://github.com/stefanha/patches>`__ to add patches from email +directly to your git repo it will include the tags automatically; if +you're updating patches manually or in some other way you'll need to +edit the commit messages by hand.) + +**Don't send pull requests for code that hasn't passed review**. A pull +request says these patches are ready to go into QEMU now, so they must +have passed the standard code review processes. In particular if you've +corrected issues in one round of code review, you need to send your +fixed patch series as normal to the list; you can't put it in a pull +request until it's gone through. (Extremely trivial fixes may be OK to +just fix in passing, but if in doubt err on the side of not.) + +**Test before sending**. This is an obvious thing to say, but make sure +everything builds (including that it compiles at each step of the patch +series) and that "make check" passes before sending out the pull +request. As a submaintainer you're one of QEMU's lines of defense +against bad code, so double check the details. + +**All pull requests must be signed**. If your key is not already signed +by members of the QEMU community, you should make arrangements to attend +a `KeySigningParty <https://wiki.qemu.org/KeySigningParty>`__ (for +example at KVM Forum) or make alternative arrangements to have your key +signed by an attendee. Key signing requires meeting another community +member \*in person\* so please make appropriate arrangements. By +"signed" here we mean that the pullreq email should quote a tag which is +a GPG-signed tag (as created with 'gpg tag -s ...'). + +**Pull requests not for master should say "not for master" and have +"PULL SUBSYSTEM whatever" in the subject tag**. If your pull request is +targeting a stable branch or some submaintainer tree, please include the +string "not for master" in the cover letter email, and make sure the +subject tag is "PULL SUBSYSTEM s390/block/whatever" rather than just +"PULL". This allows it to be automatically filtered out of the set of +pull requests that should be applied to master. + +You might be interested in the `make-pullreq +<https://git.linaro.org/people/peter.maydell/misc-scripts.git/tree/make-pullreq>`__ +script which automates some of this process for you and includes a few +sanity checks. Note that you must edit it to configure it suitably for +your local situation! diff --git a/docs/devel/tcg-icount.rst b/docs/devel/tcg-icount.rst new file mode 100644 index 000000000..50c8e8dab --- /dev/null +++ b/docs/devel/tcg-icount.rst @@ -0,0 +1,94 @@ +.. + Copyright (c) 2020, Linaro Limited + Written by Alex Bennée + + +======================== +TCG Instruction Counting +======================== + +TCG has long supported a feature known as icount which allows for +instruction counting during execution. This should not be confused +with cycle accurate emulation - QEMU does not attempt to emulate how +long an instruction would take on real hardware. That is a job for +other more detailed (and slower) tools that simulate the rest of a +micro-architecture. + +This feature is only available for system emulation and is +incompatible with multi-threaded TCG. It can be used to better align +execution time with wall-clock time so a "slow" device doesn't run too +fast on modern hardware. It can also provides for a degree of +deterministic execution and is an essential part of the record/replay +support in QEMU. + +Core Concepts +============= + +At its heart icount is simply a count of executed instructions which +is stored in the TimersState of QEMU's timer sub-system. The number of +executed instructions can then be used to calculate QEMU_CLOCK_VIRTUAL +which represents the amount of elapsed time in the system since +execution started. Depending on the icount mode this may either be a +fixed number of ns per instruction or adjusted as execution continues +to keep wall clock time and virtual time in sync. + +To be able to calculate the number of executed instructions the +translator starts by allocating a budget of instructions to be +executed. The budget of instructions is limited by how long it will be +until the next timer will expire. We store this budget as part of a +vCPU icount_decr field which shared with the machinery for handling +cpu_exit(). The whole field is checked at the start of every +translated block and will cause a return to the outer loop to deal +with whatever caused the exit. + +In the case of icount, before the flag is checked we subtract the +number of instructions the translation block would execute. If this +would cause the instruction budget to go negative we exit the main +loop and regenerate a new translation block with exactly the right +number of instructions to take the budget to 0 meaning whatever timer +was due to expire will expire exactly when we exit the main run loop. + +Dealing with MMIO +----------------- + +While we can adjust the instruction budget for known events like timer +expiry we cannot do the same for MMIO. Every load/store we execute +might potentially trigger an I/O event, at which point we will need an +up to date and accurate reading of the icount number. + +To deal with this case, when an I/O access is made we: + + - restore un-executed instructions to the icount budget + - re-compile a single [1]_ instruction block for the current PC + - exit the cpu loop and execute the re-compiled block + +The new block is created with the CF_LAST_IO compile flag which +ensures the final instruction translation starts with a call to +gen_io_start() so we don't enter a perpetual loop constantly +recompiling a single instruction block. For translators using the +common translator_loop this is done automatically. + +.. [1] sometimes two instructions if dealing with delay slots + +Other I/O operations +-------------------- + +MMIO isn't the only type of operation for which we might need a +correct and accurate clock. IO port instructions and accesses to +system registers are the common examples here. These instructions have +to be handled by the individual translators which have the knowledge +of which operations are I/O operations. + +When the translator is handling an instruction of this kind: + +* it must call gen_io_start() if icount is enabled, at some + point before the generation of the code which actually does + the I/O, using a code fragment similar to: + +.. code:: c + + if (tb_cflags(s->base.tb) & CF_USE_ICOUNT) { + gen_io_start(); + } + +* it must end the TB immediately after this instruction diff --git a/docs/devel/tcg-plugins.rst b/docs/devel/tcg-plugins.rst new file mode 100644 index 000000000..f93ef4fe5 --- /dev/null +++ b/docs/devel/tcg-plugins.rst @@ -0,0 +1,438 @@ +.. + Copyright (C) 2017, Emilio G. Cota <cota@braap.org> + Copyright (c) 2019, Linaro Limited + Written by Emilio Cota and Alex Bennée + +QEMU TCG Plugins +================ + +QEMU TCG plugins provide a way for users to run experiments taking +advantage of the total system control emulation can have over a guest. +It provides a mechanism for plugins to subscribe to events during +translation and execution and optionally callback into the plugin +during these events. TCG plugins are unable to change the system state +only monitor it passively. However they can do this down to an +individual instruction granularity including potentially subscribing +to all load and store operations. + +Usage +----- + +Any QEMU binary with TCG support has plugins enabled by default. +Earlier releases needed to be explicitly enabled with:: + + configure --enable-plugins + +Once built a program can be run with multiple plugins loaded each with +their own arguments:: + + $QEMU $OTHER_QEMU_ARGS \ + -plugin tests/plugin/libhowvec.so,inline=on,count=hint \ + -plugin tests/plugin/libhotblocks.so + +Arguments are plugin specific and can be used to modify their +behaviour. In this case the howvec plugin is being asked to use inline +ops to count and break down the hint instructions by type. + +Writing plugins +--------------- + +API versioning +~~~~~~~~~~~~~~ + +This is a new feature for QEMU and it does allow people to develop +out-of-tree plugins that can be dynamically linked into a running QEMU +process. However the project reserves the right to change or break the +API should it need to do so. The best way to avoid this is to submit +your plugin upstream so they can be updated if/when the API changes. + +All plugins need to declare a symbol which exports the plugin API +version they were built against. This can be done simply by:: + + QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION; + +The core code will refuse to load a plugin that doesn't export a +``qemu_plugin_version`` symbol or if plugin version is outside of QEMU's +supported range of API versions. + +Additionally the ``qemu_info_t`` structure which is passed to the +``qemu_plugin_install`` method of a plugin will detail the minimum and +current API versions supported by QEMU. The API version will be +incremented if new APIs are added. The minimum API version will be +incremented if existing APIs are changed or removed. + +Lifetime of the query handle +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each callback provides an opaque anonymous information handle which +can usually be further queried to find out information about a +translation, instruction or operation. The handles themselves are only +valid during the lifetime of the callback so it is important that any +information that is needed is extracted during the callback and saved +by the plugin. + +Plugin life cycle +~~~~~~~~~~~~~~~~~ + +First the plugin is loaded and the public qemu_plugin_install function +is called. The plugin will then register callbacks for various plugin +events. Generally plugins will register a handler for the *atexit* +if they want to dump a summary of collected information once the +program/system has finished running. + +When a registered event occurs the plugin callback is invoked. The +callbacks may provide additional information. In the case of a +translation event the plugin has an option to enumerate the +instructions in a block of instructions and optionally register +callbacks to some or all instructions when they are executed. + +There is also a facility to add an inline event where code to +increment a counter can be directly inlined with the translation. +Currently only a simple increment is supported. This is not atomic so +can miss counts. If you want absolute precision you should use a +callback which can then ensure atomicity itself. + +Finally when QEMU exits all the registered *atexit* callbacks are +invoked. + +Exposure of QEMU internals +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The plugin architecture actively avoids leaking implementation details +about how QEMU's translation works to the plugins. While there are +conceptions such as translation time and translation blocks the +details are opaque to plugins. The plugin is able to query select +details of instructions and system configuration only through the +exported *qemu_plugin* functions. + +API +~~~ + +.. kernel-doc:: include/qemu/qemu-plugin.h + +Internals +--------- + +Locking +~~~~~~~ + +We have to ensure we cannot deadlock, particularly under MTTCG. For +this we acquire a lock when called from plugin code. We also keep the +list of callbacks under RCU so that we do not have to hold the lock +when calling the callbacks. This is also for performance, since some +callbacks (e.g. memory access callbacks) might be called very +frequently. + + * A consequence of this is that we keep our own list of CPUs, so that + we do not have to worry about locking order wrt cpu_list_lock. + * Use a recursive lock, since we can get registration calls from + callbacks. + +As a result registering/unregistering callbacks is "slow", since it +takes a lock. But this is very infrequent; we want performance when +calling (or not calling) callbacks, not when registering them. Using +RCU is great for this. + +We support the uninstallation of a plugin at any time (e.g. from +plugin callbacks). This allows plugins to remove themselves if they no +longer want to instrument the code. This operation is asynchronous +which means callbacks may still occur after the uninstall operation is +requested. The plugin isn't completely uninstalled until the safe work +has executed while all vCPUs are quiescent. + +Example Plugins +--------------- + +There are a number of plugins included with QEMU and you are +encouraged to contribute your own plugins plugins upstream. There is a +``contrib/plugins`` directory where they can go. + +- tests/plugins + +These are some basic plugins that are used to test and exercise the +API during the ``make check-tcg`` target. + +- contrib/plugins/hotblocks.c + +The hotblocks plugin allows you to examine the where hot paths of +execution are in your program. Once the program has finished you will +get a sorted list of blocks reporting the starting PC, translation +count, number of instructions and execution count. This will work best +with linux-user execution as system emulation tends to generate +re-translations as blocks from different programs get swapped in and +out of system memory. + +If your program is single-threaded you can use the ``inline`` option for +slightly faster (but not thread safe) counters. + +Example:: + + ./aarch64-linux-user/qemu-aarch64 \ + -plugin contrib/plugins/libhotblocks.so -d plugin \ + ./tests/tcg/aarch64-linux-user/sha1 + SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6 + collected 903 entries in the hash table + pc, tcount, icount, ecount + 0x0000000041ed10, 1, 5, 66087 + 0x000000004002b0, 1, 4, 66087 + ... + +- contrib/plugins/hotpages.c + +Similar to hotblocks but this time tracks memory accesses:: + + ./aarch64-linux-user/qemu-aarch64 \ + -plugin contrib/plugins/libhotpages.so -d plugin \ + ./tests/tcg/aarch64-linux-user/sha1 + SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6 + Addr, RCPUs, Reads, WCPUs, Writes + 0x000055007fe000, 0x0001, 31747952, 0x0001, 8835161 + 0x000055007ff000, 0x0001, 29001054, 0x0001, 8780625 + 0x00005500800000, 0x0001, 687465, 0x0001, 335857 + 0x0000000048b000, 0x0001, 130594, 0x0001, 355 + 0x0000000048a000, 0x0001, 1826, 0x0001, 11 + +The hotpages plugin can be configured using the following arguments: + + * sortby=reads|writes|address + + Log the data sorted by either the number of reads, the number of writes, or + memory address. (Default: entries are sorted by the sum of reads and writes) + + * io=on + + Track IO addresses. Only relevant to full system emulation. (Default: off) + + * pagesize=N + + The page size used. (Default: N = 4096) + +- contrib/plugins/howvec.c + +This is an instruction classifier so can be used to count different +types of instructions. It has a number of options to refine which get +counted. You can give a value to the ``count`` argument for a class of +instructions to break it down fully, so for example to see all the system +registers accesses:: + + ./aarch64-softmmu/qemu-system-aarch64 $(QEMU_ARGS) \ + -append "root=/dev/sda2 systemd.unit=benchmark.service" \ + -smp 4 -plugin ./contrib/plugins/libhowvec.so,count=sreg -d plugin + +which will lead to a sorted list after the class breakdown:: + + Instruction Classes: + Class: UDEF not counted + Class: SVE (68 hits) + Class: PCrel addr (47789483 hits) + Class: Add/Sub (imm) (192817388 hits) + Class: Logical (imm) (93852565 hits) + Class: Move Wide (imm) (76398116 hits) + Class: Bitfield (44706084 hits) + Class: Extract (5499257 hits) + Class: Cond Branch (imm) (147202932 hits) + Class: Exception Gen (193581 hits) + Class: NOP not counted + Class: Hints (6652291 hits) + Class: Barriers (8001661 hits) + Class: PSTATE (1801695 hits) + Class: System Insn (6385349 hits) + Class: System Reg counted individually + Class: Branch (reg) (69497127 hits) + Class: Branch (imm) (84393665 hits) + Class: Cmp & Branch (110929659 hits) + Class: Tst & Branch (44681442 hits) + Class: AdvSimd ldstmult (736 hits) + Class: ldst excl (9098783 hits) + Class: Load Reg (lit) (87189424 hits) + Class: ldst noalloc pair (3264433 hits) + Class: ldst pair (412526434 hits) + Class: ldst reg (imm) (314734576 hits) + Class: Loads & Stores (2117774 hits) + Class: Data Proc Reg (223519077 hits) + Class: Scalar FP (31657954 hits) + Individual Instructions: + Instr: mrs x0, sp_el0 (2682661 hits) (op=0xd5384100/ System Reg) + Instr: mrs x1, tpidr_el2 (1789339 hits) (op=0xd53cd041/ System Reg) + Instr: mrs x2, tpidr_el2 (1513494 hits) (op=0xd53cd042/ System Reg) + Instr: mrs x0, tpidr_el2 (1490823 hits) (op=0xd53cd040/ System Reg) + Instr: mrs x1, sp_el0 (933793 hits) (op=0xd5384101/ System Reg) + Instr: mrs x2, sp_el0 (699516 hits) (op=0xd5384102/ System Reg) + Instr: mrs x4, tpidr_el2 (528437 hits) (op=0xd53cd044/ System Reg) + Instr: mrs x30, ttbr1_el1 (480776 hits) (op=0xd538203e/ System Reg) + Instr: msr ttbr1_el1, x30 (480713 hits) (op=0xd518203e/ System Reg) + Instr: msr vbar_el1, x30 (480671 hits) (op=0xd518c01e/ System Reg) + ... + +To find the argument shorthand for the class you need to examine the +source code of the plugin at the moment, specifically the ``*opt`` +argument in the InsnClassExecCount tables. + +- contrib/plugins/lockstep.c + +This is a debugging tool for developers who want to find out when and +where execution diverges after a subtle change to TCG code generation. +It is not an exact science and results are likely to be mixed once +asynchronous events are introduced. While the use of -icount can +introduce determinism to the execution flow it doesn't always follow +the translation sequence will be exactly the same. Typically this is +caused by a timer firing to service the GUI causing a block to end +early. However in some cases it has proved to be useful in pointing +people at roughly where execution diverges. The only argument you need +for the plugin is a path for the socket the two instances will +communicate over:: + + + ./sparc-softmmu/qemu-system-sparc -monitor none -parallel none \ + -net none -M SS-20 -m 256 -kernel day11/zImage.elf \ + -plugin ./contrib/plugins/liblockstep.so,sockpath=lockstep-sparc.sock \ + -d plugin,nochain + +which will eventually report:: + + qemu-system-sparc: warning: nic lance.0 has no peer + @ 0x000000ffd06678 vs 0x000000ffd001e0 (2/1 since last) + @ 0x000000ffd07d9c vs 0x000000ffd06678 (3/1 since last) + Δ insn_count @ 0x000000ffd07d9c (809900609) vs 0x000000ffd06678 (809900612) + previously @ 0x000000ffd06678/10 (809900609 insns) + previously @ 0x000000ffd001e0/4 (809900599 insns) + previously @ 0x000000ffd080ac/2 (809900595 insns) + previously @ 0x000000ffd08098/5 (809900593 insns) + previously @ 0x000000ffd080c0/1 (809900588 insns) + +- contrib/plugins/hwprofile.c + +The hwprofile tool can only be used with system emulation and allows +the user to see what hardware is accessed how often. It has a number of options: + + * track=read or track=write + + By default the plugin tracks both reads and writes. You can use one + of these options to limit the tracking to just one class of accesses. + + * source + + Will include a detailed break down of what the guest PC that made the + access was. Not compatible with the pattern option. Example output:: + + cirrus-low-memory @ 0xfffffd00000a0000 + pc:fffffc0000005cdc, 1, 256 + pc:fffffc0000005ce8, 1, 256 + pc:fffffc0000005cec, 1, 256 + + * pattern + + Instead break down the accesses based on the offset into the HW + region. This can be useful for seeing the most used registers of a + device. Example output:: + + pci0-conf @ 0xfffffd01fe000000 + off:00000004, 1, 1 + off:00000010, 1, 3 + off:00000014, 1, 3 + off:00000018, 1, 2 + off:0000001c, 1, 2 + off:00000020, 1, 2 + ... + +- contrib/plugins/execlog.c + +The execlog tool traces executed instructions with memory access. It can be used +for debugging and security analysis purposes. +Please be aware that this will generate a lot of output. + +The plugin takes no argument:: + + qemu-system-arm $(QEMU_ARGS) \ + -plugin ./contrib/plugins/libexeclog.so -d plugin + +which will output an execution trace following this structure:: + + # vCPU, vAddr, opcode, disassembly[, load/store, memory addr, device]... + 0, 0xa12, 0xf8012400, "movs r4, #0" + 0, 0xa14, 0xf87f42b4, "cmp r4, r6" + 0, 0xa16, 0xd206, "bhs #0xa26" + 0, 0xa18, 0xfff94803, "ldr r0, [pc, #0xc]", load, 0x00010a28, RAM + 0, 0xa1a, 0xf989f000, "bl #0xd30" + 0, 0xd30, 0xfff9b510, "push {r4, lr}", store, 0x20003ee0, RAM, store, 0x20003ee4, RAM + 0, 0xd32, 0xf9893014, "adds r0, #0x14" + 0, 0xd34, 0xf9c8f000, "bl #0x10c8" + 0, 0x10c8, 0xfff96c43, "ldr r3, [r0, #0x44]", load, 0x200000e4, RAM + +- contrib/plugins/cache.c + +Cache modelling plugin that measures the performance of a given L1 cache +configuration, and optionally a unified L2 per-core cache when a given working +set is run:: + + qemu-x86_64 -plugin ./contrib/plugins/libcache.so \ + -d plugin -D cache.log ./tests/tcg/x86_64-linux-user/float_convs + +will report the following:: + + core #, data accesses, data misses, dmiss rate, insn accesses, insn misses, imiss rate + 0 996695 508 0.0510% 2642799 18617 0.7044% + + address, data misses, instruction + 0x424f1e (_int_malloc), 109, movq %rax, 8(%rcx) + 0x41f395 (_IO_default_xsputn), 49, movb %dl, (%rdi, %rax) + 0x42584d (ptmalloc_init.part.0), 33, movaps %xmm0, (%rax) + 0x454d48 (__tunables_init), 20, cmpb $0, (%r8) + ... + + address, fetch misses, instruction + 0x4160a0 (__vfprintf_internal), 744, movl $1, %ebx + 0x41f0a0 (_IO_setb), 744, endbr64 + 0x415882 (__vfprintf_internal), 744, movq %r12, %rdi + 0x4268a0 (__malloc), 696, andq $0xfffffffffffffff0, %rax + ... + +The plugin has a number of arguments, all of them are optional: + + * limit=N + + Print top N icache and dcache thrashing instructions along with their + address, number of misses, and its disassembly. (default: 32) + + * icachesize=N + * iblksize=B + * iassoc=A + + Instruction cache configuration arguments. They specify the cache size, block + size, and associativity of the instruction cache, respectively. + (default: N = 16384, B = 64, A = 8) + + * dcachesize=N + * dblksize=B + * dassoc=A + + Data cache configuration arguments. They specify the cache size, block size, + and associativity of the data cache, respectively. + (default: N = 16384, B = 64, A = 8) + + * evict=POLICY + + Sets the eviction policy to POLICY. Available policies are: :code:`lru`, + :code:`fifo`, and :code:`rand`. The plugin will use the specified policy for + both instruction and data caches. (default: POLICY = :code:`lru`) + + * cores=N + + Sets the number of cores for which we maintain separate icache and dcache. + (default: for linux-user, N = 1, for full system emulation: N = cores + available to guest) + + * l2=on + + Simulates a unified L2 cache (stores blocks for both instructions and data) + using the default L2 configuration (cache size = 2MB, associativity = 16-way, + block size = 64B). + + * l2cachesize=N + * l2blksize=B + * l2assoc=A + + L2 cache configuration arguments. They specify the cache size, block size, and + associativity of the L2 cache, respectively. Setting any of the L2 + configuration arguments implies ``l2=on``. + (default: N = 2097152 (2MB), B = 64, A = 16) diff --git a/docs/devel/tcg.rst b/docs/devel/tcg.rst new file mode 100644 index 000000000..a65fb7b1c --- /dev/null +++ b/docs/devel/tcg.rst @@ -0,0 +1,190 @@ +==================== +Translator Internals +==================== + +QEMU is a dynamic translator. When it first encounters a piece of code, +it converts it to the host instruction set. Usually dynamic translators +are very complicated and highly CPU dependent. QEMU uses some tricks +which make it relatively easily portable and simple while achieving good +performances. + +QEMU's dynamic translation backend is called TCG, for "Tiny Code +Generator". For more information, please take a look at ``tcg/README``. + +The following sections outline some notable features and implementation +details of QEMU's dynamic translator. + +CPU state optimisations +----------------------- + +The target CPUs have many internal states which change the way they +evaluate instructions. In order to achieve a good speed, the +translation phase considers that some state information of the virtual +CPU cannot change in it. The state is recorded in the Translation +Block (TB). If the state changes (e.g. privilege level), a new TB will +be generated and the previous TB won't be used anymore until the state +matches the state recorded in the previous TB. The same idea can be applied +to other aspects of the CPU state. For example, on x86, if the SS, +DS and ES segments have a zero base, then the translator does not even +generate an addition for the segment base. + +Direct block chaining +--------------------- + +After each translated basic block is executed, QEMU uses the simulated +Program Counter (PC) and other CPU state information (such as the CS +segment base value) to find the next basic block. + +In its simplest, less optimized form, this is done by exiting from the +current TB, going through the TB epilogue, and then back to the +main loop. That’s where QEMU looks for the next TB to execute, +translating it from the guest architecture if it isn’t already available +in memory. Then QEMU proceeds to execute this next TB, starting at the +prologue and then moving on to the translated instructions. + +Exiting from the TB this way will cause the ``cpu_exec_interrupt()`` +callback to be re-evaluated before executing additional instructions. +It is mandatory to exit this way after any CPU state changes that may +unmask interrupts. + +In order to accelerate the cases where the TB for the new +simulated PC is already available, QEMU has mechanisms that allow +multiple TBs to be chained directly, without having to go back to the +main loop as described above. These mechanisms are: + +``lookup_and_goto_ptr`` +^^^^^^^^^^^^^^^^^^^^^^^ + +Calling ``tcg_gen_lookup_and_goto_ptr()`` will emit a call to +``helper_lookup_tb_ptr``. This helper will look for an existing TB that +matches the current CPU state. If the destination TB is available its +code address is returned, otherwise the address of the JIT epilogue is +returned. The call to the helper is always followed by the tcg ``goto_ptr`` +opcode, which branches to the returned address. In this way, we either +branch to the next TB or return to the main loop. + +``goto_tb + exit_tb`` +^^^^^^^^^^^^^^^^^^^^^ + +The translation code usually implements branching by performing the +following steps: + +1. Call ``tcg_gen_goto_tb()`` passing a jump slot index (either 0 or 1) + as a parameter. + +2. Emit TCG instructions to update the CPU state with any information + that has been assumed constant and is required by the main loop to + correctly locate and execute the next TB. For most guests, this is + just the PC of the branch destination, but others may store additional + data. The information updated in this step must be inferable from both + ``cpu_get_tb_cpu_state()`` and ``cpu_restore_state()``. + +3. Call ``tcg_gen_exit_tb()`` passing the address of the current TB and + the jump slot index again. + +Step 1, ``tcg_gen_goto_tb()``, will emit a ``goto_tb`` TCG +instruction that later on gets translated to a jump to an address +associated with the specified jump slot. Initially, this is the address +of step 2's instructions, which update the CPU state information. Step 3, +``tcg_gen_exit_tb()``, exits from the current TB returning a tagged +pointer composed of the last executed TB’s address and the jump slot +index. + +The first time this whole sequence is executed, step 1 simply jumps +to step 2. Then the CPU state information gets updated and we exit from +the current TB. As a result, the behavior is very similar to the less +optimized form described earlier in this section. + +Next, the main loop looks for the next TB to execute using the +current CPU state information (creating the TB if it wasn’t already +available) and, before starting to execute the new TB’s instructions, +patches the previously executed TB by associating one of its jump +slots (the one specified in the call to ``tcg_gen_exit_tb()``) with the +address of the new TB. + +The next time this previous TB is executed and we get to that same +``goto_tb`` step, it will already be patched (assuming the destination TB +is still in memory) and will jump directly to the first instruction of +the destination TB, without going back to the main loop. + +For the ``goto_tb + exit_tb`` mechanism to be used, the following +conditions need to be satisfied: + +* The change in CPU state must be constant, e.g., a direct branch and + not an indirect branch. + +* The direct branch cannot cross a page boundary. Memory mappings + may change, causing the code at the destination address to change. + +Note that, on step 3 (``tcg_gen_exit_tb()``), in addition to the +jump slot index, the address of the TB just executed is also returned. +This address corresponds to the TB that will be patched; it may be +different than the one that was directly executed from the main loop +if the latter had already been chained to other TBs. + +Self-modifying code and translated code invalidation +---------------------------------------------------- + +Self-modifying code is a special challenge in x86 emulation because no +instruction cache invalidation is signaled by the application when code +is modified. + +User-mode emulation marks a host page as write-protected (if it is +not already read-only) every time translated code is generated for a +basic block. Then, if a write access is done to the page, Linux raises +a SEGV signal. QEMU then invalidates all the translated code in the page +and enables write accesses to the page. For system emulation, write +protection is achieved through the software MMU. + +Correct translated code invalidation is done efficiently by maintaining +a linked list of every translated block contained in a given page. Other +linked lists are also maintained to undo direct block chaining. + +On RISC targets, correctly written software uses memory barriers and +cache flushes, so some of the protection above would not be +necessary. However, QEMU still requires that the generated code always +matches the target instructions in memory in order to handle +exceptions correctly. + +Exception support +----------------- + +longjmp() is used when an exception such as division by zero is +encountered. + +The host SIGSEGV and SIGBUS signal handlers are used to get invalid +memory accesses. QEMU keeps a map from host program counter to +target program counter, and looks up where the exception happened +based on the host program counter at the exception point. + +On some targets, some bits of the virtual CPU's state are not flushed to the +memory until the end of the translation block. This is done for internal +emulation state that is rarely accessed directly by the program and/or changes +very often throughout the execution of a translation block---this includes +condition codes on x86, delay slots on SPARC, conditional execution on +Arm, and so on. This state is stored for each target instruction, and +looked up on exceptions. + +MMU emulation +------------- + +For system emulation QEMU uses a software MMU. In that mode, the MMU +virtual to physical address translation is done at every memory +access. + +QEMU uses an address translation cache (TLB) to speed up the translation. +In order to avoid flushing the translated code each time the MMU +mappings change, all caches in QEMU are physically indexed. This +means that each basic block is indexed with its physical address. + +In order to avoid invalidating the basic block chain when MMU mappings +change, chaining is only performed when the destination of the jump +shares a page with the basic block that is performing the jump. + +The MMU can also distinguish RAM and ROM memory areas from MMIO memory +areas. Access is faster for RAM and ROM because the translation cache also +hosts the offset between guest address and host memory. Accessing MMIO +memory areas instead calls out to C code for device emulation. +Finally, the MMU helps tracking dirty pages and pages pointed to by +translation blocks. + diff --git a/docs/devel/testing.rst b/docs/devel/testing.rst new file mode 100644 index 000000000..755343c7d --- /dev/null +++ b/docs/devel/testing.rst @@ -0,0 +1,1309 @@ +Testing in QEMU +=============== + +This document describes the testing infrastructure in QEMU. + +Testing with "make check" +------------------------- + +The "make check" testing family includes most of the C based tests in QEMU. For +a quick help, run ``make check-help`` from the source tree. + +The usual way to run these tests is: + +.. code:: + + make check + +which includes QAPI schema tests, unit tests, QTests and some iotests. +Different sub-types of "make check" tests will be explained below. + +Before running tests, it is best to build QEMU programs first. Some tests +expect the executables to exist and will fail with obscure messages if they +cannot find them. + +Unit tests +~~~~~~~~~~ + +Unit tests, which can be invoked with ``make check-unit``, are simple C tests +that typically link to individual QEMU object files and exercise them by +calling exported functions. + +If you are writing new code in QEMU, consider adding a unit test, especially +for utility modules that are relatively stateless or have few dependencies. To +add a new unit test: + +1. Create a new source file. For example, ``tests/unit/foo-test.c``. + +2. Write the test. Normally you would include the header file which exports + the module API, then verify the interface behaves as expected from your + test. The test code should be organized with the glib testing framework. + Copying and modifying an existing test is usually a good idea. + +3. Add the test to ``tests/unit/meson.build``. The unit tests are listed in a + dictionary called ``tests``. The values are any additional sources and + dependencies to be linked with the test. For a simple test whose source + is in ``tests/unit/foo-test.c``, it is enough to add an entry like:: + + { + ... + 'foo-test': [], + ... + } + +Since unit tests don't require environment variables, the simplest way to debug +a unit test failure is often directly invoking it or even running it under +``gdb``. However there can still be differences in behavior between ``make`` +invocations and your manual run, due to ``$MALLOC_PERTURB_`` environment +variable (which affects memory reclamation and catches invalid pointers better) +and gtester options. If necessary, you can run + +.. code:: + + make check-unit V=1 + +and copy the actual command line which executes the unit test, then run +it from the command line. + +QTest +~~~~~ + +QTest is a device emulation testing framework. It can be very useful to test +device models; it could also control certain aspects of QEMU (such as virtual +clock stepping), with a special purpose "qtest" protocol. Refer to +:doc:`qtest` for more details. + +QTest cases can be executed with + +.. code:: + + make check-qtest + +QAPI schema tests +~~~~~~~~~~~~~~~~~ + +The QAPI schema tests validate the QAPI parser used by QMP, by feeding +predefined input to the parser and comparing the result with the reference +output. + +The input/output data is managed under the ``tests/qapi-schema`` directory. +Each test case includes four files that have a common base name: + + * ``${casename}.json`` - the file contains the JSON input for feeding the + parser + * ``${casename}.out`` - the file contains the expected stdout from the parser + * ``${casename}.err`` - the file contains the expected stderr from the parser + * ``${casename}.exit`` - the expected error code + +Consider adding a new QAPI schema test when you are making a change on the QAPI +parser (either fixing a bug or extending/modifying the syntax). To do this: + +1. Add four files for the new case as explained above. For example: + + ``$EDITOR tests/qapi-schema/foo.{json,out,err,exit}``. + +2. Add the new test in ``tests/Makefile.include``. For example: + + ``qapi-schema += foo.json`` + +check-block +~~~~~~~~~~~ + +``make check-block`` runs a subset of the block layer iotests (the tests that +are in the "auto" group). +See the "QEMU iotests" section below for more information. + +QEMU iotests +------------ + +QEMU iotests, under the directory ``tests/qemu-iotests``, is the testing +framework widely used to test block layer related features. It is higher level +than "make check" tests and 99% of the code is written in bash or Python +scripts. The testing success criteria is golden output comparison, and the +test files are named with numbers. + +To run iotests, make sure QEMU is built successfully, then switch to the +``tests/qemu-iotests`` directory under the build directory, and run ``./check`` +with desired arguments from there. + +By default, "raw" format and "file" protocol is used; all tests will be +executed, except the unsupported ones. You can override the format and protocol +with arguments: + +.. code:: + + # test with qcow2 format + ./check -qcow2 + # or test a different protocol + ./check -nbd + +It's also possible to list test numbers explicitly: + +.. code:: + + # run selected cases with qcow2 format + ./check -qcow2 001 030 153 + +Cache mode can be selected with the "-c" option, which may help reveal bugs +that are specific to certain cache mode. + +More options are supported by the ``./check`` script, run ``./check -h`` for +help. + +Writing a new test case +~~~~~~~~~~~~~~~~~~~~~~~ + +Consider writing a tests case when you are making any changes to the block +layer. An iotest case is usually the choice for that. There are already many +test cases, so it is possible that extending one of them may achieve the goal +and save the boilerplate to create one. (Unfortunately, there isn't a 100% +reliable way to find a related one out of hundreds of tests. One approach is +using ``git grep``.) + +Usually an iotest case consists of two files. One is an executable that +produces output to stdout and stderr, the other is the expected reference +output. They are given the same number in file names. E.g. Test script ``055`` +and reference output ``055.out``. + +In rare cases, when outputs differ between cache mode ``none`` and others, a +``.out.nocache`` file is added. In other cases, when outputs differ between +image formats, more than one ``.out`` files are created ending with the +respective format names, e.g. ``178.out.qcow2`` and ``178.out.raw``. + +There isn't a hard rule about how to write a test script, but a new test is +usually a (copy and) modification of an existing case. There are a few +commonly used ways to create a test: + +* A Bash script. It will make use of several environmental variables related + to the testing procedure, and could source a group of ``common.*`` libraries + for some common helper routines. + +* A Python unittest script. Import ``iotests`` and create a subclass of + ``iotests.QMPTestCase``, then call ``iotests.main`` method. The downside of + this approach is that the output is too scarce, and the script is considered + harder to debug. + +* A simple Python script without using unittest module. This could also import + ``iotests`` for launching QEMU and utilities etc, but it doesn't inherit + from ``iotests.QMPTestCase`` therefore doesn't use the Python unittest + execution. This is a combination of 1 and 2. + +Pick the language per your preference since both Bash and Python have +comparable library support for invoking and interacting with QEMU programs. If +you opt for Python, it is strongly recommended to write Python 3 compatible +code. + +Both Python and Bash frameworks in iotests provide helpers to manage test +images. They can be used to create and clean up images under the test +directory. If no I/O or any protocol specific feature is needed, it is often +more convenient to use the pseudo block driver, ``null-co://``, as the test +image, which doesn't require image creation or cleaning up. Avoid system-wide +devices or files whenever possible, such as ``/dev/null`` or ``/dev/zero``. +Otherwise, image locking implications have to be considered. For example, +another application on the host may have locked the file, possibly leading to a +test failure. If using such devices are explicitly desired, consider adding +``locking=off`` option to disable image locking. + +Debugging a test case +~~~~~~~~~~~~~~~~~~~~~ + +The following options to the ``check`` script can be useful when debugging +a failing test: + +* ``-gdb`` wraps every QEMU invocation in a ``gdbserver``, which waits for a + connection from a gdb client. The options given to ``gdbserver`` (e.g. the + address on which to listen for connections) are taken from the ``$GDB_OPTIONS`` + environment variable. By default (if ``$GDB_OPTIONS`` is empty), it listens on + ``localhost:12345``. + It is possible to connect to it for example with + ``gdb -iex "target remote $addr"``, where ``$addr`` is the address + ``gdbserver`` listens on. + If the ``-gdb`` option is not used, ``$GDB_OPTIONS`` is ignored, + regardless of whether it is set or not. + +* ``-valgrind`` attaches a valgrind instance to QEMU. If it detects + warnings, it will print and save the log in + ``$TEST_DIR/<valgrind_pid>.valgrind``. + The final command line will be ``valgrind --log-file=$TEST_DIR/ + <valgrind_pid>.valgrind --error-exitcode=99 $QEMU ...`` + +* ``-d`` (debug) just increases the logging verbosity, showing + for example the QMP commands and answers. + +* ``-p`` (print) redirects QEMU’s stdout and stderr to the test output, + instead of saving it into a log file in + ``$TEST_DIR/qemu-machine-<random_string>``. + +Test case groups +~~~~~~~~~~~~~~~~ + +"Tests may belong to one or more test groups, which are defined in the form +of a comment in the test source file. By convention, test groups are listed +in the second line of the test file, after the "#!/..." line, like this: + +.. code:: + + #!/usr/bin/env python3 + # group: auto quick + # + ... + +Another way of defining groups is creating the tests/qemu-iotests/group.local +file. This should be used only for downstream (this file should never appear +in upstream). This file may be used for defining some downstream test groups +or for temporarily disabling tests, like this: + +.. code:: + + # groups for some company downstream process + # + # ci - tests to run on build + # down - our downstream tests, not for upstream + # + # Format of each line is: + # TEST_NAME TEST_GROUP [TEST_GROUP ]... + + 013 ci + 210 disabled + 215 disabled + our-ugly-workaround-test down ci + +Note that the following group names have a special meaning: + +- quick: Tests in this group should finish within a few seconds. + +- auto: Tests in this group are used during "make check" and should be + runnable in any case. That means they should run with every QEMU binary + (also non-x86), with every QEMU configuration (i.e. must not fail if + an optional feature is not compiled in - but reporting a "skip" is ok), + work at least with the qcow2 file format, work with all kind of host + filesystems and users (e.g. "nobody" or "root") and must not take too + much memory and disk space (since CI pipelines tend to fail otherwise). + +- disabled: Tests in this group are disabled and ignored by check. + +.. _container-ref: + +Container based tests +--------------------- + +Introduction +~~~~~~~~~~~~ + +The container testing framework in QEMU utilizes public images to +build and test QEMU in predefined and widely accessible Linux +environments. This makes it possible to expand the test coverage +across distros, toolchain flavors and library versions. The support +was originally written for Docker although we also support Podman as +an alternative container runtime. Although the many of the target +names and scripts are prefixed with "docker" the system will +automatically run on whichever is configured. + +The container images are also used to augment the generation of tests +for testing TCG. See :ref:`checktcg-ref` for more details. + +Docker Prerequisites +~~~~~~~~~~~~~~~~~~~~ + +Install "docker" with the system package manager and start the Docker service +on your development machine, then make sure you have the privilege to run +Docker commands. Typically it means setting up passwordless ``sudo docker`` +command or login as root. For example: + +.. code:: + + $ sudo yum install docker + $ # or `apt-get install docker` for Ubuntu, etc. + $ sudo systemctl start docker + $ sudo docker ps + +The last command should print an empty table, to verify the system is ready. + +An alternative method to set up permissions is by adding the current user to +"docker" group and making the docker daemon socket file (by default +``/var/run/docker.sock``) accessible to the group: + +.. code:: + + $ sudo groupadd docker + $ sudo usermod $USER -a -G docker + $ sudo chown :docker /var/run/docker.sock + +Note that any one of above configurations makes it possible for the user to +exploit the whole host with Docker bind mounting or other privileged +operations. So only do it on development machines. + +Podman Prerequisites +~~~~~~~~~~~~~~~~~~~~ + +Install "podman" with the system package manager. + +.. code:: + + $ sudo dnf install podman + $ podman ps + +The last command should print an empty table, to verify the system is ready. + +Quickstart +~~~~~~~~~~ + +From source tree, type ``make docker-help`` to see the help. Testing +can be started without configuring or building QEMU (``configure`` and +``make`` are done in the container, with parameters defined by the +make target): + +.. code:: + + make docker-test-build@centos8 + +This will create a container instance using the ``centos8`` image (the image +is downloaded and initialized automatically), in which the ``test-build`` job +is executed. + +Registry +~~~~~~~~ + +The QEMU project has a container registry hosted by GitLab at +``registry.gitlab.com/qemu-project/qemu`` which will automatically be +used to pull in pre-built layers. This avoids unnecessary strain on +the distro archives created by multiple developers running the same +container build steps over and over again. This can be overridden +locally by using the ``NOCACHE`` build option: + +.. code:: + + make docker-image-debian10 NOCACHE=1 + +Images +~~~~~~ + +Along with many other images, the ``centos8`` image is defined in a Dockerfile +in ``tests/docker/dockerfiles/``, called ``centos8.docker``. ``make docker-help`` +command will list all the available images. + +To add a new image, simply create a new ``.docker`` file under the +``tests/docker/dockerfiles/`` directory. + +A ``.pre`` script can be added beside the ``.docker`` file, which will be +executed before building the image under the build context directory. This is +mainly used to do necessary host side setup. One such setup is ``binfmt_misc``, +for example, to make qemu-user powered cross build containers work. + +Tests +~~~~~ + +Different tests are added to cover various configurations to build and test +QEMU. Docker tests are the executables under ``tests/docker`` named +``test-*``. They are typically shell scripts and are built on top of a shell +library, ``tests/docker/common.rc``, which provides helpers to find the QEMU +source and build it. + +The full list of tests is printed in the ``make docker-help`` help. + +Debugging a Docker test failure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When CI tasks, maintainers or yourself report a Docker test failure, follow the +below steps to debug it: + +1. Locally reproduce the failure with the reported command line. E.g. run + ``make docker-test-mingw@fedora J=8``. +2. Add "V=1" to the command line, try again, to see the verbose output. +3. Further add "DEBUG=1" to the command line. This will pause in a shell prompt + in the container right before testing starts. You could either manually + build QEMU and run tests from there, or press Ctrl-D to let the Docker + testing continue. +4. If you press Ctrl-D, the same building and testing procedure will begin, and + will hopefully run into the error again. After that, you will be dropped to + the prompt for debug. + +Options +~~~~~~~ + +Various options can be used to affect how Docker tests are done. The full +list is in the ``make docker`` help text. The frequently used ones are: + +* ``V=1``: the same as in top level ``make``. It will be propagated to the + container and enable verbose output. +* ``J=$N``: the number of parallel tasks in make commands in the container, + similar to the ``-j $N`` option in top level ``make``. (The ``-j`` option in + top level ``make`` will not be propagated into the container.) +* ``DEBUG=1``: enables debug. See the previous "Debugging a Docker test + failure" section. + +Thread Sanitizer +---------------- + +Thread Sanitizer (TSan) is a tool which can detect data races. QEMU supports +building and testing with this tool. + +For more information on TSan: + +https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual + +Thread Sanitizer in Docker +~~~~~~~~~~~~~~~~~~~~~~~~~~ +TSan is currently supported in the ubuntu2004 docker. + +The test-tsan test will build using TSan and then run make check. + +.. code:: + + make docker-test-tsan@ubuntu2004 + +TSan warnings under docker are placed in files located at build/tsan/. + +We recommend using DEBUG=1 to allow launching the test from inside the docker, +and to allow review of the warnings generated by TSan. + +Building and Testing with TSan +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is possible to build and test with TSan, with a few additional steps. +These steps are normally done automatically in the docker. + +There is a one time patch needed in clang-9 or clang-10 at this time: + +.. code:: + + sed -i 's/^const/static const/g' \ + /usr/lib/llvm-10/lib/clang/10.0.0/include/sanitizer/tsan_interface.h + +To configure the build for TSan: + +.. code:: + + ../configure --enable-tsan --cc=clang-10 --cxx=clang++-10 \ + --disable-werror --extra-cflags="-O0" + +The runtime behavior of TSAN is controlled by the TSAN_OPTIONS environment +variable. + +More information on the TSAN_OPTIONS can be found here: + +https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags + +For example: + +.. code:: + + export TSAN_OPTIONS=suppressions=<path to qemu>/tests/tsan/suppressions.tsan \ + detect_deadlocks=false history_size=7 exitcode=0 \ + log_path=<build path>/tsan/tsan_warning + +The above exitcode=0 has TSan continue without error if any warnings are found. +This allows for running the test and then checking the warnings afterwards. +If you want TSan to stop and exit with error on warnings, use exitcode=66. + +TSan Suppressions +~~~~~~~~~~~~~~~~~ +Keep in mind that for any data race warning, although there might be a data race +detected by TSan, there might be no actual bug here. TSan provides several +different mechanisms for suppressing warnings. In general it is recommended +to fix the code if possible to eliminate the data race rather than suppress +the warning. + +A few important files for suppressing warnings are: + +tests/tsan/suppressions.tsan - Has TSan warnings we wish to suppress at runtime. +The comment on each suppression will typically indicate why we are +suppressing it. More information on the file format can be found here: + +https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions + +tests/tsan/blacklist.tsan - Has TSan warnings we wish to disable +at compile time for test or debug. +Add flags to configure to enable: + +"--extra-cflags=-fsanitize-blacklist=<src path>/tests/tsan/blacklist.tsan" + +More information on the file format can be found here under "Blacklist Format": + +https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags + +TSan Annotations +~~~~~~~~~~~~~~~~ +include/qemu/tsan.h defines annotations. See this file for more descriptions +of the annotations themselves. Annotations can be used to suppress +TSan warnings or give TSan more information so that it can detect proper +relationships between accesses of data. + +Annotation examples can be found here: + +https://github.com/llvm/llvm-project/tree/master/compiler-rt/test/tsan/ + +Good files to start with are: annotate_happens_before.cpp and ignore_race.cpp + +The full set of annotations can be found here: + +https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp + +VM testing +---------- + +This test suite contains scripts that bootstrap various guest images that have +necessary packages to build QEMU. The basic usage is documented in ``Makefile`` +help which is displayed with ``make vm-help``. + +Quickstart +~~~~~~~~~~ + +Run ``make vm-help`` to list available make targets. Invoke a specific make +command to run build test in an image. For example, ``make vm-build-freebsd`` +will build the source tree in the FreeBSD image. The command can be executed +from either the source tree or the build dir; if the former, ``./configure`` is +not needed. The command will then generate the test image in ``./tests/vm/`` +under the working directory. + +Note: images created by the scripts accept a well-known RSA key pair for SSH +access, so they SHOULD NOT be exposed to external interfaces if you are +concerned about attackers taking control of the guest and potentially +exploiting a QEMU security bug to compromise the host. + +QEMU binaries +~~~~~~~~~~~~~ + +By default, ``qemu-system-x86_64`` is searched in $PATH to run the guest. If +there isn't one, or if it is older than 2.10, the test won't work. In this case, +provide the QEMU binary in env var: ``QEMU=/path/to/qemu-2.10+``. + +Likewise the path to ``qemu-img`` can be set in QEMU_IMG environment variable. + +Make jobs +~~~~~~~~~ + +The ``-j$X`` option in the make command line is not propagated into the VM, +specify ``J=$X`` to control the make jobs in the guest. + +Debugging +~~~~~~~~~ + +Add ``DEBUG=1`` and/or ``V=1`` to the make command to allow interactive +debugging and verbose output. If this is not enough, see the next section. +``V=1`` will be propagated down into the make jobs in the guest. + +Manual invocation +~~~~~~~~~~~~~~~~~ + +Each guest script is an executable script with the same command line options. +For example to work with the netbsd guest, use ``$QEMU_SRC/tests/vm/netbsd``: + +.. code:: + + $ cd $QEMU_SRC/tests/vm + + # To bootstrap the image + $ ./netbsd --build-image --image /var/tmp/netbsd.img + <...> + + # To run an arbitrary command in guest (the output will not be echoed unless + # --debug is added) + $ ./netbsd --debug --image /var/tmp/netbsd.img uname -a + + # To build QEMU in guest + $ ./netbsd --debug --image /var/tmp/netbsd.img --build-qemu $QEMU_SRC + + # To get to an interactive shell + $ ./netbsd --interactive --image /var/tmp/netbsd.img sh + +Adding new guests +~~~~~~~~~~~~~~~~~ + +Please look at existing guest scripts for how to add new guests. + +Most importantly, create a subclass of BaseVM and implement ``build_image()`` +method and define ``BUILD_SCRIPT``, then finally call ``basevm.main()`` from +the script's ``main()``. + +* Usually in ``build_image()``, a template image is downloaded from a + predefined URL. ``BaseVM._download_with_cache()`` takes care of the cache and + the checksum, so consider using it. + +* Once the image is downloaded, users, SSH server and QEMU build deps should + be set up: + + - Root password set to ``BaseVM.ROOT_PASS`` + - User ``BaseVM.GUEST_USER`` is created, and password set to + ``BaseVM.GUEST_PASS`` + - SSH service is enabled and started on boot, + ``$QEMU_SRC/tests/keys/id_rsa.pub`` is added to ssh's ``authorized_keys`` + file of both root and the normal user + - DHCP client service is enabled and started on boot, so that it can + automatically configure the virtio-net-pci NIC and communicate with QEMU + user net (10.0.2.2) + - Necessary packages are installed to untar the source tarball and build + QEMU + +* Write a proper ``BUILD_SCRIPT`` template, which should be a shell script that + untars a raw virtio-blk block device, which is the tarball data blob of the + QEMU source tree, then configure/build it. Running "make check" is also + recommended. + +Image fuzzer testing +-------------------- + +An image fuzzer was added to exercise format drivers. Currently only qcow2 is +supported. To start the fuzzer, run + +.. code:: + + tests/image-fuzzer/runner.py -c '[["qemu-img", "info", "$test_img"]]' /tmp/test qcow2 + +Alternatively, some command different from ``qemu-img info`` can be tested, by +changing the ``-c`` option. + +Integration tests using the Avocado Framework +--------------------------------------------- + +The ``tests/avocado`` directory hosts integration tests. They're usually +higher level tests, and may interact with external resources and with +various guest operating systems. + +These tests are written using the Avocado Testing Framework (which must +be installed separately) in conjunction with a the ``avocado_qemu.Test`` +class, implemented at ``tests/avocado/avocado_qemu``. + +Tests based on ``avocado_qemu.Test`` can easily: + + * Customize the command line arguments given to the convenience + ``self.vm`` attribute (a QEMUMachine instance) + + * Interact with the QEMU monitor, send QMP commands and check + their results + + * Interact with the guest OS, using the convenience console device + (which may be useful to assert the effectiveness and correctness of + command line arguments or QMP commands) + + * Interact with external data files that accompany the test itself + (see ``self.get_data()``) + + * Download (and cache) remote data files, such as firmware and kernel + images + + * Have access to a library of guest OS images (by means of the + ``avocado.utils.vmimage`` library) + + * Make use of various other test related utilities available at the + test class itself and at the utility library: + + - http://avocado-framework.readthedocs.io/en/latest/api/test/avocado.html#avocado.Test + - http://avocado-framework.readthedocs.io/en/latest/api/utils/avocado.utils.html + +Running tests +~~~~~~~~~~~~~ + +You can run the avocado tests simply by executing: + +.. code:: + + make check-avocado + +This involves the automatic creation of Python virtual environment +within the build tree (at ``tests/venv``) which will have all the +right dependencies, and will save tests results also within the +build tree (at ``tests/results``). + +Note: the build environment must be using a Python 3 stack, and have +the ``venv`` and ``pip`` packages installed. If necessary, make sure +``configure`` is called with ``--python=`` and that those modules are +available. On Debian and Ubuntu based systems, depending on the +specific version, they may be on packages named ``python3-venv`` and +``python3-pip``. + +It is also possible to run tests based on tags using the +``make check-avocado`` command and the ``AVOCADO_TAGS`` environment +variable: + +.. code:: + + make check-avocado AVOCADO_TAGS=quick + +Note that tags separated with commas have an AND behavior, while tags +separated by spaces have an OR behavior. For more information on Avocado +tags, see: + + https://avocado-framework.readthedocs.io/en/latest/guides/user/chapters/tags.html + +To run a single test file, a couple of them, or a test within a file +using the ``make check-avocado`` command, set the ``AVOCADO_TESTS`` +environment variable with the test files or test names. To run all +tests from a single file, use: + + .. code:: + + make check-avocado AVOCADO_TESTS=$FILEPATH + +The same is valid to run tests from multiple test files: + + .. code:: + + make check-avocado AVOCADO_TESTS='$FILEPATH1 $FILEPATH2' + +To run a single test within a file, use: + + .. code:: + + make check-avocado AVOCADO_TESTS=$FILEPATH:$TESTCLASS.$TESTNAME + +The same is valid to run single tests from multiple test files: + + .. code:: + + make check-avocado AVOCADO_TESTS='$FILEPATH1:$TESTCLASS1.$TESTNAME1 $FILEPATH2:$TESTCLASS2.$TESTNAME2' + +The scripts installed inside the virtual environment may be used +without an "activation". For instance, the Avocado test runner +may be invoked by running: + + .. code:: + + tests/venv/bin/avocado run $OPTION1 $OPTION2 tests/avocado/ + +Note that if ``make check-avocado`` was not executed before, it is +possible to create the Python virtual environment with the dependencies +needed running: + + .. code:: + + make check-venv + +It is also possible to run tests from a single file or a single test within +a test file. To run tests from a single file within the build tree, use: + + .. code:: + + tests/venv/bin/avocado run tests/avocado/$TESTFILE + +To run a single test within a test file, use: + + .. code:: + + tests/venv/bin/avocado run tests/avocado/$TESTFILE:$TESTCLASS.$TESTNAME + +Valid test names are visible in the output from any previous execution +of Avocado or ``make check-avocado``, and can also be queried using: + + .. code:: + + tests/venv/bin/avocado list tests/avocado + +Manual Installation +~~~~~~~~~~~~~~~~~~~ + +To manually install Avocado and its dependencies, run: + +.. code:: + + pip install --user avocado-framework + +Alternatively, follow the instructions on this link: + + https://avocado-framework.readthedocs.io/en/latest/guides/user/chapters/installing.html + +Overview +~~~~~~~~ + +The ``tests/avocado/avocado_qemu`` directory provides the +``avocado_qemu`` Python module, containing the ``avocado_qemu.Test`` +class. Here's a simple usage example: + +.. code:: + + from avocado_qemu import QemuSystemTest + + + class Version(QemuSystemTest): + """ + :avocado: tags=quick + """ + def test_qmp_human_info_version(self): + self.vm.launch() + res = self.vm.command('human-monitor-command', + command_line='info version') + self.assertRegexpMatches(res, r'^(\d+\.\d+\.\d)') + +To execute your test, run: + +.. code:: + + avocado run version.py + +Tests may be classified according to a convention by using docstring +directives such as ``:avocado: tags=TAG1,TAG2``. To run all tests +in the current directory, tagged as "quick", run: + +.. code:: + + avocado run -t quick . + +The ``avocado_qemu.Test`` base test class +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``avocado_qemu.Test`` class has a number of characteristics that +are worth being mentioned right away. + +First of all, it attempts to give each test a ready to use QEMUMachine +instance, available at ``self.vm``. Because many tests will tweak the +QEMU command line, launching the QEMUMachine (by using ``self.vm.launch()``) +is left to the test writer. + +The base test class has also support for tests with more than one +QEMUMachine. The way to get machines is through the ``self.get_vm()`` +method which will return a QEMUMachine instance. The ``self.get_vm()`` +method accepts arguments that will be passed to the QEMUMachine creation +and also an optional ``name`` attribute so you can identify a specific +machine and get it more than once through the tests methods. A simple +and hypothetical example follows: + +.. code:: + + from avocado_qemu import QemuSystemTest + + + class MultipleMachines(QemuSystemTest): + def test_multiple_machines(self): + first_machine = self.get_vm() + second_machine = self.get_vm() + self.get_vm(name='third_machine').launch() + + first_machine.launch() + second_machine.launch() + + first_res = first_machine.command( + 'human-monitor-command', + command_line='info version') + + second_res = second_machine.command( + 'human-monitor-command', + command_line='info version') + + third_res = self.get_vm(name='third_machine').command( + 'human-monitor-command', + command_line='info version') + + self.assertEquals(first_res, second_res, third_res) + +At test "tear down", ``avocado_qemu.Test`` handles all the QEMUMachines +shutdown. + +The ``avocado_qemu.LinuxTest`` base test class +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``avocado_qemu.LinuxTest`` is further specialization of the +``avocado_qemu.Test`` class, so it contains all the characteristics of +the later plus some extra features. + +First of all, this base class is intended for tests that need to +interact with a fully booted and operational Linux guest. At this +time, it uses a Fedora 31 guest image. The most basic example looks +like this: + +.. code:: + + from avocado_qemu import LinuxTest + + + class SomeTest(LinuxTest): + + def test(self): + self.launch_and_wait() + self.ssh_command('some_command_to_be_run_in_the_guest') + +Please refer to tests that use ``avocado_qemu.LinuxTest`` under +``tests/avocado`` for more examples. + +QEMUMachine +~~~~~~~~~~~ + +The QEMUMachine API is already widely used in the Python iotests, +device-crash-test and other Python scripts. It's a wrapper around the +execution of a QEMU binary, giving its users: + + * the ability to set command line arguments to be given to the QEMU + binary + + * a ready to use QMP connection and interface, which can be used to + send commands and inspect its results, as well as asynchronous + events + + * convenience methods to set commonly used command line arguments in + a more succinct and intuitive way + +QEMU binary selection +^^^^^^^^^^^^^^^^^^^^^ + +The QEMU binary used for the ``self.vm`` QEMUMachine instance will +primarily depend on the value of the ``qemu_bin`` parameter. If it's +not explicitly set, its default value will be the result of a dynamic +probe in the same source tree. A suitable binary will be one that +targets the architecture matching host machine. + +Based on this description, test writers will usually rely on one of +the following approaches: + +1) Set ``qemu_bin``, and use the given binary + +2) Do not set ``qemu_bin``, and use a QEMU binary named like + "qemu-system-${arch}", either in the current + working directory, or in the current source tree. + +The resulting ``qemu_bin`` value will be preserved in the +``avocado_qemu.Test`` as an attribute with the same name. + +Attribute reference +~~~~~~~~~~~~~~~~~~~ + +Test +^^^^ + +Besides the attributes and methods that are part of the base +``avocado.Test`` class, the following attributes are available on any +``avocado_qemu.Test`` instance. + +vm +'' + +A QEMUMachine instance, initially configured according to the given +``qemu_bin`` parameter. + +arch +'''' + +The architecture can be used on different levels of the stack, e.g. by +the framework or by the test itself. At the framework level, it will +currently influence the selection of a QEMU binary (when one is not +explicitly given). + +Tests are also free to use this attribute value, for their own needs. +A test may, for instance, use the same value when selecting the +architecture of a kernel or disk image to boot a VM with. + +The ``arch`` attribute will be set to the test parameter of the same +name. If one is not given explicitly, it will either be set to +``None``, or, if the test is tagged with one (and only one) +``:avocado: tags=arch:VALUE`` tag, it will be set to ``VALUE``. + +cpu +''' + +The cpu model that will be set to all QEMUMachine instances created +by the test. + +The ``cpu`` attribute will be set to the test parameter of the same +name. If one is not given explicitly, it will either be set to +``None ``, or, if the test is tagged with one (and only one) +``:avocado: tags=cpu:VALUE`` tag, it will be set to ``VALUE``. + +machine +''''''' + +The machine type that will be set to all QEMUMachine instances created +by the test. + +The ``machine`` attribute will be set to the test parameter of the same +name. If one is not given explicitly, it will either be set to +``None``, or, if the test is tagged with one (and only one) +``:avocado: tags=machine:VALUE`` tag, it will be set to ``VALUE``. + +qemu_bin +'''''''' + +The preserved value of the ``qemu_bin`` parameter or the result of the +dynamic probe for a QEMU binary in the current working directory or +source tree. + +LinuxTest +^^^^^^^^^ + +Besides the attributes present on the ``avocado_qemu.Test`` base +class, the ``avocado_qemu.LinuxTest`` adds the following attributes: + +distro +'''''' + +The name of the Linux distribution used as the guest image for the +test. The name should match the **Provider** column on the list +of images supported by the avocado.utils.vmimage library: + +https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images + +distro_version +'''''''''''''' + +The version of the Linux distribution as the guest image for the +test. The name should match the **Version** column on the list +of images supported by the avocado.utils.vmimage library: + +https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images + +distro_checksum +''''''''''''''' + +The sha256 hash of the guest image file used for the test. + +If this value is not set in the code or by a test parameter (with the +same name), no validation on the integrity of the image will be +performed. + +Parameter reference +~~~~~~~~~~~~~~~~~~~ + +To understand how Avocado parameters are accessed by tests, and how +they can be passed to tests, please refer to:: + + https://avocado-framework.readthedocs.io/en/latest/guides/writer/chapters/writing.html#accessing-test-parameters + +Parameter values can be easily seen in the log files, and will look +like the following: + +.. code:: + + PARAMS (key=qemu_bin, path=*, default=./qemu-system-x86_64) => './qemu-system-x86_64 + +Test +^^^^ + +arch +'''' + +The architecture that will influence the selection of a QEMU binary +(when one is not explicitly given). + +Tests are also free to use this parameter value, for their own needs. +A test may, for instance, use the same value when selecting the +architecture of a kernel or disk image to boot a VM with. + +This parameter has a direct relation with the ``arch`` attribute. If +not given, it will default to None. + +cpu +''' + +The cpu model that will be set to all QEMUMachine instances created +by the test. + +machine +''''''' + +The machine type that will be set to all QEMUMachine instances created +by the test. + +qemu_bin +'''''''' + +The exact QEMU binary to be used on QEMUMachine. + +LinuxTest +^^^^^^^^^ + +Besides the parameters present on the ``avocado_qemu.Test`` base +class, the ``avocado_qemu.LinuxTest`` adds the following parameters: + +distro +'''''' + +The name of the Linux distribution used as the guest image for the +test. The name should match the **Provider** column on the list +of images supported by the avocado.utils.vmimage library: + +https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images + +distro_version +'''''''''''''' + +The version of the Linux distribution as the guest image for the +test. The name should match the **Version** column on the list +of images supported by the avocado.utils.vmimage library: + +https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images + +distro_checksum +''''''''''''''' + +The sha256 hash of the guest image file used for the test. + +If this value is not set in the code or by this parameter no +validation on the integrity of the image will be performed. + +Skipping tests +~~~~~~~~~~~~~~ + +The Avocado framework provides Python decorators which allow for easily skip +tests running under certain conditions. For example, on the lack of a binary +on the test system or when the running environment is a CI system. For further +information about those decorators, please refer to:: + + https://avocado-framework.readthedocs.io/en/latest/guides/writer/chapters/writing.html#skipping-tests + +While the conditions for skipping tests are often specifics of each one, there +are recurring scenarios identified by the QEMU developers and the use of +environment variables became a kind of standard way to enable/disable tests. + +Here is a list of the most used variables: + +AVOCADO_ALLOW_LARGE_STORAGE +^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Tests which are going to fetch or produce assets considered *large* are not +going to run unless that ``AVOCADO_ALLOW_LARGE_STORAGE=1`` is exported on +the environment. + +The definition of *large* is a bit arbitrary here, but it usually means an +asset which occupies at least 1GB of size on disk when uncompressed. + +AVOCADO_ALLOW_UNTRUSTED_CODE +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +There are tests which will boot a kernel image or firmware that can be +considered not safe to run on the developer's workstation, thus they are +skipped by default. The definition of *not safe* is also arbitrary but +usually it means a blob which either its source or build process aren't +public available. + +You should export ``AVOCADO_ALLOW_UNTRUSTED_CODE=1`` on the environment in +order to allow tests which make use of those kind of assets. + +AVOCADO_TIMEOUT_EXPECTED +^^^^^^^^^^^^^^^^^^^^^^^^ +The Avocado framework has a timeout mechanism which interrupts tests to avoid the +test suite of getting stuck. The timeout value can be set via test parameter or +property defined in the test class, for further details:: + + https://avocado-framework.readthedocs.io/en/latest/guides/writer/chapters/writing.html#setting-a-test-timeout + +Even though the timeout can be set by the test developer, there are some tests +that may not have a well-defined limit of time to finish under certain +conditions. For example, tests that take longer to execute when QEMU is +compiled with debug flags. Therefore, the ``AVOCADO_TIMEOUT_EXPECTED`` variable +has been used to determine whether those tests should run or not. + +GITLAB_CI +^^^^^^^^^ +A number of tests are flagged to not run on the GitLab CI. Usually because +they proved to the flaky or there are constraints on the CI environment which +would make them fail. If you encounter a similar situation then use that +variable as shown on the code snippet below to skip the test: + +.. code:: + + @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab') + def test(self): + do_something() + +Uninstalling Avocado +~~~~~~~~~~~~~~~~~~~~ + +If you've followed the manual installation instructions above, you can +easily uninstall Avocado. Start by listing the packages you have +installed:: + + pip list --user + +And remove any package you want with:: + + pip uninstall <package_name> + +If you've used ``make check-avocado``, the Python virtual environment where +Avocado is installed will be cleaned up as part of ``make check-clean``. + +.. _checktcg-ref: + +Testing with "make check-tcg" +----------------------------- + +The check-tcg tests are intended for simple smoke tests of both +linux-user and softmmu TCG functionality. However to build test +programs for guest targets you need to have cross compilers available. +If your distribution supports cross compilers you can do something as +simple as:: + + apt install gcc-aarch64-linux-gnu + +The configure script will automatically pick up their presence. +Sometimes compilers have slightly odd names so the availability of +them can be prompted by passing in the appropriate configure option +for the architecture in question, for example:: + + $(configure) --cross-cc-aarch64=aarch64-cc + +There is also a ``--cross-cc-flags-ARCH`` flag in case additional +compiler flags are needed to build for a given target. + +If you have the ability to run containers as the user the build system +will automatically use them where no system compiler is available. For +architectures where we also support building QEMU we will generally +use the same container to build tests. However there are a number of +additional containers defined that have a minimal cross-build +environment that is only suitable for building test cases. Sometimes +we may use a bleeding edge distribution for compiler features needed +for test cases that aren't yet in the LTS distros we support for QEMU +itself. + +See :ref:`container-ref` for more details. + +Running subset of tests +~~~~~~~~~~~~~~~~~~~~~~~ + +You can build the tests for one architecture:: + + make build-tcg-tests-$TARGET + +And run with:: + + make run-tcg-tests-$TARGET + +Adding ``V=1`` to the invocation will show the details of how to +invoke QEMU for the test which is useful for debugging tests. + +TCG test dependencies +~~~~~~~~~~~~~~~~~~~~~ + +The TCG tests are deliberately very light on dependencies and are +either totally bare with minimal gcc lib support (for softmmu tests) +or just glibc (for linux-user tests). This is because getting a cross +compiler to work with additional libraries can be challenging. + +Other TCG Tests +--------------- + +There are a number of out-of-tree test suites that are used for more +extensive testing of processor features. + +KVM Unit Tests +~~~~~~~~~~~~~~ + +The KVM unit tests are designed to run as a Guest OS under KVM but +there is no reason why they can't exercise the TCG as well. It +provides a minimal OS kernel with hooks for enabling the MMU as well +as reporting test results via a special device:: + + https://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git + +Linux Test Project +~~~~~~~~~~~~~~~~~~ + +The LTP is focused on exercising the syscall interface of a Linux +kernel. It checks that syscalls behave as documented and strives to +exercise as many corner cases as possible. It is a useful test suite +to run to exercise QEMU's linux-user code:: + + https://linux-test-project.github.io/ + +GCC gcov support +---------------- + +``gcov`` is a GCC tool to analyze the testing coverage by +instrumenting the tested code. To use it, configure QEMU with +``--enable-gcov`` option and build. Then run the tests as usual. + +If you want to gather coverage information on a single test the ``make +clean-gcda`` target can be used to delete any existing coverage +information before running a single test. + +You can generate a HTML coverage report by executing ``make +coverage-html`` which will create +``meson-logs/coveragereport/index.html``. + +Further analysis can be conducted by running the ``gcov`` command +directly on the various .gcda output files. Please read the ``gcov`` +documentation for more information. diff --git a/docs/devel/tracing.rst b/docs/devel/tracing.rst new file mode 100644 index 000000000..ba8395489 --- /dev/null +++ b/docs/devel/tracing.rst @@ -0,0 +1,498 @@ +======= +Tracing +======= + +Introduction +============ + +This document describes the tracing infrastructure in QEMU and how to use it +for debugging, profiling, and observing execution. + +Quickstart +========== + +Enable tracing of ``memory_region_ops_read`` and ``memory_region_ops_write`` +events:: + + $ qemu --trace "memory_region_ops_*" ... + ... + 719585@1608130130.441188:memory_region_ops_read cpu 0 mr 0x562fdfbb3820 addr 0x3cc value 0x67 size 1 + 719585@1608130130.441190:memory_region_ops_write cpu 0 mr 0x562fdfbd2f00 addr 0x3d4 value 0x70e size 2 + +This output comes from the "log" trace backend that is enabled by default when +``./configure --enable-trace-backends=BACKENDS`` was not explicitly specified. + +Multiple patterns can be specified by repeating the ``--trace`` option:: + + $ qemu --trace "kvm_*" --trace "virtio_*" ... + +When patterns are used frequently it is more convenient to store them in a +file to avoid long command-line options:: + + $ echo "memory_region_ops_*" >/tmp/events + $ echo "kvm_*" >>/tmp/events + $ qemu --trace events=/tmp/events ... + +Trace events +============ + +Sub-directory setup +------------------- + +Each directory in the source tree can declare a set of trace events in a local +"trace-events" file. All directories which contain "trace-events" files must be +listed in the "trace_events_subdirs" variable in the top level meson.build +file. During build, the "trace-events" file in each listed subdirectory will be +processed by the "tracetool" script to generate code for the trace events. + +The individual "trace-events" files are merged into a "trace-events-all" file, +which is also installed into "/usr/share/qemu" with the name "trace-events". +This merged file is to be used by the "simpletrace.py" script to later analyse +traces in the simpletrace data format. + +The following files are automatically generated in <builddir>/trace/ during the +build: + + - trace-<subdir>.c - the trace event state declarations + - trace-<subdir>.h - the trace event enums and probe functions + - trace-dtrace-<subdir>.h - DTrace event probe specification + - trace-dtrace-<subdir>.dtrace - DTrace event probe helper declaration + - trace-dtrace-<subdir>.o - binary DTrace provider (generated by dtrace) + - trace-ust-<subdir>.h - UST event probe helper declarations + +Here <subdir> is the sub-directory path with '/' replaced by '_'. For example, +"accel/kvm" becomes "accel_kvm" and the final filename for "trace-<subdir>.c" +becomes "trace-accel_kvm.c". + +Source files in the source tree do not directly include generated files in +"<builddir>/trace/". Instead they #include the local "trace.h" file, without +any sub-directory path prefix. eg io/channel-buffer.c would do:: + + #include "trace.h" + +The "io/trace.h" file must be created manually with an #include of the +corresponding "trace/trace-<subdir>.h" file that will be generated in the +builddir:: + + $ echo '#include "trace/trace-io.h"' >io/trace.h + +While it is possible to include a trace.h file from outside a source file's own +sub-directory, this is discouraged in general. It is strongly preferred that +all events be declared directly in the sub-directory that uses them. The only +exception is where there are some shared trace events defined in the top level +directory trace-events file. The top level directory generates trace files +with a filename prefix of "trace/trace-root" instead of just "trace". This is +to avoid ambiguity between a trace.h in the current directory, vs the top level +directory. + +Using trace events +------------------ + +Trace events are invoked directly from source code like this:: + + #include "trace.h" /* needed for trace event prototype */ + + void *qemu_vmalloc(size_t size) + { + void *ptr; + size_t align = QEMU_VMALLOC_ALIGN; + + if (size < align) { + align = getpagesize(); + } + ptr = qemu_memalign(align, size); + trace_qemu_vmalloc(size, ptr); + return ptr; + } + +Declaring trace events +---------------------- + +The "tracetool" script produces the trace.h header file which is included by +every source file that uses trace events. Since many source files include +trace.h, it uses a minimum of types and other header files included to keep the +namespace clean and compile times and dependencies down. + +Trace events should use types as follows: + + * Use stdint.h types for fixed-size types. Most offsets and guest memory + addresses are best represented with uint32_t or uint64_t. Use fixed-size + types over primitive types whose size may change depending on the host + (32-bit versus 64-bit) so trace events don't truncate values or break + the build. + + * Use void * for pointers to structs or for arrays. The trace.h header + cannot include all user-defined struct declarations and it is therefore + necessary to use void * for pointers to structs. + + * For everything else, use primitive scalar types (char, int, long) with the + appropriate signedness. + + * Avoid floating point types (float and double) because SystemTap does not + support them. In most cases it is possible to round to an integer type + instead. This may require scaling the value first by multiplying it by 1000 + or the like when digits after the decimal point need to be preserved. + +Format strings should reflect the types defined in the trace event. Take +special care to use PRId64 and PRIu64 for int64_t and uint64_t types, +respectively. This ensures portability between 32- and 64-bit platforms. +Format strings must not end with a newline character. It is the responsibility +of backends to adapt line ending for proper logging. + +Each event declaration will start with the event name, then its arguments, +finally a format string for pretty-printing. For example:: + + qemu_vmalloc(size_t size, void *ptr) "size %zu ptr %p" + qemu_vfree(void *ptr) "ptr %p" + + +Hints for adding new trace events +--------------------------------- + +1. Trace state changes in the code. Interesting points in the code usually + involve a state change like starting, stopping, allocating, freeing. State + changes are good trace events because they can be used to understand the + execution of the system. + +2. Trace guest operations. Guest I/O accesses like reading device registers + are good trace events because they can be used to understand guest + interactions. + +3. Use correlator fields so the context of an individual line of trace output + can be understood. For example, trace the pointer returned by malloc and + used as an argument to free. This way mallocs and frees can be matched up. + Trace events with no context are not very useful. + +4. Name trace events after their function. If there are multiple trace events + in one function, append a unique distinguisher at the end of the name. + +Generic interface and monitor commands +====================================== + +You can programmatically query and control the state of trace events through a +backend-agnostic interface provided by the header "trace/control.h". + +Note that some of the backends do not provide an implementation for some parts +of this interface, in which case QEMU will just print a warning (please refer to +header "trace/control.h" to see which routines are backend-dependent). + +The state of events can also be queried and modified through monitor commands: + +* ``info trace-events`` + View available trace events and their state. State 1 means enabled, state 0 + means disabled. + +* ``trace-event NAME on|off`` + Enable/disable a given trace event or a group of events (using wildcards). + +The "--trace events=<file>" command line argument can be used to enable the +events listed in <file> from the very beginning of the program. This file must +contain one event name per line. + +If a line in the "--trace events=<file>" file begins with a '-', the trace event +will be disabled instead of enabled. This is useful when a wildcard was used +to enable an entire family of events but one noisy event needs to be disabled. + +Wildcard matching is supported in both the monitor command "trace-event" and the +events list file. That means you can enable/disable the events having a common +prefix in a batch. For example, virtio-blk trace events could be enabled using +the following monitor command:: + + trace-event virtio_blk_* on + +Trace backends +============== + +The "tracetool" script automates tedious trace event code generation and also +keeps the trace event declarations independent of the trace backend. The trace +events are not tightly coupled to a specific trace backend, such as LTTng or +SystemTap. Support for trace backends can be added by extending the "tracetool" +script. + +The trace backends are chosen at configure time:: + + ./configure --enable-trace-backends=simple,dtrace + +For a list of supported trace backends, try ./configure --help or see below. +If multiple backends are enabled, the trace is sent to them all. + +If no backends are explicitly selected, configure will default to the +"log" backend. + +The following subsections describe the supported trace backends. + +Nop +--- + +The "nop" backend generates empty trace event functions so that the compiler +can optimize out trace events completely. This imposes no performance +penalty. + +Note that regardless of the selected trace backend, events with the "disable" +property will be generated with the "nop" backend. + +Log +--- + +The "log" backend sends trace events directly to standard error. This +effectively turns trace events into debug printfs. + +This is the simplest backend and can be used together with existing code that +uses DPRINTF(). + +The -msg timestamp=on|off command-line option controls whether or not to print +the tid/timestamp prefix for each trace event. + +Simpletrace +----------- + +The "simple" backend writes binary trace logs to a file from a thread, making +it lower overhead than the "log" backend. A Python API is available for writing +offline trace file analysis scripts. It may not be as powerful as +platform-specific or third-party trace backends but it is portable and has no +special library dependencies. + +Monitor commands +~~~~~~~~~~~~~~~~ + +* ``trace-file on|off|flush|set <path>`` + Enable/disable/flush the trace file or set the trace file name. + +Analyzing trace files +~~~~~~~~~~~~~~~~~~~~~ + +The "simple" backend produces binary trace files that can be formatted with the +simpletrace.py script. The script takes the "trace-events-all" file and the +binary trace:: + + ./scripts/simpletrace.py trace-events-all trace-12345 + +You must ensure that the same "trace-events-all" file was used to build QEMU, +otherwise trace event declarations may have changed and output will not be +consistent. + +Ftrace +------ + +The "ftrace" backend writes trace data to ftrace marker. This effectively +sends trace events to ftrace ring buffer, and you can compare qemu trace +data and kernel(especially kvm.ko when using KVM) trace data. + +if you use KVM, enable kvm events in ftrace:: + + # echo 1 > /sys/kernel/debug/tracing/events/kvm/enable + +After running qemu by root user, you can get the trace:: + + # cat /sys/kernel/debug/tracing/trace + +Restriction: "ftrace" backend is restricted to Linux only. + +Syslog +------ + +The "syslog" backend sends trace events using the POSIX syslog API. The log +is opened specifying the LOG_DAEMON facility and LOG_PID option (so events +are tagged with the pid of the particular QEMU process that generated +them). All events are logged at LOG_INFO level. + +NOTE: syslog may squash duplicate consecutive trace events and apply rate + limiting. + +Restriction: "syslog" backend is restricted to POSIX compliant OS. + +LTTng Userspace Tracer +---------------------- + +The "ust" backend uses the LTTng Userspace Tracer library. There are no +monitor commands built into QEMU, instead UST utilities should be used to list, +enable/disable, and dump traces. + +Package lttng-tools is required for userspace tracing. You must ensure that the +current user belongs to the "tracing" group, or manually launch the +lttng-sessiond daemon for the current user prior to running any instance of +QEMU. + +While running an instrumented QEMU, LTTng should be able to list all available +events:: + + lttng list -u + +Create tracing session:: + + lttng create mysession + +Enable events:: + + lttng enable-event qemu:g_malloc -u + +Where the events can either be a comma-separated list of events, or "-a" to +enable all tracepoint events. Start and stop tracing as needed:: + + lttng start + lttng stop + +View the trace:: + + lttng view + +Destroy tracing session:: + + lttng destroy + +Babeltrace can be used at any later time to view the trace:: + + babeltrace $HOME/lttng-traces/mysession-<date>-<time> + +SystemTap +--------- + +The "dtrace" backend uses DTrace sdt probes but has only been tested with +SystemTap. When SystemTap support is detected a .stp file with wrapper probes +is generated to make use in scripts more convenient. This step can also be +performed manually after a build in order to change the binary name in the .stp +probes:: + + scripts/tracetool.py --backends=dtrace --format=stap \ + --binary path/to/qemu-binary \ + --target-type system \ + --target-name x86_64 \ + --group=all \ + trace-events-all \ + qemu.stp + +To facilitate simple usage of systemtap where there merely needs to be printf +logging of certain probes, a helper script "qemu-trace-stap" is provided. +Consult its manual page for guidance on its usage. + +Trace event properties +====================== + +Each event in the "trace-events-all" file can be prefixed with a space-separated +list of zero or more of the following event properties. + +"disable" +--------- + +If a specific trace event is going to be invoked a huge number of times, this +might have a noticeable performance impact even when the event is +programmatically disabled. + +In this case you should declare such event with the "disable" property. This +will effectively disable the event at compile time (by using the "nop" backend), +thus having no performance impact at all on regular builds (i.e., unless you +edit the "trace-events-all" file). + +In addition, there might be cases where relatively complex computations must be +performed to generate values that are only used as arguments for a trace +function. In these cases you can use 'trace_event_get_state_backends()' to +guard such computations, so they are skipped if the event has been either +compile-time disabled or run-time disabled. If the event is compile-time +disabled, this check will have no performance impact. + +:: + + #include "trace.h" /* needed for trace event prototype */ + + void *qemu_vmalloc(size_t size) + { + void *ptr; + size_t align = QEMU_VMALLOC_ALIGN; + + if (size < align) { + align = getpagesize(); + } + ptr = qemu_memalign(align, size); + if (trace_event_get_state_backends(TRACE_QEMU_VMALLOC)) { + void *complex; + /* some complex computations to produce the 'complex' value */ + trace_qemu_vmalloc(size, ptr, complex); + } + return ptr; + } + +"tcg" +----- + +Guest code generated by TCG can be traced by defining an event with the "tcg" +event property. Internally, this property generates two events: +"<eventname>_trans" to trace the event at translation time, and +"<eventname>_exec" to trace the event at execution time. + +Instead of using these two events, you should instead use the function +"trace_<eventname>_tcg" during translation (TCG code generation). This function +will automatically call "trace_<eventname>_trans", and will generate the +necessary TCG code to call "trace_<eventname>_exec" during guest code execution. + +Events with the "tcg" property can be declared in the "trace-events" file with a +mix of native and TCG types, and "trace_<eventname>_tcg" will gracefully forward +them to the "<eventname>_trans" and "<eventname>_exec" events. Since TCG values +are not known at translation time, these are ignored by the "<eventname>_trans" +event. Because of this, the entry in the "trace-events" file needs two printing +formats (separated by a comma):: + + tcg foo(uint8_t a1, TCGv_i32 a2) "a1=%d", "a1=%d a2=%d" + +For example:: + + #include "trace-tcg.h" + + void some_disassembly_func (...) + { + uint8_t a1 = ...; + TCGv_i32 a2 = ...; + trace_foo_tcg(a1, a2); + } + +This will immediately call:: + + void trace_foo_trans(uint8_t a1); + +and will generate the TCG code to call:: + + void trace_foo(uint8_t a1, uint32_t a2); + +"vcpu" +------ + +Identifies events that trace vCPU-specific information. It implicitly adds a +"CPUState*" argument, and extends the tracing print format to show the vCPU +information. If used together with the "tcg" property, it adds a second +"TCGv_env" argument that must point to the per-target global TCG register that +points to the vCPU when guest code is executed (usually the "cpu_env" variable). + +The "tcg" and "vcpu" properties are currently only honored in the root +./trace-events file. + +The following example events:: + + foo(uint32_t a) "a=%x" + vcpu bar(uint32_t a) "a=%x" + tcg vcpu baz(uint32_t a) "a=%x", "a=%x" + +Can be used as:: + + #include "trace-tcg.h" + + CPUArchState *env; + TCGv_ptr cpu_env; + + void some_disassembly_func(...) + { + /* trace emitted at this point */ + trace_foo(0xd1); + /* trace emitted at this point */ + trace_bar(env_cpu(env), 0xd2); + /* trace emitted at this point (env) and when guest code is executed (cpu_env) */ + trace_baz_tcg(env_cpu(env), cpu_env, 0xd3); + } + +If the translating vCPU has address 0xc1 and code is later executed by vCPU +0xc2, this would be an example output:: + + // at guest code translation + foo a=0xd1 + bar cpu=0xc1 a=0xd2 + baz_trans cpu=0xc1 a=0xd3 + // at guest code execution + baz_exec cpu=0xc2 a=0xd3 diff --git a/docs/devel/trivial-patches.rst b/docs/devel/trivial-patches.rst new file mode 100644 index 000000000..9380c730f --- /dev/null +++ b/docs/devel/trivial-patches.rst @@ -0,0 +1,52 @@ +.. _trivial-patches: + +Trivial Patches +=============== + +Overview +-------- + +Trivial patches that change just a few lines of code sometimes languish +on the mailing list even though they require only a small amount of +review. This is often the case for patches that do not fall under an +actively maintained subsystem and therefore fall through the cracks. + +The trivial patches team take on the task of reviewing and building pull +requests for patches that: + +- Do not fall under an actively maintained subsystem. +- Are single patches or short series (max 2-4 patches). +- Only touch a few lines of code. + +**You should hint that your patch is a candidate by CCing +qemu-trivial@nongnu.org.** + +Repositories +------------ + +Since the trivial patch team rotates maintainership there is only one +active repository at a time: + +- git://github.com/vivier/qemu.git trivial-patches - `browse <https://github.com/vivier/qemu/tree/trivial-patches>`__ + +Workflow +-------- + +The trivial patches team rotates the duty of collecting trivial patches +amongst its members. A team member's job is to: + +1. Identify trivial patches on the development mailing list. +2. Review trivial patches, merge them into a git tree, and reply to state + that the patch is queued. +3. Send pull requests to the development mailing list once a week. + +A single team member can be on duty as long as they like. The suggested +time is 1 week before handing off to the next member. + +Team +---- + +If you would like to join the trivial patches team, contact Laurent +Vivier. The current team includes: + +- `Laurent Vivier <mailto:laurent@vivier.eu>`__ diff --git a/docs/devel/ui.rst b/docs/devel/ui.rst new file mode 100644 index 000000000..17fb667de --- /dev/null +++ b/docs/devel/ui.rst @@ -0,0 +1,8 @@ +================= +QEMU UI subsystem +================= + +QEMU Clipboard +-------------- + +.. kernel-doc:: include/ui/clipboard.h diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst new file mode 100644 index 000000000..9ff6163c8 --- /dev/null +++ b/docs/devel/vfio-migration.rst @@ -0,0 +1,150 @@ +===================== +VFIO device Migration +===================== + +Migration of virtual machine involves saving the state for each device that +the guest is running on source host and restoring this saved state on the +destination host. This document details how saving and restoring of VFIO +devices is done in QEMU. + +Migration of VFIO devices consists of two phases: the optional pre-copy phase, +and the stop-and-copy phase. The pre-copy phase is iterative and allows to +accommodate VFIO devices that have a large amount of data that needs to be +transferred. The iterative pre-copy phase of migration allows for the guest to +continue whilst the VFIO device state is transferred to the destination, this +helps to reduce the total downtime of the VM. VFIO devices can choose to skip +the pre-copy phase of migration by returning pending_bytes as zero during the +pre-copy phase. + +A detailed description of the UAPI for VFIO device migration can be found in +the comment for the ``vfio_device_migration_info`` structure in the header +file linux-headers/linux/vfio.h. + +VFIO implements the device hooks for the iterative approach as follows: + +* A ``save_setup`` function that sets up the migration region and sets _SAVING + flag in the VFIO device state. + +* A ``load_setup`` function that sets up the migration region on the + destination and sets _RESUMING flag in the VFIO device state. + +* A ``save_live_pending`` function that reads pending_bytes from the vendor + driver, which indicates the amount of data that the vendor driver has yet to + save for the VFIO device. + +* A ``save_live_iterate`` function that reads the VFIO device's data from the + vendor driver through the migration region during iterative phase. + +* A ``save_state`` function to save the device config space if it is present. + +* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the + VFIO device state and iteratively copies the remaining data for the VFIO + device until the vendor driver indicates that no data remains (pending bytes + is zero). + +* A ``load_state`` function that loads the config section and the data + sections that are generated by the save functions above + +* ``cleanup`` functions for both save and load that perform any migration + related cleanup, including unmapping the migration region + + +The VFIO migration code uses a VM state change handler to change the VFIO +device state when the VM state changes from running to not-running, and +vice versa. + +Similarly, a migration state change handler is used to trigger a transition of +the VFIO device state when certain changes of the migration state occur. For +example, the VFIO device state is transitioned back to _RUNNING in case a +migration failed or was canceled. + +System memory dirty pages tracking +---------------------------------- + +A ``log_global_start`` and ``log_global_stop`` memory listener callback informs +the VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync`` +memory listener callback marks those system memory pages as dirty which are +used for DMA by the VFIO device. The dirty pages bitmap is queried per +container. All pages pinned by the vendor driver through external APIs have to +be marked as dirty during migration. When there are CPU writes, CPU dirty page +tracking can identify dirtied pages, but any page pinned by the vendor driver +can also be written by the device. There is currently no device or IOMMU +support for dirty page tracking in hardware. + +By default, dirty pages are tracked when the device is in pre-copy as well as +stop-and-copy phase. So, a page pinned by the vendor driver will be copied to +the destination in both phases. Copying dirty pages in pre-copy phase helps +QEMU to predict if it can achieve its downtime tolerances. If QEMU during +pre-copy phase keeps finding dirty pages continuously, then it understands +that even in stop-and-copy phase, it is likely to find dirty pages and can +predict the downtime accordingly. + +QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking`` +which disables querying the dirty bitmap during pre-copy phase. If it is set to +off, all dirty pages will be copied to the destination in stop-and-copy phase +only. + +System memory dirty pages tracking when vIOMMU is enabled +--------------------------------------------------------- + +With vIOMMU, an IO virtual address range can get unmapped while in pre-copy +phase of migration. In that case, the unmap ioctl returns any dirty pages in +that range and QEMU reports corresponding guest physical pages dirty. During +stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped +pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those +mapped ranges. + +Flow of state changes during Live migration +=========================================== + +Below is the flow of state change during live migration. +The values in the brackets represent the VM state, the migration state, and +the VFIO device state, respectively. + +Live migration save path +------------------------ + +:: + + QEMU normal running state + (RUNNING, _NONE, _RUNNING) + | + migrate_init spawns migration_thread + Migration thread then calls each device's .save_setup() + (RUNNING, _SETUP, _RUNNING|_SAVING) + | + (RUNNING, _ACTIVE, _RUNNING|_SAVING) + If device is active, get pending_bytes by .save_live_pending() + If total pending_bytes >= threshold_size, call .save_live_iterate() + Data of VFIO device for pre-copy phase is copied + Iterate till total pending bytes converge and are less than threshold + | + On migration completion, vCPU stops and calls .save_live_complete_precopy for + each active device. The VFIO device is then transitioned into _SAVING state + (FINISH_MIGRATE, _DEVICE, _SAVING) + | + For the VFIO device, iterate in .save_live_complete_precopy until + pending data is 0 + (FINISH_MIGRATE, _DEVICE, _STOPPED) + | + (FINISH_MIGRATE, _COMPLETED, _STOPPED) + Migraton thread schedules cleanup bottom half and exits + +Live migration resume path +-------------------------- + +:: + + Incoming migration calls .load_setup for each device + (RESTORE_VM, _ACTIVE, _STOPPED) + | + For each device, .load_state is called for that device section data + (RESTORE_VM, _ACTIVE, _RESUMING) + | + At the end, .load_cleanup is called for each device and vCPUs are started + (RUNNING, _NONE, _RUNNING) + +Postcopy +======== + +Postcopy migration is currently not supported for VFIO devices. diff --git a/docs/devel/virtio-migration.txt b/docs/devel/virtio-migration.txt new file mode 100644 index 000000000..98a6b0ffb --- /dev/null +++ b/docs/devel/virtio-migration.txt @@ -0,0 +1,108 @@ +Virtio devices and migration +============================ + +Copyright 2015 IBM Corp. + +This work is licensed under the terms of the GNU GPL, version 2 or later. See +the COPYING file in the top-level directory. + +Saving and restoring the state of virtio devices is a bit of a twisty maze, +for several reasons: +- state is distributed between several parts: + - virtio core, for common fields like features, number of queues, ... + - virtio transport (pci, ccw, ...), for the different proxy devices and + transport specific state (msix vectors, indicators, ...) + - virtio device (net, blk, ...), for the different device types and their + state (mac address, request queue, ...) +- most fields are saved via the stream interface; subsequently, subsections + have been added to make cross-version migration possible + +This file attempts to document the current procedure and point out some +caveats. + + +Save state procedure +==================== + +virtio core virtio transport virtio device +----------- ---------------- ------------- + + save() function registered + via VMState wrapper on + device class +virtio_save() <---------- + ------> save_config() + - save proxy device + - save transport-specific + device fields +- save common device + fields +- save common virtqueue + fields + ------> save_queue() + - save transport-specific + virtqueue fields + ------> save_device() + - save device-specific + fields +- save subsections + - device endianness, + if changed from + default endianness + - 64 bit features, if + any high feature bit + is set + - virtio-1 virtqueue + fields, if VERSION_1 + is set + + +Load state procedure +==================== + +virtio core virtio transport virtio device +----------- ---------------- ------------- + + load() function registered + via VMState wrapper on + device class +virtio_load() <---------- + ------> load_config() + - load proxy device + - load transport-specific + device fields +- load common device + fields +- load common virtqueue + fields + ------> load_queue() + - load transport-specific + virtqueue fields +- notify guest + ------> load_device() + - load device-specific + fields +- load subsections + - device endianness + - 64 bit features + - virtio-1 virtqueue + fields +- sanitize endianness +- sanitize features +- virtqueue index sanity + check + - feature-dependent setup + + +Implications of this setup +========================== + +Devices need to be careful in their state processing during load: The +load_device() procedure is invoked by the core before subsections have +been loaded. Any code that depends on information transmitted in subsections +therefore has to be invoked in the device's load() function _after_ +virtio_load() returned (like e.g. code depending on features). + +Any extension of the state being migrated should be done in subsections +added to the core for compatibility reasons. If transport or device specific +state is added, core needs to invoke a callback from the new subsection. diff --git a/docs/devel/writing-monitor-commands.rst b/docs/devel/writing-monitor-commands.rst new file mode 100644 index 000000000..1693822f8 --- /dev/null +++ b/docs/devel/writing-monitor-commands.rst @@ -0,0 +1,751 @@ +How to write monitor commands +============================= + +This document is a step-by-step guide on how to write new QMP commands using +the QAPI framework and HMP commands. + +This document doesn't discuss QMP protocol level details, nor does it dive +into the QAPI framework implementation. + +For an in-depth introduction to the QAPI framework, please refer to +docs/devel/qapi-code-gen.txt. For documentation about the QMP protocol, +start with docs/interop/qmp-intro.txt. + +New commands may be implemented in QMP only. New HMP commands should be +implemented on top of QMP. The typical HMP command wraps around an +equivalent QMP command, but HMP convenience commands built from QMP +building blocks are also fine. The long term goal is to make all +existing HMP commands conform to this, to fully isolate HMP from the +internals of QEMU. Refer to the `Writing a debugging aid returning +unstructured text`_ section for further guidance on commands that +would have traditionally been HMP only. + +Overview +-------- + +Generally speaking, the following steps should be taken in order to write a +new QMP command. + +1. Define the command and any types it needs in the appropriate QAPI + schema module. + +2. Write the QMP command itself, which is a regular C function. Preferably, + the command should be exported by some QEMU subsystem. But it can also be + added to the monitor/qmp-cmds.c file + +3. At this point the command can be tested under the QMP protocol + +4. Write the HMP command equivalent. This is not required and should only be + done if it does make sense to have the functionality in HMP. The HMP command + is implemented in terms of the QMP command + +The following sections will demonstrate each of the steps above. We will start +very simple and get more complex as we progress. + + +Testing +------- + +For all the examples in the next sections, the test setup is the same and is +shown here. + +First, QEMU should be started like this:: + + # qemu-system-TARGET [...] \ + -chardev socket,id=qmp,port=4444,host=localhost,server=on \ + -mon chardev=qmp,mode=control,pretty=on + +Then, in a different terminal:: + + $ telnet localhost 4444 + Trying 127.0.0.1... + Connected to localhost. + Escape character is '^]'. + { + "QMP": { + "version": { + "qemu": { + "micro": 50, + "minor": 15, + "major": 0 + }, + "package": "" + }, + "capabilities": [ + ] + } + } + +The above output is the QMP server saying you're connected. The server is +actually in capabilities negotiation mode. To enter in command mode type:: + + { "execute": "qmp_capabilities" } + +Then the server should respond:: + + { + "return": { + } + } + +Which is QMP's way of saying "the latest command executed OK and didn't return +any data". Now you're ready to enter the QMP example commands as explained in +the following sections. + + +Writing a simple command: hello-world +------------------------------------- + +That's the most simple QMP command that can be written. Usually, this kind of +command carries some meaningful action in QEMU but here it will just print +"Hello, world" to the standard output. + +Our command will be called "hello-world". It takes no arguments, nor does it +return any data. + +The first step is defining the command in the appropriate QAPI schema +module. We pick module qapi/misc.json, and add the following line at +the bottom:: + + { 'command': 'hello-world' } + +The "command" keyword defines a new QMP command. It's an JSON object. All +schema entries are JSON objects. The line above will instruct the QAPI to +generate any prototypes and the necessary code to marshal and unmarshal +protocol data. + +The next step is to write the "hello-world" implementation. As explained +earlier, it's preferable for commands to live in QEMU subsystems. But +"hello-world" doesn't pertain to any, so we put its implementation in +monitor/qmp-cmds.c:: + + void qmp_hello_world(Error **errp) + { + printf("Hello, world!\n"); + } + +There are a few things to be noticed: + +1. QMP command implementation functions must be prefixed with "qmp\_" +2. qmp_hello_world() returns void, this is in accordance with the fact that the + command doesn't return any data +3. It takes an "Error \*\*" argument. This is required. Later we will see how to + return errors and take additional arguments. The Error argument should not + be touched if the command doesn't return errors +4. We won't add the function's prototype. That's automatically done by the QAPI +5. Printing to the terminal is discouraged for QMP commands, we do it here + because it's the easiest way to demonstrate a QMP command + +You're done. Now build qemu, run it as suggested in the "Testing" section, +and then type the following QMP command:: + + { "execute": "hello-world" } + +Then check the terminal running qemu and look for the "Hello, world" string. If +you don't see it then something went wrong. + + +Arguments +~~~~~~~~~ + +Let's add an argument called "message" to our "hello-world" command. The new +argument will contain the string to be printed to stdout. It's an optional +argument, if it's not present we print our default "Hello, World" string. + +The first change we have to do is to modify the command specification in the +schema file to the following:: + + { 'command': 'hello-world', 'data': { '*message': 'str' } } + +Notice the new 'data' member in the schema. It's an JSON object whose each +element is an argument to the command in question. Also notice the asterisk, +it's used to mark the argument optional (that means that you shouldn't use it +for mandatory arguments). Finally, 'str' is the argument's type, which +stands for "string". The QAPI also supports integers, booleans, enumerations +and user defined types. + +Now, let's update our C implementation in monitor/qmp-cmds.c:: + + void qmp_hello_world(bool has_message, const char *message, Error **errp) + { + if (has_message) { + printf("%s\n", message); + } else { + printf("Hello, world\n"); + } + } + +There are two important details to be noticed: + +1. All optional arguments are accompanied by a 'has\_' boolean, which is set + if the optional argument is present or false otherwise +2. The C implementation signature must follow the schema's argument ordering, + which is defined by the "data" member + +Time to test our new version of the "hello-world" command. Build qemu, run it as +described in the "Testing" section and then send two commands:: + + { "execute": "hello-world" } + { + "return": { + } + } + + { "execute": "hello-world", "arguments": { "message": "We love qemu" } } + { + "return": { + } + } + +You should see "Hello, world" and "We love qemu" in the terminal running qemu, +if you don't see these strings, then something went wrong. + + +Errors +~~~~~~ + +QMP commands should use the error interface exported by the error.h header +file. Basically, most errors are set by calling the error_setg() function. + +Let's say we don't accept the string "message" to contain the word "love". If +it does contain it, we want the "hello-world" command to return an error:: + + void qmp_hello_world(bool has_message, const char *message, Error **errp) + { + if (has_message) { + if (strstr(message, "love")) { + error_setg(errp, "the word 'love' is not allowed"); + return; + } + printf("%s\n", message); + } else { + printf("Hello, world\n"); + } + } + +The first argument to the error_setg() function is the Error pointer +to pointer, which is passed to all QMP functions. The next argument is a human +description of the error, this is a free-form printf-like string. + +Let's test the example above. Build qemu, run it as defined in the "Testing" +section, and then issue the following command:: + + { "execute": "hello-world", "arguments": { "message": "all you need is love" } } + +The QMP server's response should be:: + + { + "error": { + "class": "GenericError", + "desc": "the word 'love' is not allowed" + } + } + +Note that error_setg() produces a "GenericError" class. In general, +all QMP errors should have that error class. There are two exceptions +to this rule: + + 1. To support a management application's need to recognize a specific + error for special handling + + 2. Backward compatibility + +If the failure you want to report falls into one of the two cases above, +use error_set() with a second argument of an ErrorClass value. + + +Command Documentation +~~~~~~~~~~~~~~~~~~~~~ + +There's only one step missing to make "hello-world"'s implementation complete, +and that's its documentation in the schema file. + +There are many examples of such documentation in the schema file already, but +here goes "hello-world"'s new entry for qapi/misc.json:: + + ## + # @hello-world: + # + # Print a client provided string to the standard output stream. + # + # @message: string to be printed + # + # Returns: Nothing on success. + # + # Notes: if @message is not provided, the "Hello, world" string will + # be printed instead + # + # Since: <next qemu stable release, eg. 1.0> + ## + { 'command': 'hello-world', 'data': { '*message': 'str' } } + +Please, note that the "Returns" clause is optional if a command doesn't return +any data nor any errors. + + +Implementing the HMP command +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Now that the QMP command is in place, we can also make it available in the human +monitor (HMP). + +With the introduction of the QAPI, HMP commands make QMP calls. Most of the +time HMP commands are simple wrappers. All HMP commands implementation exist in +the monitor/hmp-cmds.c file. + +Here's the implementation of the "hello-world" HMP command:: + + void hmp_hello_world(Monitor *mon, const QDict *qdict) + { + const char *message = qdict_get_try_str(qdict, "message"); + Error *err = NULL; + + qmp_hello_world(!!message, message, &err); + if (hmp_handle_error(mon, err)) { + return; + } + } + +Also, you have to add the function's prototype to the hmp.h file. + +There are three important points to be noticed: + +1. The "mon" and "qdict" arguments are mandatory for all HMP functions. The + former is the monitor object. The latter is how the monitor passes + arguments entered by the user to the command implementation +2. hmp_hello_world() performs error checking. In this example we just call + hmp_handle_error() which prints a message to the user, but we could do + more, like taking different actions depending on the error + qmp_hello_world() returns +3. The "err" variable must be initialized to NULL before performing the + QMP call + +There's one last step to actually make the command available to monitor users, +we should add it to the hmp-commands.hx file:: + + { + .name = "hello-world", + .args_type = "message:s?", + .params = "hello-world [message]", + .help = "Print message to the standard output", + .cmd = hmp_hello_world, + }, + +:: + + STEXI + @item hello_world @var{message} + @findex hello_world + Print message to the standard output + ETEXI + +To test this you have to open a user monitor and issue the "hello-world" +command. It might be instructive to check the command's documentation with +HMP's "help" command. + +Please, check the "-monitor" command-line option to know how to open a user +monitor. + + +Writing more complex commands +----------------------------- + +A QMP command is capable of returning any data the QAPI supports like integers, +strings, booleans, enumerations and user defined types. + +In this section we will focus on user defined types. Please, check the QAPI +documentation for information about the other types. + + +Modelling data in QAPI +~~~~~~~~~~~~~~~~~~~~~~ + +For a QMP command that to be considered stable and supported long term, +there is a requirement returned data should be explicitly modelled +using fine-grained QAPI types. As a general guide, a caller of the QMP +command should never need to parse individual returned data fields. If +a field appears to need parsing, then it should be split into separate +fields corresponding to each distinct data item. This should be the +common case for any new QMP command that is intended to be used by +machines, as opposed to exclusively human operators. + +Some QMP commands, however, are only intended as ad hoc debugging aids +for human operators. While they may return large amounts of formatted +data, it is not expected that machines will need to parse the result. +The overhead of defining a fine grained QAPI type for the data may not +be justified by the potential benefit. In such cases, it is permitted +to have a command return a simple string that contains formatted data, +however, it is mandatory for the command to use the 'x-' name prefix. +This indicates that the command is not guaranteed to be long term +stable / liable to change in future and is not following QAPI design +best practices. An example where this approach is taken is the QMP +command "x-query-registers". This returns a formatted dump of the +architecture specific CPU state. The way the data is formatted varies +across QEMU targets, is liable to change over time, and is only +intended to be consumed as an opaque string by machines. Refer to the +`Writing a debugging aid returning unstructured text`_ section for +an illustration. + +User Defined Types +~~~~~~~~~~~~~~~~~~ + +FIXME This example needs to be redone after commit 6d32717 + +For this example we will write the query-alarm-clock command, which returns +information about QEMU's timer alarm. For more information about it, please +check the "-clock" command-line option. + +We want to return two pieces of information. The first one is the alarm clock's +name. The second one is when the next alarm will fire. The former information is +returned as a string, the latter is an integer in nanoseconds (which is not +very useful in practice, as the timer has probably already fired when the +information reaches the client). + +The best way to return that data is to create a new QAPI type, as shown below:: + + ## + # @QemuAlarmClock + # + # QEMU alarm clock information. + # + # @clock-name: The alarm clock method's name. + # + # @next-deadline: The time (in nanoseconds) the next alarm will fire. + # + # Since: 1.0 + ## + { 'type': 'QemuAlarmClock', + 'data': { 'clock-name': 'str', '*next-deadline': 'int' } } + +The "type" keyword defines a new QAPI type. Its "data" member contains the +type's members. In this example our members are the "clock-name" and the +"next-deadline" one, which is optional. + +Now let's define the query-alarm-clock command:: + + ## + # @query-alarm-clock + # + # Return information about QEMU's alarm clock. + # + # Returns a @QemuAlarmClock instance describing the alarm clock method + # being currently used by QEMU (this is usually set by the '-clock' + # command-line option). + # + # Since: 1.0 + ## + { 'command': 'query-alarm-clock', 'returns': 'QemuAlarmClock' } + +Notice the "returns" keyword. As its name suggests, it's used to define the +data returned by a command. + +It's time to implement the qmp_query_alarm_clock() function, you can put it +in the qemu-timer.c file:: + + QemuAlarmClock *qmp_query_alarm_clock(Error **errp) + { + QemuAlarmClock *clock; + int64_t deadline; + + clock = g_malloc0(sizeof(*clock)); + + deadline = qemu_next_alarm_deadline(); + if (deadline > 0) { + clock->has_next_deadline = true; + clock->next_deadline = deadline; + } + clock->clock_name = g_strdup(alarm_timer->name); + + return clock; + } + +There are a number of things to be noticed: + +1. The QemuAlarmClock type is automatically generated by the QAPI framework, + its members correspond to the type's specification in the schema file +2. As specified in the schema file, the function returns a QemuAlarmClock + instance and takes no arguments (besides the "errp" one, which is mandatory + for all QMP functions) +3. The "clock" variable (which will point to our QAPI type instance) is + allocated by the regular g_malloc0() function. Note that we chose to + initialize the memory to zero. This is recommended for all QAPI types, as + it helps avoiding bad surprises (specially with booleans) +4. Remember that "next_deadline" is optional? All optional members have a + 'has_TYPE_NAME' member that should be properly set by the implementation, + as shown above +5. Even static strings, such as "alarm_timer->name", should be dynamically + allocated by the implementation. This is so because the QAPI also generates + a function to free its types and it cannot distinguish between dynamically + or statically allocated strings +6. You have to include "qapi/qapi-commands-misc.h" in qemu-timer.c + +Time to test the new command. Build qemu, run it as described in the "Testing" +section and try this:: + + { "execute": "query-alarm-clock" } + { + "return": { + "next-deadline": 2368219, + "clock-name": "dynticks" + } + } + + +The HMP command +~~~~~~~~~~~~~~~ + +Here's the HMP counterpart of the query-alarm-clock command:: + + void hmp_info_alarm_clock(Monitor *mon) + { + QemuAlarmClock *clock; + Error *err = NULL; + + clock = qmp_query_alarm_clock(&err); + if (hmp_handle_error(mon, err)) { + return; + } + + monitor_printf(mon, "Alarm clock method in use: '%s'\n", clock->clock_name); + if (clock->has_next_deadline) { + monitor_printf(mon, "Next alarm will fire in %" PRId64 " nanoseconds\n", + clock->next_deadline); + } + + qapi_free_QemuAlarmClock(clock); + } + +It's important to notice that hmp_info_alarm_clock() calls +qapi_free_QemuAlarmClock() to free the data returned by qmp_query_alarm_clock(). +For user defined types, the QAPI will generate a qapi_free_QAPI_TYPE_NAME() +function and that's what you have to use to free the types you define and +qapi_free_QAPI_TYPE_NAMEList() for list types (explained in the next section). +If the QMP call returns a string, then you should g_free() to free it. + +Also note that hmp_info_alarm_clock() performs error handling. That's not +strictly required if you're sure the QMP function doesn't return errors, but +it's good practice to always check for errors. + +Another important detail is that HMP's "info" commands don't go into the +hmp-commands.hx. Instead, they go into the info_cmds[] table, which is defined +in the monitor/misc.c file. The entry for the "info alarmclock" follows:: + + { + .name = "alarmclock", + .args_type = "", + .params = "", + .help = "show information about the alarm clock", + .cmd = hmp_info_alarm_clock, + }, + +To test this, run qemu and type "info alarmclock" in the user monitor. + + +Returning Lists +~~~~~~~~~~~~~~~ + +For this example, we're going to return all available methods for the timer +alarm, which is pretty much what the command-line option "-clock ?" does, +except that we're also going to inform which method is in use. + +This first step is to define a new type:: + + ## + # @TimerAlarmMethod + # + # Timer alarm method information. + # + # @method-name: The method's name. + # + # @current: true if this alarm method is currently in use, false otherwise + # + # Since: 1.0 + ## + { 'type': 'TimerAlarmMethod', + 'data': { 'method-name': 'str', 'current': 'bool' } } + +The command will be called "query-alarm-methods", here is its schema +specification:: + + ## + # @query-alarm-methods + # + # Returns information about available alarm methods. + # + # Returns: a list of @TimerAlarmMethod for each method + # + # Since: 1.0 + ## + { 'command': 'query-alarm-methods', 'returns': ['TimerAlarmMethod'] } + +Notice the syntax for returning lists "'returns': ['TimerAlarmMethod']", this +should be read as "returns a list of TimerAlarmMethod instances". + +The C implementation follows:: + + TimerAlarmMethodList *qmp_query_alarm_methods(Error **errp) + { + TimerAlarmMethodList *method_list = NULL; + const struct qemu_alarm_timer *p; + bool current = true; + + for (p = alarm_timers; p->name; p++) { + TimerAlarmMethod *value = g_malloc0(*value); + value->method_name = g_strdup(p->name); + value->current = current; + QAPI_LIST_PREPEND(method_list, value); + current = false; + } + + return method_list; + } + +The most important difference from the previous examples is the +TimerAlarmMethodList type, which is automatically generated by the QAPI from +the TimerAlarmMethod type. + +Each list node is represented by a TimerAlarmMethodList instance. We have to +allocate it, and that's done inside the for loop: the "info" pointer points to +an allocated node. We also have to allocate the node's contents, which is +stored in its "value" member. In our example, the "value" member is a pointer +to an TimerAlarmMethod instance. + +Notice that the "current" variable is used as "true" only in the first +iteration of the loop. That's because the alarm timer method in use is the +first element of the alarm_timers array. Also notice that QAPI lists are handled +by hand and we return the head of the list. + +Now Build qemu, run it as explained in the "Testing" section and try our new +command:: + + { "execute": "query-alarm-methods" } + { + "return": [ + { + "current": false, + "method-name": "unix" + }, + { + "current": true, + "method-name": "dynticks" + } + ] + } + +The HMP counterpart is a bit more complex than previous examples because it +has to traverse the list, it's shown below for reference:: + + void hmp_info_alarm_methods(Monitor *mon) + { + TimerAlarmMethodList *method_list, *method; + Error *err = NULL; + + method_list = qmp_query_alarm_methods(&err); + if (hmp_handle_error(mon, err)) { + return; + } + + for (method = method_list; method; method = method->next) { + monitor_printf(mon, "%c %s\n", method->value->current ? '*' : ' ', + method->value->method_name); + } + + qapi_free_TimerAlarmMethodList(method_list); + } + +Writing a debugging aid returning unstructured text +--------------------------------------------------- + +As discussed in section `Modelling data in QAPI`_, it is required that +commands expecting machine usage be using fine-grained QAPI data types. +The exception to this rule applies when the command is solely intended +as a debugging aid and allows for returning unstructured text. This is +commonly needed for query commands that report aspects of QEMU's +internal state that are useful to human operators. + +In this example we will consider a simplified variant of the HMP +command ``info roms``. Following the earlier rules, this command will +need to live under the ``x-`` name prefix, so its QMP implementation +will be called ``x-query-roms``. It will have no parameters and will +return a single text string:: + + { 'struct': 'HumanReadableText', + 'data': { 'human-readable-text': 'str' } } + + { 'command': 'x-query-roms', + 'returns': 'HumanReadableText' } + +The ``HumanReadableText`` struct is intended to be used for all +commands, under the ``x-`` name prefix that are returning unstructured +text targeted at humans. It should never be used for commands outside +the ``x-`` name prefix, as those should be using structured QAPI types. + +Implementing the QMP command +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The QMP implementation will typically involve creating a ``GString`` +object and printing formatted data into it:: + + HumanReadableText *qmp_x_query_roms(Error **errp) + { + g_autoptr(GString) buf = g_string_new(""); + Rom *rom; + + QTAILQ_FOREACH(rom, &roms, next) { + g_string_append_printf("%s size=0x%06zx name=\"%s\"\n", + memory_region_name(rom->mr), + rom->romsize, + rom->name); + } + + return human_readable_text_from_str(buf); + } + + +Implementing the HMP command +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Now that the QMP command is in place, we can also make it available in +the human monitor (HMP) as shown in previous examples. The HMP +implementations will all look fairly similar, as all they need do is +invoke the QMP command and then print the resulting text or error +message. Here's the implementation of the "info roms" HMP command:: + + void hmp_info_roms(Monitor *mon, const QDict *qdict) + { + Error err = NULL; + g_autoptr(HumanReadableText) info = qmp_x_query_roms(&err); + + if (hmp_handle_error(mon, err)) { + return; + } + monitor_printf(mon, "%s", info->human_readable_text); + } + +Also, you have to add the function's prototype to the hmp.h file. + +There's one last step to actually make the command available to +monitor users, we should add it to the hmp-commands-info.hx file:: + + { + .name = "roms", + .args_type = "", + .params = "", + .help = "show roms", + .cmd = hmp_info_roms, + }, + +The case of writing a HMP info handler that calls a no-parameter QMP query +command is quite common. To simplify the implementation there is a general +purpose HMP info handler for this scenario. All that is required to expose +a no-parameter QMP query command via HMP is to declare it using the +'.cmd_info_hrt' field to point to the QMP handler, and leave the '.cmd' +field NULL:: + + { + .name = "roms", + .args_type = "", + .params = "", + .help = "show roms", + .cmd_info_hrt = qmp_x_query_roms, + }, diff --git a/docs/hyperv.txt b/docs/hyperv.txt new file mode 100644 index 000000000..0417c183a --- /dev/null +++ b/docs/hyperv.txt @@ -0,0 +1,255 @@ +Hyper-V Enlightenments +====================== + + +1. Description +=============== +In some cases when implementing a hardware interface in software is slow, KVM +implements its own paravirtualized interfaces. This works well for Linux as +guest support for such features is added simultaneously with the feature itself. +It may, however, be hard-to-impossible to add support for these interfaces to +proprietary OSes, namely, Microsoft Windows. + +KVM on x86 implements Hyper-V Enlightenments for Windows guests. These features +make Windows and Hyper-V guests think they're running on top of a Hyper-V +compatible hypervisor and use Hyper-V specific features. + + +2. Setup +========= +No Hyper-V enlightenments are enabled by default by either KVM or QEMU. In +QEMU, individual enlightenments can be enabled through CPU flags, e.g: + + qemu-system-x86_64 --enable-kvm --cpu host,hv_relaxed,hv_vpindex,hv_time, ... + +Sometimes there are dependencies between enlightenments, QEMU is supposed to +check that the supplied configuration is sane. + +When any set of the Hyper-V enlightenments is enabled, QEMU changes hypervisor +identification (CPUID 0x40000000..0x4000000A) to Hyper-V. KVM identification +and features are kept in leaves 0x40000100..0x40000101. + + +3. Existing enlightenments +=========================== + +3.1. hv-relaxed +================ +This feature tells guest OS to disable watchdog timeouts as it is running on a +hypervisor. It is known that some Windows versions will do this even when they +see 'hypervisor' CPU flag. + +3.2. hv-vapic +============== +Provides so-called VP Assist page MSR to guest allowing it to work with APIC +more efficiently. In particular, this enlightenment allows paravirtualized +(exit-less) EOI processing. + +3.3. hv-spinlocks=xxx +====================== +Enables paravirtualized spinlocks. The parameter indicates how many times +spinlock acquisition should be attempted before indicating the situation to the +hypervisor. A special value 0xffffffff indicates "never notify". + +3.4. hv-vpindex +================ +Provides HV_X64_MSR_VP_INDEX (0x40000002) MSR to the guest which has Virtual +processor index information. This enlightenment makes sense in conjunction with +hv-synic, hv-stimer and other enlightenments which require the guest to know its +Virtual Processor indices (e.g. when VP index needs to be passed in a +hypercall). + +3.5. hv-runtime +================ +Provides HV_X64_MSR_VP_RUNTIME (0x40000010) MSR to the guest. The MSR keeps the +virtual processor run time in 100ns units. This gives guest operating system an +idea of how much time was 'stolen' from it (when the virtual CPU was preempted +to perform some other work). + +3.6. hv-crash +============== +Provides HV_X64_MSR_CRASH_P0..HV_X64_MSR_CRASH_P5 (0x40000100..0x40000105) and +HV_X64_MSR_CRASH_CTL (0x40000105) MSRs to the guest. These MSRs are written to +by the guest when it crashes, HV_X64_MSR_CRASH_P0..HV_X64_MSR_CRASH_P5 MSRs +contain additional crash information. This information is outputted in QEMU log +and through QAPI. +Note: unlike under genuine Hyper-V, write to HV_X64_MSR_CRASH_CTL causes guest +to shutdown. This effectively blocks crash dump generation by Windows. + +3.7. hv-time +============= +Enables two Hyper-V-specific clocksources available to the guest: MSR-based +Hyper-V clocksource (HV_X64_MSR_TIME_REF_COUNT, 0x40000020) and Reference TSC +page (enabled via MSR HV_X64_MSR_REFERENCE_TSC, 0x40000021). Both clocksources +are per-guest, Reference TSC page clocksource allows for exit-less time stamp +readings. Using this enlightenment leads to significant speedup of all timestamp +related operations. + +3.8. hv-synic +============== +Enables Hyper-V Synthetic interrupt controller - an extension of a local APIC. +When enabled, this enlightenment provides additional communication facilities +to the guest: SynIC messages and Events. This is a pre-requisite for +implementing VMBus devices (not yet in QEMU). Additionally, this enlightenment +is needed to enable Hyper-V synthetic timers. SynIC is controlled through MSRs +HV_X64_MSR_SCONTROL..HV_X64_MSR_EOM (0x40000080..0x40000084) and +HV_X64_MSR_SINT0..HV_X64_MSR_SINT15 (0x40000090..0x4000009F) + +Requires: hv-vpindex + +3.9. hv-stimer +=============== +Enables Hyper-V synthetic timers. There are four synthetic timers per virtual +CPU controlled through HV_X64_MSR_STIMER0_CONFIG..HV_X64_MSR_STIMER3_COUNT +(0x400000B0..0x400000B7) MSRs. These timers can work either in single-shot or +periodic mode. It is known that certain Windows versions revert to using HPET +(or even RTC when HPET is unavailable) extensively when this enlightenment is +not provided; this can lead to significant CPU consumption, even when virtual +CPU is idle. + +Requires: hv-vpindex, hv-synic, hv-time + +3.10. hv-tlbflush +================== +Enables paravirtualized TLB shoot-down mechanism. On x86 architecture, remote +TLB flush procedure requires sending IPIs and waiting for other CPUs to perform +local TLB flush. In virtualized environment some virtual CPUs may not even be +scheduled at the time of the call and may not require flushing (or, flushing +may be postponed until the virtual CPU is scheduled). hv-tlbflush enlightenment +implements TLB shoot-down through hypervisor enabling the optimization. + +Requires: hv-vpindex + +3.11. hv-ipi +============= +Enables paravirtualized IPI send mechanism. HvCallSendSyntheticClusterIpi +hypercall may target more than 64 virtual CPUs simultaneously, doing the same +through APIC requires more than one access (and thus exit to the hypervisor). + +Requires: hv-vpindex + +3.12. hv-vendor-id=xxx +======================= +This changes Hyper-V identification in CPUID 0x40000000.EBX-EDX from the default +"Microsoft Hv". The parameter should be no longer than 12 characters. According +to the specification, guests shouldn't use this information and it is unknown +if there is a Windows version which acts differently. +Note: hv-vendor-id is not an enlightenment and thus doesn't enable Hyper-V +identification when specified without some other enlightenment. + +3.13. hv-reset +=============== +Provides HV_X64_MSR_RESET (0x40000003) MSR to the guest allowing it to reset +itself by writing to it. Even when this MSR is enabled, it is not a recommended +way for Windows to perform system reboot and thus it may not be used. + +3.14. hv-frequencies +============================================ +Provides HV_X64_MSR_TSC_FREQUENCY (0x40000022) and HV_X64_MSR_APIC_FREQUENCY +(0x40000023) allowing the guest to get its TSC/APIC frequencies without doing +measurements. + +3.15 hv-reenlightenment +======================== +The enlightenment is nested specific, it targets Hyper-V on KVM guests. When +enabled, it provides HV_X64_MSR_REENLIGHTENMENT_CONTROL (0x40000106), +HV_X64_MSR_TSC_EMULATION_CONTROL (0x40000107)and HV_X64_MSR_TSC_EMULATION_STATUS +(0x40000108) MSRs allowing the guest to get notified when TSC frequency changes +(only happens on migration) and keep using old frequency (through emulation in +the hypervisor) until it is ready to switch to the new one. This, in conjunction +with hv-frequencies, allows Hyper-V on KVM to pass stable clocksource (Reference +TSC page) to its own guests. + +Note, KVM doesn't fully support re-enlightenment notifications and doesn't +emulate TSC accesses after migration so 'tsc-frequency=' CPU option also has to +be specified to make migration succeed. The destination host has to either have +the same TSC frequency or support TSC scaling CPU feature. + +Recommended: hv-frequencies + +3.16. hv-evmcs +=============== +The enlightenment is nested specific, it targets Hyper-V on KVM guests. When +enabled, it provides Enlightened VMCS version 1 feature to the guest. The feature +implements paravirtualized protocol between L0 (KVM) and L1 (Hyper-V) +hypervisors making L2 exits to the hypervisor faster. The feature is Intel-only. +Note: some virtualization features (e.g. Posted Interrupts) are disabled when +hv-evmcs is enabled. It may make sense to measure your nested workload with and +without the feature to find out if enabling it is beneficial. + +Requires: hv-vapic + +3.17. hv-stimer-direct +======================= +Hyper-V specification allows synthetic timer operation in two modes: "classic", +when expiration event is delivered as SynIC message and "direct", when the event +is delivered via normal interrupt. It is known that nested Hyper-V can only +use synthetic timers in direct mode and thus 'hv-stimer-direct' needs to be +enabled. + +Requires: hv-vpindex, hv-synic, hv-time, hv-stimer + +3.18. hv-avic (hv-apicv) +======================= +The enlightenment allows to use Hyper-V SynIC with hardware APICv/AVIC enabled. +Normally, Hyper-V SynIC disables these hardware feature and suggests the guest +to use paravirtualized AutoEOI feature. +Note: enabling this feature on old hardware (without APICv/AVIC support) may +have negative effect on guest's performance. + +3.19. hv-no-nonarch-coresharing=on/off/auto +=========================================== +This enlightenment tells guest OS that virtual processors will never share a +physical core unless they are reported as sibling SMT threads. This information +is required by Windows and Hyper-V guests to properly mitigate SMT related CPU +vulnerabilities. +When the option is set to 'auto' QEMU will enable the feature only when KVM +reports that non-architectural coresharing is impossible, this means that +hyper-threading is not supported or completely disabled on the host. This +setting also prevents migration as SMT settings on the destination may differ. +When the option is set to 'on' QEMU will always enable the feature, regardless +of host setup. To keep guests secure, this can only be used in conjunction with +exposing correct vCPU topology and vCPU pinning. + +3.20. hv-version-id-{build,major,minor,spack,sbranch,snumber} +============================================================= +This changes Hyper-V version identification in CPUID 0x40000002.EAX-EDX from the +default (WS2016). +- hv-version-id-build sets 'Build Number' (32 bits) +- hv-version-id-major sets 'Major Version' (16 bits) +- hv-version-id-minor sets 'Minor Version' (16 bits) +- hv-version-id-spack sets 'Service Pack' (32 bits) +- hv-version-id-sbranch sets 'Service Branch' (8 bits) +- hv-version-id-snumber sets 'Service Number' (24 bits) + +Note: hv-version-id-* are not enlightenments and thus don't enable Hyper-V +identification when specified without any other enlightenments. + +4. Supplementary features +========================= + +4.1. hv-passthrough +=================== +In some cases (e.g. during development) it may make sense to use QEMU in +'pass-through' mode and give Windows guests all enlightenments currently +supported by KVM. This pass-through mode is enabled by "hv-passthrough" CPU +flag. +Note: "hv-passthrough" flag only enables enlightenments which are known to QEMU +(have corresponding "hv-*" flag) and copies "hv-spinlocks="/"hv-vendor-id=" +values from KVM to QEMU. "hv-passthrough" overrides all other "hv-*" settings on +the command line. Also, enabling this flag effectively prevents migration as the +list of enabled enlightenments may differ between target and destination hosts. + +4.2. hv-enforce-cpuid +===================== +By default, KVM allows the guest to use all currently supported Hyper-V +enlightenments when Hyper-V CPUID interface was exposed, regardless of if +some features were not announced in guest visible CPUIDs. 'hv-enforce-cpuid' +feature alters this behavior and only allows the guest to use exposed Hyper-V +enlightenments. + + +5. Useful links +================ +Hyper-V Top Level Functional specification and other information: +https://github.com/MicrosoftDocs/Virtualization-Documentation diff --git a/docs/igd-assign.txt b/docs/igd-assign.txt new file mode 100644 index 000000000..e17bb5078 --- /dev/null +++ b/docs/igd-assign.txt @@ -0,0 +1,133 @@ +Intel Graphics Device (IGD) assignment with vfio-pci +==================================================== + +IGD has two different modes for assignment using vfio-pci: + +1) Universal Pass-Through (UPT) mode: + + In this mode the IGD device is added as a *secondary* (ie. non-primary) + graphics device in combination with an emulated primary graphics device. + This mode *requires* guest driver support to remove the external + dependencies generally associated with IGD (see below). Those guest + drivers only support this mode for Broadwell and newer IGD, according to + Intel. Additionally, this mode by default, and as officially supported + by Intel, does not support direct video output. The intention is to use + this mode either to provide hardware acceleration to the emulated graphics + or to use this mode in combination with guest-based remote access software, + for example VNC (see below for optional output support). This mode + theoretically has no device specific handling dependencies on vfio-pci or + the VM firmware. + +2) "Legacy" mode: + + In this mode the IGD device is intended to be the primary and exclusive + graphics device in the VM[1], as such QEMU does not facilitate any sort + of remote graphics to the VM in this mode. A connected physical monitor + is the intended output device for IGD. This mode includes several + requirements and restrictions: + + * IGD must be given address 02.0 on the PCI root bus in the VM + * The host kernel must support vfio extensions for IGD (v4.6) + * vfio VGA support very likely needs to be enabled in the host kernel + * The VM firmware must support specific fw_cfg enablers for IGD + * The VM machine type must support a PCI host bridge at 00.0 (standard) + * The VM machine type must provide or allow to be created a special + ISA/LPC bridge device (vfio-pci-igd-lpc-bridge) on the root bus at + PCI address 1f.0. + * The IGD device must have a VGA ROM, either provided via the romfile + option or loaded automatically through vfio (standard). rombar=0 + will disable legacy mode support. + * Hotplug of the IGD device is not supported. + * The IGD device must be a SandyBridge or newer model device. + +For either mode, depending on the host kernel, the i915 driver in the host +may generate faults and errors upon re-binding to an IGD device after it +has been assigned to a VM. It's therefore generally recommended to prevent +such driver binding unless the host driver is known to work well for this. +There are numerous ways to do this, i915 can be blacklisted on the host, +the driver_override option can be used to ensure that only vfio-pci can bind +to the device on the host[2], virsh nodedev-detach can be used to bind the +device to vfio drivers and then managed='no' set in the VM xml to prevent +re-binding to i915, etc. Also note that IGD is also typically the primary +graphics in the host and special options may be required beyond simply +blacklisting i915 or using pci-stub/vfio-pci to take ownership of IGD as a +PCI class device. Lower level drivers exist that may still claim the device. +It may therefore be necessary to use kernel boot options video=vesafb:off or +video=efifb:off (depending on host BIOS/UEFI) or these can be combined to +a catch-all, video=vesafb:off,efifb:off. Error messages such as: + + Failed to mmap 0000:00:02.0 BAR <>. Performance may be slow + +are a good indicator that such a problem exists. The host files /proc/iomem +and /proc/ioports are often useful for identifying drivers consuming ranges +of the device to cause such conflicts. + +Additionally, IGD device are known to generate small numbers of DMAR faults +when initially assigned. It is believed that this is simply the IGD attempting +to access the reserved GTT space after reset, which it no longer has access to +when accessed from userspace. So long as the DMAR faults are small in number +and most importantly, not ongoing, these are not an indication of an error. + +Additionally++, analog VGA output (as opposed to digital outputs like HDMI, +DVI, or DisplayPort) may be unsupported in some use cases. In the author's +experience, even DP to VGA adapters can be troublesome while adapters between +digital formats work well. + +Usage +===== +The intention is for IGD assignment to be transparent for users and thus for +management tools like libvirt. To make use of legacy mode, simply remove all +other graphics options and use "-nographic" and either "-vga none" or +"-nodefaults", along with adding the device using vfio-pci: + + -device vfio-pci,host=00:02.0,id=hostdev0,bus=pci.0,addr=0x2 + +For UPT mode, retain the default emulated graphics and simply add the vfio-pci +device making use of any other bus address other than 02.0. libvirt will +default to assigning the device a UPT compatible address while legacy mode +users will need to manually edit the XML if using a tool like virt-manager +where the VM device address is not expressly specified. + +An experimental vfio-pci option also exists to enable OpRegion, and thus +external monitor support, for UPT mode. This can be enabled by adding +"x-igd-opregion=on" to the vfio-pci device options for the IGD device. As +with legacy mode, this requires the host to support features introduced in +the v4.6 kernel. If Intel chooses to embrace this support, the option may +be made non-experimental in the future, opening it to libvirt support. + +Developer ABI +============= +Legacy mode IGD support imposes two fw_cfg requirements on the VM firmware: + +1) "etc/igd-opregion" + + This fw_cfg file exposes the OpRegion for the IGD device. A reserved + region should be created below 4GB (recommended 4KB alignment), sized + sufficient for the fw_cfg file size, and the content of this file copied + to it. The dword based address of this reserved memory region must also + be written to the ASLS register at offset 0xFC on the IGD device. It is + recommended that firmware should make use of this fw_cfg entry for any + PCI class VGA device with Intel vendor ID. Multiple of such devices + within a VM is undefined. + +2) "etc/igd-bdsm-size" + + This fw_cfg file contains an 8-byte, little endian integer indicating + the size of the reserved memory region required for IGD stolen memory. + Firmware must allocate a reserved memory below 4GB with required 1MB + alignment equal to this size. Additionally the base address of this + reserved region must be written to the dword BDSM register in PCI config + space of the IGD device at offset 0x5C. As this support is related to + running the IGD ROM, which has other dependencies on the device appearing + at guest address 00:02.0, it's expected that this fw_cfg file is only + relevant to a single PCI class VGA device with Intel vendor ID, appearing + at PCI bus address 00:02.0. + +Footnotes +========= +[1] Nothing precludes adding additional emulated or assigned graphics devices + as non-primary, other than the combination typically not working. I only + intend to set user expectations, others are welcome to find working + combinations or fix whatever issues prevent this from working in the common + case. +[2] # echo "vfio-pci" > /sys/bus/pci/devices/0000:00:02.0/driver_override diff --git a/docs/image-fuzzer.txt b/docs/image-fuzzer.txt new file mode 100644 index 000000000..279cc8c80 --- /dev/null +++ b/docs/image-fuzzer.txt @@ -0,0 +1,239 @@ +# Specification for the fuzz testing tool +# +# Copyright (C) 2014 Maria Kustova <maria.k@catit.be> +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. + + +Image fuzzer +============ + +Description +----------- + +The goal of the image fuzzer is to catch crashes of qemu-io/qemu-img +by providing to them randomly corrupted images. +Test images are generated from scratch and have valid inner structure with some +elements, e.g. L1/L2 tables, having random invalid values. + + +Test runner +----------- + +The test runner generates test images, executes tests utilizing generated +images, indicates their results and collects all test related artifacts (logs, +core dumps, test images, backing files). +The test means execution of all available commands under test with the same +generated test image. +By default, the test runner generates new tests and executes them until +keyboard interruption. But if a test seed is specified via the '--seed' runner +parameter, then only one test with this seed will be executed, after its finish +the runner will exit. + +The runner uses an external image fuzzer to generate test images. An image +generator should be specified as a mandatory parameter of the test runner. +Details about interactions between the runner and fuzzers see "Module +interfaces". + +The runner activates generation of core dumps during test executions, but it +assumes that core dumps will be generated in the current working directory. +For comprehensive test results, please, set up your test environment +properly. + +Paths to binaries under test (SUTs) ``qemu-img`` and ``qemu-io`` are retrieved +from environment variables. If the environment check fails the runner will +use SUTs installed in system paths. +``qemu-img`` is required for creation of backing files, so it's mandatory to set +the related environment variable if it's not installed in the system path. +For details about environment variables see qemu-iotests/check. + +The runner accepts a JSON array of fields expected to be fuzzed via the +'--config' argument, e.g. + + '[["feature_name_table"], ["header", "l1_table_offset"]]' + +Each sublist can have one or two strings defining image structure elements. +In the latter case a parent element should be placed on the first position, +and a field name on the second one. + +The runner accepts a list of commands under test as a JSON array via +the '--command' argument. Each command is a list containing a SUT and all its +arguments, e.g. + + runner.py -c '[["qemu-io", "$test_img", "-c", "write $off $len"]]' + /tmp/test ../qcow2 + +For variable arguments next aliases can be used: + - $test_img for a fuzzed img + - $off for an offset in the fuzzed image + - $len for a data size + +Values for last two aliases will be generated based on a size of a virtual +disk of the generated image. +In case when no commands are specified the runner will execute commands from +the default list: + - qemu-img check + - qemu-img info + - qemu-img convert + - qemu-io -c read + - qemu-io -c write + - qemu-io -c aio_read + - qemu-io -c aio_write + - qemu-io -c flush + - qemu-io -c discard + - qemu-io -c truncate + + +Qcow2 image generator +--------------------- + +The 'qcow2' generator is a Python package providing 'create_image' method as +a single public API. See details in 'Test runner/image fuzzer' chapter of +'Module interfaces'. + +Qcow2 contains two submodules: fuzz.py and layout.py. + +'fuzz.py' contains all fuzzing functions, one per image field. It's assumed +that after code analysis every field will have own constraints for its value. +For now only universal potentially dangerous values are used, e.g. type limits +for integers or unsafe symbols as '%s' for strings. For bitmasks random amount +of bits are set to ones. All fuzzed values are checked on non-equality to the +current valid value of the field. In case of equality the value will be +regenerated. + +'layout.py' creates a random valid image, fuzzes a random subset of the image +fields by 'fuzz.py' module and writes a fuzzed image to the file specified. +If a fuzzer configuration is specified, then it has the next interpretation: + + 1. If a list contains a parent image element only, then some random portion + of fields of this element will be fuzzed every test. + The same behavior is applied for the entire image if no configuration is + used. This case is useful for the test specialization. + + 2. If a list contains a parent element and a field name, then a field + will be always fuzzed for every test. This case is useful for regression + testing. + +The generator can create header fields, header extensions, L1/L2 tables and +refcount table and blocks. + +Module interfaces +----------------- + +* Test runner/image fuzzer + +The runner calls an image generator specifying the path to a test image file, +path to a backing file and its format and a fuzzer configuration. +An image generator is expected to provide a + + 'create_image(test_img_path, backing_file_path=None, + backing_file_format=None, fuzz_config=None)' + +method that creates a test image, writes it to the specified file and returns +the size of the virtual disk. +The file should be created if it doesn't exist or overwritten otherwise. +fuzz_config has a form of a list of lists. Every sublist can have one +or two elements: first element is a name of a parent image element, second one +if exists is a name of a field in this element. +Example, + [['header', 'l1_table_offset'], + ['header', 'nb_snapshots'], + ['feature_name_table']] + +Random seed is set by the runner at every test execution for the regression +purpose, so an image generator is not recommended to modify it internally. + + +Overall fuzzer requirements +=========================== + +Input data: +---------- + + - image template (generator) + - work directory + - action vector (optional) + - seed (optional) + - SUT and its arguments (optional) + + +Fuzzer requirements: +------------------- + +1. Should be able to inject random data +2. Should be able to select a random value from the manually pregenerated + vector (boundary values, e.g. max/min cluster size) +3. Image template should describe a general structure invariant for all + test images (image format description) +4. Image template should be autonomous and other fuzzer parts should not + rely on it +5. Image template should contain reference rules (not only block+size + description) +6. Should generate the test image with the correct structure based on an image + template +7. Should accept a seed as an argument (for regression purpose) +8. Should generate a seed if it is not specified as an input parameter. +9. The same seed should generate the same image for the same action vector, + specified or generated. +10. Should accept a vector of actions as an argument (for test reproducing and + for test case specification, e.g. group of tests for header structure, + group of test for snapshots, etc) +11. Action vector should be randomly generated from the pool of available + actions, if it is not specified as an input parameter +12. Pool of actions should be defined automatically based on an image template +13. Should accept a SUT and its call parameters as an argument or select them + randomly otherwise. As far as it's expected to be rarely changed, the list + of all possible test commands can be available in the test runner + internally. +14. Should support an external cancellation of a test run +15. Seed should be logged (for regression purpose) +16. All files related to a test result should be collected: a test image, + SUT logs, fuzzer logs and crash dumps +17. Should be compatible with python version 2.4-2.7 +18. Usage of external libraries should be limited as much as possible. + + +Image formats: +------------- + +Main target image format is qcow2, but support of image templates should +provide an ability to add any other image format. + + +Effectiveness: +------------- + +The fuzzer can be controlled via template, seed and action vector; +it makes the fuzzer itself invariant to an image format and test logic. +It should be able to perform rather complex and precise tests, that can be +specified via an action vector. Otherwise, knowledge about an image structure +allows the fuzzer to generate the pool of all available areas can be fuzzed +and randomly select some of them and so compose its own action vector. +Also complexity of a template defines complexity of the fuzzer, so its +functionality can be varied from simple model-independent fuzzing to smart +model-based one. + + +Glossary: +-------- + +Action vector is a sequence of structure elements retrieved from an image +format, each of them will be fuzzed for the test image. It's a subset of +elements of the action pool. Example: header, refcount table, etc. +Action pool is all available elements of an image structure that generated +automatically from an image template. +Image template is a formal description of an image structure and relations +between image blocks. +Test image is an output image of the fuzzer defined by the current seed and +action vector. diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 000000000..0b9ee9901 --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,20 @@ +.. QEMU documentation master file, created by + sphinx-quickstart on Thu Jan 31 16:40:14 2019. + You can adapt this file completely to your liking, but it should at least + contain the root `toctree` directive. + +================================ +Welcome to QEMU's documentation! +================================ + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + about/index + system/index + user/index + tools/index + interop/index + specs/index + devel/index diff --git a/docs/interop/barrier.rst b/docs/interop/barrier.rst new file mode 100644 index 000000000..055f2c1ae --- /dev/null +++ b/docs/interop/barrier.rst @@ -0,0 +1,426 @@ +Barrier client protocol +======================= + +QEMU's ``input-barrier`` device implements the client end of +the KVM (Keyboard-Video-Mouse) software +`Barrier <https://github.com/debauchee/barrier>`__. + +This document briefly describes the protocol as we implement it. + +Message format +-------------- + +Message format between the server and client is in two parts: + +#. the payload length, a 32bit integer in network endianness +#. the payload + +The payload starts with a 4byte string (without NUL) which is the +command. The first command between the server and the client +is the only command not encoded on 4 bytes ("Barrier"). +The remaining part of the payload is decoded according to the command. + +Protocol Description +-------------------- + +This comes from ``barrier/src/lib/barrier/protocol_types.h``. + +barrierCmdHello "Barrier" +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t minor, int16_t major }`` +Description: + Say hello to client + + ``minor`` = protocol major version number supported by server + + ``major`` = protocol minor version number supported by server + +barrierCmdHelloBack "Barrier" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + client ->server +Parameters: + ``{ int16_t minor, int16_t major, char *name}`` +Description: + Respond to hello from server + + ``minor`` = protocol major version number supported by client + + ``major`` = protocol minor version number supported by client + + ``name`` = client name + +barrierCmdDInfo "DINF" +^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + client ->server +Parameters: + ``{ int16_t x_origin, int16_t y_origin, int16_t width, int16_t height, int16_t x, int16_t y}`` +Description: + The client screen must send this message in response to the + barrierCmdQInfo message. It must also send this message when the + screen's resolution changes. In this case, the client screen should + ignore any barrierCmdDMouseMove messages until it receives a + barrierCmdCInfoAck in order to prevent attempts to move the mouse off + the new screen area. + +barrierCmdCNoop "CNOP" +^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + client -> server +Parameters: + None +Description: + No operation + +barrierCmdCClose "CBYE" +^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + None +Description: + Close connection + +barrierCmdCEnter "CINN" +^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t x, int16_t y, int32_t seq, int16_t modifier }`` +Description: + Enter screen. + + ``x``, ``y`` = entering screen absolute coordinates + + ``seq`` = sequence number, which is used to order messages between + screens. the secondary screen must return this number + with some messages + + ``modifier`` = modifier key mask. this will have bits set for each + toggle modifier key that is activated on entry to the + screen. the secondary screen should adjust its toggle + modifiers to reflect that state. + +barrierCmdCLeave "COUT" +^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + None +Description: + Leaving screen. the secondary screen should send clipboard data in + response to this message for those clipboards that it has grabbed + (i.e. has sent a barrierCmdCClipboard for and has not received a + barrierCmdCClipboard for with a greater sequence number) and that + were grabbed or have changed since the last leave. + +barrierCmdCClipboard "CCLP" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int8_t id, int32_t seq }`` +Description: + Grab clipboard. Sent by screen when some other app on that screen + grabs a clipboard. + + ``id`` = the clipboard identifier + + ``seq`` = sequence number. Client must use the sequence number passed in + the most recent barrierCmdCEnter. the server always sends 0. + +barrierCmdCScreenSaver "CSEC" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int8_t started }`` +Description: + Screensaver change. + + ``started`` = Screensaver on primary has started (1) or closed (0) + +barrierCmdCResetOptions "CROP" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + None +Description: + Reset options. Client should reset all of its options to their + defaults. + +barrierCmdCInfoAck "CIAK" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + None +Description: + Resolution change acknowledgment. Sent by server in response to a + client screen's barrierCmdDInfo. This is sent for every + barrierCmdDInfo, whether or not the server had sent a barrierCmdQInfo. + +barrierCmdCKeepAlive "CALV" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + None +Description: + Keep connection alive. Sent by the server periodically to verify + that connections are still up and running. clients must reply in + kind on receipt. if the server gets an error sending the message or + does not receive a reply within a reasonable time then the server + disconnects the client. if the client doesn't receive these (or any + message) periodically then it should disconnect from the server. the + appropriate interval is defined by an option. + +barrierCmdDKeyDown "DKDN" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t keyid, int16_t modifier [,int16_t button] }`` +Description: + Key pressed. + + ``keyid`` = X11 key id + + ``modified`` = modified mask + + ``button`` = X11 Xkb keycode (optional) + +barrierCmdDKeyRepeat "DKRP" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t keyid, int16_t modifier, int16_t repeat [,int16_t button] }`` +Description: + Key auto-repeat. + + ``keyid`` = X11 key id + + ``modified`` = modified mask + + ``repeat`` = number of repeats + + ``button`` = X11 Xkb keycode (optional) + +barrierCmdDKeyUp "DKUP" +^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t keyid, int16_t modifier [,int16_t button] }`` +Description: + Key released. + + ``keyid`` = X11 key id + + ``modified`` = modified mask + + ``button`` = X11 Xkb keycode (optional) + +barrierCmdDMouseDown "DMDN" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int8_t button }`` +Description: + Mouse button pressed. + + ``button`` = button id + +barrierCmdDMouseUp "DMUP" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int8_t button }`` +Description: + Mouse button release. + + ``button`` = button id + +barrierCmdDMouseMove "DMMV" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t x, int16_t y }`` +Description: + Absolute mouse moved. + + ``x``, ``y`` = absolute screen coordinates + +barrierCmdDMouseRelMove "DMRM" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t x, int16_t y }`` +Description: + Relative mouse moved. + + ``x``, ``y`` = r relative screen coordinates + +barrierCmdDMouseWheel "DMWM" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t x , int16_t y }`` or ``{ int16_t y }`` +Description: + Mouse scroll. The delta should be +120 for one tick forward (away + from the user) or right and -120 for one tick backward (toward the + user) or left. + + ``x`` = x delta + + ``y`` = y delta + +barrierCmdDClipboard "DCLP" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int8_t id, int32_t seq, int8_t mark, char *data }`` +Description: + Clipboard data. + + ``id`` = clipboard id + + ``seq`` = sequence number. The sequence number is 0 when sent by the + server. Client screens should use the/ sequence number from + the most recent barrierCmdCEnter. + +barrierCmdDSetOptions "DSOP" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int32 t nb, { int32_t id, int32_t val }[] }`` +Description: + Set options. Client should set the given option/value pairs. + + ``nb`` = numbers of ``{ id, val }`` entries + + ``id`` = option id + + ``val`` = option new value + +barrierCmdDFileTransfer "DFTR" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int8_t mark, char *content }`` +Description: + Transfer file data. + + * ``mark`` = 0 means the content followed is the file size + * 1 means the content followed is the chunk data + * 2 means the file transfer is finished + +barrierCmdDDragInfo "DDRG" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t nb, char *content }`` +Description: + Drag information. + + ``nb`` = number of dragging objects + + ``content`` = object's directory + +barrierCmdQInfo "QINF" +^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + None +Description: + Query screen info + + Client should reply with a barrierCmdDInfo + +barrierCmdEIncompatible "EICV" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + ``{ int16_t nb, major *minor }`` +Description: + Incompatible version. + + ``major`` = major version + + ``minor`` = minor version + +barrierCmdEBusy "EBSY" +^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + None +Description: + Name provided when connecting is already in use. + +barrierCmdEUnknown "EUNK" +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + None +Description: + Unknown client. Name provided when connecting is not in primary's + screen configuration map. + +barrierCmdEBad "EBAD" +^^^^^^^^^^^^^^^^^^^^^^^ + +Direction: + server -> client +Parameters: + None +Description: + Protocol violation. Server should disconnect after sending this + message. + diff --git a/docs/interop/bitmaps.rst b/docs/interop/bitmaps.rst new file mode 100644 index 000000000..1de46febd --- /dev/null +++ b/docs/interop/bitmaps.rst @@ -0,0 +1,1629 @@ +.. + Copyright 2019 John Snow <jsnow@redhat.com> and Red Hat, Inc. + All rights reserved. + + This file is licensed via The FreeBSD Documentation License, the full + text of which is included at the end of this document. + +==================================== +Dirty Bitmaps and Incremental Backup +==================================== + +Dirty Bitmaps are in-memory objects that track writes to block devices. They +can be used in conjunction with various block job operations to perform +incremental or differential backup regimens. + +This document explains the conceptual mechanisms, as well as up-to-date, +complete and comprehensive documentation on the API to manipulate them. +(Hopefully, the "why", "what", and "how".) + +The intended audience for this document is developers who are adding QEMU +backup features to management applications, or power users who run and +administer QEMU directly via QMP. + +.. contents:: + +Overview +-------- + +Bitmaps are bit vectors where each '1' bit in the vector indicates a modified +("dirty") segment of the corresponding block device. The size of the segment +that is tracked is the granularity of the bitmap. If the granularity of a +bitmap is 64K, each '1' bit means that a 64K region as a whole may have +changed in some way, possibly by as little as one byte. + +Smaller granularities mean more accurate tracking of modified disk data, but +requires more computational overhead and larger bitmap sizes. Larger +granularities mean smaller bitmap sizes, but less targeted backups. + +The size of a bitmap (in bytes) can be computed as such: + ``size`` = ceil(ceil(``image_size`` / ``granularity``) / 8) + +e.g. the size of a 64KiB granularity bitmap on a 2TiB image is: + ``size`` = ((2147483648K / 64K) / 8) + = 4194304B = 4MiB. + +QEMU uses these bitmaps when making incremental backups to know which sections +of the file to copy out. They are not enabled by default and must be +explicitly added in order to begin tracking writes. + +Bitmaps can be created at any time and can be attached to any arbitrary block +node in the storage graph, but are most useful conceptually when attached to +the root node attached to the guest's storage device model. + +That is to say: It's likely most useful to track the guest's writes to disk, +but you could theoretically track things like qcow2 metadata changes by +attaching the bitmap elsewhere in the storage graph. This is beyond the scope +of this document. + +QEMU supports persisting these bitmaps to disk via the qcow2 image format. +Bitmaps which are stored or loaded in this way are called "persistent", +whereas bitmaps that are not are called "transient". + +QEMU also supports the migration of both transient bitmaps (tracking any +arbitrary image format) or persistent bitmaps (qcow2) via live migration. + +Supported Image Formats +----------------------- + +QEMU supports all documented features below on the qcow2 image format. + +However, qcow2 is only strictly necessary for the persistence feature, which +writes bitmap data to disk upon close. If persistence is not required for a +specific use case, all bitmap features excepting persistence are available for +any arbitrary image format. + +For example, Dirty Bitmaps can be combined with the 'raw' image format, but +any changes to the bitmap will be discarded upon exit. + +.. warning:: Transient bitmaps will not be saved on QEMU exit! Persistent + bitmaps are available only on qcow2 images. + +Dirty Bitmap Names +------------------ + +Bitmap objects need a method to reference them in the API. All API-created and +managed bitmaps have a human-readable name chosen by the user at creation +time. + +- A bitmap's name is unique to the node, but bitmaps attached to different + nodes can share the same name. Therefore, all bitmaps are addressed via + their (node, name) pair. + +- The name of a user-created bitmap cannot be empty (""). + +- Transient bitmaps can have JSON unicode names that are effectively not + length limited. (QMP protocol may restrict messages to less than 64MiB.) + +- Persistent storage formats may impose their own requirements on bitmap names + and namespaces. Presently, only qcow2 supports persistent bitmaps. See + docs/interop/qcow2.txt for more details on restrictions. Notably: + + - qcow2 bitmap names are limited to between 1 and 1023 bytes long. + + - No two bitmaps saved to the same qcow2 file may share the same name. + +- QEMU occasionally uses bitmaps for internal use which have no name. They are + hidden from API query calls, cannot be manipulated by the external API, are + never persistent, nor ever migrated. + +Bitmap Status +------------- + +Dirty Bitmap objects can be queried with the QMP command `query-block +<qemu-qmp-ref.html#index-query_002dblock>`_, and are visible via the +`BlockDirtyInfo <qemu-qmp-ref.html#index-BlockDirtyInfo>`_ QAPI structure. + +This struct shows the name, granularity, and dirty byte count for each bitmap. +Additionally, it shows several boolean status indicators: + +- ``recording``: This bitmap is recording writes. +- ``busy``: This bitmap is in-use by an operation. +- ``persistent``: This bitmap is a persistent type. +- ``inconsistent``: This bitmap is corrupted and cannot be used. + +The ``+busy`` status prohibits you from deleting, clearing, or otherwise +modifying a bitmap, and happens when the bitmap is being used for a backup +operation or is in the process of being loaded from a migration. Many of the +commands documented below will refuse to work on such bitmaps. + +The ``+inconsistent`` status similarly prohibits almost all operations, +notably allowing only the ``block-dirty-bitmap-remove`` operation. + +There is also a deprecated ``status`` field of type `DirtyBitmapStatus +<qemu-qmp-ref.html#index-DirtyBitmapStatus>`_. A bitmap historically had +five visible states: + + #. ``Frozen``: This bitmap is currently in-use by an operation and is + immutable. It can't be deleted, renamed, reset, etc. + + (This is now ``+busy``.) + + #. ``Disabled``: This bitmap is not recording new writes. + + (This is now ``-recording -busy``.) + + #. ``Active``: This bitmap is recording new writes. + + (This is now ``+recording -busy``.) + + #. ``Locked``: This bitmap is in-use by an operation, and is immutable. + The difference from "Frozen" was primarily implementation details. + + (This is now ``+busy``.) + + #. ``Inconsistent``: This persistent bitmap was not saved to disk + correctly, and can no longer be used. It remains in memory to serve as + an indicator of failure. + + (This is now ``+inconsistent``.) + +These states are directly replaced by the status indicators and should not be +used. The difference between ``Frozen`` and ``Locked`` is an implementation +detail and should not be relevant to external users. + +Basic QMP Usage +--------------- + +The primary interface to manipulating bitmap objects is via the QMP +interface. If you are not familiar, see docs/interop/qmp-intro.txt for a broad +overview, and `qemu-qmp-ref <qemu-qmp-ref.html>`_ for a full reference of all +QMP commands. + +Supported Commands +~~~~~~~~~~~~~~~~~~ + +There are six primary bitmap-management API commands: + +- ``block-dirty-bitmap-add`` +- ``block-dirty-bitmap-remove`` +- ``block-dirty-bitmap-clear`` +- ``block-dirty-bitmap-disable`` +- ``block-dirty-bitmap-enable`` +- ``block-dirty-bitmap-merge`` + +And one related query command: + +- ``query-block`` + +Creation: block-dirty-bitmap-add +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`block-dirty-bitmap-add +<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002dadd>`_: + +Creates a new bitmap that tracks writes to the specified node. granularity, +persistence, and recording state can be adjusted at creation time. + +.. admonition:: Example + + to create a new, actively recording persistent bitmap: + + .. code-block:: QMP + + -> { "execute": "block-dirty-bitmap-add", + "arguments": { + "node": "drive0", + "name": "bitmap0", + "persistent": true, + } + } + + <- { "return": {} } + +- This bitmap will have a default granularity that matches the cluster size of + its associated drive, if available, clamped to between [4KiB, 64KiB]. The + current default for qcow2 is 64KiB. + +.. admonition:: Example + + To create a new, disabled (``-recording``), transient bitmap that tracks + changes in 32KiB segments: + + .. code-block:: QMP + + -> { "execute": "block-dirty-bitmap-add", + "arguments": { + "node": "drive0", + "name": "bitmap1", + "granularity": 32768, + "disabled": true + } + } + + <- { "return": {} } + +Deletion: block-dirty-bitmap-remove +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`block-dirty-bitmap-remove +<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002dremove>`_: + +Deletes a bitmap. Bitmaps that are ``+busy`` cannot be removed. + +- Deleting a bitmap does not impact any other bitmaps attached to the same + node, nor does it affect any backups already created from this bitmap or + node. + +- Because bitmaps are only unique to the node to which they are attached, you + must specify the node/drive name here, too. + +- Deleting a persistent bitmap will remove it from the qcow2 file. + +.. admonition:: Example + + Remove a bitmap named ``bitmap0`` from node ``drive0``: + + .. code-block:: QMP + + -> { "execute": "block-dirty-bitmap-remove", + "arguments": { + "node": "drive0", + "name": "bitmap0" + } + } + + <- { "return": {} } + +Resetting: block-dirty-bitmap-clear +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`block-dirty-bitmap-clear +<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002dclear>`_: + +Clears all dirty bits from a bitmap. ``+busy`` bitmaps cannot be cleared. + +- An incremental backup created from an empty bitmap will copy no data, as if + nothing has changed. + +.. admonition:: Example + + Clear all dirty bits from bitmap ``bitmap0`` on node ``drive0``: + + .. code-block:: QMP + + -> { "execute": "block-dirty-bitmap-clear", + "arguments": { + "node": "drive0", + "name": "bitmap0" + } + } + + <- { "return": {} } + +Enabling: block-dirty-bitmap-enable +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`block-dirty-bitmap-enable +<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002denable>`_: + +"Enables" a bitmap, setting the ``recording`` bit to true, causing writes to +begin being recorded. ``+busy`` bitmaps cannot be enabled. + +- Bitmaps default to being enabled when created, unless configured otherwise. + +- Persistent enabled bitmaps will remember their ``+recording`` status on + load. + +.. admonition:: Example + + To set ``+recording`` on bitmap ``bitmap0`` on node ``drive0``: + + .. code-block:: QMP + + -> { "execute": "block-dirty-bitmap-enable", + "arguments": { + "node": "drive0", + "name": "bitmap0" + } + } + + <- { "return": {} } + +Enabling: block-dirty-bitmap-disable +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`block-dirty-bitmap-disable +<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002ddisable>`_: + +"Disables" a bitmap, setting the ``recording`` bit to false, causing further +writes to begin being ignored. ``+busy`` bitmaps cannot be disabled. + +.. warning:: + + This is potentially dangerous: QEMU makes no effort to stop any writes if + there are disabled bitmaps on a node, and will not mark any disabled bitmaps + as ``+inconsistent`` if any such writes do happen. Backups made from such + bitmaps will not be able to be used to reconstruct a coherent image. + +- Disabling a bitmap may be useful for examining which sectors of a disk + changed during a specific time period, or for explicit management of + differential backup windows. + +- Persistent disabled bitmaps will remember their ``-recording`` status on + load. + +.. admonition:: Example + + To set ``-recording`` on bitmap ``bitmap0`` on node ``drive0``: + + .. code-block:: QMP + + -> { "execute": "block-dirty-bitmap-disable", + "arguments": { + "node": "drive0", + "name": "bitmap0" + } + } + + <- { "return": {} } + +Merging, Copying: block-dirty-bitmap-merge +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`block-dirty-bitmap-merge +<qemu-qmp-ref.html#index-block_002ddirty_002dbitmap_002dmerge>`_: + +Merges one or more bitmaps into a target bitmap. For any segment that is dirty +in any one source bitmap, the target bitmap will mark that segment dirty. + +- Merge takes one or more bitmaps as a source and merges them together into a + single destination, such that any segment marked as dirty in any source + bitmap(s) will be marked dirty in the destination bitmap. + +- Merge does not create the destination bitmap if it does not exist. A blank + bitmap can be created beforehand to achieve the same effect. + +- The destination is not cleared prior to merge, so subsequent merge + operations will continue to cumulatively mark more segments as dirty. + +- If the merge operation should fail, the destination bitmap is guaranteed to + be unmodified. The operation may fail if the source or destination bitmaps + are busy, or have different granularities. + +- Bitmaps can only be merged on the same node. There is only one "node" + argument, so all bitmaps must be attached to that same node. + +- Copy can be achieved by merging from a single source to an empty + destination. + +.. admonition:: Example + + Merge the data from ``bitmap0`` into the bitmap ``new_bitmap`` on node + ``drive0``. If ``new_bitmap`` was empty prior to this command, this achieves + a copy. + + .. code-block:: QMP + + -> { "execute": "block-dirty-bitmap-merge", + "arguments": { + "node": "drive0", + "target": "new_bitmap", + "bitmaps": [ "bitmap0" ] + } + } + + <- { "return": {} } + +Querying: query-block +~~~~~~~~~~~~~~~~~~~~~ + +`query-block +<qemu-qmp-ref.html#index-query_002dblock>`_: + +Not strictly a bitmaps command, but will return information about any bitmaps +attached to nodes serving as the root for guest devices. + +- The "inconsistent" bit will not appear when it is false, appearing only when + the value is true to indicate there is a problem. + +.. admonition:: Example + + Query the block sub-system of QEMU. The following json has trimmed irrelevant + keys from the response to highlight only the bitmap-relevant portions of the + API. This result highlights a bitmap ``bitmap0`` attached to the root node of + device ``drive0``. + + .. code-block:: QMP + + -> { + "execute": "query-block", + "arguments": {} + } + + <- { + "return": [ { + "dirty-bitmaps": [ { + "status": "active", + "count": 0, + "busy": false, + "name": "bitmap0", + "persistent": false, + "recording": true, + "granularity": 65536 + } ], + "device": "drive0", + } ] + } + +Bitmap Persistence +------------------ + +As outlined in `Supported Image Formats`_, QEMU can persist bitmaps to qcow2 +files. Demonstrated in `Creation: block-dirty-bitmap-add`_, passing +``persistent: true`` to ``block-dirty-bitmap-add`` will persist that bitmap to +disk. + +Persistent bitmaps will be automatically loaded into memory upon load, and +will be written back to disk upon close. Their usage should be mostly +transparent. + +However, if QEMU does not get a chance to close the file cleanly, the bitmap +will be marked as ``+inconsistent`` at next load and considered unsafe to use +for any operation. At this point, the only valid operation on such bitmaps is +``block-dirty-bitmap-remove``. + +Losing a bitmap in this way does not invalidate any existing backups that have +been made from this bitmap, but no further backups will be able to be issued +for this chain. + +Transactions +------------ + +Transactions are a QMP feature that allows you to submit multiple QMP commands +at once, being guaranteed that they will all succeed or fail atomically, +together. The interaction of bitmaps and transactions are demonstrated below. + +See `transaction <qemu-qmp.ref.html#index-transaction>`_ in the QMP reference +for more details. + +Justification +~~~~~~~~~~~~~ + +Bitmaps can generally be modified at any time, but certain operations often +only make sense when paired directly with other commands. When a VM is paused, +it's easy to ensure that no guest writes occur between individual QMP +commands. When a VM is running, this is difficult to accomplish with +individual QMP commands that may allow guest writes to occur between each +command. + +For example, using only individual QMP commands, we could: + +#. Boot the VM in a paused state. +#. Create a full drive backup of drive0. +#. Create a new bitmap attached to drive0, confident that nothing has been + written to drive0 in the meantime. +#. Resume execution of the VM. +#. At a later point, issue incremental backups from ``bitmap0``. + +At this point, the bitmap and drive backup would be correctly in sync, and +incremental backups made from this point forward would be correctly aligned to +the full drive backup. + +This is not particularly useful if we decide we want to start incremental +backups after the VM has been running for a while, for which we would want to +perform actions such as the following: + +#. Boot the VM and begin execution. +#. Using a single transaction, perform the following operations: + + - Create ``bitmap0``. + - Create a full drive backup of ``drive0``. + +#. At a later point, issue incremental backups from ``bitmap0``. + +.. note:: As a consideration, if ``bitmap0`` is created prior to the full + drive backup, incremental backups can still be authored from this + bitmap, but they will copy extra segments reflecting writes that + occurred prior to the backup operation. Transactions allow us to + narrow critical points in time to reduce waste, or, in the other + direction, to ensure that no segments are omitted. + +Supported Bitmap Transactions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- ``block-dirty-bitmap-add`` +- ``block-dirty-bitmap-clear`` +- ``block-dirty-bitmap-enable`` +- ``block-dirty-bitmap-disable`` +- ``block-dirty-bitmap-merge`` + +The usages for these commands are identical to their respective QMP commands, +but see the sections below for concrete examples. + +Incremental Backups - Push Model +-------------------------------- + +Incremental backups are simply partial disk images that can be combined with +other partial disk images on top of a base image to reconstruct a full backup +from the point in time at which the incremental backup was issued. + +The "Push Model" here references the fact that QEMU is "pushing" the modified +blocks out to a destination. We will be using the `blockdev-backup +<qemu-qmp-ref.html#index-blockdev_002dbackup>`_ QMP command to create both +full and incremental backups. + +The command is a background job, which has its own QMP API for querying and +management documented in `Background jobs +<qemu-qmp-ref.html#Background-jobs>`_. + +Example: New Incremental Backup Anchor Point +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As outlined in the Transactions - `Justification`_ section, perhaps we want to +create a new incremental backup chain attached to a drive. + +This example creates a new, full backup of "drive0" and accompanies it with a +new, empty bitmap that records writes from this point in time forward. + +The target can be created with the help of `blockdev-add +<qemu-qmp-ref.html#index-blockdev_002dadd>`_ or `blockdev-create +<qemu-qmp-ref.html#index-blockdev_002dcreate>`_ command. + +.. note:: Any new writes that happen after this command is issued, even while + the backup job runs, will be written locally and not to the backup + destination. These writes will be recorded in the bitmap + accordingly. + +.. code-block:: QMP + + -> { + "execute": "transaction", + "arguments": { + "actions": [ + { + "type": "block-dirty-bitmap-add", + "data": { + "node": "drive0", + "name": "bitmap0" + } + }, + { + "type": "blockdev-backup", + "data": { + "device": "drive0", + "target": "target0", + "sync": "full" + } + } + ] + } + } + + <- { "return": {} } + + <- { + "timestamp": { + "seconds": 1555436945, + "microseconds": 179620 + }, + "data": { + "status": "created", + "id": "drive0" + }, + "event": "JOB_STATUS_CHANGE" + } + + ... + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 68719476736, + "offset": 68719476736 + }, + "event": "BLOCK_JOB_COMPLETED" + } + + <- { + "timestamp": {...}, + "data": { + "status": "concluded", + "id": "drive0" + }, + "event": "JOB_STATUS_CHANGE" + } + + <- { + "timestamp": {...}, + "data": { + "status": "null", + "id": "drive0" + }, + "event": "JOB_STATUS_CHANGE" + } + +A full explanation of the job transition semantics and the JOB_STATUS_CHANGE +event are beyond the scope of this document and will be omitted in all +subsequent examples; above, several more events have been omitted for brevity. + +.. note:: Subsequent examples will omit all events except BLOCK_JOB_COMPLETED + except where necessary to illustrate workflow differences. + + Omitted events and json objects will be represented by ellipses: + ``...`` + +Example: Resetting an Incremental Backup Anchor Point +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If we want to start a new backup chain with an existing bitmap, we can also +use a transaction to reset the bitmap while making a new full backup: + +.. code-block:: QMP + + -> { + "execute": "transaction", + "arguments": { + "actions": [ + { + "type": "block-dirty-bitmap-clear", + "data": { + "node": "drive0", + "name": "bitmap0" + } + }, + { + "type": "blockdev-backup", + "data": { + "device": "drive0", + "target": "target0", + "sync": "full" + } + } + ] + } + } + + <- { "return": {} } + + ... + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 68719476736, + "offset": 68719476736 + }, + "event": "BLOCK_JOB_COMPLETED" + } + + ... + +The result of this example is identical to the first, but we clear an existing +bitmap instead of adding a new one. + +.. tip:: In both of these examples, "bitmap0" is tied conceptually to the + creation of new, full backups. This relationship is not saved or + remembered by QEMU; it is up to the operator or management layer to + remember which bitmaps are associated with which backups. + +Example: First Incremental Backup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. Create a full backup and sync it to a dirty bitmap using any method: + + - Either of the two live backup method demonstrated above, + - Using QMP commands with the VM paused as in the `Justification`_ section, + or + - With the VM offline, manually copy the image and start the VM in a paused + state, careful to add a new bitmap before the VM begins execution. + + Whichever method is chosen, let's assume that at the end of this step: + + - The full backup is named ``drive0.full.qcow2``. + - The bitmap we created is named ``bitmap0``, attached to ``drive0``. + +#. Create a destination image for the incremental backup that utilizes the + full backup as a backing image. + + - Let's assume the new incremental image is named ``drive0.inc0.qcow2``: + + .. code:: bash + + $ qemu-img create -f qcow2 drive0.inc0.qcow2 \ + -b drive0.full.qcow2 -F qcow2 + +#. Add target block node: + + .. code-block:: QMP + + -> { + "execute": "blockdev-add", + "arguments": { + "node-name": "target0", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "drive0.inc0.qcow2" + } + } + } + + <- { "return": {} } + +#. Issue an incremental backup command: + + .. code-block:: QMP + + -> { + "execute": "blockdev-backup", + "arguments": { + "device": "drive0", + "bitmap": "bitmap0", + "target": "target0", + "sync": "incremental" + } + } + + <- { "return": {} } + + ... + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 68719476736, + "offset": 68719476736 + }, + "event": "BLOCK_JOB_COMPLETED" + } + + ... + +This copies any blocks modified since the full backup was created into the +``drive0.inc0.qcow2`` file. During the operation, ``bitmap0`` is marked +``+busy``. If the operation is successful, ``bitmap0`` will be cleared to +reflect the "incremental" backup regimen, which only copies out new changes +from each incremental backup. + +.. note:: Any new writes that occur after the backup operation starts do not + get copied to the destination. The backup's "point in time" is when + the backup starts, not when it ends. These writes are recorded in a + special bitmap that gets re-added to bitmap0 when the backup ends so + that the next incremental backup can copy them out. + +Example: Second Incremental Backup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. Create a new destination image for the incremental backup that points to + the previous one, e.g.: ``drive0.inc1.qcow2`` + + .. code:: bash + + $ qemu-img create -f qcow2 drive0.inc1.qcow2 \ + -b drive0.inc0.qcow2 -F qcow2 + +#. Add target block node: + + .. code-block:: QMP + + -> { + "execute": "blockdev-add", + "arguments": { + "node-name": "target0", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "drive0.inc1.qcow2" + } + } + } + + <- { "return": {} } + +#. Issue a new incremental backup command. The only difference here is that we + have changed the target image below. + + .. code-block:: QMP + + -> { + "execute": "blockdev-backup", + "arguments": { + "device": "drive0", + "bitmap": "bitmap0", + "target": "target0", + "sync": "incremental" + } + } + + <- { "return": {} } + + ... + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 68719476736, + "offset": 68719476736 + }, + "event": "BLOCK_JOB_COMPLETED" + } + + ... + +Because the first incremental backup from the previous example completed +successfully, ``bitmap0`` was synchronized with ``drive0.inc0.qcow2``. Here, +we use ``bitmap0`` again to create a new incremental backup that targets the +previous one, creating a chain of three images: + +.. admonition:: Diagram + + .. code:: text + + +-------------------+ +-------------------+ +-------------------+ + | drive0.full.qcow2 |<--| drive0.inc0.qcow2 |<--| drive0.inc1.qcow2 | + +-------------------+ +-------------------+ +-------------------+ + +Each new incremental backup re-synchronizes the bitmap to the latest backup +authored, allowing a user to continue to "consume" it to create new backups on +top of an existing chain. + +In the above diagram, neither drive0.inc1.qcow2 nor drive0.inc0.qcow2 are +complete images by themselves, but rely on their backing chain to reconstruct +a full image. The dependency terminates with each full backup. + +Each backup in this chain remains independent, and is unchanged by new entries +made later in the chain. For instance, drive0.inc0.qcow2 remains a perfectly +valid backup of the disk as it was when that backup was issued. + +Example: Incremental Push Backups without Backing Files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Backup images are best kept off-site, so we often will not have the preceding +backups in a chain available to link against. This is not a problem at backup +time; we simply do not set the backing image when creating the destination +image: + +#. Create a new destination image with no backing file set. We will need to + specify the size of the base image, because the backing file isn't + available for QEMU to use to determine it. + + .. code:: bash + + $ qemu-img create -f qcow2 drive0.inc2.qcow2 64G + + .. note:: Alternatively, you can omit ``mode: "existing"`` from the push + backup commands to have QEMU create an image without a backing + file for you, but you lose control over format options like + compatibility and preallocation presets. + +#. Add target block node: + + .. code-block:: QMP + + -> { + "execute": "blockdev-add", + "arguments": { + "node-name": "target0", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "drive0.inc2.qcow2" + } + } + } + + <- { "return": {} } + +#. Issue a new incremental backup command. Apart from the new destination + image, there is no difference from the last two examples. + + .. code-block:: QMP + + -> { + "execute": "blockdev-backup", + "arguments": { + "device": "drive0", + "bitmap": "bitmap0", + "target": "target0", + "sync": "incremental" + } + } + + <- { "return": {} } + + ... + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 68719476736, + "offset": 68719476736 + }, + "event": "BLOCK_JOB_COMPLETED" + } + + ... + +The only difference from the perspective of the user is that you will need to +set the backing image when attempting to restore the backup: + +.. code:: bash + + $ qemu-img rebase drive0.inc2.qcow2 \ + -u -b drive0.inc1.qcow2 + +This uses the "unsafe" rebase mode to simply set the backing file to a file +that isn't present. + +It is also possible to use ``--image-opts`` to specify the entire backing +chain by hand as an ephemeral property at runtime, but that is beyond the +scope of this document. + +Example: Multi-drive Incremental Backup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Assume we have a VM with two drives, "drive0" and "drive1" and we wish to back +both of them up such that the two backups represent the same crash-consistent +point in time. + +#. For each drive, create an empty image: + + .. code:: bash + + $ qemu-img create -f qcow2 drive0.full.qcow2 64G + $ qemu-img create -f qcow2 drive1.full.qcow2 64G + +#. Add target block nodes: + + .. code-block:: QMP + + -> { + "execute": "blockdev-add", + "arguments": { + "node-name": "target0", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "drive0.full.qcow2" + } + } + } + + <- { "return": {} } + + -> { + "execute": "blockdev-add", + "arguments": { + "node-name": "target1", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "drive1.full.qcow2" + } + } + } + + <- { "return": {} } + +#. Create a full (anchor) backup for each drive, with accompanying bitmaps: + + .. code-block:: QMP + + -> { + "execute": "transaction", + "arguments": { + "actions": [ + { + "type": "block-dirty-bitmap-add", + "data": { + "node": "drive0", + "name": "bitmap0" + } + }, + { + "type": "block-dirty-bitmap-add", + "data": { + "node": "drive1", + "name": "bitmap0" + } + }, + { + "type": "blockdev-backup", + "data": { + "device": "drive0", + "target": "target0", + "sync": "full" + } + }, + { + "type": "blockdev-backup", + "data": { + "device": "drive1", + "target": "target1", + "sync": "full" + } + } + ] + } + } + + <- { "return": {} } + + ... + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 68719476736, + "offset": 68719476736 + }, + "event": "BLOCK_JOB_COMPLETED" + } + + ... + + <- { + "timestamp": {...}, + "data": { + "device": "drive1", + "type": "backup", + "speed": 0, + "len": 68719476736, + "offset": 68719476736 + }, + "event": "BLOCK_JOB_COMPLETED" + } + + ... + +#. Later, create new destination images for each of the incremental backups + that point to their respective full backups: + + .. code:: bash + + $ qemu-img create -f qcow2 drive0.inc0.qcow2 \ + -b drive0.full.qcow2 -F qcow2 + $ qemu-img create -f qcow2 drive1.inc0.qcow2 \ + -b drive1.full.qcow2 -F qcow2 + +#. Add target block nodes: + + .. code-block:: QMP + + -> { + "execute": "blockdev-add", + "arguments": { + "node-name": "target0", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "drive0.inc0.qcow2" + } + } + } + + <- { "return": {} } + + -> { + "execute": "blockdev-add", + "arguments": { + "node-name": "target1", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "drive1.inc0.qcow2" + } + } + } + + <- { "return": {} } + +#. Issue a multi-drive incremental push backup transaction: + + .. code-block:: QMP + + -> { + "execute": "transaction", + "arguments": { + "actions": [ + { + "type": "blockev-backup", + "data": { + "device": "drive0", + "bitmap": "bitmap0", + "sync": "incremental", + "target": "target0" + } + }, + { + "type": "blockdev-backup", + "data": { + "device": "drive1", + "bitmap": "bitmap0", + "sync": "incremental", + "target": "target1" + } + }, + ] + } + } + + <- { "return": {} } + + ... + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 68719476736, + "offset": 68719476736 + }, + "event": "BLOCK_JOB_COMPLETED" + } + + ... + + <- { + "timestamp": {...}, + "data": { + "device": "drive1", + "type": "backup", + "speed": 0, + "len": 68719476736, + "offset": 68719476736 + }, + "event": "BLOCK_JOB_COMPLETED" + } + + ... + +Push Backup Errors & Recovery +----------------------------- + +In the event of an error that occurs after a push backup job is successfully +launched, either by an individual QMP command or a QMP transaction, the user +will receive a ``BLOCK_JOB_COMPLETE`` event with a failure message, +accompanied by a ``BLOCK_JOB_ERROR`` event. + +In the case of a job being cancelled, the user will receive a +``BLOCK_JOB_CANCELLED`` event instead of a pair of COMPLETE and ERROR +events. + +In either failure case, the bitmap used for the failed operation is not +cleared. It will contain all of the dirty bits it did at the start of the +operation, plus any new bits that got marked during the operation. + +Effectively, the "point in time" that a bitmap is recording differences +against is kept at the issuance of the last successful incremental backup, +instead of being moved forward to the start of this now-failed backup. + +Once the underlying problem is addressed (e.g. more storage space is allocated +on the destination), the incremental backup command can be retried with the +same bitmap. + +Example: Individual Failures +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Incremental Push Backup jobs that fail individually behave simply as +described above. This example demonstrates the single-job failure case: + +#. Create a target image: + + .. code:: bash + + $ qemu-img create -f qcow2 drive0.inc0.qcow2 \ + -b drive0.full.qcow2 -F qcow2 + +#. Add target block node: + + .. code-block:: QMP + + -> { + "execute": "blockdev-add", + "arguments": { + "node-name": "target0", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "drive0.inc0.qcow2" + } + } + } + + <- { "return": {} } + +#. Attempt to create an incremental backup via QMP: + + .. code-block:: QMP + + -> { + "execute": "blockdev-backup", + "arguments": { + "device": "drive0", + "bitmap": "bitmap0", + "target": "target0", + "sync": "incremental" + } + } + + <- { "return": {} } + +#. Receive a pair of events indicating failure: + + .. code-block:: QMP + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "action": "report", + "operation": "write" + }, + "event": "BLOCK_JOB_ERROR" + } + + <- { + "timestamp": {...}, + "data": { + "speed": 0, + "offset": 0, + "len": 67108864, + "error": "No space left on device", + "device": "drive0", + "type": "backup" + }, + "event": "BLOCK_JOB_COMPLETED" + } + +#. Remove target node: + + .. code-block:: QMP + + -> { + "execute": "blockdev-del", + "arguments": { + "node-name": "target0", + } + } + + <- { "return": {} } + +#. Delete the failed image, and re-create it. + + .. code:: bash + + $ rm drive0.inc0.qcow2 + $ qemu-img create -f qcow2 drive0.inc0.qcow2 \ + -b drive0.full.qcow2 -F qcow2 + +#. Add target block node: + + .. code-block:: QMP + + -> { + "execute": "blockdev-add", + "arguments": { + "node-name": "target0", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "drive0.inc0.qcow2" + } + } + } + + <- { "return": {} } + +#. Retry the command after fixing the underlying problem, such as + freeing up space on the backup volume: + + .. code-block:: QMP + + -> { + "execute": "blockdev-backup", + "arguments": { + "device": "drive0", + "bitmap": "bitmap0", + "target": "target0", + "sync": "incremental" + } + } + + <- { "return": {} } + +#. Receive confirmation that the job completed successfully: + + .. code-block:: QMP + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 67108864, + "offset": 67108864 + }, + "event": "BLOCK_JOB_COMPLETED" + } + +Example: Partial Transactional Failures +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +QMP commands like `blockdev-backup +<qemu-qmp-ref.html#index-blockdev_002dbackup>`_ +conceptually only start a job, and so transactions containing these commands +may succeed even if the job it created later fails. This might have surprising +interactions with notions of how a "transaction" ought to behave. + +This distinction means that on occasion, a transaction containing such job +launching commands may appear to succeed and return success, but later +individual jobs associated with the transaction may fail. It is possible that +a management application may have to deal with a partial backup failure after +a "successful" transaction. + +If multiple backup jobs are specified in a single transaction, if one of those +jobs fails, it will not interact with the other backup jobs in any way by +default. The job(s) that succeeded will clear the dirty bitmap associated with +the operation, but the job(s) that failed will not. It is therefore not safe +to delete any incremental backups that were created successfully in this +scenario, even though others failed. + +This example illustrates a transaction with two backup jobs, where one fails +and one succeeds: + +#. Issue the transaction to start a backup of both drives. + + .. code-block:: QMP + + -> { + "execute": "transaction", + "arguments": { + "actions": [ + { + "type": "blockdev-backup", + "data": { + "device": "drive0", + "bitmap": "bitmap0", + "sync": "incremental", + "target": "target0" + } + }, + { + "type": "blockdev-backup", + "data": { + "device": "drive1", + "bitmap": "bitmap0", + "sync": "incremental", + "target": "target1" + } + }] + } + } + +#. Receive notice that the Transaction was accepted, and jobs were + launched: + + .. code-block:: QMP + + <- { "return": {} } + +#. Receive notice that the first job has completed: + + .. code-block:: QMP + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 67108864, + "offset": 67108864 + }, + "event": "BLOCK_JOB_COMPLETED" + } + +#. Receive notice that the second job has failed: + + .. code-block:: QMP + + <- { + "timestamp": {...}, + "data": { + "device": "drive1", + "action": "report", + "operation": "read" + }, + "event": "BLOCK_JOB_ERROR" + } + + ... + + <- { + "timestamp": {...}, + "data": { + "speed": 0, + "offset": 0, + "len": 67108864, + "error": "Input/output error", + "device": "drive1", + "type": "backup" + }, + "event": "BLOCK_JOB_COMPLETED" + } + +At the conclusion of the above example, ``drive0.inc0.qcow2`` is valid and +must be kept, but ``drive1.inc0.qcow2`` is incomplete and should be +deleted. If a VM-wide incremental backup of all drives at a point-in-time is +to be made, new backups for both drives will need to be made, taking into +account that a new incremental backup for drive0 needs to be based on top of +``drive0.inc0.qcow2``. + +For this example, an incremental backup for ``drive0`` was created, but not +for ``drive1``. The last VM-wide crash-consistent backup that is available in +this case is the full backup: + +.. code:: text + + [drive0.full.qcow2] <-- [drive0.inc0.qcow2] + [drive1.full.qcow2] + +To repair this, issue a new incremental backup across both drives. The result +will be backup chains that resemble the following: + +.. code:: text + + [drive0.full.qcow2] <-- [drive0.inc0.qcow2] <-- [drive0.inc1.qcow2] + [drive1.full.qcow2] <-------------------------- [drive1.inc1.qcow2] + +Example: Grouped Completion Mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +While jobs launched by transactions normally complete or fail individually, +it's possible to instruct them to complete or fail together as a group. QMP +transactions take an optional properties structure that can affect the +behavior of the transaction. + +The ``completion-mode`` transaction property can be either ``individual`` +which is the default legacy behavior described above, or ``grouped``, detailed +below. + +In ``grouped`` completion mode, no jobs will report success until all jobs are +ready to report success. If any job fails, all other jobs will be cancelled. + +Regardless of if a participating incremental backup job failed or was +cancelled, their associated bitmaps will all be held at their existing +points-in-time, as in individual failure cases. + +Here's the same multi-drive backup scenario from `Example: Partial +Transactional Failures`_, but with the ``grouped`` completion-mode property +applied: + +#. Issue the multi-drive incremental backup transaction: + + .. code-block:: QMP + + -> { + "execute": "transaction", + "arguments": { + "properties": { + "completion-mode": "grouped" + }, + "actions": [ + { + "type": "blockdev-backup", + "data": { + "device": "drive0", + "bitmap": "bitmap0", + "sync": "incremental", + "target": "target0" + } + }, + { + "type": "blockdev-backup", + "data": { + "device": "drive1", + "bitmap": "bitmap0", + "sync": "incremental", + "target": "target1" + } + }] + } + } + +#. Receive notice that the Transaction was accepted, and jobs were launched: + + .. code-block:: QMP + + <- { "return": {} } + +#. Receive notification that the backup job for ``drive1`` has failed: + + .. code-block:: QMP + + <- { + "timestamp": {...}, + "data": { + "device": "drive1", + "action": "report", + "operation": "read" + }, + "event": "BLOCK_JOB_ERROR" + } + + <- { + "timestamp": {...}, + "data": { + "speed": 0, + "offset": 0, + "len": 67108864, + "error": "Input/output error", + "device": "drive1", + "type": "backup" + }, + "event": "BLOCK_JOB_COMPLETED" + } + +#. Receive notification that the job for ``drive0`` has been cancelled: + + .. code-block:: QMP + + <- { + "timestamp": {...}, + "data": { + "device": "drive0", + "type": "backup", + "speed": 0, + "len": 67108864, + "offset": 16777216 + }, + "event": "BLOCK_JOB_CANCELLED" + } + +At the conclusion of *this* example, both jobs have been aborted due to a +failure. Both destination images should be deleted and are no longer of use. + +The transaction as a whole can simply be re-issued at a later time. + +.. raw:: html + + <!-- + The FreeBSD Documentation License + + Redistribution and use in source (ReST) and 'compiled' forms (SGML, HTML, + PDF, PostScript, RTF and so forth) with or without modification, are + permitted provided that the following conditions are met: + + Redistributions of source code (ReST) must retain the above copyright notice, + this list of conditions and the following disclaimer of this file unmodified. + + Redistributions in compiled form (transformed to other DTDs, converted to + PDF, PostScript, RTF and other formats) must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + THIS DOCUMENTATION IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS + IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE + LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENTATION, EVEN IF ADVISED OF + THE POSSIBILITY OF SUCH DAMAGE. + --> diff --git a/docs/interop/dbus-vmstate.rst b/docs/interop/dbus-vmstate.rst new file mode 100644 index 000000000..1d719c1c6 --- /dev/null +++ b/docs/interop/dbus-vmstate.rst @@ -0,0 +1,74 @@ +============= +D-Bus VMState +============= + +Introduction +============ + +The QEMU dbus-vmstate object's aim is to migrate helpers' data running +on a QEMU D-Bus bus. (refer to the :doc:`dbus` document for +some recommendations on D-Bus usage) + +Upon migration, QEMU will go through the queue of +``org.qemu.VMState1`` D-Bus name owners and query their ``Id``. It +must be unique among the helpers. + +It will then save arbitrary data of each Id to be transferred in the +migration stream and restored/loaded at the corresponding destination +helper. + +For now, the data amount to be transferred is arbitrarily limited to +1Mb. The state must be saved quickly (a fraction of a second). (D-Bus +imposes a time limit on reply anyway, and migration would fail if data +isn't given quickly enough.) + +dbus-vmstate object can be configured with the expected list of +helpers by setting its ``id-list`` property, with a comma-separated +``Id`` list. + +Interface +========= + +On object path ``/org/qemu/VMState1``, the following +``org.qemu.VMState1`` interface should be implemented: + +.. code:: xml + + <interface name="org.qemu.VMState1"> + <property name="Id" type="s" access="read"/> + <method name="Load"> + <arg type="ay" name="data" direction="in"/> + </method> + <method name="Save"> + <arg type="ay" name="data" direction="out"/> + </method> + </interface> + +"Id" property +------------- + +A string that identifies the helper uniquely. (maximum 256 bytes +including terminating NUL byte) + +.. note:: + + The helper ID namespace is a separate namespace. In particular, it is not + related to QEMU "id" used in -object/-device objects. + +Load(in u8[] bytes) method +-------------------------- + +The method called on destination with the state to restore. + +The helper may be initially started in a waiting state (with +an --incoming argument for example), and it may resume on success. + +An error may be returned to the caller. + +Save(out u8[] bytes) method +--------------------------- + +The method called on the source to get the current state to be +migrated. The helper should continue to run normally. + +An error may be returned to the caller. diff --git a/docs/interop/dbus.rst b/docs/interop/dbus.rst new file mode 100644 index 000000000..be596d3f4 --- /dev/null +++ b/docs/interop/dbus.rst @@ -0,0 +1,110 @@ +===== +D-Bus +===== + +Introduction +============ + +QEMU may be running with various helper processes involved: + - vhost-user* processes (gpu, virtfs, input, etc...) + - TPM emulation (or other devices) + - user networking (slirp) + - network services (DHCP/DNS, samba/ftp etc) + - background tasks (compression, streaming etc) + - client UI + - admin & cli + +Having several processes allows stricter security rules, as well as +greater modularity. + +While QEMU itself uses QMP as primary IPC (and Spice/VNC for remote +display), D-Bus is the de facto IPC of choice on Unix systems. The +wire format is machine friendly, good bindings exist for various +languages, and there are various tools available. + +Using a bus, helper processes can discover and communicate with each +other easily, without going through QEMU. The bus topology is also +easier to apprehend and debug than a mesh. However, it is wise to +consider the security aspects of it. + +Security +======== + +A QEMU D-Bus bus should be private to a single VM. Thus, only +cooperative tasks are running on the same bus to serve the VM. + +D-Bus, the protocol and standard, doesn't have mechanisms to enforce +security between peers once the connection is established. Peers may +have additional mechanisms to enforce security rules, based for +example on UNIX credentials. + +The daemon can control which peers can send/recv messages using +various metadata attributes, however, this is alone is not generally +sufficient to make the deployment secure. The semantics of the actual +methods implemented using D-Bus are just as critical. Peers need to +carefully validate any information they received from a peer with a +different trust level. + +dbus-daemon policy +------------------ + +dbus-daemon can enforce various policies based on the UID/GID of the +processes that are connected to it. It is thus a good idea to run +helpers as different UID from QEMU and set appropriate policies. + +Depending on the use case, you may choose different scenarios: + + - Everything the same UID + + - Convenient for developers + - Improved reliability - crash of one part doesn't take + out entire VM + - No security benefit over traditional QEMU, unless additional + unless additional controls such as SELinux or AppArmor are + applied + + - Two UIDs, one for QEMU, one for dbus & helpers + + - Moderately improved user based security isolation + + - Many UIDs, one for QEMU one for dbus and one for each helpers + + - Best user based security isolation + - Complex to manager distinct UIDs needed for each VM + +For example, to allow only ``qemu`` user to talk to ``qemu-helper`` +``org.qemu.Helper1`` service, a dbus-daemon policy may contain: + +.. code:: xml + + <policy user="qemu"> + <allow send_destination="org.qemu.Helper1"/> + <allow receive_sender="org.qemu.Helper1"/> + </policy> + + <policy user="qemu-helper"> + <allow own="org.qemu.Helper1"/> + </policy> + + +dbus-daemon can also perform SELinux checks based on the security +context of the source and the target. For example, ``virtiofs_t`` +could be allowed to send a message to ``svirt_t``, but ``virtiofs_t`` +wouldn't be allowed to send a message to ``virtiofs_t``. + +See dbus-daemon man page for details. + +Guidelines +========== + +When implementing new D-Bus interfaces, it is recommended to follow +the "D-Bus API Design Guidelines": +https://dbus.freedesktop.org/doc/dbus-api-design.html + +The "org.qemu.*" prefix is reserved for services implemented & +distributed by the QEMU project. + +QEMU Interfaces +=============== + +:doc:`dbus-vmstate` diff --git a/docs/interop/firmware.json b/docs/interop/firmware.json new file mode 100644 index 000000000..8d8b0be03 --- /dev/null +++ b/docs/interop/firmware.json @@ -0,0 +1,593 @@ +# -*- Mode: Python -*- +# vim: filetype=python +# +# Copyright (C) 2018 Red Hat, Inc. +# +# Authors: +# Daniel P. Berrange <berrange@redhat.com> +# Laszlo Ersek <lersek@redhat.com> +# +# This work is licensed under the terms of the GNU GPL, version 2 or +# later. See the COPYING file in the top-level directory. + +## +# = Firmware +## + +{ 'include' : 'machine.json' } +{ 'include' : 'block-core.json' } + +## +# @FirmwareOSInterface: +# +# Lists the firmware-OS interface types provided by various firmware +# that is commonly used with QEMU virtual machines. +# +# @bios: Traditional x86 BIOS interface. For example, firmware built +# from the SeaBIOS project usually provides this interface. +# +# @openfirmware: The interface is defined by the (historical) IEEE +# 1275-1994 standard. Examples for firmware projects that +# provide this interface are: OpenBIOS and SLOF. +# +# @uboot: Firmware interface defined by the U-Boot project. +# +# @uefi: Firmware interface defined by the UEFI specification. For +# example, firmware built from the edk2 (EFI Development Kit II) +# project usually provides this interface. +# +# Since: 3.0 +## +{ 'enum' : 'FirmwareOSInterface', + 'data' : [ 'bios', 'openfirmware', 'uboot', 'uefi' ] } + +## +# @FirmwareDevice: +# +# Defines the device types that firmware can be mapped into. +# +# @flash: The firmware executable and its accompanying NVRAM file are to +# be mapped into a pflash chip each. +# +# @kernel: The firmware is to be loaded like a Linux kernel. This is +# similar to @memory but may imply additional processing that +# is specific to the target architecture and machine type. +# +# @memory: The firmware is to be mapped into memory. +# +# Since: 3.0 +## +{ 'enum' : 'FirmwareDevice', + 'data' : [ 'flash', 'kernel', 'memory' ] } + +## +# @FirmwareTarget: +# +# Defines the machine types that firmware may execute on. +# +# @architecture: Determines the emulation target (the QEMU system +# emulator) that can execute the firmware. +# +# @machines: Lists the machine types (known by the emulator that is +# specified through @architecture) that can execute the +# firmware. Elements of @machines are supposed to be concrete +# machine types, not aliases. Glob patterns are understood, +# which is especially useful for versioned machine types. +# (For example, the glob pattern "pc-i440fx-*" matches +# "pc-i440fx-2.12".) On the QEMU command line, "-machine +# type=..." specifies the requested machine type (but that +# option does not accept glob patterns). +# +# Since: 3.0 +## +{ 'struct' : 'FirmwareTarget', + 'data' : { 'architecture' : 'SysEmuTarget', + 'machines' : [ 'str' ] } } + +## +# @FirmwareFeature: +# +# Defines the features that firmware may support, and the platform +# requirements that firmware may present. +# +# @acpi-s3: The firmware supports S3 sleep (suspend to RAM), as defined +# in the ACPI specification. On the "pc-i440fx-*" machine +# types of the @i386 and @x86_64 emulation targets, S3 can be +# enabled with "-global PIIX4_PM.disable_s3=0" and disabled +# with "-global PIIX4_PM.disable_s3=1". On the "pc-q35-*" +# machine types of the @i386 and @x86_64 emulation targets, S3 +# can be enabled with "-global ICH9-LPC.disable_s3=0" and +# disabled with "-global ICH9-LPC.disable_s3=1". +# +# @acpi-s4: The firmware supports S4 hibernation (suspend to disk), as +# defined in the ACPI specification. On the "pc-i440fx-*" +# machine types of the @i386 and @x86_64 emulation targets, S4 +# can be enabled with "-global PIIX4_PM.disable_s4=0" and +# disabled with "-global PIIX4_PM.disable_s4=1". On the +# "pc-q35-*" machine types of the @i386 and @x86_64 emulation +# targets, S4 can be enabled with "-global +# ICH9-LPC.disable_s4=0" and disabled with "-global +# ICH9-LPC.disable_s4=1". +# +# @amd-sev: The firmware supports running under AMD Secure Encrypted +# Virtualization, as specified in the AMD64 Architecture +# Programmer's Manual. QEMU command line options related to +# this feature are documented in +# "docs/amd-memory-encryption.txt". +# +# @amd-sev-es: The firmware supports running under AMD Secure Encrypted +# Virtualization - Encrypted State, as specified in the AMD64 +# Architecture Programmer's Manual. QEMU command line options +# related to this feature are documented in +# "docs/amd-memory-encryption.txt". +# +# @enrolled-keys: The variable store (NVRAM) template associated with +# the firmware binary has the UEFI Secure Boot +# operational mode turned on, with certificates +# enrolled. +# +# @requires-smm: The firmware requires the platform to emulate SMM +# (System Management Mode), as defined in the AMD64 +# Architecture Programmer's Manual, and in the Intel(R)64 +# and IA-32 Architectures Software Developer's Manual. On +# the "pc-q35-*" machine types of the @i386 and @x86_64 +# emulation targets, SMM emulation can be enabled with +# "-machine smm=on". (On the "pc-q35-*" machine types of +# the @i386 emulation target, @requires-smm presents +# further CPU requirements; one combination known to work +# is "-cpu coreduo,nx=off".) If the firmware is marked as +# both @secure-boot and @requires-smm, then write +# accesses to the pflash chip (NVRAM) that holds the UEFI +# variable store must be restricted to code that executes +# in SMM, using the additional option "-global +# driver=cfi.pflash01,property=secure,value=on". +# Furthermore, a large guest-physical address space +# (comprising guest RAM, memory hotplug range, and 64-bit +# PCI MMIO aperture), and/or a high VCPU count, may +# present high SMRAM requirements from the firmware. On +# the "pc-q35-*" machine types of the @i386 and @x86_64 +# emulation targets, the SMRAM size may be increased +# above the default 16MB with the "-global +# mch.extended-tseg-mbytes=uint16" option. As a rule of +# thumb, the default 16MB size suffices for 1TB of +# guest-phys address space and a few tens of VCPUs; for +# every further TB of guest-phys address space, add 8MB +# of SMRAM. 48MB should suffice for 4TB of guest-phys +# address space and 2-3 hundred VCPUs. +# +# @secure-boot: The firmware implements the software interfaces for UEFI +# Secure Boot, as defined in the UEFI specification. Note +# that without @requires-smm, guest code running with +# kernel privileges can undermine the security of Secure +# Boot. +# +# @verbose-dynamic: When firmware log capture is enabled, the firmware +# logs a large amount of debug messages, which may +# impact boot performance. With log capture disabled, +# there is no boot performance impact. On the +# "pc-i440fx-*" and "pc-q35-*" machine types of the +# @i386 and @x86_64 emulation targets, firmware log +# capture can be enabled with the QEMU command line +# options "-chardev file,id=fwdebug,path=LOGFILEPATH +# -device isa-debugcon,iobase=0x402,chardev=fwdebug". +# @verbose-dynamic is mutually exclusive with +# @verbose-static. +# +# @verbose-static: The firmware unconditionally produces a large amount +# of debug messages, which may impact boot performance. +# This feature may typically be carried by certain UEFI +# firmware for the "virt-*" machine types of the @arm +# and @aarch64 emulation targets, where the debug +# messages are written to the first (always present) +# PL011 UART. @verbose-static is mutually exclusive +# with @verbose-dynamic. +# +# Since: 3.0 +## +{ 'enum' : 'FirmwareFeature', + 'data' : [ 'acpi-s3', 'acpi-s4', 'amd-sev', 'amd-sev-es', 'enrolled-keys', + 'requires-smm', 'secure-boot', 'verbose-dynamic', + 'verbose-static' ] } + +## +# @FirmwareFlashFile: +# +# Defines common properties that are necessary for loading a firmware +# file into a pflash chip. The corresponding QEMU command line option is +# "-drive file=@filename,format=@format". Note however that the +# option-argument shown here is incomplete; it is completed under +# @FirmwareMappingFlash. +# +# @filename: Specifies the filename on the host filesystem where the +# firmware file can be found. +# +# @format: Specifies the block format of the file pointed-to by +# @filename, such as @raw or @qcow2. +# +# Since: 3.0 +## +{ 'struct' : 'FirmwareFlashFile', + 'data' : { 'filename' : 'str', + 'format' : 'BlockdevDriver' } } + +## +# @FirmwareMappingFlash: +# +# Describes loading and mapping properties for the firmware executable +# and its accompanying NVRAM file, when @FirmwareDevice is @flash. +# +# @executable: Identifies the firmware executable. The firmware +# executable may be shared by multiple virtual machine +# definitions. The preferred corresponding QEMU command +# line options are +# -drive if=none,id=pflash0,readonly=on,file=@executable.@filename,format=@executable.@format +# -machine pflash0=pflash0 +# or equivalent -blockdev instead of -drive. +# With QEMU versions older than 4.0, you have to use +# -drive if=pflash,unit=0,readonly=on,file=@executable.@filename,format=@executable.@format +# +# @nvram-template: Identifies the NVRAM template compatible with +# @executable. Management software instantiates an +# individual copy -- a specific NVRAM file -- from +# @nvram-template.@filename for each new virtual +# machine definition created. @nvram-template.@filename +# itself is never mapped into virtual machines, only +# individual copies of it are. An NVRAM file is +# typically used for persistently storing the +# non-volatile UEFI variables of a virtual machine +# definition. The preferred corresponding QEMU +# command line options are +# -drive if=none,id=pflash1,readonly=off,file=FILENAME_OF_PRIVATE_NVRAM_FILE,format=@nvram-template.@format +# -machine pflash1=pflash1 +# or equivalent -blockdev instead of -drive. +# With QEMU versions older than 4.0, you have to use +# -drive if=pflash,unit=1,readonly=off,file=FILENAME_OF_PRIVATE_NVRAM_FILE,format=@nvram-template.@format +# +# Since: 3.0 +## +{ 'struct' : 'FirmwareMappingFlash', + 'data' : { 'executable' : 'FirmwareFlashFile', + 'nvram-template' : 'FirmwareFlashFile' } } + +## +# @FirmwareMappingKernel: +# +# Describes loading and mapping properties for the firmware executable, +# when @FirmwareDevice is @kernel. +# +# @filename: Identifies the firmware executable. The firmware executable +# may be shared by multiple virtual machine definitions. The +# corresponding QEMU command line option is "-kernel +# @filename". +# +# Since: 3.0 +## +{ 'struct' : 'FirmwareMappingKernel', + 'data' : { 'filename' : 'str' } } + +## +# @FirmwareMappingMemory: +# +# Describes loading and mapping properties for the firmware executable, +# when @FirmwareDevice is @memory. +# +# @filename: Identifies the firmware executable. The firmware executable +# may be shared by multiple virtual machine definitions. The +# corresponding QEMU command line option is "-bios +# @filename". +# +# Since: 3.0 +## +{ 'struct' : 'FirmwareMappingMemory', + 'data' : { 'filename' : 'str' } } + +## +# @FirmwareMapping: +# +# Provides a discriminated structure for firmware to describe its +# loading / mapping properties. +# +# @device: Selects the device type that the firmware must be mapped +# into. +# +# Since: 3.0 +## +{ 'union' : 'FirmwareMapping', + 'base' : { 'device' : 'FirmwareDevice' }, + 'discriminator' : 'device', + 'data' : { 'flash' : 'FirmwareMappingFlash', + 'kernel' : 'FirmwareMappingKernel', + 'memory' : 'FirmwareMappingMemory' } } + +## +# @Firmware: +# +# Describes a firmware (or a firmware use case) to management software. +# +# It is possible for multiple @Firmware elements to match the search +# criteria of management software. Applications thus need rules to pick +# one of the many matches, and users need the ability to override distro +# defaults. +# +# It is recommended to create firmware JSON files (each containing a +# single @Firmware root element) with a double-digit prefix, for example +# "50-ovmf.json", "50-seabios-256k.json", etc, so they can be sorted in +# predictable order. The firmware JSON files should be searched for in +# three directories: +# +# - /usr/share/qemu/firmware -- populated by distro-provided firmware +# packages (XDG_DATA_DIRS covers +# /usr/share by default), +# +# - /etc/qemu/firmware -- exclusively for sysadmins' local additions, +# +# - $XDG_CONFIG_HOME/qemu/firmware -- exclusively for per-user local +# additions (XDG_CONFIG_HOME +# defaults to $HOME/.config). +# +# Top-down, the list of directories goes from general to specific. +# +# Management software should build a list of files from all three +# locations, then sort the list by filename (i.e., last pathname +# component). Management software should choose the first JSON file on +# the sorted list that matches the search criteria. If a more specific +# directory has a file with same name as a less specific directory, then +# the file in the more specific directory takes effect. If the more +# specific file is zero length, it hides the less specific one. +# +# For example, if a distro ships +# +# - /usr/share/qemu/firmware/50-ovmf.json +# +# - /usr/share/qemu/firmware/50-seabios-256k.json +# +# then the sysadmin can prevent the default OVMF being used at all with +# +# $ touch /etc/qemu/firmware/50-ovmf.json +# +# The sysadmin can replace/alter the distro default OVMF with +# +# $ vim /etc/qemu/firmware/50-ovmf.json +# +# or they can provide a parallel OVMF with higher priority +# +# $ vim /etc/qemu/firmware/10-ovmf.json +# +# or they can provide a parallel OVMF with lower priority +# +# $ vim /etc/qemu/firmware/99-ovmf.json +# +# @description: Provides a human-readable description of the firmware. +# Management software may or may not display @description. +# +# @interface-types: Lists the types of interfaces that the firmware can +# expose to the guest OS. This is a non-empty, ordered +# list; entries near the beginning of @interface-types +# are considered more native to the firmware, and/or +# to have a higher quality implementation in the +# firmware, than entries near the end of +# @interface-types. +# +# @mapping: Describes the loading / mapping properties of the firmware. +# +# @targets: Collects the target architectures (QEMU system emulators) +# and their machine types that may execute the firmware. +# +# @features: Lists the features that the firmware supports, and the +# platform requirements it presents. +# +# @tags: A list of auxiliary strings associated with the firmware for +# which @description is not appropriate, due to the latter's +# possible exposure to the end-user. @tags serves development and +# debugging purposes only, and management software shall +# explicitly ignore it. +# +# Since: 3.0 +# +# Examples: +# +# { +# "description": "SeaBIOS", +# "interface-types": [ +# "bios" +# ], +# "mapping": { +# "device": "memory", +# "filename": "/usr/share/seabios/bios-256k.bin" +# }, +# "targets": [ +# { +# "architecture": "i386", +# "machines": [ +# "pc-i440fx-*", +# "pc-q35-*" +# ] +# }, +# { +# "architecture": "x86_64", +# "machines": [ +# "pc-i440fx-*", +# "pc-q35-*" +# ] +# } +# ], +# "features": [ +# "acpi-s3", +# "acpi-s4" +# ], +# "tags": [ +# "CONFIG_BOOTSPLASH=n", +# "CONFIG_ROM_SIZE=256", +# "CONFIG_USE_SMM=n" +# ] +# } +# +# { +# "description": "OVMF with SB+SMM, empty varstore", +# "interface-types": [ +# "uefi" +# ], +# "mapping": { +# "device": "flash", +# "executable": { +# "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd", +# "format": "raw" +# }, +# "nvram-template": { +# "filename": "/usr/share/OVMF/OVMF_VARS.fd", +# "format": "raw" +# } +# }, +# "targets": [ +# { +# "architecture": "x86_64", +# "machines": [ +# "pc-q35-*" +# ] +# } +# ], +# "features": [ +# "acpi-s3", +# "amd-sev", +# "requires-smm", +# "secure-boot", +# "verbose-dynamic" +# ], +# "tags": [ +# "-a IA32", +# "-a X64", +# "-p OvmfPkg/OvmfPkgIa32X64.dsc", +# "-t GCC48", +# "-b DEBUG", +# "-D SMM_REQUIRE", +# "-D SECURE_BOOT_ENABLE", +# "-D FD_SIZE_4MB" +# ] +# } +# +# { +# "description": "OVMF with SB+SMM, SB enabled, MS certs enrolled", +# "interface-types": [ +# "uefi" +# ], +# "mapping": { +# "device": "flash", +# "executable": { +# "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd", +# "format": "raw" +# }, +# "nvram-template": { +# "filename": "/usr/share/OVMF/OVMF_VARS.secboot.fd", +# "format": "raw" +# } +# }, +# "targets": [ +# { +# "architecture": "x86_64", +# "machines": [ +# "pc-q35-*" +# ] +# } +# ], +# "features": [ +# "acpi-s3", +# "amd-sev", +# "enrolled-keys", +# "requires-smm", +# "secure-boot", +# "verbose-dynamic" +# ], +# "tags": [ +# "-a IA32", +# "-a X64", +# "-p OvmfPkg/OvmfPkgIa32X64.dsc", +# "-t GCC48", +# "-b DEBUG", +# "-D SMM_REQUIRE", +# "-D SECURE_BOOT_ENABLE", +# "-D FD_SIZE_4MB" +# ] +# } +# +# { +# "description": "OVMF with SEV-ES support", +# "interface-types": [ +# "uefi" +# ], +# "mapping": { +# "device": "flash", +# "executable": { +# "filename": "/usr/share/OVMF/OVMF_CODE.fd", +# "format": "raw" +# }, +# "nvram-template": { +# "filename": "/usr/share/OVMF/OVMF_VARS.fd", +# "format": "raw" +# } +# }, +# "targets": [ +# { +# "architecture": "x86_64", +# "machines": [ +# "pc-q35-*" +# ] +# } +# ], +# "features": [ +# "acpi-s3", +# "amd-sev", +# "amd-sev-es", +# "verbose-dynamic" +# ], +# "tags": [ +# "-a X64", +# "-p OvmfPkg/OvmfPkgX64.dsc", +# "-t GCC48", +# "-b DEBUG", +# "-D FD_SIZE_4MB" +# ] +# } +# +# { +# "description": "UEFI firmware for ARM64 virtual machines", +# "interface-types": [ +# "uefi" +# ], +# "mapping": { +# "device": "flash", +# "executable": { +# "filename": "/usr/share/AAVMF/AAVMF_CODE.fd", +# "format": "raw" +# }, +# "nvram-template": { +# "filename": "/usr/share/AAVMF/AAVMF_VARS.fd", +# "format": "raw" +# } +# }, +# "targets": [ +# { +# "architecture": "aarch64", +# "machines": [ +# "virt-*" +# ] +# } +# ], +# "features": [ +# +# ], +# "tags": [ +# "-a AARCH64", +# "-p ArmVirtPkg/ArmVirtQemu.dsc", +# "-t GCC48", +# "-b DEBUG", +# "-D DEBUG_PRINT_ERROR_LEVEL=0x80000000" +# ] +# } +## +{ 'struct' : 'Firmware', + 'data' : { 'description' : 'str', + 'interface-types' : [ 'FirmwareOSInterface' ], + 'mapping' : 'FirmwareMapping', + 'targets' : [ 'FirmwareTarget' ], + 'features' : [ 'FirmwareFeature' ], + 'tags' : [ 'str' ] } } diff --git a/docs/interop/index.rst b/docs/interop/index.rst new file mode 100644 index 000000000..47b9ed82b --- /dev/null +++ b/docs/interop/index.rst @@ -0,0 +1,23 @@ +------------------------------------------------ +System Emulation Management and Interoperability +------------------------------------------------ + +This section of the manual contains documents and specifications that +are useful for making QEMU interoperate with other software. + +.. toctree:: + :maxdepth: 2 + + barrier + bitmaps + dbus + dbus-vmstate + live-block-operations + pr-helper + qemu-ga + qemu-ga-ref + qemu-qmp-ref + qemu-storage-daemon-qmp-ref + vhost-user + vhost-user-gpu + vhost-vdpa diff --git a/docs/interop/live-block-operations.rst b/docs/interop/live-block-operations.rst new file mode 100644 index 000000000..39e62c991 --- /dev/null +++ b/docs/interop/live-block-operations.rst @@ -0,0 +1,1121 @@ +.. + Copyright (C) 2017 Red Hat Inc. + + This work is licensed under the terms of the GNU GPL, version 2 or + later. See the COPYING file in the top-level directory. + +============================ +Live Block Device Operations +============================ + +QEMU Block Layer currently (as of QEMU 2.9) supports four major kinds of +live block device jobs -- stream, commit, mirror, and backup. These can +be used to manipulate disk image chains to accomplish certain tasks, +namely: live copy data from backing files into overlays; shorten long +disk image chains by merging data from overlays into backing files; live +synchronize data from a disk image chain (including current active disk) +to another target image; and point-in-time (and incremental) backups of +a block device. Below is a description of the said block (QMP) +primitives, and some (non-exhaustive list of) examples to illustrate +their use. + +.. note:: + The file ``qapi/block-core.json`` in the QEMU source tree has the + canonical QEMU API (QAPI) schema documentation for the QMP + primitives discussed here. + +.. todo (kashyapc):: Remove the ".. contents::" directive when Sphinx is + integrated. + +.. contents:: + +Disk image backing chain notation +--------------------------------- + +A simple disk image chain. (This can be created live using QMP +``blockdev-snapshot-sync``, or offline via ``qemu-img``):: + + (Live QEMU) + | + . + V + + [A] <----- [B] + + (backing file) (overlay) + +The arrow can be read as: Image [A] is the backing file of disk image +[B]. And live QEMU is currently writing to image [B], consequently, it +is also referred to as the "active layer". + +There are two kinds of terminology that are common when referring to +files in a disk image backing chain: + +(1) Directional: 'base' and 'top'. Given the simple disk image chain + above, image [A] can be referred to as 'base', and image [B] as + 'top'. (This terminology can be seen in in QAPI schema file, + block-core.json.) + +(2) Relational: 'backing file' and 'overlay'. Again, taking the same + simple disk image chain from the above, disk image [A] is referred + to as the backing file, and image [B] as overlay. + + Throughout this document, we will use the relational terminology. + +.. important:: + The overlay files can generally be any format that supports a + backing file, although QCOW2 is the preferred format and the one + used in this document. + + +Brief overview of live block QMP primitives +------------------------------------------- + +The following are the four different kinds of live block operations that +QEMU block layer supports. + +(1) ``block-stream``: Live copy of data from backing files into overlay + files. + + .. note:: Once the 'stream' operation has finished, three things to + note: + + (a) QEMU rewrites the backing chain to remove + reference to the now-streamed and redundant backing + file; + + (b) the streamed file *itself* won't be removed by QEMU, + and must be explicitly discarded by the user; + + (c) the streamed file remains valid -- i.e. further + overlays can be created based on it. Refer the + ``block-stream`` section further below for more + details. + +(2) ``block-commit``: Live merge of data from overlay files into backing + files (with the optional goal of removing the overlay file from the + chain). Since QEMU 2.0, this includes "active ``block-commit``" + (i.e. merge the current active layer into the base image). + + .. note:: Once the 'commit' operation has finished, there are three + things to note here as well: + + (a) QEMU rewrites the backing chain to remove reference + to now-redundant overlay images that have been + committed into a backing file; + + (b) the committed file *itself* won't be removed by QEMU + -- it ought to be manually removed; + + (c) however, unlike in the case of ``block-stream``, the + intermediate images will be rendered invalid -- i.e. + no more further overlays can be created based on + them. Refer the ``block-commit`` section further + below for more details. + +(3) ``drive-mirror`` (and ``blockdev-mirror``): Synchronize a running + disk to another image. + +(4) ``blockdev-backup`` (and the deprecated ``drive-backup``): + Point-in-time (live) copy of a block device to a destination. + + +.. _`Interacting with a QEMU instance`: + +Interacting with a QEMU instance +-------------------------------- + +To show some example invocations of command-line, we will use the +following invocation of QEMU, with a QMP server running over UNIX +socket: + +.. parsed-literal:: + + $ |qemu_system| -display none -no-user-config -nodefaults \\ + -m 512 -blockdev \\ + node-name=node-A,driver=qcow2,file.driver=file,file.node-name=file,file.filename=./a.qcow2 \\ + -device virtio-blk,drive=node-A,id=virtio0 \\ + -monitor stdio -qmp unix:/tmp/qmp-sock,server=on,wait=off + +The ``-blockdev`` command-line option, used above, is available from +QEMU 2.9 onwards. In the above invocation, notice the ``node-name`` +parameter that is used to refer to the disk image a.qcow2 ('node-A') -- +this is a cleaner way to refer to a disk image (as opposed to referring +to it by spelling out file paths). So, we will continue to designate a +``node-name`` to each further disk image created (either via +``blockdev-snapshot-sync``, or ``blockdev-add``) as part of the disk +image chain, and continue to refer to the disks using their +``node-name`` (where possible, because ``block-commit`` does not yet, as +of QEMU 2.9, accept ``node-name`` parameter) when performing various +block operations. + +To interact with the QEMU instance launched above, we will use the +``qmp-shell`` utility (located at: ``qemu/scripts/qmp``, as part of the +QEMU source directory), which takes key-value pairs for QMP commands. +Invoke it as below (which will also print out the complete raw JSON +syntax for reference -- examples in the following sections):: + + $ ./qmp-shell -v -p /tmp/qmp-sock + (QEMU) + +.. note:: + In the event we have to repeat a certain QMP command, we will: for + the first occurrence of it, show the ``qmp-shell`` invocation, *and* + the corresponding raw JSON QMP syntax; but for subsequent + invocations, present just the ``qmp-shell`` syntax, and omit the + equivalent JSON output. + + +Example disk image chain +------------------------ + +We will use the below disk image chain (and occasionally spelling it +out where appropriate) when discussing various primitives:: + + [A] <-- [B] <-- [C] <-- [D] + +Where [A] is the original base image; [B] and [C] are intermediate +overlay images; image [D] is the active layer -- i.e. live QEMU is +writing to it. (The rule of thumb is: live QEMU will always be pointing +to the rightmost image in a disk image chain.) + +The above image chain can be created by invoking +``blockdev-snapshot-sync`` commands as following (which shows the +creation of overlay image [B]) using the ``qmp-shell`` (our invocation +also prints the raw JSON invocation of it):: + + (QEMU) blockdev-snapshot-sync node-name=node-A snapshot-file=b.qcow2 snapshot-node-name=node-B format=qcow2 + { + "execute": "blockdev-snapshot-sync", + "arguments": { + "node-name": "node-A", + "snapshot-file": "b.qcow2", + "format": "qcow2", + "snapshot-node-name": "node-B" + } + } + +Here, "node-A" is the name QEMU internally uses to refer to the base +image [A] -- it is the backing file, based on which the overlay image, +[B], is created. + +To create the rest of the overlay images, [C], and [D] (omitting the raw +JSON output for brevity):: + + (QEMU) blockdev-snapshot-sync node-name=node-B snapshot-file=c.qcow2 snapshot-node-name=node-C format=qcow2 + (QEMU) blockdev-snapshot-sync node-name=node-C snapshot-file=d.qcow2 snapshot-node-name=node-D format=qcow2 + + +A note on points-in-time vs file names +-------------------------------------- + +In our disk image chain:: + + [A] <-- [B] <-- [C] <-- [D] + +We have *three* points in time and an active layer: + +- Point 1: Guest state when [B] was created is contained in file [A] +- Point 2: Guest state when [C] was created is contained in [A] + [B] +- Point 3: Guest state when [D] was created is contained in + [A] + [B] + [C] +- Active layer: Current guest state is contained in [A] + [B] + [C] + + [D] + +Therefore, be aware with naming choices: + +- Naming a file after the time it is created is misleading -- the + guest data for that point in time is *not* contained in that file + (as explained earlier) +- Rather, think of files as a *delta* from the backing file + + +Live block streaming --- ``block-stream`` +----------------------------------------- + +The ``block-stream`` command allows you to do live copy data from backing +files into overlay images. + +Given our original example disk image chain from earlier:: + + [A] <-- [B] <-- [C] <-- [D] + +The disk image chain can be shortened in one of the following different +ways (not an exhaustive list). + +.. _`Case-1`: + +(1) Merge everything into the active layer: I.e. copy all contents from + the base image, [A], and overlay images, [B] and [C], into [D], + *while* the guest is running. The resulting chain will be a + standalone image, [D] -- with contents from [A], [B] and [C] merged + into it (where live QEMU writes go to):: + + [D] + +.. _`Case-2`: + +(2) Taking the same example disk image chain mentioned earlier, merge + only images [B] and [C] into [D], the active layer. The result will + be contents of images [B] and [C] will be copied into [D], and the + backing file pointer of image [D] will be adjusted to point to image + [A]. The resulting chain will be:: + + [A] <-- [D] + +.. _`Case-3`: + +(3) Intermediate streaming (available since QEMU 2.8): Starting afresh + with the original example disk image chain, with a total of four + images, it is possible to copy contents from image [B] into image + [C]. Once the copy is finished, image [B] can now be (optionally) + discarded; and the backing file pointer of image [C] will be + adjusted to point to [A]. I.e. after performing "intermediate + streaming" of [B] into [C], the resulting image chain will be (where + live QEMU is writing to [D]):: + + [A] <-- [C] <-- [D] + + +QMP invocation for ``block-stream`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For `Case-1`_, to merge contents of all the backing files into the +active layer, where 'node-D' is the current active image (by default +``block-stream`` will flatten the entire chain); ``qmp-shell`` (and its +corresponding JSON output):: + + (QEMU) block-stream device=node-D job-id=job0 + { + "execute": "block-stream", + "arguments": { + "device": "node-D", + "job-id": "job0" + } + } + +For `Case-2`_, merge contents of the images [B] and [C] into [D], where +image [D] ends up referring to image [A] as its backing file:: + + (QEMU) block-stream device=node-D base-node=node-A job-id=job0 + +And for `Case-3`_, of "intermediate" streaming", merge contents of +images [B] into [C], where [C] ends up referring to [A] as its backing +image:: + + (QEMU) block-stream device=node-C base-node=node-A job-id=job0 + +Progress of a ``block-stream`` operation can be monitored via the QMP +command:: + + (QEMU) query-block-jobs + { + "execute": "query-block-jobs", + "arguments": {} + } + + +Once the ``block-stream`` operation has completed, QEMU will emit an +event, ``BLOCK_JOB_COMPLETED``. The intermediate overlays remain valid, +and can now be (optionally) discarded, or retained to create further +overlays based on them. Finally, the ``block-stream`` jobs can be +restarted at anytime. + + +Live block commit --- ``block-commit`` +-------------------------------------- + +The ``block-commit`` command lets you merge live data from overlay +images into backing file(s). Since QEMU 2.0, this includes "live active +commit" (i.e. it is possible to merge the "active layer", the right-most +image in a disk image chain where live QEMU will be writing to, into the +base image). This is analogous to ``block-stream``, but in the opposite +direction. + +Again, starting afresh with our example disk image chain, where live +QEMU is writing to the right-most image in the chain, [D]:: + + [A] <-- [B] <-- [C] <-- [D] + +The disk image chain can be shortened in one of the following ways: + +.. _`block-commit_Case-1`: + +(1) Commit content from only image [B] into image [A]. The resulting + chain is the following, where image [C] is adjusted to point at [A] + as its new backing file:: + + [A] <-- [C] <-- [D] + +(2) Commit content from images [B] and [C] into image [A]. The + resulting chain, where image [D] is adjusted to point to image [A] + as its new backing file:: + + [A] <-- [D] + +.. _`block-commit_Case-3`: + +(3) Commit content from images [B], [C], and the active layer [D] into + image [A]. The resulting chain (in this case, a consolidated single + image):: + + [A] + +(4) Commit content from image only image [C] into image [B]. The + resulting chain:: + + [A] <-- [B] <-- [D] + +(5) Commit content from image [C] and the active layer [D] into image + [B]. The resulting chain:: + + [A] <-- [B] + + +QMP invocation for ``block-commit`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For :ref:`Case-1 <block-commit_Case-1>`, to merge contents only from +image [B] into image [A], the invocation is as follows:: + + (QEMU) block-commit device=node-D base=a.qcow2 top=b.qcow2 job-id=job0 + { + "execute": "block-commit", + "arguments": { + "device": "node-D", + "job-id": "job0", + "top": "b.qcow2", + "base": "a.qcow2" + } + } + +Once the above ``block-commit`` operation has completed, a +``BLOCK_JOB_COMPLETED`` event will be issued, and no further action is +required. As the end result, the backing file of image [C] is adjusted +to point to image [A], and the original 4-image chain will end up being +transformed to:: + + [A] <-- [C] <-- [D] + +.. note:: + The intermediate image [B] is invalid (as in: no more further + overlays based on it can be created). + + Reasoning: An intermediate image after a 'stream' operation still + represents that old point-in-time, and may be valid in that context. + However, an intermediate image after a 'commit' operation no longer + represents any point-in-time, and is invalid in any context. + + +However, :ref:`Case-3 <block-commit_Case-3>` (also called: "active +``block-commit``") is a *two-phase* operation: In the first phase, the +content from the active overlay, along with the intermediate overlays, +is copied into the backing file (also called the base image). In the +second phase, adjust the said backing file as the current active image +-- possible via issuing the command ``block-job-complete``. Optionally, +the ``block-commit`` operation can be cancelled by issuing the command +``block-job-cancel``, but be careful when doing this. + +Once the ``block-commit`` operation has completed, the event +``BLOCK_JOB_READY`` will be emitted, signalling that the synchronization +has finished. Now the job can be gracefully completed by issuing the +command ``block-job-complete`` -- until such a command is issued, the +'commit' operation remains active. + +The following is the flow for :ref:`Case-3 <block-commit_Case-3>` to +convert a disk image chain such as this:: + + [A] <-- [B] <-- [C] <-- [D] + +Into:: + + [A] + +Where content from all the subsequent overlays, [B], and [C], including +the active layer, [D], is committed back to [A] -- which is where live +QEMU is performing all its current writes). + +Start the "active ``block-commit``" operation:: + + (QEMU) block-commit device=node-D base=a.qcow2 top=d.qcow2 job-id=job0 + { + "execute": "block-commit", + "arguments": { + "device": "node-D", + "job-id": "job0", + "top": "d.qcow2", + "base": "a.qcow2" + } + } + + +Once the synchronization has completed, the event ``BLOCK_JOB_READY`` will +be emitted. + +Then, optionally query for the status of the active block operations. +We can see the 'commit' job is now ready to be completed, as indicated +by the line *"ready": true*:: + + (QEMU) query-block-jobs + { + "execute": "query-block-jobs", + "arguments": {} + } + { + "return": [ + { + "busy": false, + "type": "commit", + "len": 1376256, + "paused": false, + "ready": true, + "io-status": "ok", + "offset": 1376256, + "device": "job0", + "speed": 0 + } + ] + } + +Gracefully complete the 'commit' block device job:: + + (QEMU) block-job-complete device=job0 + { + "execute": "block-job-complete", + "arguments": { + "device": "job0" + } + } + { + "return": {} + } + +Finally, once the above job is completed, an event +``BLOCK_JOB_COMPLETED`` will be emitted. + +.. note:: + The invocation for rest of the cases (2, 4, and 5), discussed in the + previous section, is omitted for brevity. + + +Live disk synchronization --- ``drive-mirror`` and ``blockdev-mirror`` +---------------------------------------------------------------------- + +Synchronize a running disk image chain (all or part of it) to a target +image. + +Again, given our familiar disk image chain:: + + [A] <-- [B] <-- [C] <-- [D] + +The ``drive-mirror`` (and its newer equivalent ``blockdev-mirror``) +allows you to copy data from the entire chain into a single target image +(which can be located on a different host), [E]. + +.. note:: + + When you cancel an in-progress 'mirror' job *before* the source and + target are synchronized, ``block-job-cancel`` will emit the event + ``BLOCK_JOB_CANCELLED``. However, note that if you cancel a + 'mirror' job *after* it has indicated (via the event + ``BLOCK_JOB_READY``) that the source and target have reached + synchronization, then the event emitted by ``block-job-cancel`` + changes to ``BLOCK_JOB_COMPLETED``. + + Besides the 'mirror' job, the "active ``block-commit``" is the only + other block device job that emits the event ``BLOCK_JOB_READY``. + The rest of the block device jobs ('stream', "non-active + ``block-commit``", and 'backup') end automatically. + +So there are two possible actions to take, after a 'mirror' job has +emitted the event ``BLOCK_JOB_READY``, indicating that the source and +target have reached synchronization: + +(1) Issuing the command ``block-job-cancel`` (after it emits the event + ``BLOCK_JOB_COMPLETED``) will create a point-in-time (which is at + the time of *triggering* the cancel command) copy of the entire disk + image chain (or only the top-most image, depending on the ``sync`` + mode), contained in the target image [E]. One use case for this is + live VM migration with non-shared storage. + +(2) Issuing the command ``block-job-complete`` (after it emits the event + ``BLOCK_JOB_COMPLETED``) will adjust the guest device (i.e. live + QEMU) to point to the target image, [E], causing all the new writes + from this point on to happen there. + +About synchronization modes: The synchronization mode determines +*which* part of the disk image chain will be copied to the target. +Currently, there are four different kinds: + +(1) ``full`` -- Synchronize the content of entire disk image chain to + the target + +(2) ``top`` -- Synchronize only the contents of the top-most disk image + in the chain to the target + +(3) ``none`` -- Synchronize only the new writes from this point on. + + .. note:: In the case of ``blockdev-backup`` (or deprecated + ``drive-backup``), the behavior of ``none`` + synchronization mode is different. Normally, a + ``backup`` job consists of two parts: Anything that is + overwritten by the guest is first copied out to the + backup, and in the background the whole image is copied + from start to end. With ``sync=none``, it's only the + first part. + +(4) ``incremental`` -- Synchronize content that is described by the + dirty bitmap + +.. note:: + Refer to the :doc:`bitmaps` document in the QEMU source + tree to learn about the detailed workings of the ``incremental`` + synchronization mode. + + +QMP invocation for ``drive-mirror`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To copy the contents of the entire disk image chain, from [A] all the +way to [D], to a new target (``drive-mirror`` will create the destination +file, if it doesn't already exist), call it [E]:: + + (QEMU) drive-mirror device=node-D target=e.qcow2 sync=full job-id=job0 + { + "execute": "drive-mirror", + "arguments": { + "device": "node-D", + "job-id": "job0", + "target": "e.qcow2", + "sync": "full" + } + } + +The ``"sync": "full"``, from the above, means: copy the *entire* chain +to the destination. + +Following the above, querying for active block jobs will show that a +'mirror' job is "ready" to be completed (and QEMU will also emit an +event, ``BLOCK_JOB_READY``):: + + (QEMU) query-block-jobs + { + "execute": "query-block-jobs", + "arguments": {} + } + { + "return": [ + { + "busy": false, + "type": "mirror", + "len": 21757952, + "paused": false, + "ready": true, + "io-status": "ok", + "offset": 21757952, + "device": "job0", + "speed": 0 + } + ] + } + +And, as noted in the previous section, there are two possible actions +at this point: + +(a) Create a point-in-time snapshot by ending the synchronization. The + point-in-time is at the time of *ending* the sync. (The result of + the following being: the target image, [E], will be populated with + content from the entire chain, [A] to [D]):: + + (QEMU) block-job-cancel device=job0 + { + "execute": "block-job-cancel", + "arguments": { + "device": "job0" + } + } + +(b) Or, complete the operation and pivot the live QEMU to the target + copy:: + + (QEMU) block-job-complete device=job0 + +In either of the above cases, if you once again run the +``query-block-jobs`` command, there should not be any active block +operation. + +Comparing 'commit' and 'mirror': In both then cases, the overlay images +can be discarded. However, with 'commit', the *existing* base image +will be modified (by updating it with contents from overlays); while in +the case of 'mirror', a *new* target image is populated with the data +from the disk image chain. + + +QMP invocation for live storage migration with ``drive-mirror`` + NBD +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Live storage migration (without shared storage setup) is one of the most +common use-cases that takes advantage of the ``drive-mirror`` primitive +and QEMU's built-in Network Block Device (NBD) server. Here's a quick +walk-through of this setup. + +Given the disk image chain:: + + [A] <-- [B] <-- [C] <-- [D] + +Instead of copying content from the entire chain, synchronize *only* the +contents of the *top*-most disk image (i.e. the active layer), [D], to a +target, say, [TargetDisk]. + +.. important:: + The destination host must already have the contents of the backing + chain, involving images [A], [B], and [C], visible via other means + -- whether by ``cp``, ``rsync``, or by some storage array-specific + command.) + +Sometimes, this is also referred to as "shallow copy" -- because only +the "active layer", and not the rest of the image chain, is copied to +the destination. + +.. note:: + In this example, for the sake of simplicity, we'll be using the same + ``localhost`` as both source and destination. + +As noted earlier, on the destination host the contents of the backing +chain -- from images [A] to [C] -- are already expected to exist in some +form (e.g. in a file called, ``Contents-of-A-B-C.qcow2``). Now, on the +destination host, let's create a target overlay image (with the image +``Contents-of-A-B-C.qcow2`` as its backing file), to which the contents +of image [D] (from the source QEMU) will be mirrored to:: + + $ qemu-img create -f qcow2 -b ./Contents-of-A-B-C.qcow2 \ + -F qcow2 ./target-disk.qcow2 + +And start the destination QEMU (we already have the source QEMU running +-- discussed in the section: `Interacting with a QEMU instance`_) +instance, with the following invocation. (As noted earlier, for +simplicity's sake, the destination QEMU is started on the same host, but +it could be located elsewhere): + +.. parsed-literal:: + + $ |qemu_system| -display none -no-user-config -nodefaults \\ + -m 512 -blockdev \\ + node-name=node-TargetDisk,driver=qcow2,file.driver=file,file.node-name=file,file.filename=./target-disk.qcow2 \\ + -device virtio-blk,drive=node-TargetDisk,id=virtio0 \\ + -S -monitor stdio -qmp unix:./qmp-sock2,server=on,wait=off \\ + -incoming tcp:localhost:6666 + +Given the disk image chain on source QEMU:: + + [A] <-- [B] <-- [C] <-- [D] + +On the destination host, it is expected that the contents of the chain +``[A] <-- [B] <-- [C]`` are *already* present, and therefore copy *only* +the content of image [D]. + +(1) [On *destination* QEMU] As part of the first step, start the + built-in NBD server on a given host (local host, represented by + ``::``)and port:: + + (QEMU) nbd-server-start addr={"type":"inet","data":{"host":"::","port":"49153"}} + { + "execute": "nbd-server-start", + "arguments": { + "addr": { + "data": { + "host": "::", + "port": "49153" + }, + "type": "inet" + } + } + } + +(2) [On *destination* QEMU] And export the destination disk image using + QEMU's built-in NBD server:: + + (QEMU) nbd-server-add device=node-TargetDisk writable=true + { + "execute": "nbd-server-add", + "arguments": { + "device": "node-TargetDisk" + } + } + +(3) [On *source* QEMU] Then, invoke ``drive-mirror`` (NB: since we're + running ``drive-mirror`` with ``mode=existing`` (meaning: + synchronize to a pre-created file, therefore 'existing', file on the + target host), with the synchronization mode as 'top' (``"sync: + "top"``):: + + (QEMU) drive-mirror device=node-D target=nbd:localhost:49153:exportname=node-TargetDisk sync=top mode=existing job-id=job0 + { + "execute": "drive-mirror", + "arguments": { + "device": "node-D", + "mode": "existing", + "job-id": "job0", + "target": "nbd:localhost:49153:exportname=node-TargetDisk", + "sync": "top" + } + } + +(4) [On *source* QEMU] Once ``drive-mirror`` copies the entire data, and the + event ``BLOCK_JOB_READY`` is emitted, issue ``block-job-cancel`` to + gracefully end the synchronization, from source QEMU:: + + (QEMU) block-job-cancel device=job0 + { + "execute": "block-job-cancel", + "arguments": { + "device": "job0" + } + } + +(5) [On *destination* QEMU] Then, stop the NBD server:: + + (QEMU) nbd-server-stop + { + "execute": "nbd-server-stop", + "arguments": {} + } + +(6) [On *destination* QEMU] Finally, resume the guest vCPUs by issuing the + QMP command ``cont``:: + + (QEMU) cont + { + "execute": "cont", + "arguments": {} + } + +.. note:: + Higher-level libraries (e.g. libvirt) automate the entire above + process (although note that libvirt does not allow same-host + migrations to localhost for other reasons). + + +Notes on ``blockdev-mirror`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``blockdev-mirror`` command is equivalent in core functionality to +``drive-mirror``, except that it operates at node-level in a BDS graph. + +Also: for ``blockdev-mirror``, the 'target' image needs to be explicitly +created (using ``qemu-img``) and attach it to live QEMU via +``blockdev-add``, which assigns a name to the to-be created target node. + +E.g. the sequence of actions to create a point-in-time backup of an +entire disk image chain, to a target, using ``blockdev-mirror`` would be: + +(0) Create the QCOW2 overlays, to arrive at a backing chain of desired + depth + +(1) Create the target image (using ``qemu-img``), say, ``e.qcow2`` + +(2) Attach the above created file (``e.qcow2``), run-time, using + ``blockdev-add`` to QEMU + +(3) Perform ``blockdev-mirror`` (use ``"sync": "full"`` to copy the + entire chain to the target). And notice the event + ``BLOCK_JOB_READY`` + +(4) Optionally, query for active block jobs, there should be a 'mirror' + job ready to be completed + +(5) Gracefully complete the 'mirror' block device job, and notice the + the event ``BLOCK_JOB_COMPLETED`` + +(6) Shutdown the guest by issuing the QMP ``quit`` command so that + caches are flushed + +(7) Then, finally, compare the contents of the disk image chain, and + the target copy with ``qemu-img compare``. You should notice: + "Images are identical" + + +QMP invocation for ``blockdev-mirror`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Given the disk image chain:: + + [A] <-- [B] <-- [C] <-- [D] + +To copy the contents of the entire disk image chain, from [A] all the +way to [D], to a new target, call it [E]. The following is the flow. + +Create the overlay images, [B], [C], and [D]:: + + (QEMU) blockdev-snapshot-sync node-name=node-A snapshot-file=b.qcow2 snapshot-node-name=node-B format=qcow2 + (QEMU) blockdev-snapshot-sync node-name=node-B snapshot-file=c.qcow2 snapshot-node-name=node-C format=qcow2 + (QEMU) blockdev-snapshot-sync node-name=node-C snapshot-file=d.qcow2 snapshot-node-name=node-D format=qcow2 + +Create the target image, [E]:: + + $ qemu-img create -f qcow2 e.qcow2 39M + +Add the above created target image to QEMU, via ``blockdev-add``:: + + (QEMU) blockdev-add driver=qcow2 node-name=node-E file={"driver":"file","filename":"e.qcow2"} + { + "execute": "blockdev-add", + "arguments": { + "node-name": "node-E", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "e.qcow2" + } + } + } + +Perform ``blockdev-mirror``, and notice the event ``BLOCK_JOB_READY``:: + + (QEMU) blockdev-mirror device=node-B target=node-E sync=full job-id=job0 + { + "execute": "blockdev-mirror", + "arguments": { + "device": "node-D", + "job-id": "job0", + "target": "node-E", + "sync": "full" + } + } + +Query for active block jobs, there should be a 'mirror' job ready:: + + (QEMU) query-block-jobs + { + "execute": "query-block-jobs", + "arguments": {} + } + { + "return": [ + { + "busy": false, + "type": "mirror", + "len": 21561344, + "paused": false, + "ready": true, + "io-status": "ok", + "offset": 21561344, + "device": "job0", + "speed": 0 + } + ] + } + +Gracefully complete the block device job operation, and notice the +event ``BLOCK_JOB_COMPLETED``:: + + (QEMU) block-job-complete device=job0 + { + "execute": "block-job-complete", + "arguments": { + "device": "job0" + } + } + { + "return": {} + } + +Shutdown the guest, by issuing the ``quit`` QMP command:: + + (QEMU) quit + { + "execute": "quit", + "arguments": {} + } + + +Live disk backup --- ``blockdev-backup`` and the deprecated``drive-backup`` +--------------------------------------------------------------------------- + +The ``blockdev-backup`` (and the deprecated ``drive-backup``) allows +you to create a point-in-time snapshot. + +In this case, the point-in-time is when you *start* the +``blockdev-backup`` (or deprecated ``drive-backup``) command. + + +QMP invocation for ``drive-backup`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Note that ``drive-backup`` command is deprecated since QEMU 6.2 and +will be removed in future. + +Yet again, starting afresh with our example disk image chain:: + + [A] <-- [B] <-- [C] <-- [D] + +To create a target image [E], with content populated from image [A] to +[D], from the above chain, the following is the syntax. (If the target +image does not exist, ``drive-backup`` will create it):: + + (QEMU) drive-backup device=node-D sync=full target=e.qcow2 job-id=job0 + { + "execute": "drive-backup", + "arguments": { + "device": "node-D", + "job-id": "job0", + "sync": "full", + "target": "e.qcow2" + } + } + +Once the above ``drive-backup`` has completed, a ``BLOCK_JOB_COMPLETED`` event +will be issued, indicating the live block device job operation has +completed, and no further action is required. + + +Moving from the deprecated ``drive-backup`` to newer ``blockdev-backup`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``blockdev-backup`` differs from ``drive-backup`` in how you specify +the backup target. With ``blockdev-backup`` you can't specify filename +as a target. Instead you use ``node-name`` of existing block node, +which you may add by ``blockdev-add`` or ``blockdev-create`` commands. +Correspondingly, ``blockdev-backup`` doesn't have ``mode`` and +``format`` arguments which don't apply to an existing block node. See +following sections for details and examples. + + +Notes on ``blockdev-backup`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``blockdev-backup`` command operates at node-level in a Block Driver +State (BDS) graph. + +E.g. the sequence of actions to create a point-in-time backup +of an entire disk image chain, to a target, using ``blockdev-backup`` +would be: + +(0) Create the QCOW2 overlays, to arrive at a backing chain of desired + depth + +(1) Create the target image (using ``qemu-img``), say, ``e.qcow2`` + +(2) Attach the above created file (``e.qcow2``), run-time, using + ``blockdev-add`` to QEMU + +(3) Perform ``blockdev-backup`` (use ``"sync": "full"`` to copy the + entire chain to the target). And notice the event + ``BLOCK_JOB_COMPLETED`` + +(4) Shutdown the guest, by issuing the QMP ``quit`` command, so that + caches are flushed + +(5) Then, finally, compare the contents of the disk image chain, and + the target copy with ``qemu-img compare``. You should notice: + "Images are identical" + +The following section shows an example QMP invocation for +``blockdev-backup``. + +QMP invocation for ``blockdev-backup`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Given a disk image chain of depth 1 where image [B] is the active +overlay (live QEMU is writing to it):: + + [A] <-- [B] + +The following is the procedure to copy the content from the entire chain +to a target image (say, [E]), which has the full content from [A] and +[B]. + +Create the overlay [B]:: + + (QEMU) blockdev-snapshot-sync node-name=node-A snapshot-file=b.qcow2 snapshot-node-name=node-B format=qcow2 + { + "execute": "blockdev-snapshot-sync", + "arguments": { + "node-name": "node-A", + "snapshot-file": "b.qcow2", + "format": "qcow2", + "snapshot-node-name": "node-B" + } + } + + +Create a target image that will contain the copy:: + + $ qemu-img create -f qcow2 e.qcow2 39M + +Then add it to QEMU via ``blockdev-add``:: + + (QEMU) blockdev-add driver=qcow2 node-name=node-E file={"driver":"file","filename":"e.qcow2"} + { + "execute": "blockdev-add", + "arguments": { + "node-name": "node-E", + "driver": "qcow2", + "file": { + "driver": "file", + "filename": "e.qcow2" + } + } + } + +Then invoke ``blockdev-backup`` to copy the contents from the entire +image chain, consisting of images [A] and [B] to the target image +'e.qcow2':: + + (QEMU) blockdev-backup device=node-B target=node-E sync=full job-id=job0 + { + "execute": "blockdev-backup", + "arguments": { + "device": "node-B", + "job-id": "job0", + "target": "node-E", + "sync": "full" + } + } + +Once the above 'backup' operation has completed, the event, +``BLOCK_JOB_COMPLETED`` will be emitted, signalling successful +completion. + +Next, query for any active block device jobs (there should be none):: + + (QEMU) query-block-jobs + { + "execute": "query-block-jobs", + "arguments": {} + } + +Shutdown the guest:: + + (QEMU) quit + { + "execute": "quit", + "arguments": {} + } + "return": {} + } + +.. note:: + The above step is really important; if forgotten, an error, "Failed + to get shared "write" lock on e.qcow2", will be thrown when you do + ``qemu-img compare`` to verify the integrity of the disk image + with the backup content. + + +The end result will be the image 'e.qcow2' containing a +point-in-time backup of the disk image chain -- i.e. contents from +images [A] and [B] at the time the ``blockdev-backup`` command was +initiated. + +One way to confirm the backup disk image contains the identical content +with the disk image chain is to compare the backup and the contents of +the chain, you should see "Images are identical". (NB: this is assuming +QEMU was launched with ``-S`` option, which will not start the CPUs at +guest boot up):: + + $ qemu-img compare b.qcow2 e.qcow2 + Warning: Image size mismatch! + Images are identical. + +NOTE: The "Warning: Image size mismatch!" is expected, as we created the +target image (e.qcow2) with 39M size. diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt new file mode 100644 index 000000000..bdb0f2a41 --- /dev/null +++ b/docs/interop/nbd.txt @@ -0,0 +1,70 @@ +QEMU supports the NBD protocol, and has an internal NBD client (see +block/nbd.c), an internal NBD server (see blockdev-nbd.c), and an +external NBD server tool (see qemu-nbd.c). The common code is placed +in nbd/*. + +The NBD protocol is specified here: +https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md + +The following paragraphs describe some specific properties of NBD +protocol realization in QEMU. + += Metadata namespaces = + +QEMU supports the "base:allocation" metadata context as defined in the +NBD protocol specification, and also defines an additional metadata +namespace "qemu". + +== "qemu" namespace == + +The "qemu" namespace currently contains two available metadata context +types. The first is related to exposing the contents of a dirty +bitmap alongside the associated disk contents. That metadata context +is named with the following form: + + qemu:dirty-bitmap:<dirty-bitmap-export-name> + +Each dirty-bitmap metadata context defines only one flag for extents +in reply for NBD_CMD_BLOCK_STATUS: + + bit 0: NBD_STATE_DIRTY, set when the extent is "dirty" + +The second is related to exposing the source of various extents within +the image, with a single metadata context named: + + qemu:allocation-depth + +In the allocation depth context, the entire 32-bit value represents a +depth of which layer in a thin-provisioned backing chain provided the +data (0 for unallocated, 1 for the active layer, 2 for the first +backing layer, and so forth). + +For NBD_OPT_LIST_META_CONTEXT the following queries are supported +in addition to the specific "qemu:allocation-depth" and +"qemu:dirty-bitmap:<dirty-bitmap-export-name>": + +* "qemu:" - returns list of all available metadata contexts in the + namespace. +* "qemu:dirty-bitmap:" - returns list of all available dirty-bitmap + metadata contexts. + += Features by version = + +The following list documents which qemu version first implemented +various features (both as a server exposing the feature, and as a +client taking advantage of the feature when present), to make it +easier to plan for cross-version interoperability. Note that in +several cases, the initial release containing a feature may require +additional patches from the corresponding stable branch to fix bugs in +the operation of that feature. + +* 2.6: NBD_OPT_STARTTLS with TLS X.509 Certificates +* 2.8: NBD_CMD_WRITE_ZEROES +* 2.10: NBD_OPT_GO, NBD_INFO_BLOCK +* 2.11: NBD_OPT_STRUCTURED_REPLY +* 2.12: NBD_CMD_BLOCK_STATUS for "base:allocation" +* 3.0: NBD_OPT_STARTTLS with TLS Pre-Shared Keys (PSK), +NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE +* 4.2: NBD_FLAG_CAN_MULTI_CONN for shareable read-only exports, +NBD_CMD_FLAG_FAST_ZERO +* 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth" diff --git a/docs/interop/parallels.txt b/docs/interop/parallels.txt new file mode 100644 index 000000000..bb3fadf36 --- /dev/null +++ b/docs/interop/parallels.txt @@ -0,0 +1,232 @@ += License = + +Copyright (c) 2015 Denis Lunev +Copyright (c) 2015 Vladimir Sementsov-Ogievskiy + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + += Parallels Expandable Image File Format = + +A Parallels expandable image file consists of three consecutive parts: + * header + * BAT + * data area + +All numbers in a Parallels expandable image are stored in little-endian byte +order. + + +== Definitions == + + Sector A 512-byte data chunk. + + Cluster A data chunk of the size specified in the image header. + Currently, the default size is 1MiB (2048 sectors). In previous + versions, cluster sizes of 63 sectors, 256 and 252 kilobytes were + used. + + BAT Block Allocation Table, an entity that contains information for + guest-to-host I/O data address translation. + + +== Header == + +The header is placed at the start of an image and contains the following +fields: + +Bytes: + 0 - 15: magic + Must contain "WithoutFreeSpace" or "WithouFreSpacExt". + + 16 - 19: version + Must be 2. + + 20 - 23: heads + Disk geometry parameter for guest. + + 24 - 27: cylinders + Disk geometry parameter for guest. + + 28 - 31: tracks + Cluster size, in sectors. + + 32 - 35: nb_bat_entries + Disk size, in clusters (BAT size). + + 36 - 43: nb_sectors + Disk size, in sectors. + + For "WithoutFreeSpace" images: + Only the lowest 4 bytes are used. The highest 4 bytes must be + cleared in this case. + + For "WithouFreSpacExt" images, there are no such + restrictions. + + 44 - 47: in_use + Set to 0x746F6E59 when the image is opened by software in R/W + mode; set to 0x312e3276 when the image is closed. + + A zero in this field means that the image was opened by an old + version of the software that doesn't support Format Extension + (see below). + + Other values are not allowed. + + 48 - 51: data_off + An offset, in sectors, from the start of the file to the start of + the data area. + + For "WithoutFreeSpace" images: + - If data_off is zero, the offset is calculated as the end of BAT + table plus some padding to ensure sector size alignment. + - If data_off is non-zero, the offset should be aligned to sector + size. However it is recommended to align it to cluster size for + newly created images. + + For "WithouFreSpacExt" images: + data_off must be non-zero and aligned to cluster size. + + 52 - 55: flags + Miscellaneous flags. + + Bit 0: Empty Image bit. If set, the image should be + considered clear. + + Bits 1-31: Unused. + + 56 - 63: ext_off + Format Extension offset, an offset, in sectors, from the start of + the file to the start of the Format Extension Cluster. + + ext_off must meet the same requirements as cluster offsets + defined by BAT entries (see below). + + +== BAT == + +BAT is placed immediately after the image header. In the file, BAT is a +contiguous array of 32-bit unsigned little-endian integers with +(bat_entries * 4) bytes size. + +Each BAT entry contains an offset from the start of the file to the +corresponding cluster. The offset set in clusters for "WithouFreSpacExt" images +and in sectors for "WithoutFreeSpace" images. + +If a BAT entry is zero, the corresponding cluster is not allocated and should +be considered as filled with zeroes. + +Cluster offsets specified by BAT entries must meet the following requirements: + - the value must not be lower than data offset (provided by header.data_off + or calculated as specified above), + - the value must be lower than the desired file size, + - the value must be unique among all BAT entries, + - the result of (cluster offset - data offset) must be aligned to cluster + size. + + +== Data Area == + +The data area is an area from the data offset (provided by header.data_off or +calculated as specified above) to the end of the file. It represents a +contiguous array of clusters. Most of them are allocated by the BAT, some may +be allocated by the ext_off field in the header while other may be allocated by +extensions. All clusters allocated by ext_off and extensions should meet the +same requirements as clusters specified by BAT entries. + + +== Format Extension == + +The Format Extension is an area 1 cluster in size that provides additional +format features. This cluster is addressed by the ext_off field in the header. +The format of the Format Extension area is the following: + + 0 - 7: magic + Must be 0xAB234CEF23DCEA87 + + 8 - 23: m_CheckSum + The MD5 checksum of the entire Header Extension cluster except + the first 24 bytes. + + The above are followed by feature sections or "extensions". The last + extension must be "End of features" (see below). + +Each feature section has the following format: + + 0 - 7: magic + The identifier of the feature: + 0x0000000000000000 - End of features + 0x20385FAE252CB34A - Dirty bitmap + + 8 - 15: flags + External flags for extension: + + Bit 0: NECESSARY + If the software cannot load the extension (due to an + unknown magic number or error), the file should not be + changed. If this flag is unset and there is an error on + loading the extension, said extension should be dropped. + + Bit 1: TRANSIT + If there is an unknown extension with this flag set, + said extension should be left as is. + + If neither NECESSARY nor TRANSIT are set, the extension should be + dropped. + + 16 - 19: data_size + The size of the following feature data, in bytes. + + 20 - 23: unused32 + Align header to 8 bytes boundary. + + variable: data (data_size bytes) + + The above is followed by padding to the next 8 bytes boundary, then the + next extension starts. + + The last extension must be "End of features" with all the fields set to 0. + + +=== Dirty bitmaps feature === + +This feature provides a way of storing dirty bitmaps in the image. The fields +of its data area are: + + 0 - 7: size + The bitmap size, should be equal to disk size in sectors. + + 8 - 23: id + An identifier for backup consistency checking. + + 24 - 27: granularity + Bitmap granularity, in sectors. I.e., the number of sectors + corresponding to one bit of the bitmap. Granularity must be + a power of 2. + + 28 - 31: l1_size + The number of entries in the L1 table of the bitmap. + + variable: L1 offset table (l1_table), size: 8 * l1_size bytes + +The dirty bitmap described by this feature extension is stored in a set of +clusters inside the Parallels image file. The offsets of these clusters are +saved in the L1 offset table specified by the feature extension. Each L1 table +entry is a 64 bit integer as described below: + +Given an offset in bytes into the bitmap data, corresponding L1 entry is + + l1_table[offset / cluster_size] + +If an L1 table entry is 0, all bits in the corresponding cluster of the bitmap +are assumed to be 0. + +If an L1 table entry is 1, all bits in the corresponding cluster of the bitmap +are assumed to be 1. + +If an L1 table entry is not 0 or 1, it contains the corresponding cluster +offset (in 512b sectors). Given an offset in bytes into the bitmap data the +offset in bytes into the image file can be obtained as follows: + + offset = l1_table[offset / cluster_size] * 512 + (offset % cluster_size) diff --git a/docs/interop/pr-helper.rst b/docs/interop/pr-helper.rst new file mode 100644 index 000000000..e926f0a6c --- /dev/null +++ b/docs/interop/pr-helper.rst @@ -0,0 +1,83 @@ +.. + +====================================== +Persistent reservation helper protocol +====================================== + +QEMU's SCSI passthrough devices, ``scsi-block`` and ``scsi-generic``, +can delegate implementation of persistent reservations to an external +(and typically privileged) program. Persistent Reservations allow +restricting access to block devices to specific initiators in a shared +storage setup. + +For a more detailed reference please refer to the SCSI Primary +Commands standard, specifically the section on Reservations and the +"PERSISTENT RESERVE IN" and "PERSISTENT RESERVE OUT" commands. + +This document describes the socket protocol used between QEMU's +``pr-manager-helper`` object and the external program. + +.. contents:: + +Connection and initialization +----------------------------- + +All data transmitted on the socket is big-endian. + +After connecting to the helper program's socket, the helper starts a simple +feature negotiation process by writing four bytes corresponding to +the features it exposes (``supported_features``). QEMU reads it, +then writes four bytes corresponding to the desired features of the +helper program (``requested_features``). + +If a bit is 1 in ``requested_features`` and 0 in ``supported_features``, +the corresponding feature is not supported by the helper and the connection +is closed. On the other hand, it is acceptable for a bit to be 0 in +``requested_features`` and 1 in ``supported_features``; in this case, +the helper will not enable the feature. + +Right now no feature is defined, so the two parties always write four +zero bytes. + +Command format +-------------- + +It is invalid to send multiple commands concurrently on the same +socket. It is however possible to connect multiple sockets to the +helper and send multiple commands to the helper for one or more +file descriptors. + +A command consists of a request and a response. A request consists +of a 16-byte SCSI CDB. A file descriptor must be passed to the helper +together with the SCSI CDB using ancillary data. + +The CDB has the following limitations: + +- the command (stored in the first byte) must be one of 0x5E + (PERSISTENT RESERVE IN) or 0x5F (PERSISTENT RESERVE OUT). + +- the allocation length (stored in bytes 7-8 of the CDB for PERSISTENT + RESERVE IN) or parameter list length (stored in bytes 5-8 of the CDB + for PERSISTENT RESERVE OUT) is limited to 8 KiB. + +For PERSISTENT RESERVE OUT, the parameter list is sent right after the +CDB. The length of the parameter list is taken from the CDB itself. + +The helper's reply has the following structure: + +- 4 bytes for the SCSI status + +- 4 bytes for the payload size (nonzero only for PERSISTENT RESERVE IN + and only if the SCSI status is 0x00, i.e. GOOD) + +- 96 bytes for the SCSI sense data + +- if the size is nonzero, the payload follows + +The sense data is always sent to keep the protocol simple, even though +it is only valid if the SCSI status is CHECK CONDITION (0x02). + +The payload size is always less than or equal to the allocation length +specified in the CDB for the PERSISTENT RESERVE IN command. + +If the protocol is violated, the helper closes the socket. diff --git a/docs/interop/prl-xml.txt b/docs/interop/prl-xml.txt new file mode 100644 index 000000000..7031f8752 --- /dev/null +++ b/docs/interop/prl-xml.txt @@ -0,0 +1,158 @@ += License = + +Copyright (c) 2015-2017, Virtuozzo, Inc. +Authors: + 2015 Denis Lunev <den@openvz.org> + 2015 Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> + 2016-2017 Klim Kireev <klim.kireev@virtuozzo.com> + 2016-2017 Edgar Kaziakhmedov <edgar.kaziakhmedov@virtuozzo.com> + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +This specification contains minimal information about Parallels Disk Format, +which is enough to proper work with QEMU. Nevertheless, Parallels Cloud Server +and Parallels Desktop are able to add some unspecified nodes to xml and use +them, but they are for internal work and don't affect functionality. Also it +uses auxiliary xml "Snapshot.xml", which allows to store optional snapshot +information, but it doesn't influence open/read/write functionality. QEMU and +other software should not use fields not covered in this document and +Snapshot.xml file and must leave them as is. + += Parallels Disk Format = + +Parallels disk consists of two parts: the set of snapshots and the disk +descriptor file, which stores information about all files and snapshots. + +== Definitions == + Snapshot a record of the contents captured at a particular time, + capable of storing current state. A snapshot has UUID and + parent UUID. + + Snapshot image an overlay representing the difference between this + snapshot and some earlier snapshot. + + Overlay an image storing the different sectors between two captured + states. + + Root image snapshot image with no parent, the root of snapshot tree. + + Storage the backing storage for a subset of the virtual disk. When + there is more than one storage in a Parallels disk then that + is referred to as a split image. In this case every storage + covers specific address space area of the disk and has its + particular root image. Split images are not considered here + and are not supported. Each storage consists of disk + parameters and a list of images. The list of images always + contains a root image and may also contain overlays. The + root image can be an expandable Parallels image file or + plain. Overlays must be expandable. + + Description DiskDescriptor.xml stores information about disk parameters, + file snapshots, storages. + + Top The overlay between actual state and some previous snapshot. + Snapshot It is not a snapshot in the classical sense because it + serves as the active image that the guest writes to. + + Sector a 512-byte data chunk. + +== Description file == +All information is placed in a single XML element Parallels_disk_image. +The element has only one attribute "Version", that must be 1.0. +Schema of DiskDescriptor.xml: + +<Parallels_disk_image Version="1.0"> + <Disk_Parameters> + ... + </Disk_Parameters> + <StorageData> + ... + </StorageData> + <Snapshots> + ... + </Snapshots> +</Parallels_disk_image> + +== Disk_Parameters element == +The Disk_Parameters element describes the physical layout of the virtual disk +and some general settings. + +The Disk_Parameters element MUST contain the following child elements: + * Disk_size - number of sectors in the disk, + desired size of the disk. + * Cylinders - number of the disk cylinders. + * Heads - number of the disk heads. + * Sectors - number of the disk sectors per cylinder + (sector size is 512 bytes) + Limitation: Product of the Heads, Sectors and Cylinders + values MUST be equal to the value of the Disk_size parameter. + * Padding - must be 0. Parallels Cloud Server and Parallels Desktop may + use padding set to 1, however this case is not covered + by this spec, QEMU and other software should not open + such disks and should not create them. + +== StorageData element == +This element of the file describes the root image and all snapshot images. + +The StorageData element consists of the Storage child element, as shown below: +<StorageData> + <Storage> + ... + </Storage> +</StorageData> + +A Storage element has following child elements: + * Start - start sector of the storage, in case of non split storage + equals to 0. + * End - number of sector following the last sector, in case of non + split storage equals to Disk_size. + * Blocksize - storage cluster size, number of sectors per one cluster. + Cluster size for each "Compressed" (see below) image in + parallels disk must be equal to this field. Note: cluster + size for Parallels Expandable Image is in 'tracks' field of + its header (see docs/interop/parallels.txt). + * Several Image child elements. + +Each Image element has following child elements: + * GUID - image identifier, UUID in curly brackets. + For instance, {12345678-9abc-def1-2345-6789abcdef12}. + The GUID is used by the Snapshots element to reference images + (see below) + * Type - image type of the element. It can be: + "Plain" for raw files. + "Compressed" for expanding disks. + * File - path to image file. Path can be relative to DiskDecriptor.xml or + absolute. + +== Snapshots element == +The Snapshots element describes the snapshot relations with the snapshot tree. + +The element contains the set of Shot child elements, as shown below: +<Snapshots> + <TopGUID> ... </TopGUID> /* Optional child element */ + <Shot> + ... + </Shot> + <Shot> + ... + </Shot> + ... +</Snapshots> + +Each Shot element contains the following child elements: + * GUID - an image GUID. + * ParentGUID - GUID of the image of the parent snapshot. + +The software may traverse snapshots from child to parent using <ParentGUID> +field as reference. ParentGUID of root snapshot is +{00000000-0000-0000-0000-000000000000}. There should be only one root +snapshot. Top snapshot could be described via two ways: via TopGUID child +element of the Snapshots element or via predefined GUID +{5fbaabe3-6958-40ff-92a7-860e329aab41}. If TopGUID is defined, predefined GUID is +interpreted as usual GUID. All snapshot images (except Top Snapshot) should be +opened read-only. There is another predefined GUID, +BackupID = {704718e1-2314-44c8-9087-d78ed36b0f4e}, which is used by original and +some third-party software for backup, QEMU and other software may operate with +images with GUID = BackupID as usual, however, it is not recommended to use this +GUID for new disks. Top snapshot cannot have this GUID. diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt new file mode 100644 index 000000000..f7dc304ff --- /dev/null +++ b/docs/interop/qcow2.txt @@ -0,0 +1,901 @@ +== General == + +A qcow2 image file is organized in units of constant size, which are called +(host) clusters. A cluster is the unit in which all allocations are done, +both for actual guest data and for image metadata. + +Likewise, the virtual disk as seen by the guest is divided into (guest) +clusters of the same size. + +All numbers in qcow2 are stored in Big Endian byte order. + + +== Header == + +The first cluster of a qcow2 image contains the file header: + + Byte 0 - 3: magic + QCOW magic string ("QFI\xfb") + + 4 - 7: version + Version number (valid values are 2 and 3) + + 8 - 15: backing_file_offset + Offset into the image file at which the backing file name + is stored (NB: The string is not null terminated). 0 if the + image doesn't have a backing file. + + Note: backing files are incompatible with raw external data + files (auto-clear feature bit 1). + + 16 - 19: backing_file_size + Length of the backing file name in bytes. Must not be + longer than 1023 bytes. Undefined if the image doesn't have + a backing file. + + 20 - 23: cluster_bits + Number of bits that are used for addressing an offset + within a cluster (1 << cluster_bits is the cluster size). + Must not be less than 9 (i.e. 512 byte clusters). + + Note: qemu as of today has an implementation limit of 2 MB + as the maximum cluster size and won't be able to open images + with larger cluster sizes. + + Note: if the image has Extended L2 Entries then cluster_bits + must be at least 14 (i.e. 16384 byte clusters). + + 24 - 31: size + Virtual disk size in bytes. + + Note: qemu has an implementation limit of 32 MB as + the maximum L1 table size. With a 2 MB cluster + size, it is unable to populate a virtual cluster + beyond 2 EB (61 bits); with a 512 byte cluster + size, it is unable to populate a virtual size + larger than 128 GB (37 bits). Meanwhile, L1/L2 + table layouts limit an image to no more than 64 PB + (56 bits) of populated clusters, and an image may + hit other limits first (such as a file system's + maximum size). + + 32 - 35: crypt_method + 0 for no encryption + 1 for AES encryption + 2 for LUKS encryption + + 36 - 39: l1_size + Number of entries in the active L1 table + + 40 - 47: l1_table_offset + Offset into the image file at which the active L1 table + starts. Must be aligned to a cluster boundary. + + 48 - 55: refcount_table_offset + Offset into the image file at which the refcount table + starts. Must be aligned to a cluster boundary. + + 56 - 59: refcount_table_clusters + Number of clusters that the refcount table occupies + + 60 - 63: nb_snapshots + Number of snapshots contained in the image + + 64 - 71: snapshots_offset + Offset into the image file at which the snapshot table + starts. Must be aligned to a cluster boundary. + +For version 2, the header is exactly 72 bytes in length, and finishes here. +For version 3 or higher, the header length is at least 104 bytes, including +the next fields through header_length. + + 72 - 79: incompatible_features + Bitmask of incompatible features. An implementation must + fail to open an image if an unknown bit is set. + + Bit 0: Dirty bit. If this bit is set then refcounts + may be inconsistent, make sure to scan L1/L2 + tables to repair refcounts before accessing the + image. + + Bit 1: Corrupt bit. If this bit is set then any data + structure may be corrupt and the image must not + be written to (unless for regaining + consistency). + + Bit 2: External data file bit. If this bit is set, an + external data file is used. Guest clusters are + then stored in the external data file. For such + images, clusters in the external data file are + not refcounted. The offset field in the + Standard Cluster Descriptor must match the + guest offset and neither compressed clusters + nor internal snapshots are supported. + + An External Data File Name header extension may + be present if this bit is set. + + Bit 3: Compression type bit. If this bit is set, + a non-default compression is used for compressed + clusters. The compression_type field must be + present and not zero. + + Bit 4: Extended L2 Entries. If this bit is set then + L2 table entries use an extended format that + allows subcluster-based allocation. See the + Extended L2 Entries section for more details. + + Bits 5-63: Reserved (set to 0) + + 80 - 87: compatible_features + Bitmask of compatible features. An implementation can + safely ignore any unknown bits that are set. + + Bit 0: Lazy refcounts bit. If this bit is set then + lazy refcount updates can be used. This means + marking the image file dirty and postponing + refcount metadata updates. + + Bits 1-63: Reserved (set to 0) + + 88 - 95: autoclear_features + Bitmask of auto-clear features. An implementation may only + write to an image with unknown auto-clear features if it + clears the respective bits from this field first. + + Bit 0: Bitmaps extension bit + This bit indicates consistency for the bitmaps + extension data. + + It is an error if this bit is set without the + bitmaps extension present. + + If the bitmaps extension is present but this + bit is unset, the bitmaps extension data must be + considered inconsistent. + + Bit 1: Raw external data bit + If this bit is set, the external data file can + be read as a consistent standalone raw image + without looking at the qcow2 metadata. + + Setting this bit has a performance impact for + some operations on the image (e.g. writing + zeros requires writing to the data file instead + of only setting the zero flag in the L2 table + entry) and conflicts with backing files. + + This bit may only be set if the External Data + File bit (incompatible feature bit 1) is also + set. + + Bits 2-63: Reserved (set to 0) + + 96 - 99: refcount_order + Describes the width of a reference count block entry (width + in bits: refcount_bits = 1 << refcount_order). For version 2 + images, the order is always assumed to be 4 + (i.e. refcount_bits = 16). + This value may not exceed 6 (i.e. refcount_bits = 64). + + 100 - 103: header_length + Length of the header structure in bytes. For version 2 + images, the length is always assumed to be 72 bytes. + For version 3 it's at least 104 bytes and must be a multiple + of 8. + + +=== Additional fields (version 3 and higher) === + +In general, these fields are optional and may be safely ignored by the software, +as well as filled by zeros (which is equal to field absence), if software needs +to set field B, but does not care about field A which precedes B. More +formally, additional fields have the following compatibility rules: + +1. If the value of the additional field must not be ignored for correct +handling of the file, it will be accompanied by a corresponding incompatible +feature bit. + +2. If there are no unrecognized incompatible feature bits set, an unknown +additional field may be safely ignored other than preserving its value when +rewriting the image header. + +3. An explicit value of 0 will have the same behavior as when the field is not +present*, if not altered by a specific incompatible bit. + +*. A field is considered not present when header_length is less than or equal +to the field's offset. Also, all additional fields are not present for +version 2. + + 104: compression_type + + Defines the compression method used for compressed clusters. + All compressed clusters in an image use the same compression + type. + + If the incompatible bit "Compression type" is set: the field + must be present and non-zero (which means non-zlib + compression type). Otherwise, this field must not be present + or must be zero (which means zlib). + + Available compression type values: + 0: zlib <https://www.zlib.net/> + 1: zstd <http://github.com/facebook/zstd> + + +=== Header padding === + +@header_length must be a multiple of 8, which means that if the end of the last +additional field is not aligned, some padding is needed. This padding must be +zeroed, so that if some existing (or future) additional field will fall into +the padding, it will be interpreted accordingly to point [3.] of the previous +paragraph, i.e. in the same manner as when this field is not present. + + +=== Header extensions === + +Directly after the image header, optional sections called header extensions can +be stored. Each extension has a structure like the following: + + Byte 0 - 3: Header extension type: + 0x00000000 - End of the header extension area + 0xe2792aca - Backing file format name string + 0x6803f857 - Feature name table + 0x23852875 - Bitmaps extension + 0x0537be77 - Full disk encryption header pointer + 0x44415441 - External data file name string + other - Unknown header extension, can be safely + ignored + + 4 - 7: Length of the header extension data + + 8 - n: Header extension data + + n - m: Padding to round up the header extension size to the next + multiple of 8. + +Unless stated otherwise, each header extension type shall appear at most once +in the same image. + +If the image has a backing file then the backing file name should be stored in +the remaining space between the end of the header extension area and the end of +the first cluster. It is not allowed to store other data here, so that an +implementation can safely modify the header and add extensions without harming +data of compatible features that it doesn't support. Compatible features that +need space for additional data can use a header extension. + + +== String header extensions == + +Some header extensions (such as the backing file format name and the external +data file name) are just a single string. In this case, the header extension +length is the string length and the string is not '\0' terminated. (The header +extension padding can make it look like a string is '\0' terminated, but +neither is padding always necessary nor is there a guarantee that zero bytes +are used for padding.) + + +== Feature name table == + +The feature name table is an optional header extension that contains the name +for features used by the image. It can be used by applications that don't know +the respective feature (e.g. because the feature was introduced only later) to +display a useful error message. + +The number of entries in the feature name table is determined by the length of +the header extension data. Each entry look like this: + + Byte 0: Type of feature (select feature bitmap) + 0: Incompatible feature + 1: Compatible feature + 2: Autoclear feature + + 1: Bit number within the selected feature bitmap (valid + values: 0-63) + + 2 - 47: Feature name (padded with zeros, but not necessarily null + terminated if it has full length) + + +== Bitmaps extension == + +The bitmaps extension is an optional header extension. It provides the ability +to store bitmaps related to a virtual disk. For now, there is only one bitmap +type: the dirty tracking bitmap, which tracks virtual disk changes from some +point in time. + +The data of the extension should be considered consistent only if the +corresponding auto-clear feature bit is set, see autoclear_features above. + +The fields of the bitmaps extension are: + + Byte 0 - 3: nb_bitmaps + The number of bitmaps contained in the image. Must be + greater than or equal to 1. + + Note: QEMU currently only supports up to 65535 bitmaps per + image. + + 4 - 7: Reserved, must be zero. + + 8 - 15: bitmap_directory_size + Size of the bitmap directory in bytes. It is the cumulative + size of all (nb_bitmaps) bitmap directory entries. + + 16 - 23: bitmap_directory_offset + Offset into the image file at which the bitmap directory + starts. Must be aligned to a cluster boundary. + +== Full disk encryption header pointer == + +The full disk encryption header must be present if, and only if, the +'crypt_method' header requires metadata. Currently this is only true +of the 'LUKS' crypt method. The header extension must be absent for +other methods. + +This header provides the offset at which the crypt method can store +its additional data, as well as the length of such data. + + Byte 0 - 7: Offset into the image file at which the encryption + header starts in bytes. Must be aligned to a cluster + boundary. + Byte 8 - 15: Length of the written encryption header in bytes. + Note actual space allocated in the qcow2 file may + be larger than this value, since it will be rounded + to the nearest multiple of the cluster size. Any + unused bytes in the allocated space will be initialized + to 0. + +For the LUKS crypt method, the encryption header works as follows. + +The first 592 bytes of the header clusters will contain the LUKS +partition header. This is then followed by the key material data areas. +The size of the key material data areas is determined by the number of +stripes in the key slot and key size. Refer to the LUKS format +specification ('docs/on-disk-format.pdf' in the cryptsetup source +package) for details of the LUKS partition header format. + +In the LUKS partition header, the "payload-offset" field will be +calculated as normal for the LUKS spec. ie the size of the LUKS +header, plus key material regions, plus padding, relative to the +start of the LUKS header. This offset value is not required to be +qcow2 cluster aligned. Its value is currently never used in the +context of qcow2, since the qcow2 file format itself defines where +the real payload offset is, but none the less a valid payload offset +should always be present. + +In the LUKS key slots header, the "key-material-offset" is relative +to the start of the LUKS header clusters in the qcow2 container, +not the start of the qcow2 file. + +Logically the layout looks like + + +-----------------------------+ + | QCow2 header | + | QCow2 header extension X | + | QCow2 header extension FDE | + | QCow2 header extension ... | + | QCow2 header extension Z | + +-----------------------------+ + | ....other QCow2 tables.... | + . . + . . + +-----------------------------+ + | +-------------------------+ | + | | LUKS partition header | | + | +-------------------------+ | + | | LUKS key material 1 | | + | +-------------------------+ | + | | LUKS key material 2 | | + | +-------------------------+ | + | | LUKS key material ... | | + | +-------------------------+ | + | | LUKS key material 8 | | + | +-------------------------+ | + +-----------------------------+ + | QCow2 cluster payload | + . . + . . + . . + | | + +-----------------------------+ + +== Data encryption == + +When an encryption method is requested in the header, the image payload +data must be encrypted/decrypted on every write/read. The image headers +and metadata are never encrypted. + +The algorithms used for encryption vary depending on the method + + - AES: + + The AES cipher, in CBC mode, with 256 bit keys. + + Initialization vectors generated using plain64 method, with + the virtual disk sector as the input tweak. + + This format is no longer supported in QEMU system emulators, due + to a number of design flaws affecting its security. It is only + supported in the command line tools for the sake of back compatibility + and data liberation. + + - LUKS: + + The algorithms are specified in the LUKS header. + + Initialization vectors generated using the method specified + in the LUKS header, with the physical disk sector as the + input tweak. + +== Host cluster management == + +qcow2 manages the allocation of host clusters by maintaining a reference count +for each host cluster. A refcount of 0 means that the cluster is free, 1 means +that it is used, and >= 2 means that it is used and any write access must +perform a COW (copy on write) operation. + +The refcounts are managed in a two-level table. The first level is called +refcount table and has a variable size (which is stored in the header). The +refcount table can cover multiple clusters, however it needs to be contiguous +in the image file. + +It contains pointers to the second level structures which are called refcount +blocks and are exactly one cluster in size. + +Although a large enough refcount table can reserve clusters past 64 PB +(56 bits) (assuming the underlying protocol can even be sized that +large), note that some qcow2 metadata such as L1/L2 tables must point +to clusters prior to that point. + +Note: qemu has an implementation limit of 8 MB as the maximum refcount +table size. With a 2 MB cluster size and a default refcount_order of +4, it is unable to reference host resources beyond 2 EB (61 bits); in +the worst case, with a 512 cluster size and refcount_order of 6, it is +unable to access beyond 32 GB (35 bits). + +Given an offset into the image file, the refcount of its cluster can be +obtained as follows: + + refcount_block_entries = (cluster_size * 8 / refcount_bits) + + refcount_block_index = (offset / cluster_size) % refcount_block_entries + refcount_table_index = (offset / cluster_size) / refcount_block_entries + + refcount_block = load_cluster(refcount_table[refcount_table_index]); + return refcount_block[refcount_block_index]; + +Refcount table entry: + + Bit 0 - 8: Reserved (set to 0) + + 9 - 63: Bits 9-63 of the offset into the image file at which the + refcount block starts. Must be aligned to a cluster + boundary. + + If this is 0, the corresponding refcount block has not yet + been allocated. All refcounts managed by this refcount block + are 0. + +Refcount block entry (x = refcount_bits - 1): + + Bit 0 - x: Reference count of the cluster. If refcount_bits implies a + sub-byte width, note that bit 0 means the least significant + bit in this context. + + +== Cluster mapping == + +Just as for refcounts, qcow2 uses a two-level structure for the mapping of +guest clusters to host clusters. They are called L1 and L2 table. + +The L1 table has a variable size (stored in the header) and may use multiple +clusters, however it must be contiguous in the image file. L2 tables are +exactly one cluster in size. + +The L1 and L2 tables have implications on the maximum virtual file +size; for a given L1 table size, a larger cluster size is required for +the guest to have access to more space. Furthermore, a virtual +cluster must currently map to a host offset below 64 PB (56 bits) +(although this limit could be relaxed by putting reserved bits into +use). Additionally, as cluster size increases, the maximum host +offset for a compressed cluster is reduced (a 2M cluster size requires +compressed clusters to reside below 512 TB (49 bits), and this limit +cannot be relaxed without an incompatible layout change). + +Given an offset into the virtual disk, the offset into the image file can be +obtained as follows: + + l2_entries = (cluster_size / sizeof(uint64_t)) [*] + + l2_index = (offset / cluster_size) % l2_entries + l1_index = (offset / cluster_size) / l2_entries + + l2_table = load_cluster(l1_table[l1_index]); + cluster_offset = l2_table[l2_index]; + + return cluster_offset + (offset % cluster_size) + + [*] this changes if Extended L2 Entries are enabled, see next section + +L1 table entry: + + Bit 0 - 8: Reserved (set to 0) + + 9 - 55: Bits 9-55 of the offset into the image file at which the L2 + table starts. Must be aligned to a cluster boundary. If the + offset is 0, the L2 table and all clusters described by this + L2 table are unallocated. + + 56 - 62: Reserved (set to 0) + + 63: 0 for an L2 table that is unused or requires COW, 1 if its + refcount is exactly one. This information is only accurate + in the active L1 table. + +L2 table entry: + + Bit 0 - 61: Cluster descriptor + + 62: 0 for standard clusters + 1 for compressed clusters + + 63: 0 for clusters that are unused, compressed or require COW. + 1 for standard clusters whose refcount is exactly one. + This information is only accurate in L2 tables + that are reachable from the active L1 table. + + With external data files, all guest clusters have an + implicit refcount of 1 (because of the fixed host = guest + mapping for guest cluster offsets), so this bit should be 1 + for all allocated clusters. + +Standard Cluster Descriptor: + + Bit 0: If set to 1, the cluster reads as all zeros. The host + cluster offset can be used to describe a preallocation, + but it won't be used for reading data from this cluster, + nor is data read from the backing file if the cluster is + unallocated. + + With version 2 or with extended L2 entries (see the next + section), this is always 0. + + 1 - 8: Reserved (set to 0) + + 9 - 55: Bits 9-55 of host cluster offset. Must be aligned to a + cluster boundary. If the offset is 0 and bit 63 is clear, + the cluster is unallocated. The offset may only be 0 with + bit 63 set (indicating a host cluster offset of 0) when an + external data file is used. + + 56 - 61: Reserved (set to 0) + + +Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)): + + Bit 0 - x-1: Host cluster offset. This is usually _not_ aligned to a + cluster or sector boundary! If cluster_bits is + small enough that this field includes bits beyond + 55, those upper bits must be set to 0. + + x - 61: Number of additional 512-byte sectors used for the + compressed data, beyond the sector containing the offset + in the previous field. Some of these sectors may reside + in the next contiguous host cluster. + + Note that the compressed data does not necessarily occupy + all of the bytes in the final sector; rather, decompression + stops when it has produced a cluster of data. + + Another compressed cluster may map to the tail of the final + sector used by this compressed cluster. + +If a cluster is unallocated, read requests shall read the data from the backing +file (except if bit 0 in the Standard Cluster Descriptor is set). If there is +no backing file or the backing file is smaller than the image, they shall read +zeros for all parts that are not covered by the backing file. + +== Extended L2 Entries == + +An image uses Extended L2 Entries if bit 4 is set on the incompatible_features +field of the header. + +In these images standard data clusters are divided into 32 subclusters of the +same size. They are contiguous and start from the beginning of the cluster. +Subclusters can be allocated independently and the L2 entry contains information +indicating the status of each one of them. Compressed data clusters don't have +subclusters so they are treated the same as in images without this feature. + +The size of an extended L2 entry is 128 bits so the number of entries per table +is calculated using this formula: + + l2_entries = (cluster_size / (2 * sizeof(uint64_t))) + +The first 64 bits have the same format as the standard L2 table entry described +in the previous section, with the exception of bit 0 of the standard cluster +descriptor. + +The last 64 bits contain a subcluster allocation bitmap with this format: + +Subcluster Allocation Bitmap (for standard clusters): + + Bit 0 - 31: Allocation status (one bit per subcluster) + + 1: the subcluster is allocated. In this case the + host cluster offset field must contain a valid + offset. + 0: the subcluster is not allocated. In this case + read requests shall go to the backing file or + return zeros if there is no backing file data. + + Bits are assigned starting from the least significant + one (i.e. bit x is used for subcluster x). + + 32 - 63 Subcluster reads as zeros (one bit per subcluster) + + 1: the subcluster reads as zeros. In this case the + allocation status bit must be unset. The host + cluster offset field may or may not be set. + 0: no effect. + + Bits are assigned starting from the least significant + one (i.e. bit x is used for subcluster x - 32). + +Subcluster Allocation Bitmap (for compressed clusters): + + Bit 0 - 63: Reserved (set to 0) + Compressed clusters don't have subclusters, + so this field is not used. + +== Snapshots == + +qcow2 supports internal snapshots. Their basic principle of operation is to +switch the active L1 table, so that a different set of host clusters are +exposed to the guest. + +When creating a snapshot, the L1 table should be copied and the refcount of all +L2 tables and clusters reachable from this L1 table must be increased, so that +a write causes a COW and isn't visible in other snapshots. + +When loading a snapshot, bit 63 of all entries in the new active L1 table and +all L2 tables referenced by it must be reconstructed from the refcount table +as it doesn't need to be accurate in inactive L1 tables. + +A directory of all snapshots is stored in the snapshot table, a contiguous area +in the image file, whose starting offset and length are given by the header +fields snapshots_offset and nb_snapshots. The entries of the snapshot table +have variable length, depending on the length of ID, name and extra data. + +Snapshot table entry: + + Byte 0 - 7: Offset into the image file at which the L1 table for the + snapshot starts. Must be aligned to a cluster boundary. + + 8 - 11: Number of entries in the L1 table of the snapshots + + 12 - 13: Length of the unique ID string describing the snapshot + + 14 - 15: Length of the name of the snapshot + + 16 - 19: Time at which the snapshot was taken in seconds since the + Epoch + + 20 - 23: Subsecond part of the time at which the snapshot was taken + in nanoseconds + + 24 - 31: Time that the guest was running until the snapshot was + taken in nanoseconds + + 32 - 35: Size of the VM state in bytes. 0 if no VM state is saved. + If there is VM state, it starts at the first cluster + described by first L1 table entry that doesn't describe a + regular guest cluster (i.e. VM state is stored like guest + disk content, except that it is stored at offsets that are + larger than the virtual disk presented to the guest) + + 36 - 39: Size of extra data in the table entry (used for future + extensions of the format) + + variable: Extra data for future extensions. Unknown fields must be + ignored. Currently defined are (offset relative to snapshot + table entry): + + Byte 40 - 47: Size of the VM state in bytes. 0 if no VM + state is saved. If this field is present, + the 32-bit value in bytes 32-35 is ignored. + + Byte 48 - 55: Virtual disk size of the snapshot in bytes + + Byte 56 - 63: icount value which corresponds to + the record/replay instruction count + when the snapshot was taken. Set to -1 + if icount was disabled + + Version 3 images must include extra data at least up to + byte 55. + + variable: Unique ID string for the snapshot (not null terminated) + + variable: Name of the snapshot (not null terminated) + + variable: Padding to round up the snapshot table entry size to the + next multiple of 8. + + +== Bitmaps == + +As mentioned above, the bitmaps extension provides the ability to store bitmaps +related to a virtual disk. This section describes how these bitmaps are stored. + +All stored bitmaps are related to the virtual disk stored in the same image, so +each bitmap size is equal to the virtual disk size. + +Each bit of the bitmap is responsible for strictly defined range of the virtual +disk. For bit number bit_nr the corresponding range (in bytes) will be: + + [bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1] + +Granularity is a property of the concrete bitmap, see below. + + +=== Bitmap directory === + +Each bitmap saved in the image is described in a bitmap directory entry. The +bitmap directory is a contiguous area in the image file, whose starting offset +and length are given by the header extension fields bitmap_directory_offset and +bitmap_directory_size. The entries of the bitmap directory have variable +length, depending on the lengths of the bitmap name and extra data. + +Structure of a bitmap directory entry: + + Byte 0 - 7: bitmap_table_offset + Offset into the image file at which the bitmap table + (described below) for the bitmap starts. Must be aligned to + a cluster boundary. + + 8 - 11: bitmap_table_size + Number of entries in the bitmap table of the bitmap. + + 12 - 15: flags + Bit + 0: in_use + The bitmap was not saved correctly and may be + inconsistent. Although the bitmap metadata is still + well-formed from a qcow2 perspective, the metadata + (such as the auto flag or bitmap size) or data + contents may be outdated. + + 1: auto + The bitmap must reflect all changes of the virtual + disk by any application that would write to this qcow2 + file (including writes, snapshot switching, etc.). The + type of this bitmap must be 'dirty tracking bitmap'. + + 2: extra_data_compatible + This flags is meaningful when the extra data is + unknown to the software (currently any extra data is + unknown to QEMU). + If it is set, the bitmap may be used as expected, extra + data must be left as is. + If it is not set, the bitmap must not be used, but + both it and its extra data be left as is. + + Bits 3 - 31 are reserved and must be 0. + + 16: type + This field describes the sort of the bitmap. + Values: + 1: Dirty tracking bitmap + + Values 0, 2 - 255 are reserved. + + 17: granularity_bits + Granularity bits. Valid values: 0 - 63. + + Note: QEMU currently supports only values 9 - 31. + + Granularity is calculated as + granularity = 1 << granularity_bits + + A bitmap's granularity is how many bytes of the image + accounts for one bit of the bitmap. + + 18 - 19: name_size + Size of the bitmap name. Must be non-zero. + + Note: QEMU currently doesn't support values greater than + 1023. + + 20 - 23: extra_data_size + Size of type-specific extra data. + + For now, as no extra data is defined, extra_data_size is + reserved and should be zero. If it is non-zero the + behavior is defined by extra_data_compatible flag. + + variable: extra_data + Extra data for the bitmap, occupying extra_data_size bytes. + Extra data must never contain references to clusters or in + some other way allocate additional clusters. + + variable: name + The name of the bitmap (not null terminated), occupying + name_size bytes. Must be unique among all bitmap names + within the bitmaps extension. + + variable: Padding to round up the bitmap directory entry size to the + next multiple of 8. All bytes of the padding must be zero. + + +=== Bitmap table === + +Each bitmap is stored using a one-level structure (as opposed to two-level +structures like for refcounts and guest clusters mapping) for the mapping of +bitmap data to host clusters. This structure is called the bitmap table. + +Each bitmap table has a variable size (stored in the bitmap directory entry) +and may use multiple clusters, however, it must be contiguous in the image +file. + +Structure of a bitmap table entry: + + Bit 0: Reserved and must be zero if bits 9 - 55 are non-zero. + If bits 9 - 55 are zero: + 0: Cluster should be read as all zeros. + 1: Cluster should be read as all ones. + + 1 - 8: Reserved and must be zero. + + 9 - 55: Bits 9 - 55 of the host cluster offset. Must be aligned to + a cluster boundary. If the offset is 0, the cluster is + unallocated; in that case, bit 0 determines how this + cluster should be treated during reads. + + 56 - 63: Reserved and must be zero. + + +=== Bitmap data === + +As noted above, bitmap data is stored in separate clusters, described by the +bitmap table. Given an offset (in bytes) into the bitmap data, the offset into +the image file can be obtained as follows: + + image_offset(bitmap_data_offset) = + bitmap_table[bitmap_data_offset / cluster_size] + + (bitmap_data_offset % cluster_size) + +This offset is not defined if bits 9 - 55 of bitmap table entry are zero (see +above). + +Given an offset byte_nr into the virtual disk and the bitmap's granularity, the +bit offset into the image file to the corresponding bit of the bitmap can be +calculated like this: + + bit_offset(byte_nr) = + image_offset(byte_nr / granularity / 8) * 8 + + (byte_nr / granularity) % 8 + +If the size of the bitmap data is not a multiple of the cluster size then the +last cluster of the bitmap data contains some unused tail bits. These bits must +be zero. + + +=== Dirty tracking bitmaps === + +Bitmaps with 'type' field equal to one are dirty tracking bitmaps. + +When the virtual disk is in use dirty tracking bitmap may be 'enabled' or +'disabled'. While the bitmap is 'enabled', all writes to the virtual disk +should be reflected in the bitmap. A set bit in the bitmap means that the +corresponding range of the virtual disk (see above) was written to while the +bitmap was 'enabled'. An unset bit means that this range was not written to. + +The software doesn't have to sync the bitmap in the image file with its +representation in RAM after each write or metadata change. Flag 'in_use' +should be set while the bitmap is not synced. + +In the image file the 'enabled' state is reflected by the 'auto' flag. If this +flag is set, the software must consider the bitmap as 'enabled' and start +tracking virtual disk changes to this bitmap from the first write to the +virtual disk. If this flag is not set then the bitmap is disabled. diff --git a/docs/interop/qed_spec.txt b/docs/interop/qed_spec.txt new file mode 100644 index 000000000..7982e058b --- /dev/null +++ b/docs/interop/qed_spec.txt @@ -0,0 +1,138 @@ +=Specification= + +The file format looks like this: + + +----------+----------+----------+-----+ + | cluster0 | cluster1 | cluster2 | ... | + +----------+----------+----------+-----+ + +The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters. + +Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster. + +All fields are little-endian. + +==Header== + Header { + uint32_t magic; /* QED\0 */ + + uint32_t cluster_size; /* in bytes */ + uint32_t table_size; /* for L1 and L2 tables, in clusters */ + uint32_t header_size; /* in clusters */ + + uint64_t features; /* format feature bits */ + uint64_t compat_features; /* compat feature bits */ + uint64_t autoclear_features; /* self-resetting feature bits */ + + uint64_t l1_table_offset; /* in bytes */ + uint64_t image_size; /* total logical image size, in bytes */ + + /* if (features & QED_F_BACKING_FILE) */ + uint32_t backing_filename_offset; /* in bytes from start of header */ + uint32_t backing_filename_size; /* in bytes */ + } + +Field descriptions: +* ''cluster_size'' must be a power of 2 in range [2^12, 2^26]. +* ''table_size'' must be a power of 2 in range [1, 16]. +* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters. +* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows: +** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits. +** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format. +** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later. +* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''. +* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes. +* ''backing_filename_offset'' and ''backing_filename_size'' describe a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints. The string must be stored within the first ''header_size'' clusters. The backing filename may be an absolute path or relative to the image file. + +Feature bits: +* QED_F_BACKING_FILE = 0x01. The image uses a backing file. +* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use. +* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing files are never detected as an image format if they happen to contain magic constants. + +There are currently no defined ''compat_features'' or ''autoclear_features'' bits. + +Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set. + +==Tables== + +Tables provide the translation from logical offsets in the block device to cluster offsets in the file. + + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t)) + + Table { + uint64_t offsets[TABLE_NOFFSETS]; + } + +The tables are organized as follows: + + +----------+ + | L1 table | + +----------+ + ,------' | '------. + +----------+ | +----------+ + | L2 table | ... | L2 table | + +----------+ +----------+ + ,------' | '------. + +----------+ | +----------+ + | Data | ... | Data | + +----------+ +----------+ + +A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables. + +The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table: + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size + +L1, L2, and data cluster offsets must be aligned to header.cluster_size. The following offsets have special meanings: + +===L2 table offsets=== +* 0 - unallocated. The L2 table is not yet allocated. + +===Data cluster offsets=== +* 0 - unallocated. The data cluster is not yet allocated. +* 1 - zero. The data cluster contents are all zeroes and no cluster is allocated. + +Future format extensions may wish to store per-offset information. The least significant 12 bits of an offset are reserved for this purpose and must be set to zero. Image files with cluster_size > 2^12 will have more unused bits which should also be zeroed. + +===Unallocated L2 tables and data clusters=== +Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes. + +Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing file (or zeroes if no backing file) and the data being written. + +===Zero data clusters=== +Zero data clusters are a space-efficient way of storing zeroed regions of the image. + +Reads to a zero data cluster produce zeroes. Note that the difference between an unallocated and a zero data cluster is that zero data clusters stop the reading of contents from the backing file. + +Writes to a zero data cluster cause a new data cluster to be allocated. The new data cluster is populated with zeroes and the data being written. + +===Logical offset translation=== +Logical offsets are translated into cluster offsets as follows: + + table_bits table_bits cluster_bits + <--------> <--------> <---------------> + +----------+----------+-----------------+ + | L1 index | L2 index | byte offset | + +----------+----------+-----------------+ + + Structure of a logical offset + + offset_mask = ~(cluster_size - 1) # mask for the image file byte offset + + def logical_to_cluster_offset(l1_index, l2_index, byte_offset): + l2_offset = l1_table[l1_index] + l2_table = load_table(l2_offset) + cluster_offset = l2_table[l2_index] & offset_mask + return cluster_offset + byte_offset + +==Consistency checking== + +This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit. + +The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent. + +Consistency check includes the following invariants: +# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references. +# Offsets must be within the image file size and must be ''cluster_size'' aligned. +# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table. + +The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed. diff --git a/docs/interop/qemu-ga-ref.rst b/docs/interop/qemu-ga-ref.rst new file mode 100644 index 000000000..032d49245 --- /dev/null +++ b/docs/interop/qemu-ga-ref.rst @@ -0,0 +1,7 @@ +QEMU Guest Agent Protocol Reference +=================================== + +.. contents:: + :depth: 3 + +.. qapi-doc:: qga/qapi-schema.json diff --git a/docs/interop/qemu-ga.rst b/docs/interop/qemu-ga.rst new file mode 100644 index 000000000..3063357bb --- /dev/null +++ b/docs/interop/qemu-ga.rst @@ -0,0 +1,134 @@ +QEMU Guest Agent +================ + +Synopsis +-------- + +**qemu-ga** [*OPTIONS*] + +Description +----------- + +The QEMU Guest Agent is a daemon intended to be run within virtual +machines. It allows the hypervisor host to perform various operations +in the guest, such as: + +- get information from the guest +- set the guest's system time +- read/write a file +- sync and freeze the filesystems +- suspend the guest +- reconfigure guest local processors +- set user's password +- ... + +qemu-ga will read a system configuration file on startup (located at +|CONFDIR|\ ``/qemu-ga.conf`` by default), then parse remaining +configuration options on the command line. For the same key, the last +option wins, but the lists accumulate (see below for configuration +file format). + +Options +------- + +.. program:: qemu-ga + +.. option:: -m, --method=METHOD + + Transport method: one of ``unix-listen``, ``virtio-serial``, or + ``isa-serial``, or ``vsock-listen`` (``virtio-serial`` is the default). + +.. option:: -p, --path=PATH + + Device/socket path (the default for virtio-serial is + ``/dev/virtio-ports/org.qemu.guest_agent.0``, + the default for isa-serial is ``/dev/ttyS0``). Socket addresses for + vsock-listen are written as ``<cid>:<port>``. + +.. option:: -l, --logfile=PATH + + Set log file path (default is stderr). + +.. option:: -f, --pidfile=PATH + + Specify pid file (default is ``/var/run/qemu-ga.pid``). + +.. option:: -F, --fsfreeze-hook=PATH + + Enable fsfreeze hook. Accepts an optional argument that specifies + script to run on freeze/thaw. Script will be called with + 'freeze'/'thaw' arguments accordingly (default is + |CONFDIR|\ ``/fsfreeze-hook``). If using -F with an argument, do + not follow -F with a space (for example: + ``-F/var/run/fsfreezehook.sh``). + +.. option:: -t, --statedir=PATH + + Specify the directory to store state information (absolute paths only, + default is ``/var/run``). + +.. option:: -v, --verbose + + Log extra debugging information. + +.. option:: -V, --version + + Print version information and exit. + +.. option:: -d, --daemon + + Daemonize after startup (detach from terminal). + +.. option:: -b, --blacklist=LIST + + Comma-separated list of RPCs to disable (no spaces, ``?`` to list + available RPCs). + +.. option:: -D, --dump-conf + + Dump the configuration in a format compatible with ``qemu-ga.conf`` + and exit. + +.. option:: -h, --help + + Display this help and exit. + +Files +----- + + +The syntax of the ``qemu-ga.conf`` configuration file follows the +Desktop Entry Specification, here is a quick summary: it consists of +groups of key-value pairs, interspersed with comments. + +:: + + # qemu-ga configuration sample + [general] + daemonize = 0 + pidfile = /var/run/qemu-ga.pid + verbose = 0 + method = virtio-serial + path = /dev/virtio-ports/org.qemu.guest_agent.0 + statedir = /var/run + +The list of keys follows the command line options: + +============= =========== +Key Key type +============= =========== +daemon boolean +method string +path string +logfile string +pidfile string +fsfreeze-hook string +statedir string +verbose boolean +blacklist string list +============= =========== + +See also +-------- + +:manpage:`qemu(1)` diff --git a/docs/interop/qemu-qmp-ref.rst b/docs/interop/qemu-qmp-ref.rst new file mode 100644 index 000000000..357effd64 --- /dev/null +++ b/docs/interop/qemu-qmp-ref.rst @@ -0,0 +1,7 @@ +QEMU QMP Reference Manual +========================= + +.. contents:: + :depth: 3 + +.. qapi-doc:: qapi/qapi-schema.json diff --git a/docs/interop/qemu-storage-daemon-qmp-ref.rst b/docs/interop/qemu-storage-daemon-qmp-ref.rst new file mode 100644 index 000000000..9fed68152 --- /dev/null +++ b/docs/interop/qemu-storage-daemon-qmp-ref.rst @@ -0,0 +1,7 @@ +QEMU Storage Daemon QMP Reference Manual +======================================== + +.. contents:: + :depth: 3 + +.. qapi-doc:: storage-daemon/qapi/qapi-schema.json diff --git a/docs/interop/qmp-intro.txt b/docs/interop/qmp-intro.txt new file mode 100644 index 000000000..1c745a7af --- /dev/null +++ b/docs/interop/qmp-intro.txt @@ -0,0 +1,88 @@ + QEMU Machine Protocol + ===================== + +Introduction +------------ + +The QEMU Machine Protocol (QMP) allows applications to operate a +QEMU instance. + +QMP is JSON[1] based and features the following: + +- Lightweight, text-based, easy to parse data format +- Asynchronous messages support (ie. events) +- Capabilities Negotiation + +For detailed information on QMP's usage, please, refer to the following files: + +o qmp-spec.txt QEMU Machine Protocol current specification +o qemu-qmp-ref.html QEMU QMP commands and events (auto-generated at build-time) + +[1] https://www.json.org + +Usage +----- + +You can use the -qmp option to enable QMP. For example, the following +makes QMP available on localhost port 4444: + +$ qemu [...] -qmp tcp:localhost:4444,server=on,wait=off + +However, for more flexibility and to make use of more options, the -mon +command-line option should be used. For instance, the following example +creates one HMP instance (human monitor) on stdio and one QMP instance +on localhost port 4444: + +$ qemu [...] -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline \ + -chardev socket,id=mon1,host=localhost,port=4444,server=on,wait=off \ + -mon chardev=mon1,mode=control,pretty=on + +Please, refer to QEMU's manpage for more information. + +Simple Testing +-------------- + +To manually test QMP one can connect with telnet and issue commands by hand: + +$ telnet localhost 4444 +Trying 127.0.0.1... +Connected to localhost. +Escape character is '^]'. +{ + "QMP": { + "version": { + "qemu": { + "micro": 0, + "minor": 0, + "major": 3 + }, + "package": "v3.0.0" + }, + "capabilities": [ + "oob" + ] + } +} + +{ "execute": "qmp_capabilities" } +{ + "return": { + } +} + +{ "execute": "query-status" } +{ + "return": { + "status": "prelaunch", + "singlestep": false, + "running": false + } +} + +Please refer to docs/interop/qemu-qmp-ref.* for a complete command +reference, generated from qapi/qapi-schema.json. + +QMP wiki page +------------- + +https://wiki.qemu.org/QMP diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt new file mode 100644 index 000000000..b0e8351d5 --- /dev/null +++ b/docs/interop/qmp-spec.txt @@ -0,0 +1,406 @@ + QEMU Machine Protocol Specification + +0. About This Document +====================== + +Copyright (C) 2009-2016 Red Hat, Inc. + +This work is licensed under the terms of the GNU GPL, version 2 or +later. See the COPYING file in the top-level directory. + +1. Introduction +=============== + +This document specifies the QEMU Machine Protocol (QMP), a JSON-based +protocol which is available for applications to operate QEMU at the +machine-level. It is also in use by the QEMU Guest Agent (QGA), which +is available for host applications to interact with the guest +operating system. + +2. Protocol Specification +========================= + +This section details the protocol format. For the purpose of this +document, "Server" is either QEMU or the QEMU Guest Agent, and +"Client" is any application communicating with it via QMP. + +JSON data structures, when mentioned in this document, are always in the +following format: + + json-DATA-STRUCTURE-NAME + +Where DATA-STRUCTURE-NAME is any valid JSON data structure, as defined +by the JSON standard: + +http://www.ietf.org/rfc/rfc8259.txt + +The server expects its input to be encoded in UTF-8, and sends its +output encoded in ASCII. + +For convenience, json-object members mentioned in this document will +be in a certain order. However, in real protocol usage they can be in +ANY order, thus no particular order should be assumed. On the other +hand, use of json-array elements presumes that preserving order is +important unless specifically documented otherwise. Repeating a key +within a json-object gives unpredictable results. + +Also for convenience, the server will accept an extension of +'single-quoted' strings in place of the usual "double-quoted" +json-string, and both input forms of strings understand an additional +escape sequence of "\'" for a single quote. The server will only use +double quoting on output. + +2.1 General Definitions +----------------------- + +2.1.1 All interactions transmitted by the Server are json-objects, always + terminating with CRLF + +2.1.2 All json-objects members are mandatory when not specified otherwise + +2.2 Server Greeting +------------------- + +Right when connected the Server will issue a greeting message, which signals +that the connection has been successfully established and that the Server is +ready for capabilities negotiation (for more information refer to section +'4. Capabilities Negotiation'). + +The greeting message format is: + +{ "QMP": { "version": json-object, "capabilities": json-array } } + + Where, + +- The "version" member contains the Server's version information (the format + is the same of the query-version command) +- The "capabilities" member specify the availability of features beyond the + baseline specification; the order of elements in this array has no + particular significance. + +2.2.1 Capabilities +------------------ + +Currently supported capabilities are: + +- "oob": the QMP server supports "out-of-band" (OOB) command + execution, as described in section "2.3.1 Out-of-band execution". + +2.3 Issuing Commands +-------------------- + +The format for command execution is: + +{ "execute": json-string, "arguments": json-object, "id": json-value } + +or + +{ "exec-oob": json-string, "arguments": json-object, "id": json-value } + + Where, + +- The "execute" or "exec-oob" member identifies the command to be + executed by the server. The latter requests out-of-band execution. +- The "arguments" member is used to pass any arguments required for the + execution of the command, it is optional when no arguments are + required. Each command documents what contents will be considered + valid when handling the json-argument +- The "id" member is a transaction identification associated with the + command execution, it is optional and will be part of the response + if provided. The "id" member can be any json-value. A json-number + incremented for each successive command works fine. + +The actual commands are documented in the QEMU QMP reference manual +docs/interop/qemu-qmp-ref.{7,html,info,pdf,txt}. + +2.3.1 Out-of-band execution +--------------------------- + +The server normally reads, executes and responds to one command after +the other. The client therefore receives command responses in issue +order. + +With out-of-band execution enabled via capability negotiation (section +4.), the server reads and queues commands as they arrive. It executes +commands from the queue one after the other. Commands executed +out-of-band jump the queue: the command get executed right away, +possibly overtaking prior in-band commands. The client may therefore +receive such a command's response before responses from prior in-band +commands. + +To be able to match responses back to their commands, the client needs +to pass "id" with out-of-band commands. Passing it with all commands +is recommended for clients that accept capability "oob". + +If the client sends in-band commands faster than the server can +execute them, the server will stop reading requests until the request +queue length is reduced to an acceptable range. + +To ensure commands to be executed out-of-band get read and executed, +the client should have at most eight in-band commands in flight. + +Only a few commands support out-of-band execution. The ones that do +have "allow-oob": true in output of query-qmp-schema. + +2.4 Commands Responses +---------------------- + +There are two possible responses which the Server will issue as the result +of a command execution: success or error. + +As long as the commands were issued with a proper "id" field, then the +same "id" field will be attached in the corresponding response message +so that requests and responses can match. Clients should drop all the +responses that have an unknown "id" field. + +2.4.1 success +------------- + +The format of a success response is: + +{ "return": json-value, "id": json-value } + + Where, + +- The "return" member contains the data returned by the command, which + is defined on a per-command basis (usually a json-object or + json-array of json-objects, but sometimes a json-number, json-string, + or json-array of json-strings); it is an empty json-object if the + command does not return data +- The "id" member contains the transaction identification associated + with the command execution if issued by the Client + +2.4.2 error +----------- + +The format of an error response is: + +{ "error": { "class": json-string, "desc": json-string }, "id": json-value } + + Where, + +- The "class" member contains the error class name (eg. "GenericError") +- The "desc" member is a human-readable error message. Clients should + not attempt to parse this message. +- The "id" member contains the transaction identification associated with + the command execution if issued by the Client + +NOTE: Some errors can occur before the Server is able to read the "id" member, +in these cases the "id" member will not be part of the error response, even +if provided by the client. + +2.5 Asynchronous events +----------------------- + +As a result of state changes, the Server may send messages unilaterally +to the Client at any time, when not in the middle of any other +response. They are called "asynchronous events". + +The format of asynchronous events is: + +{ "event": json-string, "data": json-object, + "timestamp": { "seconds": json-number, "microseconds": json-number } } + + Where, + +- The "event" member contains the event's name +- The "data" member contains event specific data, which is defined in a + per-event basis, it is optional +- The "timestamp" member contains the exact time of when the event + occurred in the Server. It is a fixed json-object with time in + seconds and microseconds relative to the Unix Epoch (1 Jan 1970); if + there is a failure to retrieve host time, both members of the + timestamp will be set to -1. + +The actual asynchronous events are documented in the QEMU QMP +reference manual docs/interop/qemu-qmp-ref.{7,html,info,pdf,txt}. + +Some events are rate-limited to at most one per second. If additional +"similar" events arrive within one second, all but the last one are +dropped, and the last one is delayed. "Similar" normally means same +event type. + +2.6 Forcing the JSON parser into known-good state +------------------------------------------------- + +Incomplete or invalid input can leave the server's JSON parser in a +state where it can't parse additional commands. To get it back into +known-good state, the client should provoke a lexical error. + +The cleanest way to do that is sending an ASCII control character +other than '\t' (horizontal tab), '\r' (carriage return), or '\n' (new +line). + +Sadly, older versions of QEMU can fail to flag this as an error. If a +client needs to deal with them, it should send a 0xFF byte. + +2.7 QGA Synchronization +----------------------- + +When a client connects to QGA over a transport lacking proper +connection semantics such as virtio-serial, QGA may have read partial +input from a previous client. The client needs to force QGA's parser +into known-good state using the previous section's technique. +Moreover, the client may receive output a previous client didn't read. +To help with skipping that output, QGA provides the +'guest-sync-delimited' command. Refer to its documentation for +details. + + +3. QMP Examples +=============== + +This section provides some examples of real QMP usage, in all of them +"C" stands for "Client" and "S" stands for "Server". + +3.1 Server greeting +------------------- + +S: { "QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 3}, + "package": "v3.0.0"}, "capabilities": ["oob"] } } + +3.2 Capabilities negotiation +---------------------------- + +C: { "execute": "qmp_capabilities", "arguments": { "enable": ["oob"] } } +S: { "return": {}} + +3.3 Simple 'stop' execution +--------------------------- + +C: { "execute": "stop" } +S: { "return": {} } + +3.4 KVM information +------------------- + +C: { "execute": "query-kvm", "id": "example" } +S: { "return": { "enabled": true, "present": true }, "id": "example"} + +3.5 Parsing error +------------------ + +C: { "execute": } +S: { "error": { "class": "GenericError", "desc": "Invalid JSON syntax" } } + +3.6 Powerdown event +------------------- + +S: { "timestamp": { "seconds": 1258551470, "microseconds": 802384 }, + "event": "POWERDOWN" } + +3.7 Out-of-band execution +------------------------- + +C: { "exec-oob": "migrate-pause", "id": 42 } +S: { "id": 42, + "error": { "class": "GenericError", + "desc": "migrate-pause is currently only supported during postcopy-active state" } } + + +4. Capabilities Negotiation +=========================== + +When a Client successfully establishes a connection, the Server is in +Capabilities Negotiation mode. + +In this mode only the qmp_capabilities command is allowed to run, all +other commands will return the CommandNotFound error. Asynchronous +messages are not delivered either. + +Clients should use the qmp_capabilities command to enable capabilities +advertised in the Server's greeting (section '2.2 Server Greeting') they +support. + +When the qmp_capabilities command is issued, and if it does not return an +error, the Server enters in Command mode where capabilities changes take +effect, all commands (except qmp_capabilities) are allowed and asynchronous +messages are delivered. + +5 Compatibility Considerations +============================== + +All protocol changes or new features which modify the protocol format in an +incompatible way are disabled by default and will be advertised by the +capabilities array (section '2.2 Server Greeting'). Thus, Clients can check +that array and enable the capabilities they support. + +The QMP Server performs a type check on the arguments to a command. It +generates an error if a value does not have the expected type for its +key, or if it does not understand a key that the Client included. The +strictness of the Server catches wrong assumptions of Clients about +the Server's schema. Clients can assume that, when such validation +errors occur, they will be reported before the command generated any +side effect. + +However, Clients must not assume any particular: + +- Length of json-arrays +- Size of json-objects; in particular, future versions of QEMU may add + new keys and Clients should be able to ignore them. +- Order of json-object members or json-array elements +- Amount of errors generated by a command, that is, new errors can be added + to any existing command in newer versions of the Server + +Any command or member name beginning with "x-" is deemed experimental, +and may be withdrawn or changed in an incompatible manner in a future +release. + +Of course, the Server does guarantee to send valid JSON. But apart from +this, a Client should be "conservative in what they send, and liberal in +what they accept". + +6. Downstream extension of QMP +============================== + +We recommend that downstream consumers of QEMU do *not* modify QMP. +Management tools should be able to support both upstream and downstream +versions of QMP without special logic, and downstream extensions are +inherently at odds with that. + +However, we recognize that it is sometimes impossible for downstreams to +avoid modifying QMP. Both upstream and downstream need to take care to +preserve long-term compatibility and interoperability. + +To help with that, QMP reserves JSON object member names beginning with +'__' (double underscore) for downstream use ("downstream names"). This +means upstream will never use any downstream names for its commands, +arguments, errors, asynchronous events, and so forth. + +Any new names downstream wishes to add must begin with '__'. To +ensure compatibility with other downstreams, it is strongly +recommended that you prefix your downstream names with '__RFQDN_' where +RFQDN is a valid, reverse fully qualified domain name which you +control. For example, a qemu-kvm specific monitor command would be: + + (qemu) __org.linux-kvm_enable_irqchip + +Downstream must not change the server greeting (section 2.2) other than +to offer additional capabilities. But see below for why even that is +discouraged. + +Section '5 Compatibility Considerations' applies to downstream as well +as to upstream, obviously. It follows that downstream must behave +exactly like upstream for any input not containing members with +downstream names ("downstream members"), except it may add members +with downstream names to its output. + +Thus, a client should not be able to distinguish downstream from +upstream as long as it doesn't send input with downstream members, and +properly ignores any downstream members in the output it receives. + +Advice on downstream modifications: + +1. Introducing new commands is okay. If you want to extend an existing + command, consider introducing a new one with the new behaviour + instead. + +2. Introducing new asynchronous messages is okay. If you want to extend + an existing message, consider adding a new one instead. + +3. Introducing new errors for use in new commands is okay. Adding new + errors to existing commands counts as extension, so 1. applies. + +4. New capabilities are strongly discouraged. Capabilities are for + evolving the basic protocol, and multiple diverging basic protocol + dialects are most undesirable. diff --git a/docs/interop/vhost-user-gpu.rst b/docs/interop/vhost-user-gpu.rst new file mode 100644 index 000000000..71a2c52b3 --- /dev/null +++ b/docs/interop/vhost-user-gpu.rst @@ -0,0 +1,243 @@ +======================= +Vhost-user-gpu Protocol +======================= + +.. + Licence: This work is licensed under the terms of the GNU GPL, + version 2 or later. See the COPYING file in the top-level + directory. + +.. contents:: Table of Contents + +Introduction +============ + +The vhost-user-gpu protocol is aiming at sharing the rendering result +of a virtio-gpu, done from a vhost-user slave process to a vhost-user +master process (such as QEMU). It bears a resemblance to a display +server protocol, if you consider QEMU as the display server and the +slave as the client, but in a very limited way. Typically, it will +work by setting a scanout/display configuration, before sending flush +events for the display updates. It will also update the cursor shape +and position. + +The protocol is sent over a UNIX domain stream socket, since it uses +socket ancillary data to share opened file descriptors (DMABUF fds or +shared memory). The socket is usually obtained via +``VHOST_USER_GPU_SET_SOCKET``. + +Requests are sent by the *slave*, and the optional replies by the +*master*. + +Wire format +=========== + +Unless specified differently, numbers are in the machine native byte +order. + +A vhost-user-gpu message (request and reply) consists of 3 header +fields and a payload. + ++---------+-------+------+---------+ +| request | flags | size | payload | ++---------+-------+------+---------+ + +Header +------ + +:request: ``u32``, type of the request + +:flags: ``u32``, 32-bit bit field: + + - Bit 2 is the reply flag - needs to be set on each reply + +:size: ``u32``, size of the payload + +Payload types +------------- + +Depending on the request type, **payload** can be: + +VhostUserGpuCursorPos +^^^^^^^^^^^^^^^^^^^^^ + ++------------+---+---+ +| scanout-id | x | y | ++------------+---+---+ + +:scanout-id: ``u32``, the scanout where the cursor is located + +:x/y: ``u32``, the cursor position + +VhostUserGpuCursorUpdate +^^^^^^^^^^^^^^^^^^^^^^^^ + ++-----+-------+-------+--------+ +| pos | hot_x | hot_y | cursor | ++-----+-------+-------+--------+ + +:pos: a ``VhostUserGpuCursorPos``, the cursor location + +:hot_x/hot_y: ``u32``, the cursor hot location + +:cursor: ``[u32; 64 * 64]``, 64x64 RGBA cursor data (PIXMAN_a8r8g8b8 format) + +VhostUserGpuScanout +^^^^^^^^^^^^^^^^^^^ + ++------------+---+---+ +| scanout-id | w | h | ++------------+---+---+ + +:scanout-id: ``u32``, the scanout configuration to set + +:w/h: ``u32``, the scanout width/height size + +VhostUserGpuUpdate +^^^^^^^^^^^^^^^^^^ + ++------------+---+---+---+---+------+ +| scanout-id | x | y | w | h | data | ++------------+---+---+---+---+------+ + +:scanout-id: ``u32``, the scanout content to update + +:x/y/w/h: ``u32``, region of the update + +:data: RGB data (PIXMAN_x8r8g8b8 format) + +VhostUserGpuDMABUFScanout +^^^^^^^^^^^^^^^^^^^^^^^^^ + ++------------+---+---+---+---+-----+-----+--------+-------+--------+ +| scanout-id | x | y | w | h | fdw | fwh | stride | flags | fourcc | ++------------+---+---+---+---+-----+-----+--------+-------+--------+ + +:scanout-id: ``u32``, the scanout configuration to set + +:x/y: ``u32``, the location of the scanout within the DMABUF + +:w/h: ``u32``, the scanout width/height size + +:fdw/fdh/stride/flags: ``u32``, the DMABUF width/height/stride/flags + +:fourcc: ``i32``, the DMABUF fourcc + + +C structure +----------- + +In QEMU the vhost-user-gpu message is implemented with the following struct: + +.. code:: c + + typedef struct VhostUserGpuMsg { + uint32_t request; /* VhostUserGpuRequest */ + uint32_t flags; + uint32_t size; /* the following payload size */ + union { + VhostUserGpuCursorPos cursor_pos; + VhostUserGpuCursorUpdate cursor_update; + VhostUserGpuScanout scanout; + VhostUserGpuUpdate update; + VhostUserGpuDMABUFScanout dmabuf_scanout; + struct virtio_gpu_resp_display_info display_info; + uint64_t u64; + } payload; + } QEMU_PACKED VhostUserGpuMsg; + +Protocol features +----------------- + +None yet. + +As the protocol may need to evolve, new messages and communication +changes are negotiated thanks to preliminary +``VHOST_USER_GPU_GET_PROTOCOL_FEATURES`` and +``VHOST_USER_GPU_SET_PROTOCOL_FEATURES`` requests. + +Communication +============= + +Message types +------------- + +``VHOST_USER_GPU_GET_PROTOCOL_FEATURES`` + :id: 1 + :request payload: N/A + :reply payload: ``u64`` + + Get the supported protocol features bitmask. + +``VHOST_USER_GPU_SET_PROTOCOL_FEATURES`` + :id: 2 + :request payload: ``u64`` + :reply payload: N/A + + Enable protocol features using a bitmask. + +``VHOST_USER_GPU_GET_DISPLAY_INFO`` + :id: 3 + :request payload: N/A + :reply payload: ``struct virtio_gpu_resp_display_info`` (from virtio specification) + + Get the preferred display configuration. + +``VHOST_USER_GPU_CURSOR_POS`` + :id: 4 + :request payload: ``VhostUserGpuCursorPos`` + :reply payload: N/A + + Set/show the cursor position. + +``VHOST_USER_GPU_CURSOR_POS_HIDE`` + :id: 5 + :request payload: ``VhostUserGpuCursorPos`` + :reply payload: N/A + + Set/hide the cursor. + +``VHOST_USER_GPU_CURSOR_UPDATE`` + :id: 6 + :request payload: ``VhostUserGpuCursorUpdate`` + :reply payload: N/A + + Update the cursor shape and location. + +``VHOST_USER_GPU_SCANOUT`` + :id: 7 + :request payload: ``VhostUserGpuScanout`` + :reply payload: N/A + + Set the scanout resolution. To disable a scanout, the dimensions + width/height are set to 0. + +``VHOST_USER_GPU_UPDATE`` + :id: 8 + :request payload: ``VhostUserGpuUpdate`` + :reply payload: N/A + + Update the scanout content. The data payload contains the graphical bits. + The display should be flushed and presented. + +``VHOST_USER_GPU_DMABUF_SCANOUT`` + :id: 9 + :request payload: ``VhostUserGpuDMABUFScanout`` + :reply payload: N/A + + Set the scanout resolution/configuration, and share a DMABUF file + descriptor for the scanout content, which is passed as ancillary + data. To disable a scanout, the dimensions width/height are set + to 0, there is no file descriptor passed. + +``VHOST_USER_GPU_DMABUF_UPDATE`` + :id: 10 + :request payload: ``VhostUserGpuUpdate`` + :reply payload: empty payload + + The display should be flushed and presented according to updated + region from ``VhostUserGpuUpdate``. + + Note: there is no data payload, since the scanout is shared thanks + to DMABUF, that must have been set previously with + ``VHOST_USER_GPU_DMABUF_SCANOUT``. diff --git a/docs/interop/vhost-user.json b/docs/interop/vhost-user.json new file mode 100644 index 000000000..b6ade9e49 --- /dev/null +++ b/docs/interop/vhost-user.json @@ -0,0 +1,267 @@ +# -*- Mode: Python -*- +# vim: filetype=python +# +# Copyright (C) 2018 Red Hat, Inc. +# +# Authors: +# Marc-André Lureau <marcandre.lureau@redhat.com> +# +# This work is licensed under the terms of the GNU GPL, version 2 or +# later. See the COPYING file in the top-level directory. + +## +# = vhost user backend discovery & capabilities +## + +## +# @VHostUserBackendType: +# +# List the various vhost user backend types. +# +# @9p: 9p virtio console +# @balloon: virtio balloon +# @block: virtio block +# @caif: virtio caif +# @console: virtio console +# @crypto: virtio crypto +# @gpu: virtio gpu +# @input: virtio input +# @net: virtio net +# @rng: virtio rng +# @rpmsg: virtio remote processor messaging +# @rproc-serial: virtio remoteproc serial link +# @scsi: virtio scsi +# @vsock: virtio vsock transport +# @fs: virtio fs (since 4.2) +# +# Since: 4.0 +## +{ + 'enum': 'VHostUserBackendType', + 'data': [ + '9p', + 'balloon', + 'block', + 'caif', + 'console', + 'crypto', + 'gpu', + 'input', + 'net', + 'rng', + 'rpmsg', + 'rproc-serial', + 'scsi', + 'vsock', + 'fs' + ] +} + +## +# @VHostUserBackendBlockFeature: +# +# List of vhost user "block" features. +# +# @read-only: The --read-only command line option is supported. +# @blk-file: The --blk-file command line option is supported. +# +# Since: 5.0 +## +{ + 'enum': 'VHostUserBackendBlockFeature', + 'data': [ 'read-only', 'blk-file' ] +} + +## +# @VHostUserBackendCapabilitiesBlock: +# +# Capabilities reported by vhost user "block" backends +# +# @features: list of supported features. +# +# Since: 5.0 +## +{ + 'struct': 'VHostUserBackendCapabilitiesBlock', + 'data': { + 'features': [ 'VHostUserBackendBlockFeature' ] + } +} + +## +# @VHostUserBackendInputFeature: +# +# List of vhost user "input" features. +# +# @evdev-path: The --evdev-path command line option is supported. +# @no-grab: The --no-grab command line option is supported. +# +# Since: 4.0 +## +{ + 'enum': 'VHostUserBackendInputFeature', + 'data': [ 'evdev-path', 'no-grab' ] +} + +## +# @VHostUserBackendCapabilitiesInput: +# +# Capabilities reported by vhost user "input" backends +# +# @features: list of supported features. +# +# Since: 4.0 +## +{ + 'struct': 'VHostUserBackendCapabilitiesInput', + 'data': { + 'features': [ 'VHostUserBackendInputFeature' ] + } +} + +## +# @VHostUserBackendGPUFeature: +# +# List of vhost user "gpu" features. +# +# @render-node: The --render-node command line option is supported. +# @virgl: The --virgl command line option is supported. +# +# Since: 4.0 +## +{ + 'enum': 'VHostUserBackendGPUFeature', + 'data': [ 'render-node', 'virgl' ] +} + +## +# @VHostUserBackendCapabilitiesGPU: +# +# Capabilities reported by vhost user "gpu" backends. +# +# @features: list of supported features. +# +# Since: 4.0 +## +{ + 'struct': 'VHostUserBackendCapabilitiesGPU', + 'data': { + 'features': [ 'VHostUserBackendGPUFeature' ] + } +} + +## +# @VHostUserBackendCapabilities: +# +# Capabilities reported by vhost user backends. +# +# @type: The vhost user backend type. +# +# Since: 4.0 +## +{ + 'union': 'VHostUserBackendCapabilities', + 'base': { 'type': 'VHostUserBackendType' }, + 'discriminator': 'type', + 'data': { + 'input': 'VHostUserBackendCapabilitiesInput', + 'gpu': 'VHostUserBackendCapabilitiesGPU' + } +} + +## +# @VhostUserBackend: +# +# Describes a vhost user backend to management software. +# +# It is possible for multiple @VhostUserBackend elements to match the +# search criteria of management software. Applications thus need rules +# to pick one of the many matches, and users need the ability to +# override distro defaults. +# +# It is recommended to create vhost user backend JSON files (each +# containing a single @VhostUserBackend root element) with a +# double-digit prefix, for example "50-qemu-gpu.json", +# "50-crosvm-gpu.json", etc, so they can be sorted in predictable +# order. The backend JSON files should be searched for in three +# directories: +# +# - /usr/share/qemu/vhost-user -- populated by distro-provided +# packages (XDG_DATA_DIRS covers +# /usr/share by default), +# +# - /etc/qemu/vhost-user -- exclusively for sysadmins' local additions, +# +# - $XDG_CONFIG_HOME/qemu/vhost-user -- exclusively for per-user local +# additions (XDG_CONFIG_HOME +# defaults to $HOME/.config). +# +# Top-down, the list of directories goes from general to specific. +# +# Management software should build a list of files from all three +# locations, then sort the list by filename (i.e., basename +# component). Management software should choose the first JSON file on +# the sorted list that matches the search criteria. If a more specific +# directory has a file with same name as a less specific directory, +# then the file in the more specific directory takes effect. If the +# more specific file is zero length, it hides the less specific one. +# +# For example, if a distro ships +# +# - /usr/share/qemu/vhost-user/50-qemu-gpu.json +# +# - /usr/share/qemu/vhost-user/50-crosvm-gpu.json +# +# then the sysadmin can prevent the default QEMU GPU being used at all with +# +# $ touch /etc/qemu/vhost-user/50-qemu-gpu.json +# +# The sysadmin can replace/alter the distro default QEMU GPU with +# +# $ vim /etc/qemu/vhost-user/50-qemu-gpu.json +# +# or they can provide a parallel QEMU GPU with higher priority +# +# $ vim /etc/qemu/vhost-user/10-qemu-gpu.json +# +# or they can provide a parallel QEMU GPU with lower priority +# +# $ vim /etc/qemu/vhost-user/99-qemu-gpu.json +# +# @type: The vhost user backend type. +# +# @description: Provides a human-readable description of the backend. +# Management software may or may not display @description. +# +# @binary: Absolute path to the backend binary. +# +# @tags: An optional list of auxiliary strings associated with the +# backend for which @description is not appropriate, due to the +# latter's possible exposure to the end-user. @tags serves +# development and debugging purposes only, and management +# software shall explicitly ignore it. +# +# Since: 4.0 +# +# Example: +# +# { +# "description": "QEMU vhost-user-gpu", +# "type": "gpu", +# "binary": "/usr/libexec/qemu/vhost-user-gpu", +# "tags": [ +# "CONFIG_OPENGL=y", +# "CONFIG_GBM=y" +# ] +# } +# +## +{ + 'struct' : 'VhostUserBackend', + 'data' : { + 'description': 'str', + 'type': 'VHostUserBackendType', + 'binary': 'str', + '*tags': [ 'str' ] + } +} diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst new file mode 100644 index 000000000..edc3ad84a --- /dev/null +++ b/docs/interop/vhost-user.rst @@ -0,0 +1,1585 @@ +.. _vhost_user_proto: + +=================== +Vhost-user Protocol +=================== + +.. + Copyright 2014 Virtual Open Systems Sarl. + Copyright 2019 Intel Corporation + Licence: This work is licensed under the terms of the GNU GPL, + version 2 or later. See the COPYING file in the top-level + directory. + +.. contents:: Table of Contents + +Introduction +============ + +This protocol is aiming to complement the ``ioctl`` interface used to +control the vhost implementation in the Linux kernel. It implements +the control plane needed to establish virtqueue sharing with a user +space process on the same host. It uses communication over a Unix +domain socket to share file descriptors in the ancillary data of the +message. + +The protocol defines 2 sides of the communication, *master* and +*slave*. *Master* is the application that shares its virtqueues, in +our case QEMU. *Slave* is the consumer of the virtqueues. + +In the current implementation QEMU is the *master*, and the *slave* is +the external process consuming the virtio queues, for example a +software Ethernet switch running in user space, such as Snabbswitch, +or a block device backend processing read & write to a virtual +disk. In order to facilitate interoperability between various backend +implementations, it is recommended to follow the :ref:`Backend program +conventions <backend_conventions>`. + +*Master* and *slave* can be either a client (i.e. connecting) or +server (listening) in the socket communication. + +Message Specification +===================== + +.. Note:: All numbers are in the machine native byte order. + +A vhost-user message consists of 3 header fields and a payload. + ++---------+-------+------+---------+ +| request | flags | size | payload | ++---------+-------+------+---------+ + +Header +------ + +:request: 32-bit type of the request + +:flags: 32-bit bit field + +- Lower 2 bits are the version (currently 0x01) +- Bit 2 is the reply flag - needs to be sent on each reply from the slave +- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for + details. + +:size: 32-bit size of the payload + +Payload +------- + +Depending on the request type, **payload** can be: + +A single 64-bit integer +^^^^^^^^^^^^^^^^^^^^^^^ + ++-----+ +| u64 | ++-----+ + +:u64: a 64-bit unsigned integer + +A vring state description +^^^^^^^^^^^^^^^^^^^^^^^^^ + ++-------+-----+ +| index | num | ++-------+-----+ + +:index: a 32-bit index + +:num: a 32-bit number + +A vring address description +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++-------+-------+------+------------+------+-----------+-----+ +| index | flags | size | descriptor | used | available | log | ++-------+-------+------+------------+------+-----------+-----+ + +:index: a 32-bit vring index + +:flags: a 32-bit vring flags + +:descriptor: a 64-bit ring address of the vring descriptor table + +:used: a 64-bit ring address of the vring used ring + +:available: a 64-bit ring address of the vring available ring + +:log: a 64-bit guest address for logging + +Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has +been negotiated. Otherwise it is a user address. + +Memory regions description +^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++-------------+---------+---------+-----+---------+ +| num regions | padding | region0 | ... | region7 | ++-------------+---------+---------+-----+---------+ + +:num regions: a 32-bit number of regions + +:padding: 32-bit + +A region is: + ++---------------+------+--------------+-------------+ +| guest address | size | user address | mmap offset | ++---------------+------+--------------+-------------+ + +:guest address: a 64-bit guest address of the region + +:size: a 64-bit size + +:user address: a 64-bit user address + +:mmap offset: 64-bit offset where region starts in the mapped memory + +Single memory region description +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++---------+---------------+------+--------------+-------------+ +| padding | guest address | size | user address | mmap offset | ++---------+---------------+------+--------------+-------------+ + +:padding: 64-bit + +:guest address: a 64-bit guest address of the region + +:size: a 64-bit size + +:user address: a 64-bit user address + +:mmap offset: 64-bit offset where region starts in the mapped memory + +Log description +^^^^^^^^^^^^^^^ + ++----------+------------+ +| log size | log offset | ++----------+------------+ + +:log size: size of area used for logging + +:log offset: offset from start of supplied file descriptor where + logging starts (i.e. where guest address 0 would be + logged) + +An IOTLB message +^^^^^^^^^^^^^^^^ + ++------+------+--------------+-------------------+------+ +| iova | size | user address | permissions flags | type | ++------+------+--------------+-------------------+------+ + +:iova: a 64-bit I/O virtual address programmed by the guest + +:size: a 64-bit size + +:user address: a 64-bit user address + +:permissions flags: an 8-bit value: + - 0: No access + - 1: Read access + - 2: Write access + - 3: Read/Write access + +:type: an 8-bit IOTLB message type: + - 1: IOTLB miss + - 2: IOTLB update + - 3: IOTLB invalidate + - 4: IOTLB access fail + +Virtio device config space +^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++--------+------+-------+---------+ +| offset | size | flags | payload | ++--------+------+-------+---------+ + +:offset: a 32-bit offset of virtio device's configuration space + +:size: a 32-bit configuration space access size in bytes + +:flags: a 32-bit value: + - 0: Vhost master messages used for writeable fields + - 1: Vhost master messages used for live migration + +:payload: Size bytes array holding the contents of the virtio + device's configuration space + +Vring area description +^^^^^^^^^^^^^^^^^^^^^^ + ++-----+------+--------+ +| u64 | size | offset | ++-----+------+--------+ + +:u64: a 64-bit integer contains vring index and flags + +:size: a 64-bit size of this area + +:offset: a 64-bit offset of this area from the start of the + supplied file descriptor + +Inflight description +^^^^^^^^^^^^^^^^^^^^ + ++-----------+-------------+------------+------------+ +| mmap size | mmap offset | num queues | queue size | ++-----------+-------------+------------+------------+ + +:mmap size: a 64-bit size of area to track inflight I/O + +:mmap offset: a 64-bit offset of this area from the start + of the supplied file descriptor + +:num queues: a 16-bit number of virtqueues + +:queue size: a 16-bit size of virtqueues + +C structure +----------- + +In QEMU the vhost-user message is implemented with the following struct: + +.. code:: c + + typedef struct VhostUserMsg { + VhostUserRequest request; + uint32_t flags; + uint32_t size; + union { + uint64_t u64; + struct vhost_vring_state state; + struct vhost_vring_addr addr; + VhostUserMemory memory; + VhostUserLog log; + struct vhost_iotlb_msg iotlb; + VhostUserConfig config; + VhostUserVringArea area; + VhostUserInflight inflight; + }; + } QEMU_PACKED VhostUserMsg; + +Communication +============= + +The protocol for vhost-user is based on the existing implementation of +vhost for the Linux Kernel. Most messages that can be sent via the +Unix domain socket implementing vhost-user have an equivalent ioctl to +the kernel implementation. + +The communication consists of *master* sending message requests and +*slave* sending message replies. Most of the requests don't require +replies. Here is a list of the ones that do: + +* ``VHOST_USER_GET_FEATURES`` +* ``VHOST_USER_GET_PROTOCOL_FEATURES`` +* ``VHOST_USER_GET_VRING_BASE`` +* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) +* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) + +.. seealso:: + + :ref:`REPLY_ACK <reply_ack>` + The section on ``REPLY_ACK`` protocol extension. + +There are several messages that the master sends with file descriptors passed +in the ancillary data: + +* ``VHOST_USER_SET_MEM_TABLE`` +* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) +* ``VHOST_USER_SET_LOG_FD`` +* ``VHOST_USER_SET_VRING_KICK`` +* ``VHOST_USER_SET_VRING_CALL`` +* ``VHOST_USER_SET_VRING_ERR`` +* ``VHOST_USER_SET_SLAVE_REQ_FD`` +* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) + +If *master* is unable to send the full message or receives a wrong +reply it will close the connection. An optional reconnection mechanism +can be implemented. + +If *slave* detects some error such as incompatible features, it may also +close the connection. This should only happen in exceptional circumstances. + +Any protocol extensions are gated by protocol feature bits, which +allows full backwards compatibility on both master and slave. As +older slaves don't support negotiating protocol features, a feature +bit was dedicated for this purpose:: + + #define VHOST_USER_F_PROTOCOL_FEATURES 30 + +Starting and stopping rings +--------------------------- + +Client must only process each ring when it is started. + +Client must only pass data between the ring and the backend, when the +ring is enabled. + +If ring is started but disabled, client must process the ring without +talking to the backend. + +For example, for a networking device, in the disabled state client +must not supply any new RX packets, but must process and discard any +TX packets. + +If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the +ring is initialized in an enabled state. + +If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is +initialized in a disabled state. Client must not pass data to/from the +backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with +parameter 1, or after it has been disabled by +``VHOST_USER_SET_VRING_ENABLE`` with parameter 0. + +Each ring is initialized in a stopped state, client must not process +it until ring is started, or after it has been stopped. + +Client must start ring upon receiving a kick (that is, detecting that +file descriptor is readable) on the descriptor specified by +``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message +``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving +``VHOST_USER_GET_VRING_BASE``. + +While processing the rings (whether they are enabled or not), client +must support changing some configuration aspects on the fly. + +Multiple queue support +---------------------- + +Many devices have a fixed number of virtqueues. In this case the master +already knows the number of available virtqueues without communicating with the +slave. + +Some devices do not have a fixed number of virtqueues. Instead the maximum +number of virtqueues is chosen by the slave. The number can depend on host +resource availability or slave implementation details. Such devices are called +multiple queue devices. + +Multiple queue support allows the slave to advertise the maximum number of +queues. This is treated as a protocol extension, hence the slave has to +implement protocol features first. The multiple queues feature is supported +only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set. + +The max number of queues the slave supports can be queried with message +``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested +queues is bigger than that. + +As all queues share one connection, the master uses a unique index for each +queue in the sent message to identify a specified queue. + +The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``. +vhost-user-net has historically automatically enabled the first queue pair. + +Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol +feature, even for devices with a fixed number of virtqueues, since it is simple +to implement and offers a degree of introspection. + +Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for +devices with a fixed number of virtqueues. Only true multiqueue devices +require this protocol feature. + +Migration +--------- + +During live migration, the master may need to track the modifications +the slave makes to the memory mapped regions. The client should mark +the dirty pages in a log. Once it complies to this logging, it may +declare the ``VHOST_F_LOG_ALL`` vhost feature. + +To start/stop logging of data/used ring writes, server may send +messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and +``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's +flags set to 1/0, respectively. + +All the modifications to memory pointed by vring "descriptor" should +be marked. Modifications to "used" vring should be marked if +``VHOST_VRING_F_LOG`` is part of ring's flags. + +Dirty pages are of size:: + + #define VHOST_LOG_PAGE 0x1000 + +The log memory fd is provided in the ancillary data of +``VHOST_USER_SET_LOG_BASE`` message when the slave has +``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature. + +The size of the log is supplied as part of ``VhostUserMsg`` which +should be large enough to cover all known guest addresses. Log starts +at the supplied offset in the supplied file descriptor. The log +covers from address 0 to the maximum of guest regions. In pseudo-code, +to mark page at ``addr`` as dirty:: + + page = addr / VHOST_LOG_PAGE + log[page / 8] |= 1 << page % 8 + +Where ``addr`` is the guest physical address. + +Use atomic operations, as the log may be concurrently manipulated. + +Note that when logging modifications to the used ring (when +``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should +be used to calculate the log offset: the write to first byte of the +used ring is logged at this offset from log start. Also note that this +value might be outside the legal guest physical address range +(i.e. does not have to be covered by the ``VhostUserMemory`` table), but +the bit offset of the last byte of the ring must fall within the size +supplied by ``VhostUserLog``. + +``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in +ancillary data, it may be used to inform the master that the log has +been modified. + +Once the source has finished migration, rings will be stopped by the +source. No further update must be done before rings are restarted. + +In postcopy migration the slave is started before all the memory has +been received from the source host, and care must be taken to avoid +accessing pages that have yet to be received. The slave opens a +'userfault'-fd and registers the memory with it; this fd is then +passed back over to the master. The master services requests on the +userfaultfd for pages that are accessed and when the page is available +it performs WAKE ioctl's on the userfaultfd to wake the stalled +slave. The client indicates support for this via the +``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature. + +Memory access +------------- + +The master sends a list of vhost memory regions to the slave using the +``VHOST_USER_SET_MEM_TABLE`` message. Each region has two base +addresses: a guest address and a user address. + +Messages contain guest addresses and/or user addresses to reference locations +within the shared memory. The mapping of these addresses works as follows. + +User addresses map to the vhost memory region containing that user address. + +When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated: + +* Guest addresses map to the vhost memory region containing that guest + address. + +When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated: + +* Guest addresses are also called I/O virtual addresses (IOVAs). They are + translated to user addresses via the IOTLB. + +* The vhost memory region guest address is not used. + +IOMMU support +------------- + +When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the +master sends IOTLB entries update & invalidation by sending +``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct +vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload +has to be filled with the update message type (2), the I/O virtual +address, the size, the user virtual address, and the permissions +flags. Addresses and size must be within vhost memory regions set via +the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the +``iotlb`` payload has to be filled with the invalidation message type +(3), the I/O virtual address and the size. On success, the slave is +expected to reply with a zero payload, non-zero otherwise. + +The slave relies on the slave communication channel (see :ref:`Slave +communication <slave_communication>` section below) to send IOTLB miss +and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG`` +requests to the master with a ``struct vhost_iotlb_msg`` as +payload. For miss events, the iotlb payload has to be filled with the +miss message type (1), the I/O virtual address and the permissions +flags. For access failure event, the iotlb payload has to be filled +with the access failure message type (4), the I/O virtual address and +the permissions flags. For synchronization purpose, the slave may +rely on the reply-ack feature, so the master may send a reply when +operation is completed if the reply-ack feature is negotiated and +slaves requests a reply. For miss events, completed operation means +either master sent an update message containing the IOTLB entry +containing requested address and permission, or master sent nothing if +the IOTLB miss message is invalid (invalid IOVA or permission). + +The master isn't expected to take the initiative to send IOTLB update +messages, as the slave sends IOTLB miss messages for the guest virtual +memory areas it needs to access. + +.. _slave_communication: + +Slave communication +------------------- + +An optional communication channel is provided if the slave declares +``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the +slave to make requests to the master. + +The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data. + +A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master +using this fd communication channel. + +If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is +negotiated, slave can send file descriptors (at most 8 descriptors in +each message) to master via ancillary data using this fd communication +channel. + +Inflight I/O tracking +--------------------- + +To support reconnecting after restart or crash, slave may need to +resubmit inflight I/Os. If virtqueue is processed in order, we can +easily achieve that by getting the inflight descriptors from +descriptor table (split virtqueue) or descriptor ring (packed +virtqueue). However, it can't work when we process descriptors +out-of-order because some entries which store the information of +inflight descriptors in available ring (split virtqueue) or descriptor +ring (packed virtqueue) might be overridden by new entries. To solve +this problem, slave need to allocate an extra buffer to store this +information of inflight descriptors and share it with master for +persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and +``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer +between master and slave. And the format of this buffer is described +below: + ++---------------+---------------+-----+---------------+ +| queue0 region | queue1 region | ... | queueN region | ++---------------+---------------+-----+---------------+ + +N is the number of available virtqueues. Slave could get it from num +queues field of ``VhostUserInflight``. + +For split virtqueue, queue region can be implemented as: + +.. code:: c + + typedef struct DescStateSplit { + /* Indicate whether this descriptor is inflight or not. + * Only available for head-descriptor. */ + uint8_t inflight; + + /* Padding */ + uint8_t padding[5]; + + /* Maintain a list for the last batch of used descriptors. + * Only available when batching is used for submitting */ + uint16_t next; + + /* Used to preserve the order of fetching available descriptors. + * Only available for head-descriptor. */ + uint64_t counter; + } DescStateSplit; + + typedef struct QueueRegionSplit { + /* The feature flags of this region. Now it's initialized to 0. */ + uint64_t features; + + /* The version of this region. It's 1 currently. + * Zero value indicates an uninitialized buffer */ + uint16_t version; + + /* The size of DescStateSplit array. It's equal to the virtqueue + * size. Slave could get it from queue size field of VhostUserInflight. */ + uint16_t desc_num; + + /* The head of list that track the last batch of used descriptors. */ + uint16_t last_batch_head; + + /* Store the idx value of used ring */ + uint16_t used_idx; + + /* Used to track the state of each descriptor in descriptor table */ + DescStateSplit desc[]; + } QueueRegionSplit; + +To track inflight I/O, the queue region should be processed as follows: + +When receiving available buffers from the driver: + +#. Get the next available head-descriptor index from available ring, ``i`` + +#. Set ``desc[i].counter`` to the value of global counter + +#. Increase global counter by 1 + +#. Set ``desc[i].inflight`` to 1 + +When supplying used buffers to the driver: + +1. Get corresponding used head-descriptor index, i + +2. Set ``desc[i].next`` to ``last_batch_head`` + +3. Set ``last_batch_head`` to ``i`` + +#. Steps 1,2,3 may be performed repeatedly if batching is possible + +#. Increase the ``idx`` value of used ring by the size of the batch + +#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0 + +#. Set ``used_idx`` to the ``idx`` value of used ring + +When reconnecting: + +#. If the value of ``used_idx`` does not match the ``idx`` value of + used ring (means the inflight field of ``DescStateSplit`` entries in + last batch may be incorrect), + + a. Subtract the value of ``used_idx`` from the ``idx`` value of + used ring to get last batch size of ``DescStateSplit`` entries + + #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch + list which starts from ``last_batch_head`` + + #. Set ``used_idx`` to the ``idx`` value of used ring + +#. Resubmit inflight ``DescStateSplit`` entries in order of their + counter value + +For packed virtqueue, queue region can be implemented as: + +.. code:: c + + typedef struct DescStatePacked { + /* Indicate whether this descriptor is inflight or not. + * Only available for head-descriptor. */ + uint8_t inflight; + + /* Padding */ + uint8_t padding; + + /* Link to the next free entry */ + uint16_t next; + + /* Link to the last entry of descriptor list. + * Only available for head-descriptor. */ + uint16_t last; + + /* The length of descriptor list. + * Only available for head-descriptor. */ + uint16_t num; + + /* Used to preserve the order of fetching available descriptors. + * Only available for head-descriptor. */ + uint64_t counter; + + /* The buffer id */ + uint16_t id; + + /* The descriptor flags */ + uint16_t flags; + + /* The buffer length */ + uint32_t len; + + /* The buffer address */ + uint64_t addr; + } DescStatePacked; + + typedef struct QueueRegionPacked { + /* The feature flags of this region. Now it's initialized to 0. */ + uint64_t features; + + /* The version of this region. It's 1 currently. + * Zero value indicates an uninitialized buffer */ + uint16_t version; + + /* The size of DescStatePacked array. It's equal to the virtqueue + * size. Slave could get it from queue size field of VhostUserInflight. */ + uint16_t desc_num; + + /* The head of free DescStatePacked entry list */ + uint16_t free_head; + + /* The old head of free DescStatePacked entry list */ + uint16_t old_free_head; + + /* The used index of descriptor ring */ + uint16_t used_idx; + + /* The old used index of descriptor ring */ + uint16_t old_used_idx; + + /* Device ring wrap counter */ + uint8_t used_wrap_counter; + + /* The old device ring wrap counter */ + uint8_t old_used_wrap_counter; + + /* Padding */ + uint8_t padding[7]; + + /* Used to track the state of each descriptor fetched from descriptor ring */ + DescStatePacked desc[]; + } QueueRegionPacked; + +To track inflight I/O, the queue region should be processed as follows: + +When receiving available buffers from the driver: + +#. Get the next available descriptor entry from descriptor ring, ``d`` + +#. If ``d`` is head descriptor, + + a. Set ``desc[old_free_head].num`` to 0 + + #. Set ``desc[old_free_head].counter`` to the value of global counter + + #. Increase global counter by 1 + + #. Set ``desc[old_free_head].inflight`` to 1 + +#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to + ``free_head`` + +#. Increase ``desc[old_free_head].num`` by 1 + +#. Set ``desc[free_head].addr``, ``desc[free_head].len``, + ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``, + ``d.len``, ``d.flags``, ``d.id`` + +#. Set ``free_head`` to ``desc[free_head].next`` + +#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head`` + +When supplying used buffers to the driver: + +1. Get corresponding used head-descriptor entry from descriptor ring, + ``d`` + +2. Get corresponding ``DescStatePacked`` entry, ``e`` + +3. Set ``desc[e.last].next`` to ``free_head`` + +4. Set ``free_head`` to the index of ``e`` + +#. Steps 1,2,3,4 may be performed repeatedly if batching is possible + +#. Increase ``used_idx`` by the size of the batch and update + ``used_wrap_counter`` if needed + +#. Update ``d.flags`` + +#. Set the ``inflight`` field of each head ``DescStatePacked`` entry + in the batch to 0 + +#. Set ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` + to ``free_head``, ``used_idx``, ``used_wrap_counter`` + +When reconnecting: + +#. If ``used_idx`` does not match ``old_used_idx`` (means the + ``inflight`` field of ``DescStatePacked`` entries in last batch may + be incorrect), + + a. Get the next descriptor ring entry through ``old_used_idx``, ``d`` + + #. Use ``old_used_wrap_counter`` to calculate the available flags + + #. If ``d.flags`` is not equal to the calculated flags value (means + slave has submitted the buffer to guest driver before crash, so + it has to commit the in-progres update), set ``old_free_head``, + ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``, + ``used_idx``, ``used_wrap_counter`` + +#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to + ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` + (roll back any in-progress update) + +#. Set the ``inflight`` field of each ``DescStatePacked`` entry in + free list to 0 + +#. Resubmit inflight ``DescStatePacked`` entries in order of their + counter value + +In-band notifications +--------------------- + +In some limited situations (e.g. for simulation) it is desirable to +have the kick, call and error (if used) signals done via in-band +messages instead of asynchronous eventfd notifications. This can be +done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` +protocol feature. + +Note that due to the fact that too many messages on the sockets can +cause the sending application(s) to block, it is not advised to use +this feature unless absolutely necessary. It is also considered an +error to negotiate this feature without also negotiating +``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``, +the former is necessary for getting a message channel from the slave +to the master, while the latter needs to be used with the in-band +notification messages to block until they are processed, both to avoid +blocking later and for proper processing (at least in the simulation +use case.) As it has no other way of signalling this error, the slave +should close the connection as a response to a +``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band +notifications feature flag without the other two. + +Protocol features +----------------- + +.. code:: c + + #define VHOST_USER_PROTOCOL_F_MQ 0 + #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1 + #define VHOST_USER_PROTOCOL_F_RARP 2 + #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3 + #define VHOST_USER_PROTOCOL_F_MTU 4 + #define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5 + #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6 + #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7 + #define VHOST_USER_PROTOCOL_F_PAGEFAULT 8 + #define VHOST_USER_PROTOCOL_F_CONFIG 9 + #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD 10 + #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11 + #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12 + #define VHOST_USER_PROTOCOL_F_RESET_DEVICE 13 + #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14 + #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15 + #define VHOST_USER_PROTOCOL_F_STATUS 16 + +Master message types +-------------------- + +``VHOST_USER_GET_FEATURES`` + :id: 1 + :equivalent ioctl: ``VHOST_GET_FEATURES`` + :master payload: N/A + :slave payload: ``u64`` + + Get from the underlying vhost implementation the features bitmask. + Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support + for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and + ``VHOST_USER_SET_PROTOCOL_FEATURES``. + +``VHOST_USER_SET_FEATURES`` + :id: 2 + :equivalent ioctl: ``VHOST_SET_FEATURES`` + :master payload: ``u64`` + + Enable features in the underlying vhost implementation using a + bitmask. Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals + slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and + ``VHOST_USER_SET_PROTOCOL_FEATURES``. + +``VHOST_USER_GET_PROTOCOL_FEATURES`` + :id: 15 + :equivalent ioctl: ``VHOST_GET_FEATURES`` + :master payload: N/A + :slave payload: ``u64`` + + Get the protocol feature bitmask from the underlying vhost + implementation. Only legal if feature bit + ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in + ``VHOST_USER_GET_FEATURES``. + +.. Note:: + Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must + support this message even before ``VHOST_USER_SET_FEATURES`` was + called. + +``VHOST_USER_SET_PROTOCOL_FEATURES`` + :id: 16 + :equivalent ioctl: ``VHOST_SET_FEATURES`` + :master payload: ``u64`` + + Enable protocol features in the underlying vhost implementation. + + Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in + ``VHOST_USER_GET_FEATURES``. + +.. Note:: + Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support + this message even before ``VHOST_USER_SET_FEATURES`` was called. + +``VHOST_USER_SET_OWNER`` + :id: 3 + :equivalent ioctl: ``VHOST_SET_OWNER`` + :master payload: N/A + + Issued when a new connection is established. It sets the current + *master* as an owner of the session. This can be used on the *slave* + as a "session start" flag. + +``VHOST_USER_RESET_OWNER`` + :id: 4 + :master payload: N/A + +.. admonition:: Deprecated + + This is no longer used. Used to be sent to request disabling all + rings, but some clients interpreted it to also discard connection + state (this interpretation would lead to bugs). It is recommended + that clients either ignore this message, or use it to disable all + rings. + +``VHOST_USER_SET_MEM_TABLE`` + :id: 5 + :equivalent ioctl: ``VHOST_SET_MEM_TABLE`` + :master payload: memory regions description + :slave payload: (postcopy only) memory regions description + + Sets the memory map regions on the slave so it can translate the + vring addresses. In the ancillary data there is an array of file + descriptors for each memory mapped region. The size and ordering of + the fds matches the number and ordering of memory regions. + + When ``VHOST_USER_POSTCOPY_LISTEN`` has been received, + ``SET_MEM_TABLE`` replies with the bases of the memory mapped + regions to the master. The slave must have mmap'd the regions but + not yet accessed them and should not yet generate a userfault + event. + +.. Note:: + ``NEED_REPLY_MASK`` is not set in this case. QEMU will then + reply back to the list of mappings with an empty + ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon + reception of this message may the guest start accessing the memory + and generating faults. + +``VHOST_USER_SET_LOG_BASE`` + :id: 6 + :equivalent ioctl: ``VHOST_SET_LOG_BASE`` + :master payload: u64 + :slave payload: N/A + + Sets logging shared memory space. + + When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature, + the log memory fd is provided in the ancillary data of + ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared + memory area provided in the message. + +``VHOST_USER_SET_LOG_FD`` + :id: 7 + :equivalent ioctl: ``VHOST_SET_LOG_FD`` + :master payload: N/A + + Sets the logging file descriptor, which is passed as ancillary data. + +``VHOST_USER_SET_VRING_NUM`` + :id: 8 + :equivalent ioctl: ``VHOST_SET_VRING_NUM`` + :master payload: vring state description + + Set the size of the queue. + +``VHOST_USER_SET_VRING_ADDR`` + :id: 9 + :equivalent ioctl: ``VHOST_SET_VRING_ADDR`` + :master payload: vring address description + :slave payload: N/A + + Sets the addresses of the different aspects of the vring. + +``VHOST_USER_SET_VRING_BASE`` + :id: 10 + :equivalent ioctl: ``VHOST_SET_VRING_BASE`` + :master payload: vring state description + + Sets the base offset in the available vring. + +``VHOST_USER_GET_VRING_BASE`` + :id: 11 + :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` + :master payload: vring state description + :slave payload: vring state description + + Get the available vring base offset. + +``VHOST_USER_SET_VRING_KICK`` + :id: 12 + :equivalent ioctl: ``VHOST_SET_VRING_KICK`` + :master payload: ``u64`` + + Set the event file descriptor for adding buffers to the vring. It is + passed in the ancillary data. + + Bits (0-7) of the payload contain the vring index. Bit 8 is the + invalid FD flag. This flag is set when there is no file descriptor + in the ancillary data. This signals that polling should be used + instead of waiting for the kick. Note that if the protocol feature + ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated + this message isn't necessary as the ring is also started on the + ``VHOST_USER_VRING_KICK`` message, it may however still be used to + set an event file descriptor (which will be preferred over the + message) or to enable polling. + +``VHOST_USER_SET_VRING_CALL`` + :id: 13 + :equivalent ioctl: ``VHOST_SET_VRING_CALL`` + :master payload: ``u64`` + + Set the event file descriptor to signal when buffers are used. It is + passed in the ancillary data. + + Bits (0-7) of the payload contain the vring index. Bit 8 is the + invalid FD flag. This flag is set when there is no file descriptor + in the ancillary data. This signals that polling will be used + instead of waiting for the call. Note that if the protocol features + ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and + ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message + isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be + used, it may however still be used to set an event file descriptor + or to enable polling. + +``VHOST_USER_SET_VRING_ERR`` + :id: 14 + :equivalent ioctl: ``VHOST_SET_VRING_ERR`` + :master payload: ``u64`` + + Set the event file descriptor to signal when error occurs. It is + passed in the ancillary data. + + Bits (0-7) of the payload contain the vring index. Bit 8 is the + invalid FD flag. This flag is set when there is no file descriptor + in the ancillary data. Note that if the protocol features + ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and + ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message + isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be + used, it may however still be used to set an event file descriptor + (which will be preferred over the message). + +``VHOST_USER_GET_QUEUE_NUM`` + :id: 17 + :equivalent ioctl: N/A + :master payload: N/A + :slave payload: u64 + + Query how many queues the backend supports. + + This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ`` + is set in queried protocol features by + ``VHOST_USER_GET_PROTOCOL_FEATURES``. + +``VHOST_USER_SET_VRING_ENABLE`` + :id: 18 + :equivalent ioctl: N/A + :master payload: vring state description + + Signal slave to enable or disable corresponding vring. + + This request should be sent only when + ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated. + +``VHOST_USER_SEND_RARP`` + :id: 19 + :equivalent ioctl: N/A + :master payload: ``u64`` + + Ask vhost user backend to broadcast a fake RARP to notify the migration + is terminated for guest that does not support GUEST_ANNOUNCE. + + Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is + present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit + ``VHOST_USER_PROTOCOL_F_RARP`` is present in + ``VHOST_USER_GET_PROTOCOL_FEATURES``. The first 6 bytes of the + payload contain the mac address of the guest to allow the vhost user + backend to construct and broadcast the fake RARP. + +``VHOST_USER_NET_SET_MTU`` + :id: 20 + :equivalent ioctl: N/A + :master payload: ``u64`` + + Set host MTU value exposed to the guest. + + This request should be sent only when ``VIRTIO_NET_F_MTU`` feature + has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES`` + is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit + ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in + ``VHOST_USER_GET_PROTOCOL_FEATURES``. + + If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must + respond with zero in case the specified MTU is valid, or non-zero + otherwise. + +``VHOST_USER_SET_SLAVE_REQ_FD`` + :id: 21 + :equivalent ioctl: N/A + :master payload: N/A + + Set the socket file descriptor for slave initiated requests. It is passed + in the ancillary data. + + This request should be sent only when + ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol + feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in + ``VHOST_USER_GET_PROTOCOL_FEATURES``. If + ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must + respond with zero for success, non-zero otherwise. + +``VHOST_USER_IOTLB_MSG`` + :id: 22 + :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) + :master payload: ``struct vhost_iotlb_msg`` + :slave payload: ``u64`` + + Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. + + Master sends such requests to update and invalidate entries in the + device IOTLB. The slave has to acknowledge the request with sending + zero as ``u64`` payload for success, non-zero otherwise. + + This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM`` + feature has been successfully negotiated. + +``VHOST_USER_SET_VRING_ENDIAN`` + :id: 23 + :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN`` + :master payload: vring state description + + Set the endianness of a VQ for legacy devices. Little-endian is + indicated with state.num set to 0 and big-endian is indicated with + state.num set to 1. Other values are invalid. + + This request should be sent only when + ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated. + Backends that negotiated this feature should handle both + endiannesses and expect this message once (per VQ) during device + configuration (ie. before the master starts the VQ). + +``VHOST_USER_GET_CONFIG`` + :id: 24 + :equivalent ioctl: N/A + :master payload: virtio device config space + :slave payload: virtio device config space + + When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is + submitted by the vhost-user master to fetch the contents of the + virtio device configuration space, vhost-user slave's payload size + MUST match master's request, vhost-user slave uses zero length of + payload to indicate an error to vhost-user master. The vhost-user + master may cache the contents to avoid repeated + ``VHOST_USER_GET_CONFIG`` calls. + +``VHOST_USER_SET_CONFIG`` + :id: 25 + :equivalent ioctl: N/A + :master payload: virtio device config space + :slave payload: N/A + + When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is + submitted by the vhost-user master when the Guest changes the virtio + device configuration space and also can be used for live migration + on the destination host. The vhost-user slave must check the flags + field, and slaves MUST NOT accept SET_CONFIG for read-only + configuration space fields unless the live migration bit is set. + +``VHOST_USER_CREATE_CRYPTO_SESSION`` + :id: 26 + :equivalent ioctl: N/A + :master payload: crypto session description + :slave payload: crypto session description + + Create a session for crypto operation. The server side must return + the session id, 0 or positive for success, negative for failure. + This request should be sent only when + ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been + successfully negotiated. It's a required feature for crypto + devices. + +``VHOST_USER_CLOSE_CRYPTO_SESSION`` + :id: 27 + :equivalent ioctl: N/A + :master payload: ``u64`` + + Close a session for crypto operation which was previously + created by ``VHOST_USER_CREATE_CRYPTO_SESSION``. + + This request should be sent only when + ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been + successfully negotiated. It's a required feature for crypto + devices. + +``VHOST_USER_POSTCOPY_ADVISE`` + :id: 28 + :master payload: N/A + :slave payload: userfault fd + + When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master + advises slave that a migration with postcopy enabled is underway, + the slave must open a userfaultfd for later use. Note that at this + stage the migration is still in precopy mode. + +``VHOST_USER_POSTCOPY_LISTEN`` + :id: 29 + :master payload: N/A + + Master advises slave that a transition to postcopy mode has + happened. The slave must ensure that shared memory is registered + with userfaultfd to cause faulting of non-present pages. + + This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``, + and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported. + +``VHOST_USER_POSTCOPY_END`` + :id: 30 + :slave payload: ``u64`` + + Master advises that postcopy migration has now completed. The slave + must disable the userfaultfd. The response is an acknowledgement + only. + + When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message + is sent at the end of the migration, after + ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent. + + The value returned is an error indication; 0 is success. + +``VHOST_USER_GET_INFLIGHT_FD`` + :id: 31 + :equivalent ioctl: N/A + :master payload: inflight description + + When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has + been successfully negotiated, this message is submitted by master to + get a shared buffer from slave. The shared buffer will be used to + track inflight I/O by slave. QEMU should retrieve a new one when vm + reset. + +``VHOST_USER_SET_INFLIGHT_FD`` + :id: 32 + :equivalent ioctl: N/A + :master payload: inflight description + + When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has + been successfully negotiated, this message is submitted by master to + send the shared inflight buffer back to slave so that slave could + get inflight I/O after a crash or restart. + +``VHOST_USER_GPU_SET_SOCKET`` + :id: 33 + :equivalent ioctl: N/A + :master payload: N/A + + Sets the GPU protocol socket file descriptor, which is passed as + ancillary data. The GPU protocol is used to inform the master of + rendering state and updates. See vhost-user-gpu.rst for details. + +``VHOST_USER_RESET_DEVICE`` + :id: 34 + :equivalent ioctl: N/A + :master payload: N/A + :slave payload: N/A + + Ask the vhost user backend to disable all rings and reset all + internal device state to the initial state, ready to be + reinitialized. The backend retains ownership of the device + throughout the reset operation. + + Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol + feature is set by the backend. + +``VHOST_USER_VRING_KICK`` + :id: 35 + :equivalent ioctl: N/A + :slave payload: vring state description + :master payload: N/A + + When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol + feature has been successfully negotiated, this message may be + submitted by the master to indicate that a buffer was added to + the vring instead of signalling it using the vring's kick file + descriptor or having the slave rely on polling. + + The state.num field is currently reserved and must be set to 0. + +``VHOST_USER_GET_MAX_MEM_SLOTS`` + :id: 36 + :equivalent ioctl: N/A + :slave payload: u64 + + When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol + feature has been successfully negotiated, this message is submitted + by master to the slave. The slave should return the message with a + u64 payload containing the maximum number of memory slots for + QEMU to expose to the guest. The value returned by the backend + will be capped at the maximum number of ram slots which can be + supported by the target platform. + +``VHOST_USER_ADD_MEM_REG`` + :id: 37 + :equivalent ioctl: N/A + :slave payload: single memory region description + + When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol + feature has been successfully negotiated, this message is submitted + by the master to the slave. The message payload contains a memory + region descriptor struct, describing a region of guest memory which + the slave device must map in. When the + ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has + been successfully negotiated, along with the + ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and + update the memory tables of the slave device. + +``VHOST_USER_REM_MEM_REG`` + :id: 38 + :equivalent ioctl: N/A + :slave payload: single memory region description + + When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol + feature has been successfully negotiated, this message is submitted + by the master to the slave. The message payload contains a memory + region descriptor struct, describing a region of guest memory which + the slave device must unmap. When the + ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has + been successfully negotiated, along with the + ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and + update the memory tables of the slave device. + +``VHOST_USER_SET_STATUS`` + :id: 39 + :equivalent ioctl: VHOST_VDPA_SET_STATUS + :slave payload: N/A + :master payload: ``u64`` + + When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been + successfully negotiated, this message is submitted by the master to + notify the backend with updated device status as defined in the Virtio + specification. + +``VHOST_USER_GET_STATUS`` + :id: 40 + :equivalent ioctl: VHOST_VDPA_GET_STATUS + :slave payload: ``u64`` + :master payload: N/A + + When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been + successfully negotiated, this message is submitted by the master to + query the backend for its device status as defined in the Virtio + specification. + + +Slave message types +------------------- + +``VHOST_USER_SLAVE_IOTLB_MSG`` + :id: 1 + :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) + :slave payload: ``struct vhost_iotlb_msg`` + :master payload: N/A + + Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. + Slave sends such requests to notify of an IOTLB miss, or an IOTLB + access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is + negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master + must respond with zero when operation is successfully completed, or + non-zero otherwise. This request should be send only when + ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully + negotiated. + +``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG`` + :id: 2 + :equivalent ioctl: N/A + :slave payload: N/A + :master payload: N/A + + When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user + slave sends such messages to notify that the virtio device's + configuration space has changed, for those host devices which can + support such feature, host driver can send ``VHOST_USER_GET_CONFIG`` + message to slave to get the latest content. If + ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the + ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when + operation is successfully completed, or non-zero otherwise. + +``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG`` + :id: 3 + :equivalent ioctl: N/A + :slave payload: vring area description + :master payload: N/A + + Sets host notifier for a specified queue. The queue index is + contained in the ``u64`` field of the vring area description. The + host notifier is described by the file descriptor (typically it's a + VFIO device fd) which is passed as ancillary data and the size + (which is mmap size and should be the same as host page size) and + offset (which is mmap offset) carried in the vring area + description. QEMU can mmap the file descriptor based on the size and + offset to get a memory range. Registering a host notifier means + mapping this memory range to the VM as the specified queue's notify + MMIO region. Slave sends this request to tell QEMU to de-register + the existing notifier if any and register the new notifier if the + request is sent with a file descriptor. + + This request should be sent only when + ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been + successfully negotiated. + +``VHOST_USER_SLAVE_VRING_CALL`` + :id: 4 + :equivalent ioctl: N/A + :slave payload: vring state description + :master payload: N/A + + When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol + feature has been successfully negotiated, this message may be + submitted by the slave to indicate that a buffer was used from + the vring instead of signalling this using the vring's call file + descriptor or having the master relying on polling. + + The state.num field is currently reserved and must be set to 0. + +``VHOST_USER_SLAVE_VRING_ERR`` + :id: 5 + :equivalent ioctl: N/A + :slave payload: vring state description + :master payload: N/A + + When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol + feature has been successfully negotiated, this message may be + submitted by the slave to indicate that an error occurred on the + specific vring, instead of signalling the error file descriptor + set by the master via ``VHOST_USER_SET_VRING_ERR``. + + The state.num field is currently reserved and must be set to 0. + +.. _reply_ack: + +VHOST_USER_PROTOCOL_F_REPLY_ACK +------------------------------- + +The original vhost-user specification only demands replies for certain +commands. This differs from the vhost protocol implementation where +commands are sent over an ``ioctl()`` call and block until the client +has completed. + +With this protocol extension negotiated, the sender (QEMU) can set the +``need_reply`` [Bit 3] flag to any command. This indicates that the +client MUST respond with a Payload ``VhostUserMsg`` indicating success +or failure. The payload should be set to zero on success or non-zero +on failure, unless the message already has an explicit reply body. + +The response payload gives QEMU a deterministic indication of the result +of the command. Today, QEMU is expected to terminate the main vhost-user +loop upon receiving such errors. In future, qemu could be taught to be more +resilient for selective requests. + +For the message types that already solicit a reply from the client, +the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit +being set brings no behavioural change. (See the Communication_ +section for details.) + +.. _backend_conventions: + +Backend program conventions +=========================== + +vhost-user backends can provide various devices & services and may +need to be configured manually depending on the use case. However, it +is a good idea to follow the conventions listed here when +possible. Users, QEMU or libvirt, can then rely on some common +behaviour to avoid heterogeneous configuration and management of the +backend programs and facilitate interoperability. + +Each backend installed on a host system should come with at least one +JSON file that conforms to the vhost-user.json schema. Each file +informs the management applications about the backend type, and binary +location. In addition, it defines rules for management apps for +picking the highest priority backend when multiple match the search +criteria (see ``@VhostUserBackend`` documentation in the schema file). + +If the backend is not capable of enabling a requested feature on the +host (such as 3D acceleration with virgl), or the initialization +failed, the backend should fail to start early and exit with a status +!= 0. It may also print a message to stderr for further details. + +The backend program must not daemonize itself, but it may be +daemonized by the management layer. It may also have a restricted +access to the system. + +File descriptors 0, 1 and 2 will exist, and have regular +stdin/stdout/stderr usage (they may have been redirected to /dev/null +by the management layer, or to a log handler). + +The backend program must end (as quickly and cleanly as possible) when +the SIGTERM signal is received. Eventually, it may receive SIGKILL by +the management layer after a few seconds. + +The following command line options have an expected behaviour. They +are mandatory, unless explicitly said differently: + +--socket-path=PATH + + This option specify the location of the vhost-user Unix domain socket. + It is incompatible with --fd. + +--fd=FDNUM + + When this argument is given, the backend program is started with the + vhost-user socket as file descriptor FDNUM. It is incompatible with + --socket-path. + +--print-capabilities + + Output to stdout the backend capabilities in JSON format, and then + exit successfully. Other options and arguments should be ignored, and + the backend program should not perform its normal function. The + capabilities can be reported dynamically depending on the host + capabilities. + +The JSON output is described in the ``vhost-user.json`` schema, by +```@VHostUserBackendCapabilities``. Example: + +.. code:: json + + { + "type": "foo", + "features": [ + "feature-a", + "feature-b" + ] + } + +vhost-user-input +---------------- + +Command line options: + +--evdev-path=PATH + + Specify the linux input device. + + (optional) + +--no-grab + + Do no request exclusive access to the input device. + + (optional) + +vhost-user-gpu +-------------- + +Command line options: + +--render-node=PATH + + Specify the GPU DRM render node. + + (optional) + +--virgl + + Enable virgl rendering support. + + (optional) + +vhost-user-blk +-------------- + +Command line options: + +--blk-file=PATH + + Specify block device or file path. + + (optional) + +--read-only + + Enable read-only. + + (optional) diff --git a/docs/interop/vhost-vdpa.rst b/docs/interop/vhost-vdpa.rst new file mode 100644 index 000000000..0c70ba01b --- /dev/null +++ b/docs/interop/vhost-vdpa.rst @@ -0,0 +1,17 @@ +===================== +Vhost-vdpa Protocol +===================== + +Introduction +============= +vDPA(Virtual data path acceleration) device is a device that uses +a datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. + +This document describes the vDPA support in qemu + +Here is the kernel commit here +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c8cf31885f69e86be0b5b9e6677a26797365e1d + +TODO : More information will add later diff --git a/docs/interop/vnc-ledstate-Pseudo-encoding.txt b/docs/interop/vnc-ledstate-Pseudo-encoding.txt new file mode 100644 index 000000000..0f124f68b --- /dev/null +++ b/docs/interop/vnc-ledstate-Pseudo-encoding.txt @@ -0,0 +1,50 @@ +VNC LED state Pseudo-encoding +============================= + +Introduction +------------ + +This document describes the Pseudo-encoding of LED state for RFB which +is the protocol used in VNC as reference link below: + +http://tigervnc.svn.sourceforge.net/viewvc/tigervnc/rfbproto/rfbproto.rst?content-type=text/plain + +When accessing a guest by console through VNC, there might be mismatch +between the lock keys notification LED on the computer running the VNC +client session and the current status of the lock keys on the guest +machine. + +To solve this problem it attempts to add LED state Pseudo-encoding +extension to VNC protocol to deal with setting LED state. + +Pseudo-encoding +--------------- + +This Pseudo-encoding requested by client declares to server that it supports +LED state extensions to the protocol. + +The Pseudo-encoding number for LED state defined as: + +======= =============================================================== +Number Name +======= =============================================================== +-261 'LED state Pseudo-encoding' +======= =============================================================== + +LED state Pseudo-encoding +-------------------------- + +The LED state Pseudo-encoding describes the encoding of LED state which +consists of 3 bits, from left to right each bit represents the Caps, Num, +and Scroll lock key respectively. '1' indicates that the LED should be +on and '0' should be off. + +Some example encodings for it as following: + +======= =============================================================== +Code Description +======= =============================================================== +100 CapsLock is on, NumLock and ScrollLock are off +010 NumLock is on, CapsLock and ScrollLock are off +111 CapsLock, NumLock and ScrollLock are on +======= =============================================================== diff --git a/docs/memory-hotplug.txt b/docs/memory-hotplug.txt new file mode 100644 index 000000000..6aa5e17e2 --- /dev/null +++ b/docs/memory-hotplug.txt @@ -0,0 +1,93 @@ +QEMU memory hotplug +=================== + +This document explains how to use the memory hotplug feature in QEMU, +which is present since v2.1.0. + +Guest support is required for memory hotplug to work. + +Basic RAM hotplug +----------------- + +In order to be able to hotplug memory, QEMU has to be told how many +hotpluggable memory slots to create and what is the maximum amount of +memory the guest can grow. This is done at startup time by means of +the -m command-line option, which has the following format: + + -m [size=]megs[,slots=n,maxmem=size] + +Where, + + - "megs" is the startup RAM. It is the RAM the guest will boot with + - "slots" is the number of hotpluggable memory slots + - "maxmem" is the maximum RAM size the guest can have + +For example, the following command-line: + + qemu [...] -m 1G,slots=3,maxmem=4G + +Creates a guest with 1GB of memory and three hotpluggable memory slots. +The hotpluggable memory slots are empty when the guest is booted, so all +memory the guest will see after boot is 1GB. The maximum memory the +guest can reach is 4GB. This means that three additional gigabytes can be +hotplugged by using any combination of the available memory slots. + +Two monitor commands are used to hotplug memory: + + - "object_add": creates a memory backend object + - "device_add": creates a front-end pc-dimm device and inserts it + into the first empty slot + +For example, the following commands add another 1GB to the guest +discussed earlier: + + (qemu) object_add memory-backend-ram,id=mem1,size=1G + (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 + +Using the file backend +---------------------- + +Besides basic RAM hotplug, QEMU also supports using files as a memory +backend. This is useful for using hugetlbfs in Linux, which provides +access to bigger page sizes. + +For example, assuming that the host has 1GB hugepages available in +the /mnt/hugepages-1GB directory, a 1GB hugepage could be hotplugged +into the guest from the previous section with the following commands: + + (qemu) object_add memory-backend-file,id=mem1,size=1G,mem-path=/mnt/hugepages-1GB + (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 + +It's also possible to start a guest with memory cold-plugged into the +hotpluggable memory slots. This might seem counterintuitive at first, +but this allows for a lot of flexibility when using the file backend. + +In the following command-line example, an 8GB guest is created where 6GB +comes from regular RAM, 1GB is a 1GB hugepage page and 256MB is from +2MB pages. Also, the guest has additional memory slots to hotplug more +2GB if needed: + + qemu [...] -m 6GB,slots=4,maxmem=10G \ + -object memory-backend-file,id=mem1,size=1G,mem-path=/mnt/hugepages-1G \ + -device pc-dimm,id=dimm1,memdev=mem1 \ + -object memory-backend-file,id=mem2,size=256M,mem-path=/mnt/hugepages-2MB \ + -device pc-dimm,id=dimm2,memdev=mem2 + + +RAM hot-unplug +--------------- + +In order to be able to hot unplug pc-dimm device, QEMU has to be told the ids +of pc-dimm device and memory backend object. The ids were assigned when you hot +plugged memory. + +Two monitor commands are used to hot unplug memory: + + - "device_del": deletes a front-end pc-dimm device + - "object_del": deletes a memory backend object + +For example, assuming that the pc-dimm device with id "dimm1" exists, and its memory +backend is "mem1", the following commands tries to remove it. + + (qemu) device_del dimm1 + (qemu) object_del mem1 diff --git a/docs/meson.build b/docs/meson.build new file mode 100644 index 000000000..27c6e156f --- /dev/null +++ b/docs/meson.build @@ -0,0 +1,97 @@ +if get_option('sphinx_build') == '' + sphinx_build = find_program(['sphinx-build-3', 'sphinx-build'], + required: get_option('docs')) +else + sphinx_build = find_program(get_option('sphinx_build'), + required: get_option('docs')) +endif + +# Check if tools are available to build documentation. +build_docs = false +if sphinx_build.found() + SPHINX_ARGS = ['env', 'CONFDIR=' + qemu_confdir, sphinx_build, '-q'] + # If we're making warnings fatal, apply this to Sphinx runs as well + if get_option('werror') + SPHINX_ARGS += [ '-W' ] + endif + + # This is a bit awkward but works: create a trivial document and + # try to run it with our configuration file (which enforces a + # version requirement). This will fail if sphinx-build is too old. + run_command('mkdir', ['-p', tmpdir / 'sphinx']) + run_command('touch', [tmpdir / 'sphinx/index.rst']) + sphinx_build_test_out = run_command(SPHINX_ARGS + [ + '-c', meson.current_source_dir(), + '-b', 'html', tmpdir / 'sphinx', + tmpdir / 'sphinx/out']) + build_docs = (sphinx_build_test_out.returncode() == 0) + + if not build_docs + warning('@0@: @1@'.format(sphinx_build.full_path(), sphinx_build_test_out.stderr())) + if get_option('docs').enabled() + error('Install a Python 3 version of python-sphinx and the readthedoc theme') + endif + endif +endif + +if build_docs + SPHINX_ARGS += ['-Dversion=' + meson.project_version(), '-Drelease=' + config_host['PKGVERSION']] + + have_ga = have_tools and config_host.has_key('CONFIG_GUEST_AGENT') + + man_pages = { + 'qemu-ga.8': (have_ga ? 'man8' : ''), + 'qemu-ga-ref.7': (have_ga ? 'man7' : ''), + 'qemu-qmp-ref.7': 'man7', + 'qemu-storage-daemon-qmp-ref.7': (have_tools ? 'man7' : ''), + 'qemu-img.1': (have_tools ? 'man1' : ''), + 'qemu-nbd.8': (have_tools ? 'man8' : ''), + 'qemu-pr-helper.8': (have_tools ? 'man8' : ''), + 'qemu-storage-daemon.1': (have_tools ? 'man1' : ''), + 'qemu-trace-stap.1': (stap.found() ? 'man1' : ''), + 'virtfs-proxy-helper.1': (have_virtfs_proxy_helper ? 'man1' : ''), + 'virtiofsd.1': (have_virtiofsd ? 'man1' : ''), + 'qemu.1': 'man1', + 'qemu-block-drivers.7': 'man7', + 'qemu-cpu-models.7': 'man7' + } + + sphinxdocs = [] + sphinxmans = [] + + private_dir = meson.current_build_dir() / 'manual.p' + output_dir = meson.current_build_dir() / 'manual' + input_dir = meson.current_source_dir() + + this_manual = custom_target('QEMU manual', + build_by_default: build_docs, + output: 'docs.stamp', + input: files('conf.py'), + depfile: 'docs.d', + command: [SPHINX_ARGS, '-Ddepfile=@DEPFILE@', + '-Ddepfile_stamp=@OUTPUT0@', + '-b', 'html', '-d', private_dir, + input_dir, output_dir]) + sphinxdocs += this_manual + install_subdir(output_dir, install_dir: qemu_docdir, strip_directory: true) + + these_man_pages = [] + install_dirs = [] + foreach page, section : man_pages + these_man_pages += page + install_dirs += section == '' ? false : get_option('mandir') / section + endforeach + + sphinxmans += custom_target('QEMU man pages', + build_by_default: build_docs, + output: these_man_pages, + input: this_manual, + install: build_docs, + install_dir: install_dirs, + command: [SPHINX_ARGS, '-b', 'man', '-d', private_dir, + input_dir, meson.current_build_dir()]) + + alias_target('sphinxdocs', sphinxdocs) + alias_target('html', sphinxdocs) + alias_target('man', sphinxmans) +endif diff --git a/docs/multi-thread-compression.txt b/docs/multi-thread-compression.txt new file mode 100644 index 000000000..bb88c6bdf --- /dev/null +++ b/docs/multi-thread-compression.txt @@ -0,0 +1,149 @@ +Use multiple thread (de)compression in live migration +===================================================== +Copyright (C) 2015 Intel Corporation +Author: Liang Li <liang.z.li@intel.com> + +This work is licensed under the terms of the GNU GPLv2 or later. See +the COPYING file in the top-level directory. + +Contents: +========= +* Introduction +* When to use +* Performance +* Usage +* TODO + +Introduction +============ +Instead of sending the guest memory directly, this solution will +compress the RAM page before sending; after receiving, the data will +be decompressed. Using compression in live migration can help +to reduce the data transferred about 60%, this is very useful when the +bandwidth is limited, and the total migration time can also be reduced +about 70% in a typical case. In addition to this, the VM downtime can be +reduced about 50%. The benefit depends on data's compressibility in VM. + +The process of compression will consume additional CPU cycles, and the +extra CPU cycles will increase the migration time. On the other hand, +the amount of data transferred will decrease; this factor can reduce +the total migration time. If the process of the compression is quick +enough, then the total migration time can be reduced, and multiple +thread compression can be used to accelerate the compression process. + +The decompression speed of Zlib is at least 4 times as quick as +compression, if the source and destination CPU have equal speed, +keeping the compression thread count 4 times the decompression +thread count can avoid resource waste. + +Compression level can be used to control the compression speed and the +compression ratio. High compression ratio will take more time, level 0 +stands for no compression, level 1 stands for the best compression +speed, and level 9 stands for the best compression ratio. Users can +select a level number between 0 and 9. + + +When to use the multiple thread compression in live migration +============================================================= +Compression of data will consume extra CPU cycles; so in a system with +high overhead of CPU, avoid using this feature. When the network +bandwidth is very limited and the CPU resource is adequate, use of +multiple thread compression will be very helpful. If both the CPU and +the network bandwidth are adequate, use of multiple thread compression +can still help to reduce the migration time. + +Performance +=========== +Test environment: + +CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz +Socket Count: 2 +RAM: 128G +NIC: Intel I350 (10/100/1000Mbps) +Host OS: CentOS 7 64-bit +Guest OS: RHEL 6.5 64-bit +Parameter: qemu-system-x86_64 -accel kvm -smp 4 -m 4096 + /share/ia32e_rhel6u5.qcow -monitor stdio + +There is no additional application is running on the guest when doing +the test. + + +Speed limit: 1000Gb/s +--------------------------------------------------------------- + | original | compress thread: 8 + | way | decompress thread: 2 + | | compression level: 1 +--------------------------------------------------------------- +total time(msec): | 3333 | 1833 +--------------------------------------------------------------- +downtime(msec): | 100 | 27 +--------------------------------------------------------------- +transferred ram(kB):| 363536 | 107819 +--------------------------------------------------------------- +throughput(mbps): | 893.73 | 482.22 +--------------------------------------------------------------- +total ram(kB): | 4211524 | 4211524 +--------------------------------------------------------------- + +There is an application running on the guest which write random numbers +to RAM block areas periodically. + +Speed limit: 1000Gb/s +--------------------------------------------------------------- + | original | compress thread: 8 + | way | decompress thread: 2 + | | compression level: 1 +--------------------------------------------------------------- +total time(msec): | 37369 | 15989 +--------------------------------------------------------------- +downtime(msec): | 337 | 173 +--------------------------------------------------------------- +transferred ram(kB):| 4274143 | 1699824 +--------------------------------------------------------------- +throughput(mbps): | 936.99 | 870.95 +--------------------------------------------------------------- +total ram(kB): | 4211524 | 4211524 +--------------------------------------------------------------- + +Usage +===== +1. Verify both the source and destination QEMU are able +to support the multiple thread compression migration: + {qemu} info migrate_capabilities + {qemu} ... compress: off ... + +2. Activate compression on the source: + {qemu} migrate_set_capability compress on + +3. Set the compression thread count on source: + {qemu} migrate_set_parameter compress_threads 12 + +4. Set the compression level on the source: + {qemu} migrate_set_parameter compress_level 1 + +5. Set the decompression thread count on destination: + {qemu} migrate_set_parameter decompress_threads 3 + +6. Start outgoing migration: + {qemu} migrate -d tcp:destination.host:4444 + {qemu} info migrate + Capabilities: ... compress: on + ... + +The following are the default settings: + compress: off + compress_threads: 8 + decompress_threads: 2 + compress_level: 1 (which means best speed) + +So, only the first two steps are required to use the multiple +thread compression in migration. You can do more if the default +settings are not appropriate. + +TODO +==== +Some faster (de)compression method such as LZ4 and Quicklz can help +to reduce the CPU consumption when doing (de)compression. If using +these faster (de)compression method, less (de)compression threads +are needed when doing the migration. diff --git a/docs/multiseat.txt b/docs/multiseat.txt new file mode 100644 index 000000000..2b297e979 --- /dev/null +++ b/docs/multiseat.txt @@ -0,0 +1,145 @@ + +multiseat howto (with some multihead coverage) +============================================== + +host devices +------------ + +First you must compile qemu with a user interface supporting +multihead/multiseat and input event routing. Right now this +list includes sdl2, gtk (both 2+3) and vnc: + + ./configure --enable-sdl + +or + + ./configure --enable-gtk + + +Next put together the qemu command line (sdk/gtk): + +qemu -accel kvm -usb $memory $disk $whatever \ + -display [ sdl | gtk ] \ + -vga std \ + -device usb-tablet + +That is it for the first seat, which will use the standard vga, the +standard ps/2 keyboard (implicitly there) and the usb-tablet. Now the +additional switches for the second seat: + + -device pci-bridge,addr=12.0,chassis_nr=2,id=head.2 \ + -device secondary-vga,bus=head.2,addr=02.0,id=video.2 \ + -device nec-usb-xhci,bus=head.2,addr=0f.0,id=usb.2 \ + -device usb-kbd,bus=usb.2.0,port=1,display=video.2 \ + -device usb-tablet,bus=usb.2.0,port=2,display=video.2 + +This places a pci bridge in slot 12, connects a display adapter and +xhci (usb) controller to the bridge. Then it adds a usb keyboard and +usb mouse, both connected to the xhci and linked to the display. + +The "display=video2" sets up the input routing. Any input coming from +the window which belongs to the video.2 display adapter will be routed +to these input devices. + +Starting with qemu 2.4 and linux kernel 4.1 you can also use virtio +for the input devices, using this ... + + -device pci-bridge,addr=12.0,chassis_nr=2,id=head.2 \ + -device secondary-vga,bus=head.2,addr=02.0,id=video.2 \ + -device virtio-keyboard-pci,bus=head.2,addr=03.0,display=video.2 \ + -device virtio-tablet-pci,bus=head.2,addr=03.0,display=video.2 + +... instead of xhci and usb hid devices. + +host ui +------- + +The sdl2 ui will start up with two windows, one for each display +device. The gtk ui will start with a single window and each display +in a separate tab. You can either simply switch tabs to switch heads, +or use the "View / Detach tab" menu item to move one of the displays +to its own window so you can see both display devices side-by-side. + +For vnc some additional configuration on the command line is needed. +We'll create two vnc server instances, and bind the second one to the +second seat, similar to input devices: + + -display vnc=:1,id=primary \ + -display vnc=:2,id=secondary,display=video.2 + +Connecting to vnc display :1 gives you access to the first seat, and +likewise connecting to vnc display :2 shows the second seat. + +Note on spice: Spice handles multihead just fine. But it can't do +multiseat. For tablet events the event source is sent to the spice +agent. But qemu can't figure it, so it can't do input routing. +Fixing this needs a new or extended input interface between +libspice-server and qemu. For keyboard events it is even worse: The +event source isn't included in the spice protocol, so the wire +protocol must be extended to support this. + + +guest side +---------- + +You need a pretty recent linux guest. systemd with loginctl. kernel +3.14+ with CONFIG_DRM_BOCHS enabled. Fedora 20 will do. Must be +fully updated for the new kernel though, i.e. the live iso doesn't cut +it. + +Now we'll have to configure the guest. Boot and login. "lspci -vt" +should list the pci bridge with the display adapter and usb controller: + + [root@fedora ~]# lspci -vt + -[0000:00]-+-00.0 Intel Corporation 440FX - 82441FX PMC [Natoma] + [ ... ] + \-12.0-[01]--+-02.0 Device 1234:1111 + \-0f.0 NEC Corporation USB 3.0 Host Controller + +Good. Now lets tell the system that the pci bridge and all devices +below it belong to a separate seat by dropping a file into +/etc/udev/rules.d: + + [root@fedora ~]# cat /etc/udev/rules.d/70-qemu-autoseat.rules + SUBSYSTEMS=="pci", DEVPATH=="*/0000:00:12.0", TAG+="seat", ENV{ID_AUTOSEAT}="1" + +Reboot. System should come up with two seats. With loginctl you can +check the configuration: + + [root@fedora ~]# loginctl list-seats + SEAT + seat0 + seat-pci-pci-0000_00_12_0 + + 2 seats listed. + +You can use "loginctl seat-status seat-pci-pci-0000_00_12_0" to list +the devices attached to the seat. + +Background info is here: + http://www.freedesktop.org/wiki/Software/systemd/multiseat/ + + +guest side with pci-bridge-seat +------------------------------- + +QEMU version 2.4 and newer has a new pci-bridge-seat device which +can be used instead of pci-bridge. Just swap the device name in the +qemu command line above. The only difference between the two devices +is the pci id. We can match the pci id instead of the device path +with a nice generic rule now, which simplifies the guest +configuration: + + [root@fedora ~]# cat /etc/udev/rules.d/70-qemu-pci-bridge-seat.rules + SUBSYSTEM=="pci", ATTR{vendor}=="0x1b36", ATTR{device}=="0x000a", \ + TAG+="seat", ENV{ID_AUTOSEAT}="1" + +Patch with this rule has been submitted to upstream udev/systemd, was +accepted and should be included in the next systemd release (222). +So, if your guest has this or a newer version, multiseat will work just +fine without any manual guest configuration. + +Enjoy! + +-- +Gerd Hoffmann <kraxel@redhat.com> diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt new file mode 100644 index 000000000..fd7773dc5 --- /dev/null +++ b/docs/nvdimm.txt @@ -0,0 +1,265 @@ +QEMU Virtual NVDIMM +=================== + +This document explains the usage of virtual NVDIMM (vNVDIMM) feature +which is available since QEMU v2.6.0. + +The current QEMU only implements the persistent memory mode of vNVDIMM +device and not the block window mode. + +Basic Usage +----------- + +The storage of a vNVDIMM device in QEMU is provided by the memory +backend (i.e. memory-backend-file and memory-backend-ram). A simple +way to create a vNVDIMM device at startup time is done via the +following command line options: + + -machine pc,nvdimm=on + -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE + -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE,readonly=off + -device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off + +Where, + + - the "nvdimm" machine option enables vNVDIMM feature. + + - "slots=$N" should be equal to or larger than the total amount of + normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. + + - "maxmem=$MAX_SIZE" should be equal to or larger than the total size + of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be + >= $RAM_SIZE + $NVDIMM_SIZE here. + + - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH, + size=$NVDIMM_SIZE,readonly=off" creates a backend storage of size + $NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go + to the file $PATH. + + "share=on/off" controls the visibility of guest writes. If + "share=on", then guest writes will be applied to the backend + file. If another guest uses the same backend file with option + "share=on", then above writes will be visible to it as well. If + "share=off", then guest writes won't be applied to the backend + file and thus will be invisible to other guests. + + "readonly=on/off" controls whether the file $PATH is opened read-only or + read/write (default). + + - "device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off" creates a read/write + virtual NVDIMM device whose storage is provided by above memory backend + device. + + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept + persistent writes. Linux guest drivers set the device to read-only when this + bit is present. Set unarmed to on when the memdev has readonly=on. + +Multiple vNVDIMM devices can be created if multiple pairs of "-object" +and "-device" are provided. + +For above command line options, if the guest OS has the proper NVDIMM +driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to +detect a NVDIMM device which is in the persistent memory mode and whose +size is $NVDIMM_SIZE. + +Note: + +1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual + backend file size is not equal to the size given by "size" option, + QEMU will truncate the backend file by ftruncate(2), which will + corrupt the existing data in the backend file, especially for the + shrink case. + + QEMU v2.8.0 and later check the backend file size and the "size" + option. If they do not match, QEMU will report errors and abort in + order to avoid the data corruption. + +2. QEMU v2.6.0 only puts a basic alignment requirement on the "size" + option of memory-backend-file, e.g. 4KB alignment on x86. However, + QEMU v.2.7.0 puts an additional alignment requirement, which may + require a larger value than the basic one, e.g. 2MB on x86. This + change breaks the usage of memory-backend-file that only satisfies + the basic alignment. + + QEMU v2.8.0 and later remove the additional alignment on non-s390x + architectures, so the broken memory-backend-file can work again. + +Label +----- + +QEMU v2.7.0 and later implement the label support for vNVDIMM devices. +To enable label on vNVDIMM devices, users can simply add +"label-size=$SZ" option to "-device nvdimm", e.g. + + -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K + +Note: + +1. The minimal label size is 128KB. + +2. QEMU v2.7.0 and later store labels at the end of backend storage. + If a memory backend file, which was previously used as the backend + of a vNVDIMM device without labels, is now used for a vNVDIMM + device with label, the data in the label area at the end of file + will be inaccessible to the guest. If any useful data (e.g. the + meta-data of the file system) was stored there, the latter usage + may result guest data corruption (e.g. breakage of guest file + system). + +Hotplug +------- + +QEMU v2.8.0 and later implement the hotplug support for vNVDIMM +devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is +accomplished by two monitor commands "object_add" and "device_add". + +For example, the following commands add another 4GB vNVDIMM device to +the guest: + + (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G + (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2 + +Note: + +1. Each hotplugged vNVDIMM device consumes one memory slot. Users + should always ensure the memory option "-m ...,slots=N" specifies + enough number of slots, i.e. + N >= number of RAM devices + + number of statically plugged vNVDIMM devices + + number of hotplugged vNVDIMM devices + +2. The similar is required for the memory option "-m ...,maxmem=M", i.e. + M >= size of RAM devices + + size of statically plugged vNVDIMM devices + + size of hotplugged vNVDIMM devices + +Alignment +--------- + +QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping +address to the page size (getpagesize(2)) by default. However, some +types of backends may require an alignment different than the page +size. In that case, QEMU v2.12.0 and later provide 'align' option to +memory-backend-file to allow users to specify the proper alignment. +For device dax (e.g., /dev/dax0.0), this alignment needs to match the +alignment requirement of the device dax. The NUM of 'align=NUM' option +must be larger than or equal to the 'align' of device dax. +We can use one of the following commands to show the 'align' of device dax. + + ndctl list -X + daxctl list -R + +In order to get the proper 'align' of device dax, you need to install +the library 'libdaxctl'. + +For example, device dax require the 2 MB alignment, so we can use +following QEMU command line options to use it (/dev/dax0.0) as the +backend of vNVDIMM: + + -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M + -device nvdimm,id=nvdimm1,memdev=mem1 + +Guest Data Persistence +---------------------- + +Though QEMU supports multiple types of vNVDIMM backends on Linux, +the only backend that can guarantee the guest write persistence is: + +A. DAX device (e.g., /dev/dax0.0, ) or +B. DAX file(mounted with dax option) + +When using B (A file supporting direct mapping of persistent memory) +as a backend, write persistence is guaranteed if the host kernel has +support for the MAP_SYNC flag in the mmap system call (available +since Linux 4.15 and on certain distro kernels) and additionally +both 'pmem' and 'share' flags are set to 'on' on the backend. + +If these conditions are not satisfied i.e. if either 'pmem' or 'share' +are not set, if the backend file does not support DAX or if MAP_SYNC +is not supported by the host kernel, write persistence is not +guaranteed after a system crash. For compatibility reasons, these +conditions are ignored if not satisfied. Currently, no way is +provided to test for them. +For more details, please reference mmap(2) man page: +http://man7.org/linux/man-pages/man2/mmap.2.html. + +When using other types of backends, it's suggested to set 'unarmed' +option of '-device nvdimm' to 'on', which sets the unarmed flag of the +guest NVDIMM region mapping structure. This unarmed flag indicates +guest software that this vNVDIMM device contains a region that cannot +accept persistent writes. In result, for example, the guest Linux +NVDIMM driver, marks such vNVDIMM device as read-only. + +Backend File Setup Example +-------------------------- + +Here are two examples showing how to setup these persistent backends on +linux using the tool ndctl [3]. + +A. DAX device + +Use the following command to set up /dev/dax0.0 so that the entirety of +namespace0.0 can be exposed as an emulated NVDIMM to the guest: + + ndctl create-namespace -f -e namespace0.0 -m devdax + +The /dev/dax0.0 could be used directly in "mem-path" option. + +B. DAX file + +Individual files on a DAX host file system can be exposed as emulated +NVDIMMS. First an fsdax block device is created, partitioned, and then +mounted with the "dax" mount option: + + ndctl create-namespace -f -e namespace0.0 -m fsdax + (partition /dev/pmem0 with name pmem0p1) + mount -o dax /dev/pmem0p1 /mnt + (create or copy a disk image file with qemu-img(1), cp(1), or dd(1) + in /mnt) + +Then the new file in /mnt could be used in "mem-path" option. + +NVDIMM Persistence +------------------ + +ACPI 6.2 Errata A added support for a new Platform Capabilities Structure +which allows the platform to communicate what features it supports related to +NVDIMM data persistence. Users can provide a persistence value to a guest via +the optional "nvdimm-persistence" machine command line option: + + -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu + +There are currently two valid values for this option: + +"mem-ctrl" - The platform supports flushing dirty data from the memory + controller to the NVDIMMs in the event of power loss. + +"cpu" - The platform supports flushing dirty data from the CPU cache to + the NVDIMMs in the event of power loss. This implies that the + platform also supports flushing dirty data through the memory + controller on power loss. + +If the vNVDIMM backend is in host persistent memory that can be accessed in +SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set +the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU +is built with libpmem [2] support (configured with --enable-libpmem), QEMU +will take necessary operations to guarantee the persistence of its own writes +to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration). +If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report +a "lack of libpmem support" message to ensure the persistence is available. +For example, if we want to ensure the persistence for some backend file, +use the QEMU command line: + + -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on + +References +---------- + +[1] NVM Programming Model (NPM) + Version 1.2 + https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf +[2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page: + http://pmem.io/pmdk/ +[3] ndctl-create-namespace - provision or reconfigure a namespace + http://pmem.io/ndctl/ndctl-create-namespace.html diff --git a/docs/papr-pef.txt b/docs/papr-pef.txt new file mode 100644 index 000000000..72550e9bf --- /dev/null +++ b/docs/papr-pef.txt @@ -0,0 +1,30 @@ +POWER (PAPR) Protected Execution Facility (PEF) +=============================================== + +Protected Execution Facility (PEF), also known as Secure Guest support +is a feature found on IBM POWER9 and POWER10 processors. + +If a suitable firmware including an Ultravisor is installed, it adds +an extra memory protection mode to the CPU. The ultravisor manages a +pool of secure memory which cannot be accessed by the hypervisor. + +When this feature is enabled in QEMU, a guest can use ultracalls to +enter "secure mode". This transfers most of its memory to secure +memory, where it cannot be eavesdropped by a compromised hypervisor. + +Launching +--------- + +To launch a guest which will be permitted to enter PEF secure mode: + +# ${QEMU} \ + -object pef-guest,id=pef0 \ + -machine confidential-guest-support=pef0 \ + ... + +Live Migration +---------------- + +Live migration is not yet implemented for PEF guests. For +consistency, we currently prevent migration if the PEF feature is +enabled, whether or not the guest has actually entered secure mode. diff --git a/docs/pci_expander_bridge.txt b/docs/pci_expander_bridge.txt new file mode 100644 index 000000000..36750273b --- /dev/null +++ b/docs/pci_expander_bridge.txt @@ -0,0 +1,58 @@ +PCI EXPANDER BRIDGE (PXB) +========================= + +Description +=========== +PXB is a "light-weight" host bridge in the same PCI domain +as the main host bridge whose purpose is to enable +the main host bridge to support multiple PCI root buses. +It is implemented only for i440fx and can be placed only +on bus 0 (pci.0). + +As opposed to PCI-2-PCI bridge's secondary bus, PXB's bus +is a primary bus and can be associated with a NUMA node +(different from the main host bridge) allowing the guest OS +to recognize the proximity of a pass-through device to +other resources as RAM and CPUs. + +Usage +===== +A detailed command line would be: + +[qemu-bin + storage options] +-m 2G +-object memory-backend-ram,size=1024M,policy=bind,host-nodes=0,id=ram-node0 -numa node,nodeid=0,cpus=0,memdev=ram-node0 +-object memory-backend-ram,size=1024M,policy=bind,host-nodes=1,id=ram-node1 -numa node,nodeid=1,cpus=1,memdev=ram-node1 +-device pxb,id=bridge1,bus=pci.0,numa_node=1,bus_nr=4 -netdev user,id=nd -device e1000,bus=bridge1,addr=0x4,netdev=nd +-device pxb,id=bridge2,bus=pci.0,numa_node=0,bus_nr=8 -device e1000,bus=bridge2,addr=0x3 +-device pxb,id=bridge3,bus=pci.0,bus_nr=40 -drive if=none,id=drive0,file=[img] -device virtio-blk-pci,drive=drive0,scsi=off,bus=bridge3,addr=1 + +Here you have: + - 2 NUMA nodes for the guest, 0 and 1. (both mapped to the same NUMA node in host, but you can and should put it in different host NUMA nodes) + - a pxb host bridge attached to NUMA 1 with an e1000 behind it + - a pxb host bridge attached to NUMA 0 with an e1000 behind it + - a pxb host bridge not attached to any NUMA with a hard drive behind it. + +Limitations +=========== +Please observe that we specified the bus "pci.0" for the second and third pxb. +This is because when no bus is given, another pxb can be selected by QEMU as default bus, +however, PXBs can be placed only under the root bus. + +Implementation +============== +The PXB is composed by: +- HostBridge (TYPE_PXB_HOST) + The host bridge allows to register and query the PXB's PCI root bus in QEMU. +- PXBDev(TYPE_PXB_DEVICE) + It is a regular PCI Device that resides on the piix host-bridge bus and its bus uses the same PCI domain. + However, the bus behind is exposed through ACPI as a primary PCI bus and starts a new PCI hierarchy. + The interrupts from devices behind the PXB are routed through this device the same as if it were a + PCI-2-PCI bridge. The _PRT follows the i440fx model. +- PCIBridgeDev(TYPE_PCI_BRIDGE_DEV) + Created automatically as part of init sequence. + When adding a device to PXB it is attached to the bridge for two reasons: + - Using the bridge will enable hotplug support + - All the devices behind the bridge will use bridge's IO/MEM windows compacting + the PCI address space. + diff --git a/docs/pcie.txt b/docs/pcie.txt new file mode 100644 index 000000000..89e350207 --- /dev/null +++ b/docs/pcie.txt @@ -0,0 +1,318 @@ +PCI EXPRESS GUIDELINES +====================== + +1. Introduction +================ +The doc proposes best practices on how to use PCI Express (PCIe) / PCI +devices in PCI Express based machines and explains the reasoning behind +them. + +Note that the PCIe features are available only when using the 'q35' +machine type on x86 architecture and the 'virt' machine type on AArch64. +Other machine types do not use PCIe at this time. + +The following presentations accompany this document: + (1) Q35 overview. + https://wiki.qemu.org/images/4/4e/Q35.pdf + (2) A comparison between PCI and PCI Express technologies. + https://wiki.qemu.org/images/f/f6/PCIvsPCIe.pdf + +Note: The usage examples are not intended to replace the full +documentation, please use QEMU help to retrieve all options. + +2. Device placement strategy +============================ +QEMU does not have a clear socket-device matching mechanism +and allows any PCI/PCI Express device to be plugged into any +PCI/PCI Express slot. +Plugging a PCI device into a PCI Express slot might not always work and +is weird anyway since it cannot be done for "bare metal". +Plugging a PCI Express device into a PCI slot will hide the Extended +Configuration Space thus is also not recommended. + +The recommendation is to separate the PCI Express and PCI hierarchies. +PCI Express devices should be plugged only into PCI Express Root Ports and +PCI Express Downstream ports. + +2.1 Root Bus (pcie.0) +===================== +Place only the following kinds of devices directly on the Root Complex: + (1) PCI Devices (e.g. network card, graphics card, IDE controller), + not controllers. Place only legacy PCI devices on + the Root Complex. These will be considered Integrated Endpoints. + Note: Integrated Endpoints are not hot-pluggable. + + Although the PCI Express spec does not forbid PCI Express devices as + Integrated Endpoints, existing hardware mostly integrates legacy PCI + devices with the Root Complex. Guest OSes are suspected to behave + strangely when PCI Express devices are integrated + with the Root Complex. + + (2) PCI Express Root Ports (ioh3420), for starting exclusively PCI Express + hierarchies. + + (3) PCI Express to PCI Bridge (pcie-pci-bridge), for starting legacy PCI + hierarchies. + + (4) Extra Root Complexes (pxb-pcie), if multiple PCI Express Root Buses + are needed. + + pcie.0 bus + ---------------------------------------------------------------------------- + | | | | + ----------- ------------------ ------------------- -------------- + | PCI Dev | | PCIe Root Port | | PCIe-PCI Bridge | | pxb-pcie | + ----------- ------------------ ------------------- -------------- + +2.1.1 To plug a device into pcie.0 as a Root Complex Integrated Endpoint use: + -device <dev>[,bus=pcie.0] +2.1.2 To expose a new PCI Express Root Bus use: + -device pxb-pcie,id=pcie.1,bus_nr=x[,numa_node=y][,addr=z] + PCI Express Root Ports and PCI Express to PCI bridges can be + connected to the pcie.1 bus: + -device ioh3420,id=root_port1[,bus=pcie.1][,chassis=x][,slot=y][,addr=z] \ + -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1 + + +2.2 PCI Express only hierarchy +============================== +Always use PCI Express Root Ports to start PCI Express hierarchies. + +A PCI Express Root bus supports up to 32 devices. Since each +PCI Express Root Port is a function and a multi-function +device may support up to 8 functions, the maximum possible +number of PCI Express Root Ports per PCI Express Root Bus is 256. + +Prefer grouping PCI Express Root Ports into multi-function devices +to keep a simple flat hierarchy that is enough for most scenarios. +Only use PCI Express Switches (x3130-upstream, xio3130-downstream) +if there is no more room for PCI Express Root Ports. +Please see section 4. for further justifications. + +Plug only PCI Express devices into PCI Express Ports. + + + pcie.0 bus + ---------------------------------------------------------------------------------- + | | | + ------------- ------------- ------------- + | Root Port | | Root Port | | Root Port | + ------------ ------------- ------------- + | -------------------------|------------------------ + ------------ | ----------------- | + | PCIe Dev | | PCI Express | Upstream Port | | + ------------ | Switch ----------------- | + | | | | + | ------------------- ------------------- | + | | Downstream Port | | Downstream Port | | + | ------------------- ------------------- | + -------------|-----------------------|------------ + ------------ + | PCIe Dev | + ------------ + +2.2.1 Plugging a PCI Express device into a PCI Express Root Port: + -device ioh3420,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z] \ + -device <dev>,bus=root_port1 +2.2.2 Using multi-function PCI Express Root Ports: + -device ioh3420,id=root_port1,multifunction=on,chassis=x,addr=z.0[,slot=y][,bus=pcie.0] \ + -device ioh3420,id=root_port2,chassis=x1,addr=z.1[,slot=y1][,bus=pcie.0] \ + -device ioh3420,id=root_port3,chassis=x2,addr=z.2[,slot=y2][,bus=pcie.0] \ +2.2.3 Plugging a PCI Express device into a Switch: + -device ioh3420,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z] \ + -device x3130-upstream,id=upstream_port1,bus=root_port1[,addr=x] \ + -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=x1,slot=y1[,addr=z1]] \ + -device <dev>,bus=downstream_port1 + +Notes: + - (slot, chassis) pair is mandatory and must be unique for each + PCI Express Root Port. slot defaults to 0 when not specified. + - 'addr' parameter can be 0 for all the examples above. + + +2.3 PCI only hierarchy +====================== +Legacy PCI devices can be plugged into pcie.0 as Integrated Endpoints, +but, as mentioned in section 5, doing so means the legacy PCI +device in question will be incapable of hot-unplugging. +Besides that use PCI Express to PCI Bridges (pcie-pci-bridge) in +combination with PCI-PCI Bridges (pci-bridge) to start PCI hierarchies. + +Prefer flat hierarchies. For most scenarios a single PCI Express to PCI Bridge +(having 32 slots) and several PCI-PCI Bridges attached to it +(each supporting also 32 slots) will support hundreds of legacy devices. +The recommendation is to populate one PCI-PCI Bridge under the +PCI Express to PCI Bridge until is full and then plug a new PCI-PCI Bridge... + + pcie.0 bus + ---------------------------------------------- + | | + ----------- ------------------- + | PCI Dev | | PCIe-PCI Bridge | + ----------- ------------------- + | | + ------------------ ------------------ + | PCI-PCI Bridge | | PCI-PCI Bridge | + ------------------ ------------------ + | | + ----------- ----------- + | PCI Dev | | PCI Dev | + ----------- ----------- + +2.3.1 To plug a PCI device into pcie.0 as an Integrated Endpoint use: + -device <dev>[,bus=pcie.0] +2.3.2 Plugging a PCI device into a PCI-PCI Bridge: + -device pcie-pci-bridge,id=pcie_pci_bridge1[,bus=pcie.0] \ + -device pci-bridge,id=pci_bridge1,bus=pcie_pci_bridge1[,chassis_nr=x][,addr=y] \ + -device <dev>,bus=pci_bridge1[,addr=x] + Note that 'addr' cannot be 0 unless shpc=off parameter is passed to + the PCI Bridge/PCI Express to PCI Bridge. + +3. IO space issues +=================== +The PCI Express Root Ports and PCI Express Downstream ports are seen by +Firmware/Guest OS as PCI-PCI Bridges. As required by the PCI spec, each +such Port should be reserved a 4K IO range for, even though only one +(multifunction) device can be plugged into each Port. This results in +poor IO space utilization. + +The firmware used by QEMU (SeaBIOS/OVMF) may try further optimizations +by not allocating IO space for each PCI Express Root / PCI Express +Downstream port if: + (1) the port is empty, or + (2) the device behind the port has no IO BARs. + +The IO space is very limited, to 65536 byte-wide IO ports, and may even be +fragmented by fixed IO ports owned by platform devices resulting in at most +10 PCI Express Root Ports or PCI Express Downstream Ports per system +if devices with IO BARs are used in the PCI Express hierarchy. Using the +proposed device placing strategy solves this issue by using only +PCI Express devices within PCI Express hierarchy. + +The PCI Express spec requires that PCI Express devices work properly +without using IO ports. The PCI hierarchy has no such limitations. + + +4. Bus numbers issues +====================== +Each PCI domain can have up to only 256 buses and the QEMU PCI Express +machines do not support multiple PCI domains even if extra Root +Complexes (pxb-pcie) are used. + +Each element of the PCI Express hierarchy (Root Complexes, +PCI Express Root Ports, PCI Express Downstream/Upstream ports) +uses one bus number. Since only one (multifunction) device +can be attached to a PCI Express Root Port or PCI Express Downstream +Port it is advised to plan in advance for the expected number of +devices to prevent bus number starvation. + +Avoiding PCI Express Switches (and thereby striving for a 'flatter' PCI +Express hierarchy) enables the hierarchy to not spend bus numbers on +Upstream Ports. + +The bus_nr properties of the pxb-pcie devices partition the 0..255 bus +number space. All bus numbers assigned to the buses recursively behind a +given pxb-pcie device's root bus must fit between the bus_nr property of +that pxb-pcie device, and the lowest of the higher bus_nr properties +that the command line sets for other pxb-pcie devices. + + +5. Hot-plug +============ +The PCI Express root buses (pcie.0 and the buses exposed by pxb-pcie devices) +do not support hot-plug, so any devices plugged into Root Complexes +cannot be hot-plugged/hot-unplugged: + (1) PCI Express Integrated Endpoints + (2) PCI Express Root Ports + (3) PCI Express to PCI Bridges + (4) pxb-pcie + +Be aware that PCI Express Downstream Ports can't be hot-plugged into +an existing PCI Express Upstream Port. + +PCI devices can be hot-plugged into PCI Express to PCI and PCI-PCI Bridges. +The PCI hot-plug into PCI-PCI bridge is ACPI based, whereas hot-plug into +PCI Express to PCI bridges is SHPC-based. They both can work side by side with +the PCI Express native hot-plug. + +PCI Express devices can be natively hot-plugged/hot-unplugged into/from +PCI Express Root Ports (and PCI Express Downstream Ports). + +5.1 Planning for hot-plug: + (1) PCI hierarchy + Leave enough PCI-PCI Bridge slots empty or add one + or more empty PCI-PCI Bridges to the PCI Express to PCI Bridge. + + For each such PCI-PCI Bridge the Guest Firmware is expected to reserve + 4K IO space and 2M MMIO range to be used for all devices behind it. + Appropriate PCI capability is designed, see pcie_pci_bridge.txt. + + Because of the hard IO limit of around 10 PCI Bridges (~ 40K space) + per system don't use more than 9 PCI-PCI Bridges, leaving 4K for the + Integrated Endpoints. (The PCI Express Hierarchy needs no IO space). + + (2) PCI Express hierarchy: + Leave enough PCI Express Root Ports empty. Use multifunction + PCI Express Root Ports (up to 8 ports per pcie.0 slot) + on the Root Complex(es), for keeping the + hierarchy as flat as possible, thereby saving PCI bus numbers. + Don't use PCI Express Switches if you don't have + to, each one of those uses an extra PCI bus (for its Upstream Port) + that could be put to better use with another Root Port or Downstream + Port, which may come handy for hot-plugging another device. + + +5.3 Hot-plug example: +Using HMP: (add -monitor stdio to QEMU command line) + device_add <dev>,id=<id>,bus=<PCI Express Root Port Id/PCI Express Downstream Port Id/PCI-PCI Bridge Id/> + + +6. Device assignment +==================== +Host devices are mostly PCI Express and should be plugged only into +PCI Express Root Ports or PCI Express Downstream Ports. +PCI-PCI Bridge slots can be used for legacy PCI host devices. + +6.1 How to detect if a device is PCI Express: + > lspci -s 03:00.0 -v (as root) + + 03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83) + Subsystem: Intel Corporation Dual Band Wireless-AC 7260 + Flags: bus master, fast devsel, latency 0, IRQ 50 + Memory at f0400000 (64-bit, non-prefetchable) [size=8K] + Capabilities: [c8] Power Management version 3 + Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ + Capabilities: [40] Express Endpoint, MSI 00 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + Capabilities: [100] Advanced Error Reporting + Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20 + Capabilities: [14c] Latency Tolerance Reporting + Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 Len=014 + +If you can see the "Express Endpoint" capability in the +output, then the device is indeed PCI Express. + + +7. Virtio devices +================= +Virtio devices plugged into the PCI hierarchy or as Integrated Endpoints +will remain PCI and have transitional behaviour as default. +Transitional virtio devices work in both IO and MMIO modes depending on +the guest support. The Guest firmware will assign both IO and MMIO resources +to transitional virtio devices. + +Virtio devices plugged into PCI Express ports are PCI Express devices and +have "1.0" behavior by default without IO support. +In both cases disable-legacy and disable-modern properties can be used +to override the behaviour. + +Note that setting disable-legacy=off will enable legacy mode (enabling +legacy behavior) for PCI Express virtio devices causing them to +require IO space, which, given the limited available IO space, may quickly +lead to resource exhaustion, and is therefore strongly discouraged. + + +8. Conclusion +============== +The proposal offers a usage model that is easy to understand and follow +and at the same time overcomes the PCI Express architecture limitations. diff --git a/docs/pcie_pci_bridge.txt b/docs/pcie_pci_bridge.txt new file mode 100644 index 000000000..1aa08fc5f --- /dev/null +++ b/docs/pcie_pci_bridge.txt @@ -0,0 +1,114 @@ +Generic PCI Express to PCI Bridge +================================ + +Description +=========== +PCIE-to-PCI bridge is a new method for legacy PCI +hierarchies creation on Q35 machines. + +Previously Intel DMI-to-PCI bridge was used for this purpose. +But due to its strict limitations - no support of hot-plug, +no cross-platform and cross-architecture support - a new generic +PCIE-to-PCI bridge should now be used for any legacy PCI device usage +with PCI Express machine. + +This generic PCIE-PCI bridge is a cross-platform device, +can be hot-plugged into appropriate root port (requires additional actions, +see 'PCIE-PCI bridge hot-plug' section), +and supports devices hot-plug into the bridge itself +(with some limitations, see below). + +Hot-plug of legacy PCI devices into the bridge +is provided by bridge's built-in Standard hot-plug Controller. +Though it still has some limitations, see below. + +PCIE-PCI bridge hot-plug +======================= +Guest OSes require extra efforts to enable PCIE-PCI bridge hot-plug. +Motivation - now on init any PCI Express root port which doesn't have +any device plugged in, has no free buses reserved to provide any of them +to a hot-plugged devices in future. + +To solve this problem we reserve additional buses on a firmware level. +Currently only SeaBIOS is supported. +The way of bus number to reserve delivery is special +Red Hat vendor-specific PCI capability, added to the root port +that is planned to have PCIE-PCI bridge hot-plugged in. + +Capability layout (defined in include/hw/pci/pci_bridge.h): + + uint8_t id; Standard PCI capability header field + uint8_t next; Standard PCI capability header field + uint8_t len; Standard PCI vendor-specific capability header field + + uint8_t type; Red Hat vendor-specific capability type + List of currently existing types: + RESOURCE_RESERVE = 1 + + + uint32_t bus_res; Minimum number of buses to reserve + + uint64_t io; IO space to reserve + uint32_t mem Non-prefetchable memory to reserve + + At most one of the following two fields may be set to a value + different from -1: + uint32_t mem_pref_32; Prefetchable memory to reserve (32-bit MMIO) + uint64_t mem_pref_64; Prefetchable memory to reserve (64-bit MMIO) + +If any reservation field is -1 then this kind of reservation is not +needed and must be ignored by firmware. + +At the moment this capability is used only in QEMU generic PCIe root port +(-device pcie-root-port). Capability construction function takes all reservation +fields values from corresponding device properties. By default all of them are +set to -1 to leave root port's default behavior unchanged. + +Usage +===== +A detailed command line would be: + +[qemu-bin + storage options] \ +-m 2G \ +-device pcie-root-port,bus=pcie.0,id=rp1,slot=1 \ +-device pcie-root-port,bus=pcie.0,id=rp2,slot=2 \ +-device pcie-root-port,bus=pcie.0,id=rp3,slot=3,bus-reserve=1 \ +-device pcie-pci-bridge,id=br1,bus=rp1 \ +-device pcie-pci-bridge,id=br2,bus=rp2 \ +-device e1000,bus=br1,addr=8 + +Then in monitor it's OK to execute next commands: +device_add pcie-pci-bridge,id=br3,bus=rp3 \ +device_add e1000,bus=br2,addr=1 \ +device_add e1000,bus=br3,addr=1 + +Here you have: + (1) Cold-plugged: + - Root ports: 1 QEMU generic root port with the capability mentioned above, + 2 QEMU generic root ports without this capability; + - 2 PCIE-PCI bridges plugged into 2 different root ports; + - e1000 plugged into the first bridge. + (2) Hot-plugged: + - PCIE-PCI bridge, plugged into QEMU generic root port; + - 2 e1000 cards, one plugged into the cold-plugged PCIE-PCI bridge, + another plugged into the hot-plugged bridge. + +Limitations +=========== +The PCIE-PCI bridge can be hot-plugged only into pcie-root-port that +has proper 'bus-reserve' property value to provide secondary bus for the +hot-plugged bridge. + +Windows 7 and older versions don't support hot-plug devices into the PCIE-PCI bridge. +To enable device hot-plug into the bridge on Linux there're 3 ways: +1) Build shpchp module with this patch http://www.spinics.net/lists/linux-pci/msg63052.html +2) Use kernel 4.14+ where the patch mentioned above is already merged. +3) Set 'msi' property to off - this forces the bridge to use legacy INTx, + which allows the bridge to notify the OS about hot-plug event without having + BUSMASTER set. + +Implementation +============== +The PCIE-PCI bridge is based on PCI-PCI bridge, but also accumulates PCI Express +features as a PCI Express device. + diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt new file mode 100644 index 000000000..5c122fe81 --- /dev/null +++ b/docs/pvrdma.txt @@ -0,0 +1,345 @@ +Paravirtualized RDMA Device (PVRDMA) +==================================== + + +1. Description +=============== +PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. +It works with its Linux Kernel driver AS IS, no need for any special guest +modifications. + +While it complies with the VMware device, it can also communicate with bare +metal RDMA-enabled machines as peers. + +It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe). + +It does not require the whole guest RAM to be pinned allowing memory +over-commit and, even if not implemented yet, migration support will be +possible with some HW assistance. + +A project presentation accompany this document: +- https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf + + + +2. Setup +======== + + +2.1 Guest setup +=============== +Fedora 27+ kernels work out of the box, older distributions +require updating the kernel to 4.14 to include the pvrdma driver. + +However the libpvrdma library needed by User Level Software is still +not available as part of the distributions, so the rdma-core library +needs to be compiled and optionally installed. + +Please follow the instructions at: + https://github.com/linux-rdma/rdma-core.git + + +2.2 Host Setup +============== +The pvrdma backend is an ibdevice interface that can be exposed +either by a Soft-RoCE(rxe) device on machines with no RDMA device, +or an HCA SRIOV function(VF/PF). +Note that ibdevice interfaces can't be shared between pvrdma devices, +each one requiring a separate instance (rxe or SRIOV VF). + + +2.2.1 Soft-RoCE backend(rxe) +=========================== +A stable version of rxe is required, Fedora 27+ or a Linux +Kernel 4.14+ is preferred. + +The rdma_rxe module is part of the Linux Kernel but not loaded by default. +Install the User Level library (librxe) following the instructions from: +https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home + +Associate an ETH interface with rxe by running: + rxe_cfg add eth0 +An rxe0 ibdevice interface will be created and can be used as pvrdma backend. + + +2.2.2 RDMA device Virtual Function backend +========================================== +Nothing special is required, the pvrdma device can work not only with +Ethernet Links, but also Infinibands Links. +All is needed is an ibdevice with an active port, for Mellanox cards +will be something like mlx5_6 which can be the backend. + + +2.2.3 QEMU setup +================ +Configure QEMU with --enable-rdma flag, installing +the required RDMA libraries. + + + +3. Usage +======== + + +3.1 VM Memory settings +====================== +Currently the device is working only with memory backed RAM +and it must be mark as "shared": + -m 1G \ + -object memory-backend-ram,id=mb1,size=1G,share \ + -numa node,memdev=mb1 \ + + +3.2 MAD Multiplexer +=================== +MAD Multiplexer is a service that exposes MAD-like interface for VMs in +order to overcome the limitation where only single entity can register with +MAD layer to send and receive RDMA-CM MAD packets. + +To build rdmacm-mux run +# make rdmacm-mux + +Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel +modules aren't loaded, otherwise the rdmacm-mux service will fail to start. + +The application accepts 3 command line arguments and exposes a UNIX socket +to pass control and data to it. +-d rdma-device-name Name of RDMA device to register with +-s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux) +-p rdma-device-port Port number of RDMA device to register with (default 1) +The final UNIX socket file name is a concatenation of the 3 arguments so +for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2 +will be created. + +pvrdma requires this service. + +Please refer to contrib/rdmacm-mux for more details. + + +3.3 Service exposed by libvirt daemon +===================================== +The control over the RDMA device's GID table is done by updating the +device's Ethernet function addresses. +Usually the first GID entry is determined by the MAC address, the second by +the first IPv6 address and the third by the IPv4 address. Other entries can +be added by adding more IP addresses. The opposite is the same, i.e. +whenever an address is removed, the corresponding GID entry is removed. +The process is done by the network and RDMA stacks. Whenever an address is +added the ib_core driver is notified and calls the device driver add_gid +function which in turn update the device. +To support this in pvrdma device the device hooks into the create_bind and +destroy_bind HW commands triggered by pvrdma driver in guest. + +Whenever changed is made to the pvrdma port's GID table a special QMP +messages is sent to be processed by libvirt to update the address of the +backend Ethernet device. + +pvrdma requires that libvirt service will be up. + + +3.4 PCI devices settings +======================== +RoCE device exposes two functions - an Ethernet and RDMA. +To support it, pvrdma device is composed of two PCI functions, an Ethernet +device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The +Ethernet function can be used for other Ethernet purposes such as IP. + + +3.5 Device parameters +===================== +- netdev: Specifies the Ethernet device function name on the host for + example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet + device used to create it. +- ibdev: The IB device name on host for example rxe0, mlx5_0 etc. +- mad-chardev: The name of the MAD multiplexer char device. +- ibport: In case of multi-port device (such as Mellanox's HCA) this + specify the port to use. If not set 1 will be used. +- dev-caps-max-mr-size: The maximum size of MR. +- dev-caps-max-qp: Maximum number of QPs. +- dev-caps-max-cq: Maximum number of CQs. +- dev-caps-max-mr: Maximum number of MRs. +- dev-caps-max-pd: Maximum number of PDs. +- dev-caps-max-ah: Maximum number of AHs. + +Notes: +- The first 3 parameters are mandatory settings, the rest have their + defaults. +- The last 8 parameters (the ones that prefixed by dev-caps) defines the top + limits but the final values is adjusted by the backend device limitations. +- netdev can be extracted from ibdev's sysfs + (/sys/class/infiniband/<ibdev>/device/net/) + + +3.6 Example +=========== +Define bridge device with vmxnet3 network backend: +<interface type='bridge'> + <mac address='56:b4:44:e9:62:dc'/> + <source bridge='bridge1'/> + <model type='vmxnet3'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/> +</interface> + +Define pvrdma device: +<qemu:commandline> + <qemu:arg value='-object'/> + <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/> + <qemu:arg value='-numa'/> + <qemu:arg value='node,memdev=mb1'/> + <qemu:arg value='-chardev'/> + <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/> + <qemu:arg value='-device'/> + <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/> +</qemu:commandline> + + + +4. Implementation details +========================= + + +4.1 Overview +============ +The device acts like a proxy between the Guest Driver and the host +ibdevice interface. +On configuration path: + - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request + a resource from the backend interface, maintaining a 1-1 mapping + between the guest and host. +On data path: + - Every post_send/receive received from the guest will be converted into + a post_send/receive for the backend. The buffers data will not be touched + or copied resulting in near bare-metal performance for large enough buffers. + - Completions from the backend interface will result in completions for + the pvrdma device. + + +4.2 PCI BARs +============ +PCI Bars: + BAR 0 - MSI-X + MSI-X vectors: + (0) Command - used when execution of a command is completed. + (1) Async - not in use. + (2) Completion - used when a completion event is placed in + device's CQ ring. + BAR 1 - Registers + -------------------------------------------------------- + | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC | + -------------------------------------------------------- + DSR - Address of driver/device shared memory used + for the command channel, used for passing: + - General info such as driver version + - Address of 'command' and 'response' + - Address of async ring + - Address of device's CQ ring + - Device capabilities + CTL - Device control operations (activate, reset etc) + IMG - Set interrupt mask + REQ - Command execution register + ERR - Operation status + + BAR 2 - UAR + --------------------------------------------------------- + | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag | + --------------------------------------------------------- + - Offset 0 used for QP operations (send and recv) + - Offset 4 used for CQ operations (arm and poll) + + +4.3 Major flows +=============== + +4.3.1 Create CQ +=============== + - Guest driver + - Allocates pages for CQ ring + - Creates page directory (pdir) to hold CQ ring's pages + - Initializes CQ ring + - Initializes 'Create CQ' command object (cqe, pdir etc) + - Copies the command to 'command' address + - Writes 0 into REQ register + - Device + - Reads the request object from the 'command' address + - Allocates CQ object and initialize CQ ring based on pdir + - Creates the backend CQ + - Writes operation status to ERR register + - Posts command-interrupt to guest + - Guest driver + - Reads the HW response code from ERR register + +4.3.2 Create QP +=============== + - Guest driver + - Allocates pages for send and receive rings + - Creates page directory(pdir) to hold the ring's pages + - Initializes 'Create QP' command object (max_send_wr, + send_cq_handle, recv_cq_handle, pdir etc) + - Copies the object to 'command' address + - Write 0 into REQ register + - Device + - Reads the request object from 'command' address + - Allocates the QP object and initialize + - Send and recv rings based on pdir + - Send and recv ring state + - Creates the backend QP + - Writes the operation status to ERR register + - Posts command-interrupt to guest + - Guest driver + - Reads the HW response code from ERR register + +4.3.3 Post receive +================== + - Guest driver + - Initializes a wqe and place it on recv ring + - Write to qpn|qp_recv_bit (31) to QP offset in UAR + - Device + - Extracts qpn from UAR + - Walks through the ring and does the following for each wqe + - Prepares the backend CQE context to be used when + receiving completion from backend (wr_id, op_code, emu_cq_num) + - For each sge prepares backend sge + - Calls backend's post_recv + +4.3.4 Process backend events +============================ + - Done by a dedicated thread used to process backend events; + at initialization is attached to the device and creates + the communication channel. + - Thread main loop: + - Polls for completions + - Extracts QEMU _cq_num, wr_id and op_code from context + - Writes CQE to CQ ring + - Writes CQ number to device CQ + - Sends completion-interrupt to guest + - Deallocates context + - Acks the event to backend + + + +5. Limitations +============== +- The device obviously is limited by the Guest Linux Driver features implementation + of the VMware device API. +- Memory registration mechanism requires mremap for every page in the buffer in order + to map it to a contiguous virtual address range. Since this is not the data path + it should not matter much. If the default max mr size is increased, be aware that + memory registration can take up to 0.5 seconds for 1GB of memory. +- The device requires target page size to be the same as the host page size, + otherwise it will fail to init. +- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached, + so it can't work with huge pages. The limitation will be addressed in the future, + however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge + pages available, QEMU will use them. QEMU will fail to init if the requirements + are not met. + + + +6. Performance +============== +By design the pvrdma device exits on each post-send/receive, so for small buffers +the performance is affected; however for medium buffers it will became close to +bare metal and from 1MB buffers and up it reaches bare metal performance. +(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device) + +All the above assumes no memory registration is done on data path. diff --git a/docs/qcow2-cache.txt b/docs/qcow2-cache.txt new file mode 100644 index 000000000..5f763aa6b --- /dev/null +++ b/docs/qcow2-cache.txt @@ -0,0 +1,241 @@ +qcow2 L2/refcount cache configuration +===================================== +Copyright (C) 2015, 2018-2020 Igalia, S.L. +Author: Alberto Garcia <berto@igalia.com> + +This work is licensed under the terms of the GNU GPL, version 2 or +later. See the COPYING file in the top-level directory. + +Introduction +------------ +The QEMU qcow2 driver has two caches that can improve the I/O +performance significantly. However, setting the right cache sizes is +not a straightforward operation. + +This document attempts to give an overview of the L2 and refcount +caches, and how to configure them. + +Please refer to the docs/interop/qcow2.txt file for an in-depth +technical description of the qcow2 file format. + + +Clusters +-------- +A qcow2 file is organized in units of constant size called clusters. + +The cluster size is configurable, but it must be a power of two and +its value 512 bytes or higher. QEMU currently defaults to 64 KB +clusters, and it does not support sizes larger than 2MB. + +The 'qemu-img create' command supports specifying the size using the +cluster_size option: + + qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G + + +The L2 tables +------------- +The qcow2 format uses a two-level structure to map the virtual disk as +seen by the guest to the disk image in the host. These structures are +called the L1 and L2 tables. + +There is one single L1 table per disk image. The table is small and is +always kept in memory. + +There can be many L2 tables, depending on how much space has been +allocated in the image. Each table is one cluster in size. In order to +read or write data from the virtual disk, QEMU needs to read its +corresponding L2 table to find out where that data is located. Since +reading the table for each I/O operation can be expensive, QEMU keeps +an L2 cache in memory to speed up disk access. + +The size of the L2 cache can be configured, and setting the right +value can improve the I/O performance significantly. + + +The refcount blocks +------------------- +The qcow2 format also maintains a reference count for each cluster. +Reference counts are used for cluster allocation and internal +snapshots. The data is stored in a two-level structure similar to the +L1/L2 tables described above. + +The second level structures are called refcount blocks, are also one +cluster in size and the number is also variable and dependent on the +amount of allocated space. + +Each block contains a number of refcount entries. Their size (in bits) +is a power of two and must not be higher than 64. It defaults to 16 +bits, but a different value can be set using the refcount_bits option: + + qemu-img create -f qcow2 -o refcount_bits=8 hd.qcow2 4G + +QEMU keeps a refcount cache to speed up I/O much like the +aforementioned L2 cache, and its size can also be configured. + + +Choosing the right cache sizes +------------------------------ +In order to choose the cache sizes we need to know how they relate to +the amount of allocated space. + +The part of the virtual disk that can be mapped by the L2 and refcount +caches (in bytes) is: + + disk_size = l2_cache_size * cluster_size / 8 + disk_size = refcount_cache_size * cluster_size * 8 / refcount_bits + +With the default values for cluster_size (64KB) and refcount_bits +(16), this becomes: + + disk_size = l2_cache_size * 8192 + disk_size = refcount_cache_size * 32768 + +So in order to cover n GB of disk space with the default values we +need: + + l2_cache_size = disk_size_GB * 131072 + refcount_cache_size = disk_size_GB * 32768 + +For example, 1MB of L2 cache is needed to cover every 8 GB of the virtual +image size (given that the default cluster size is used): + + 8 GB / 8192 = 1 MB + +The refcount cache is 4 times the cluster size by default. With the default +cluster size of 64 KB, it is 256 KB (262144 bytes). This is sufficient for +8 GB of image size: + + 262144 * 32768 = 8 GB + + +How to configure the cache sizes +-------------------------------- +Cache sizes can be configured using the -drive option in the +command-line, or the 'blockdev-add' QMP command. + +There are three options available, and all of them take bytes: + +"l2-cache-size": maximum size of the L2 table cache +"refcount-cache-size": maximum size of the refcount block cache +"cache-size": maximum size of both caches combined + +There are a few things that need to be taken into account: + + - Both caches must have a size that is a multiple of the cluster size + (or the cache entry size: see "Using smaller cache sizes" below). + + - The maximum L2 cache size is 32 MB by default on Linux platforms (enough + for full coverage of 256 GB images, with the default cluster size). This + value can be modified using the "l2-cache-size" option. QEMU will not use + more memory than needed to hold all of the image's L2 tables, regardless + of this max. value. + On non-Linux platforms the maximal value is smaller by default (8 MB) and + this difference stems from the fact that on Linux the cache can be cleared + periodically if needed, using the "cache-clean-interval" option (see below). + The minimal L2 cache size is 2 clusters (or 2 cache entries, see below). + + - The default (and minimum) refcount cache size is 4 clusters. + + - If only "cache-size" is specified then QEMU will assign as much + memory as possible to the L2 cache before increasing the refcount + cache size. + + - At most two of "l2-cache-size", "refcount-cache-size", and "cache-size" + can be set simultaneously. + +Unlike L2 tables, refcount blocks are not used during normal I/O but +only during allocations and internal snapshots. In most cases they are +accessed sequentially (even during random guest I/O) so increasing the +refcount cache size won't have any measurable effect in performance +(this can change if you are using internal snapshots, so you may want +to think about increasing the cache size if you use them heavily). + +Before QEMU 2.12 the refcount cache had a default size of 1/4 of the +L2 cache size. This resulted in unnecessarily large caches, so now the +refcount cache is as small as possible unless overridden by the user. + + +Using smaller cache entries +--------------------------- +The qcow2 L2 cache can store complete tables. This means that if QEMU +needs an entry from an L2 table then the whole table is read from disk +and is kept in the cache. If the cache is full then a complete table +needs to be evicted first. + +This can be inefficient with large cluster sizes since it results in +more disk I/O and wastes more cache memory. + +Since QEMU 2.12 you can change the size of the L2 cache entry and make +it smaller than the cluster size. This can be configured using the +"l2-cache-entry-size" parameter: + + -drive file=hd.qcow2,l2-cache-size=2097152,l2-cache-entry-size=4096 + +Since QEMU 4.0 the value of l2-cache-entry-size defaults to 4KB (or +the cluster size if it's smaller). + +Some things to take into account: + + - The L2 cache entry size has the same restrictions as the cluster + size (power of two, at least 512 bytes). + + - Smaller entry sizes generally improve the cache efficiency and make + disk I/O faster. This is particularly true with solid state drives + so it's a good idea to reduce the entry size in those cases. With + rotating hard drives the situation is a bit more complicated so you + should test it first and stay with the default size if unsure. + + - Try different entry sizes to see which one gives faster performance + in your case. The block size of the host filesystem is generally a + good default (usually 4096 bytes in the case of ext4, hence the + default). + + - Only the L2 cache can be configured this way. The refcount cache + always uses the cluster size as the entry size. + + - If the L2 cache is big enough to hold all of the image's L2 tables + (as explained in the "Choosing the right cache sizes" and "How to + configure the cache sizes" sections in this document) then none of + this is necessary and you can omit the "l2-cache-entry-size" + parameter altogether. In this case QEMU makes the entry size + equal to the cluster size by default. + + +Reducing the memory usage +------------------------- +It is possible to clean unused cache entries in order to reduce the +memory usage during periods of low I/O activity. + +The parameter "cache-clean-interval" defines an interval (in seconds), +after which all the cache entries that haven't been accessed during the +interval are removed from memory. Setting this parameter to 0 disables this +feature. + +The following example removes all unused cache entries every 15 minutes: + + -drive file=hd.qcow2,cache-clean-interval=900 + +If unset, the default value for this parameter is 600 on platforms which +support this functionality, and is 0 (disabled) on other platforms. + +This functionality currently relies on the MADV_DONTNEED argument for +madvise() to actually free the memory. This is a Linux-specific feature, +so cache-clean-interval is not supported on other systems. + + +Extended L2 Entries +------------------- +All numbers shown in this document are valid for qcow2 images with normal +64-bit L2 entries. + +Images with extended L2 entries need twice as much L2 metadata, so the L2 +cache size must be twice as large for the same disk space. + + disk_size = l2_cache_size * cluster_size / 16 + +i.e. + + l2_cache_size = disk_size * 16 / cluster_size + +Refcount blocks are not affected by this. diff --git a/docs/qdev-device-use.txt b/docs/qdev-device-use.txt new file mode 100644 index 000000000..240888933 --- /dev/null +++ b/docs/qdev-device-use.txt @@ -0,0 +1,404 @@ += How to convert to -device & friends = + +=== Specifying Bus and Address on Bus === + +In qdev, each device has a parent bus. Some devices provide one or +more buses for children. You can specify a device's parent bus with +-device parameter bus. + +A device typically has a device address on its parent bus. For buses +where this address can be configured, devices provide a bus-specific +property. Examples: + + bus property name value format + PCI addr %x.%x (dev.fn, .fn optional) + I2C address %u + SCSI scsi-id %u + IDE unit %u + HDA cad %u + virtio-serial-bus nr %u + ccid-bus slot %u + USB port %d(.%d)* (port.port...) + +Example: device i440FX-pcihost is on the root bus, and provides a PCI +bus named pci.0. To put a FOO device into its slot 4, use -device +FOO,bus=/i440FX-pcihost/pci.0,addr=4. The abbreviated form bus=pci.0 +also works as long as the bus name is unique. + +=== Block Devices === + +A QEMU block device (drive) has a host and a guest part. + +In the general case, the guest device is connected to a controller +device. For instance, the IDE controller provides two IDE buses, each +of which can have up to two devices, and each device is a guest part, +and is connected to a host part. + +Except we sometimes lump controller, bus(es) and drive device(s) all +together into a single device. For instance, the ISA floppy +controller is connected to up to two host drives. + +The old ways to define block devices define host and guest part +together. Sometimes, they can even define a controller device in +addition to the block device. + +The new way keeps the parts separate: you create the host part with +-drive, and guest device(s) with -device. + +The various old ways to define drives all boil down to the common form + + -drive if=TYPE,bus=BUS,unit=UNIT,OPTS... + +TYPE, BUS and UNIT identify the controller device, which of its buses +to use, and the drive's address on that bus. Details depend on TYPE. + +Instead of bus=BUS,unit=UNIT, you can also say index=IDX. + +In the new way, this becomes something like + + -drive if=none,id=DRIVE-ID,HOST-OPTS... + -device DEVNAME,drive=DRIVE-ID,DEV-OPTS... + +The old OPTS get split into HOST-OPTS and DEV-OPTS as follows: + +* file, format, snapshot, cache, aio, readonly, rerror, werror go into + HOST-OPTS. + +* cyls, head, secs and trans go into HOST-OPTS. Future work: they + should go into DEV-OPTS instead. + +* serial goes into DEV-OPTS, for devices supporting serial numbers. + For other devices, it goes nowhere. + +* media is special. In the old way, it selects disk vs. CD-ROM with + if=ide, if=scsi and if=xen. The new way uses DEVNAME for that. + Additionally, readonly=on goes into HOST-OPTS. + +* addr is special, see if=virtio below. + +The -device argument differs in detail for each type of drive: + +* if=ide + + -device DEVNAME,drive=DRIVE-ID,bus=IDE-BUS,unit=UNIT + + where DEVNAME is either ide-hd or ide-cd, IDE-BUS identifies an IDE + bus, normally either ide.0 or ide.1, and UNIT is either 0 or 1. + +* if=scsi + + The old way implicitly creates SCSI controllers as needed. The new + way makes that explicit: + + -device lsi53c895a,id=ID + + As for all PCI devices, you can add bus=PCI-BUS,addr=DEVFN to + control the PCI device address. + + This SCSI controller provides a single SCSI bus, named ID.0. Put a + disk on it: + + -device DEVNAME,drive=DRIVE-ID,bus=ID.0,scsi-id=UNIT + + where DEVNAME is either scsi-hd, scsi-cd or scsi-generic. + +* if=floppy + + -device floppy,unit=UNIT,drive=DRIVE-ID + + Without any -device floppy,... you get an empty unit 0 and no unit + 1. You can use -nodefaults to suppress the default unit 0, see + "Default Devices". + +* if=virtio + + -device virtio-blk-pci,drive=DRIVE-ID,class=C,vectors=V,ioeventfd=IOEVENTFD + + This lets you control PCI device class and MSI-X vectors. + + IOEVENTFD controls whether or not ioeventfd is used for virtqueue + notify. It can be set to on (default) or off. + + As for all PCI devices, you can add bus=PCI-BUS,addr=DEVFN to + control the PCI device address. This replaces option addr available + with -drive if=virtio. + +* if=pflash, if=mtd, if=sd, if=xen are not yet available with -device + +For USB devices, the old way was actually different: + + -usbdevice disk:format=FMT:FILENAME + +"Was" because "disk:" is gone since v2.12.0. + +The old way provided much less control than -drive's OPTS... The new +way fixes that: + + -device usb-storage,drive=DRIVE-ID,removable=RMB + +The removable parameter gives control over the SCSI INQUIRY removable +(RMB) bit. USB thumbdrives usually set removable=on, while USB hard +disks set removable=off. + +Bug: usb-storage pretends to be a block device, but it's really a SCSI +controller that can serve only a single device, which it creates +automatically. The automatic creation guesses what kind of guest part +to create from the host part, like -drive if=scsi. Host and guest +part are not cleanly separated. + +=== Character Devices === + +A QEMU character device has a host and a guest part. + +The old ways to define character devices define host and guest part +together. + +The new way keeps the parts separate: you create the host part with +-chardev, and the guest device with -device. + +The various old ways to define a character device are all of the +general form + + -FOO FOO-OPTS...,LEGACY-CHARDEV + +where FOO-OPTS... is specific to -FOO, and the host part +LEGACY-CHARDEV is the same everywhere. + +In the new way, this becomes + + -chardev HOST-OPTS...,id=CHR-ID + -device DEVNAME,chardev=CHR-ID,DEV-OPTS... + +The appropriate DEVNAME depends on the machine type. For type "pc": + +* -serial becomes -device isa-serial,iobase=IOADDR,irq=IRQ,index=IDX + + This lets you control I/O ports and IRQs. + +* -parallel becomes -device isa-parallel,iobase=IOADDR,irq=IRQ,index=IDX + + This lets you control I/O ports and IRQs. + +* -usbdevice braille doesn't support LEGACY-CHARDEV syntax. It always + uses "braille". With -device, this useful default is gone, so you + have to use something like + + -device usb-braille,chardev=braille -chardev braille,id=braille + +* -usbdevice serial::chardev is gone since v2.12.0. It became + -device usb-serial,chardev=dev. + +LEGACY-CHARDEV translates to -chardev HOST-OPTS... as follows: + +* null becomes -chardev null + +* pty, msmouse, wctablet, braille, stdio likewise + +* vc:WIDTHxHEIGHT becomes -chardev vc,width=WIDTH,height=HEIGHT + +* vc:<COLS>Cx<ROWS>C becomes -chardev vc,cols=<COLS>,rows=<ROWS> + +* con: becomes -chardev console + +* COM<NUM> becomes -chardev serial,path=COM<NUM> + +* file:FNAME becomes -chardev file,path=FNAME + +* pipe:FNAME becomes -chardev pipe,path=FNAME + +* tcp:HOST:PORT,OPTS... becomes -chardev socket,host=HOST,port=PORT,OPTS... + +* telnet:HOST:PORT,OPTS... becomes + -chardev socket,host=HOST,port=PORT,OPTS...,telnet=on + +* udp:HOST:PORT@LOCALADDR:LOCALPORT becomes + -chardev udp,host=HOST,port=PORT,localaddr=LOCALADDR,localport=LOCALPORT + +* unix:FNAME becomes -chardev socket,path=FNAME + +* /dev/parportN becomes -chardev parport,file=/dev/parportN + +* /dev/ppiN likewise + +* Any other /dev/FNAME becomes -chardev tty,path=/dev/FNAME + +* mon:LEGACY-CHARDEV is special: it multiplexes the monitor onto the + character device defined by LEGACY-CHARDEV. -chardev provides more + general multiplexing instead: you can connect up to four users to a + single host part. You need to pass mux=on to -chardev to enable + switching the input focus. + +QEMU uses LEGACY-CHARDEV syntax not just to set up guest devices, but +also in various other places such as -monitor or -net +user,guestfwd=... You can use chardev:CHR-ID in place of +LEGACY-CHARDEV to refer to a host part defined with -chardev. + +=== Network Devices === + +Host and guest part of network devices have always been separate. + +The old way to define the guest part looks like this: + + -net nic,netdev=NET-ID,macaddr=MACADDR,model=MODEL,name=ID,addr=STR,vectors=V + +Except for USB it looked like this: + + -usbdevice net:netdev=NET-ID,macaddr=MACADDR,name=ID + +"Looked" because "net:" is gone since v2.12.0. + +The new way is -device: + + -device DEVNAME,netdev=NET-ID,mac=MACADDR,DEV-OPTS... + +DEVNAME equals MODEL, except for virtio you have to name the virtio +device appropriate for the bus (virtio-net-pci for PCI), and for USB +you have to use usb-net. + +The old name=ID parameter becomes the usual id=ID with -device. + +For PCI devices, you can add bus=PCI-BUS,addr=DEVFN to control the PCI +device address, as usual. The old -net nic provides parameter addr +for that, which is silently ignored when the NIC is not a PCI device. + +For virtio-net-pci, you can control whether or not ioeventfd is used for +virtqueue notify by setting ioeventfd= to on or off (default). + +-net nic accepts vectors=V for all models, but it's silently ignored +except for virtio-net-pci (model=virtio). With -device, only devices +that support it accept it. + +Not all devices are available with -device at this time. All PCI +devices and ne2k_isa are. + +Some PCI devices aren't available with -net nic, e.g. i82558a. + +=== Graphics Devices === + +Host and guest part of graphics devices have always been separate. + +The old way to define the guest graphics device is -vga VGA. Not all +machines support all -vga options. + +The new way is -device. The mapping from -vga argument to -device +depends on the machine type. For machine "pc", it's: + + std -device VGA + cirrus -device cirrus-vga + vmware -device vmware-svga + qxl -device qxl-vga + none -nodefaults + disables more than just VGA, see "Default Devices" + +As for all PCI devices, you can add bus=PCI-BUS,addr=DEVFN to control +the PCI device address. + +-device VGA supports properties bios-offset and bios-size, but they +aren't used with machine type "pc". + +For machine "isapc", it's + + std -device isa-vga + cirrus not yet available with -device + none -nodefaults + disables more than just VGA, see "Default Devices" + +Bug: the new way doesn't work for machine types "pc" and "isapc", +because it violates obscure device initialization ordering +constraints. + +=== Audio Devices === + +Host and guest part of audio devices have always been separate. + +The old way to define guest audio devices is -soundhw C1,... + +The new way is to define each guest audio device separately with +-device. + +Map from -soundhw sound card name to -device: + + ac97 -device AC97 + cs4231a -device cs4231a,iobase=IOADDR,irq=IRQ,dma=DMA + es1370 -device ES1370 + gus -device gus,iobase=IOADDR,irq=IRQ,dma=DMA,freq=F + hda -device intel-hda,msi=MSI -device hda-duplex + sb16 -device sb16,iobase=IOADDR,irq=IRQ,dma=DMA,dma16=DMA16,version=V + adlib not yet available with -device + pcspk not yet available with -device + +For PCI devices, you can add bus=PCI-BUS,addr=DEVFN to control the PCI +device address, as usual. + +=== USB Devices === + +The old way to define a virtual USB device is -usbdevice DRIVER:OPTS... + +The new way is -device DEVNAME,DEV-OPTS... Details depend on DRIVER: + +* ccid -device usb-ccid +* keyboard -device usb-kbd +* mouse -device usb-mouse +* tablet -device usb-tablet +* wacom-tablet -device usb-wacom-tablet +* u2f -device u2f-{emulated,passthru} +* braille See "Character Devices" + +Until v2.12.0, we additionally had + +* host:... See "Host Device Assignment" +* disk:... See "Block Devices" +* serial:... See "Character Devices" +* net:... See "Network Devices" + +=== Watchdog Devices === + +Host and guest part of watchdog devices have always been separate. + +The old way to define a guest watchdog device is -watchdog DEVNAME. +The new way is -device DEVNAME. For PCI devices, you can add +bus=PCI-BUS,addr=DEVFN to control the PCI device address, as usual. + +=== Host Device Assignment === + +QEMU supports assigning host PCI devices (qemu-kvm only at this time) +and host USB devices. PCI devices can only be assigned with -device: + + -device vfio-pci,host=ADDR,id=ID + +The old way to assign a USB host device + + -usbdevice host:auto:BUS.ADDR:VID:PRID + +was removed in v2.12.0. Any of BUS, ADDR, VID, PRID could be the +wildcard *. + +The new way is + + -device usb-host,hostbus=BUS,hostaddr=ADDR,vendorid=VID,productid=PRID + +Omitted options match anything. + +=== Default Devices === + +QEMU creates a number of devices by default, depending on the machine +type. + +-device DEVNAME... and global DEVNAME... suppress default devices for +some DEVNAMEs: + + default device suppressing DEVNAMEs + CD-ROM ide-cd, ide-hd, scsi-cd, scsi-hd + floppy floppy, isa-fdc + parallel isa-parallel + serial isa-serial + VGA VGA, cirrus-vga, isa-vga, isa-cirrus-vga, + vmware-svga, qxl-vga, virtio-vga, ati-vga, + vhost-user-vga + +The default NIC is connected to a default part created along with it. +It is *not* suppressed by configuring a NIC with -device (you may call +that a bug). -net and -netdev suppress the default NIC. + +-nodefaults suppresses all the default devices mentioned above, plus a +few other things such as default SD-Card drive and default monitor. diff --git a/docs/qemu-option-trace.rst.inc b/docs/qemu-option-trace.rst.inc new file mode 100644 index 000000000..d7acbe67f --- /dev/null +++ b/docs/qemu-option-trace.rst.inc @@ -0,0 +1,26 @@ + +Specify tracing options. + +``[enable=]PATTERN`` + + Immediately enable events matching *PATTERN* + (either event name or a globbing pattern). This option is only + available if QEMU has been compiled with the ``simple``, ``log`` + or ``ftrace`` tracing backend. To specify multiple events or patterns, + specify the :option:`-trace` option multiple times. + + Use :option:`-trace help` to print a list of names of trace points. + +``events=FILE`` + + Immediately enable events listed in *FILE*. + The file must contain one event name (as listed in the ``trace-events-all`` + file) per line; globbing patterns are accepted too. This option is only + available if QEMU has been compiled with the ``simple``, ``log`` or + ``ftrace`` tracing backend. + +``file=FILE`` + + Log output traces to *FILE*. + This option is only available if QEMU has been compiled with + the ``simple`` tracing backend. diff --git a/docs/qemu_logo.pdf b/docs/qemu_logo.pdf Binary files differnew file mode 100644 index 000000000..294cb7dec --- /dev/null +++ b/docs/qemu_logo.pdf diff --git a/docs/qemupciserial.inf b/docs/qemupciserial.inf new file mode 100644 index 000000000..7ca766745 --- /dev/null +++ b/docs/qemupciserial.inf @@ -0,0 +1,102 @@ +; qemupciserial.inf for QEMU, based on MSPORTS.INF + +; The driver itself is shipped with Windows (serial.sys). This is +; just an inf file to tell windows which pci id the serial pci card +; emulated by qemu has, and to apply a name tag to it which windows +; will show in the device manager. + +; Installing the driver: Go to device manager. You should find a "pci +; serial card" tagged with a yellow question mark. Open properties. +; Pick "update driver". Then "select driver manually". Pick "Ports +; (Com+Lpt)" from the list. Click "Have a disk". Select this file. +; Procedure may vary a bit depending on the windows version. + +; This file covers all options: pci-serial, pci-serial-2x, pci-serial-4x +; for both 32 and 64 bit platforms. + +[Version] +Signature="$Windows NT$" +Class=MultiFunction +ClassGUID={4d36e971-e325-11ce-bfc1-08002be10318} +Provider=%QEMU% +DriverVer=12/29/2013,1.3.0 +[ControlFlags] +ExcludeFromSelect=* +[Manufacturer] +%QEMU%=QEMU,NTx86,NTAMD64 + +[QEMU.NTx86] +%QEMU-PCI_SERIAL_1_PORT%=ComPort_inst1, PCI\VEN_1B36&DEV_0002 +%QEMU-PCI_SERIAL_2_PORT%=ComPort_inst2, PCI\VEN_1B36&DEV_0003 +%QEMU-PCI_SERIAL_4_PORT%=ComPort_inst4, PCI\VEN_1B36&DEV_0004 + +[QEMU.NTAMD64] +%QEMU-PCI_SERIAL_1_PORT%=ComPort_inst1, PCI\VEN_1B36&DEV_0002 +%QEMU-PCI_SERIAL_2_PORT%=ComPort_inst2, PCI\VEN_1B36&DEV_0003 +%QEMU-PCI_SERIAL_4_PORT%=ComPort_inst4, PCI\VEN_1B36&DEV_0004 + +[ComPort_inst1] +Include=mf.inf +Needs=MFINSTALL.mf + +[ComPort_inst2] +Include=mf.inf +Needs=MFINSTALL.mf + +[ComPort_inst4] +Include=mf.inf +Needs=MFINSTALL.mf + +[ComPort_inst1.HW] +AddReg=ComPort_inst1.RegHW + +[ComPort_inst2.HW] +AddReg=ComPort_inst2.RegHW + +[ComPort_inst4.HW] +AddReg=ComPort_inst4.RegHW + +[ComPort_inst1.Services] +Include=mf.inf +Needs=MFINSTALL.mf.Services + +[ComPort_inst2.Services] +Include=mf.inf +Needs=MFINSTALL.mf.Services + +[ComPort_inst4.Services] +Include=mf.inf +Needs=MFINSTALL.mf.Services + +[ComPort_inst1.RegHW] +HKR,Child0000,HardwareID,,*PNP0501 +HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00 +HKR,Child0000,ResourceMap,1,02 + +[ComPort_inst2.RegHW] +HKR,Child0000,HardwareID,,*PNP0501 +HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00 +HKR,Child0000,ResourceMap,1,02 +HKR,Child0001,HardwareID,,*PNP0501 +HKR,Child0001,VaryingResourceMap,1,00, 08,00,00,00, 08,00,00,00 +HKR,Child0001,ResourceMap,1,02 + +[ComPort_inst4.RegHW] +HKR,Child0000,HardwareID,,*PNP0501 +HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00 +HKR,Child0000,ResourceMap,1,02 +HKR,Child0001,HardwareID,,*PNP0501 +HKR,Child0001,VaryingResourceMap,1,00, 08,00,00,00, 08,00,00,00 +HKR,Child0001,ResourceMap,1,02 +HKR,Child0002,HardwareID,,*PNP0501 +HKR,Child0002,VaryingResourceMap,1,00, 10,00,00,00, 08,00,00,00 +HKR,Child0002,ResourceMap,1,02 +HKR,Child0003,HardwareID,,*PNP0501 +HKR,Child0003,VaryingResourceMap,1,00, 18,00,00,00, 08,00,00,00 +HKR,Child0003,ResourceMap,1,02 + +[Strings] +QEMU="QEMU" +QEMU-PCI_SERIAL_1_PORT="1x QEMU PCI Serial Card" +QEMU-PCI_SERIAL_2_PORT="2x QEMU PCI Serial Card" +QEMU-PCI_SERIAL_4_PORT="4x QEMU PCI Serial Card" diff --git a/docs/rdma.txt b/docs/rdma.txt new file mode 100644 index 000000000..2b4cdea1d --- /dev/null +++ b/docs/rdma.txt @@ -0,0 +1,420 @@ +(RDMA: Remote Direct Memory Access) +RDMA Live Migration Specification, Version # 1 +============================================== +Wiki: https://wiki.qemu.org/Features/RDMALiveMigration +Github: git@github.com:hinesmr/qemu.git, 'rdma' branch + +Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com> + +An *exhaustive* paper (2010) shows additional performance details +linked on the QEMU wiki above. + +Contents: +========= +* Introduction +* Before running +* Running +* Performance +* RDMA Migration Protocol Description +* Versioning and Capabilities +* QEMUFileRDMA Interface +* Migration of VM's ram +* Error handling +* TODO + +Introduction: +============= + +RDMA helps make your migration more deterministic under heavy load because +of the significantly lower latency and higher throughput over TCP/IP. This is +because the RDMA I/O architecture reduces the number of interrupts and +data copies by bypassing the host networking stack. In particular, a TCP-based +migration, under certain types of memory-bound workloads, may take a more +unpredictable amount of time to complete the migration if the amount of +memory tracked during each live migration iteration round cannot keep pace +with the rate of dirty memory produced by the workload. + +RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA +over Converged Ethernet) as well as Infiniband-based. This implementation of +migration using RDMA is capable of using both technologies because of +the use of the OpenFabrics OFED software stack that abstracts out the +programming model irrespective of the underlying hardware. + +Refer to openfabrics.org or your respective RDMA hardware vendor for +an understanding on how to verify that you have the OFED software stack +installed in your environment. You should be able to successfully link +against the "librdmacm" and "libibverbs" libraries and development headers +for a working build of QEMU to run successfully using RDMA Migration. + +BEFORE RUNNING: +=============== + +Use of RDMA during migration requires pinning and registering memory +with the hardware. This means that memory must be physically resident +before the hardware can transmit that memory to another machine. +If this is not acceptable for your application or product, then the use +of RDMA migration may in fact be harmful to co-located VMs or other +software on the machine if there is not sufficient memory available to +relocate the entire footprint of the virtual machine. If so, then the +use of RDMA is discouraged and it is recommended to use standard TCP migration. + +Experimental: Next, decide if you want dynamic page registration. +For example, if you have an 8GB RAM virtual machine, but only 1GB +is in active use, then enabling this feature will cause all 8GB to +be pinned and resident in memory. This feature mostly affects the +bulk-phase round of the migration and can be enabled for extremely +high-performance RDMA hardware using the following command: + +QEMU Monitor Command: +$ migrate_set_capability rdma-pin-all on # disabled by default + +Performing this action will cause all 8GB to be pinned, so if that's +not what you want, then please ignore this step altogether. + +On the other hand, this will also significantly speed up the bulk round +of the migration, which can greatly reduce the "total" time of your migration. +Example performance of this using an idle VM in the previous example +can be found in the "Performance" section. + +Note: for very large virtual machines (hundreds of GBs), pinning all +*all* of the memory of your virtual machine in the kernel is very expensive +may extend the initial bulk iteration time by many seconds, +and thus extending the total migration time. However, this will not +affect the determinism or predictability of your migration you will +still gain from the benefits of advanced pinning with RDMA. + +RUNNING: +======== + +First, set the migration speed to match your hardware's capabilities: + +QEMU Monitor Command: +$ migrate_set_parameter max_bandwidth 40g # or whatever is the MAX of your RDMA device + +Next, on the destination machine, add the following to the QEMU command line: + +qemu ..... -incoming rdma:host:port + +Finally, perform the actual migration on the source machine: + +QEMU Monitor Command: +$ migrate -d rdma:host:port + +PERFORMANCE +=========== + +Here is a brief summary of total migration time and downtime using RDMA: +Using a 40gbps infiniband link performing a worst-case stress test, +using an 8GB RAM virtual machine: + +Using the following command: +$ apt-get install stress +$ stress --vm-bytes 7500M --vm 1 --vm-keep + +1. Migration throughput: 26 gigabits/second. +2. Downtime (stop time) varies between 15 and 100 milliseconds. + +EFFECTS of memory registration on bulk phase round: + +For example, in the same 8GB RAM example with all 8GB of memory in +active use and the VM itself is completely idle using the same 40 gbps +infiniband link: + +1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps +2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps + +These numbers would of course scale up to whatever size virtual machine +you have to migrate using RDMA. + +Enabling this feature does *not* have any measurable affect on +migration *downtime*. This is because, without this feature, all of the +memory will have already been registered already in advance during +the bulk round and does not need to be re-registered during the successive +iteration rounds. + +RDMA Protocol Description: +========================== + +Migration with RDMA is separated into two parts: + +1. The transmission of the pages using RDMA +2. Everything else (a control channel is introduced) + +"Everything else" is transmitted using a formal +protocol now, consisting of infiniband SEND messages. + +An infiniband SEND message is the standard ibverbs +message used by applications of infiniband hardware. +The only difference between a SEND message and an RDMA +message is that SEND messages cause notifications +to be posted to the completion queue (CQ) on the +infiniband receiver side, whereas RDMA messages (used +for VM's ram) do not (to behave like an actual DMA). + +Messages in infiniband require two things: + +1. registration of the memory that will be transmitted +2. (SEND only) work requests to be posted on both + sides of the network before the actual transmission + can occur. + +RDMA messages are much easier to deal with. Once the memory +on the receiver side is registered and pinned, we're +basically done. All that is required is for the sender +side to start dumping bytes onto the link. + +(Memory is not released from pinning until the migration +completes, given that RDMA migrations are very fast.) + +SEND messages require more coordination because the +receiver must have reserved space (using a receive +work request) on the receive queue (RQ) before QEMUFileRDMA +can start using them to carry all the bytes as +a control transport for migration of device state. + +To begin the migration, the initial connection setup is +as follows (migration-rdma.c): + +1. Receiver and Sender are started (command line or libvirt): +2. Both sides post two RQ work requests +3. Receiver does listen() +4. Sender does connect() +5. Receiver accept() +6. Check versioning and capabilities (described later) + +At this point, we define a control channel on top of SEND messages +which is described by a formal protocol. Each SEND message has a +header portion and a data portion (but together are transmitted +as a single SEND message). + +Header: + * Length (of the data portion, uint32, network byte order) + * Type (what command to perform, uint32, network byte order) + * Repeat (Number of commands in data portion, same type only) + +The 'Repeat' field is here to support future multiple page registrations +in a single message without any need to change the protocol itself +so that the protocol is compatible against multiple versions of QEMU. +Version #1 requires that all server implementations of the protocol must +check this field and register all requests found in the array of commands located +in the data portion and return an equal number of results in the response. +The maximum number of repeats is hard-coded to 4096. This is a conservative +limit based on the maximum size of a SEND message along with empirical +observations on the maximum future benefit of simultaneous page registrations. + +The 'type' field has 12 different command values: + 1. Unused + 2. Error (sent to the source during bad things) + 3. Ready (control-channel is available) + 4. QEMU File (for sending non-live device state) + 5. RAM Blocks request (used right after connection setup) + 6. RAM Blocks result (used right after connection setup) + 7. Compress page (zap zero page and skip registration) + 8. Register request (dynamic chunk registration) + 9. Register result ('rkey' to be used by sender) + 10. Register finished (registration for current iteration finished) + 11. Unregister request (unpin previously registered memory) + 12. Unregister finished (confirmation that unpin completed) + +A single control message, as hinted above, can contain within the data +portion an array of many commands of the same type. If there is more than +one command, then the 'repeat' field will be greater than 1. + +After connection setup, message 5 & 6 are used to exchange ram block +information and optionally pin all the memory if requested by the user. + +After ram block exchange is completed, we have two protocol-level +functions, responsible for communicating control-channel commands +using the above list of values: + +Logically: + +qemu_rdma_exchange_recv(header, expected command type) + +1. We transmit a READY command to let the sender know that + we are *ready* to receive some data bytes on the control channel. +2. Before attempting to receive the expected command, we post another + RQ work request to replace the one we just used up. +3. Block on a CQ event channel and wait for the SEND to arrive. +4. When the send arrives, librdmacm will unblock us. +5. Verify that the command-type and version received matches the one we expected. + +qemu_rdma_exchange_send(header, data, optional response header & data): + +1. Block on the CQ event channel waiting for a READY command + from the receiver to tell us that the receiver + is *ready* for us to transmit some new bytes. +2. Optionally: if we are expecting a response from the command + (that we have not yet transmitted), let's post an RQ + work request to receive that data a few moments later. +3. When the READY arrives, librdmacm will + unblock us and we immediately post a RQ work request + to replace the one we just used up. +4. Now, we can actually post the work request to SEND + the requested command type of the header we were asked for. +5. Optionally, if we are expecting a response (as before), + we block again and wait for that response using the additional + work request we previously posted. (This is used to carry + 'Register result' commands #6 back to the sender which + hold the rkey need to perform RDMA. Note that the virtual address + corresponding to this rkey was already exchanged at the beginning + of the connection (described below). + +All of the remaining command types (not including 'ready') +described above all use the aforementioned two functions to do the hard work: + +1. After connection setup, RAMBlock information is exchanged using + this protocol before the actual migration begins. This information includes + a description of each RAMBlock on the server side as well as the virtual addresses + and lengths of each RAMBlock. This is used by the client to determine the + start and stop locations of chunks and how to register them dynamically + before performing the RDMA operations. +2. During runtime, once a 'chunk' becomes full of pages ready to + be sent with RDMA, the registration commands are used to ask the + other side to register the memory for this chunk and respond + with the result (rkey) of the registration. +3. Also, the QEMUFile interfaces also call these functions (described below) + when transmitting non-live state, such as devices or to send + its own protocol information during the migration process. +4. Finally, zero pages are only checked if a page has not yet been registered + using chunk registration (or not checked at all and unconditionally + written if chunk registration is disabled. This is accomplished using + the "Compress" command listed above. If the page *has* been registered + then we check the entire chunk for zero. Only if the entire chunk is + zero, then we send a compress command to zap the page on the other side. + +Versioning and Capabilities +=========================== +Current version of the protocol is version #1. + +The same version applies to both for protocol traffic and capabilities +negotiation. (i.e. There is only one version number that is referred to +by all communication). + +librdmacm provides the user with a 'private data' area to be exchanged +at connection-setup time before any infiniband traffic is generated. + +Header: + * Version (protocol version validated before send/recv occurs), + uint32, network byte order + * Flags (bitwise OR of each capability), + uint32, network byte order + +There is no data portion of this header right now, so there is +no length field. The maximum size of the 'private data' section +is only 192 bytes per the Infiniband specification, so it's not +very useful for data anyway. This structure needs to remain small. + +This private data area is a convenient place to check for protocol +versioning because the user does not need to register memory to +transmit a few bytes of version information. + +This is also a convenient place to negotiate capabilities +(like dynamic page registration). + +If the version is invalid, we throw an error. + +If the version is new, we only negotiate the capabilities that the +requested version is able to perform and ignore the rest. + +Currently there is only one capability in Version #1: dynamic page registration + +Finally: Negotiation happens with the Flags field: If the primary-VM +sets a flag, but the destination does not support this capability, it +will return a zero-bit for that flag and the primary-VM will understand +that as not being an available capability and will thus disable that +capability on the primary-VM side. + +QEMUFileRDMA Interface: +======================= + +QEMUFileRDMA introduces a couple of new functions: + +1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) +2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) + +These two functions are very short and simply use the protocol +describe above to deliver bytes without changing the upper-level +users of QEMUFile that depend on a bytestream abstraction. + +Finally, how do we handoff the actual bytes to get_buffer()? + +Again, because we're trying to "fake" a bytestream abstraction +using an analogy not unlike individual UDP frames, we have +to hold on to the bytes received from control-channel's SEND +messages in memory. + +Each time we receive a complete "QEMU File" control-channel +message, the bytes from SEND are copied into a small local holding area. + +Then, we return the number of bytes requested by get_buffer() +and leave the remaining bytes in the holding area until get_buffer() +comes around for another pass. + +If the buffer is empty, then we follow the same steps +listed above and issue another "QEMU File" protocol command, +asking for a new SEND message to re-fill the buffer. + +Migration of VM's ram: +==================== + +At the beginning of the migration, (migration-rdma.c), +the sender and the receiver populate the list of RAMBlocks +to be registered with each other into a structure. +Then, using the aforementioned protocol, they exchange a +description of these blocks with each other, to be used later +during the iteration of main memory. This description includes +a list of all the RAMBlocks, their offsets and lengths, virtual +addresses and possibly includes pre-registered RDMA keys in case dynamic +page registration was disabled on the server-side, otherwise not. + +Main memory is not migrated with the aforementioned protocol, +but is instead migrated with normal RDMA Write operations. + +Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now). +Chunk size is not dynamic, but it could be in a future implementation. +There's nothing to indicate that this is useful right now. + +When a chunk is full (or a flush() occurs), the memory backed by +the chunk is registered with librdmacm is pinned in memory on +both sides using the aforementioned protocol. +After pinning, an RDMA Write is generated and transmitted +for the entire chunk. + +Chunks are also transmitted in batches: This means that we +do not request that the hardware signal the completion queue +for the completion of *every* chunk. The current batch size +is about 64 chunks (corresponding to 64 MB of memory). +Only the last chunk in a batch must be signaled. +This helps keep everything as asynchronous as possible +and helps keep the hardware busy performing RDMA operations. + +Error-handling: +=============== + +Infiniband has what is called a "Reliable, Connected" +link (one of 4 choices). This is the mode in which +we use for RDMA migration. + +If a *single* message fails, +the decision is to abort the migration entirely and +cleanup all the RDMA descriptors and unregister all +the memory. + +After cleanup, the Virtual Machine is returned to normal +operation the same way that would happen if the TCP +socket is broken during a non-RDMA based migration. + +TODO: +===== +1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits + are not compatible with infiniband memory pinning and will result in + an aborted migration (but with the source VM left unaffected). +2. Use of the recent /proc/<pid>/pagemap would likely speed up + the use of KSM and ballooning while using RDMA. +3. Also, some form of balloon-device usage tracking would also + help alleviate some issues. +4. Use LRU to provide more fine-grained direction of UNREGISTER + requests for unpinning memory in an overcommitted environment. +5. Expose UNREGISTER support to the user by way of workload-specific + hints about application behavior. diff --git a/docs/replay.txt b/docs/replay.txt new file mode 100644 index 000000000..5b008ca49 --- /dev/null +++ b/docs/replay.txt @@ -0,0 +1,410 @@ +Copyright (c) 2010-2015 Institute for System Programming + of the Russian Academy of Sciences. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +Record/replay +------------- + +Record/replay functions are used for the deterministic replay of qemu execution. +Execution recording writes a non-deterministic events log, which can be later +used for replaying the execution anywhere and for unlimited number of times. +It also supports checkpointing for faster rewind to the specific replay moment. +Execution replaying reads the log and replays all non-deterministic events +including external input, hardware clocks, and interrupts. + +Deterministic replay has the following features: + * Deterministically replays whole system execution and all contents of + the memory, state of the hardware devices, clocks, and screen of the VM. + * Writes execution log into the file for later replaying for multiple times + on different machines. + * Supports i386, x86_64, and Arm hardware platforms. + * Performs deterministic replay of all operations with keyboard and mouse + input devices. + +Usage of the record/replay: + * First, record the execution with the following command line: + qemu-system-i386 \ + -icount shift=7,rr=record,rrfile=replay.bin \ + -drive file=disk.qcow2,if=none,snapshot,id=img-direct \ + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ + -device ide-hd,drive=img-blkreplay \ + -netdev user,id=net1 -device rtl8139,netdev=net1 \ + -object filter-replay,id=replay,netdev=net1 + * After recording, you can replay it by using another command line: + qemu-system-i386 \ + -icount shift=7,rr=replay,rrfile=replay.bin \ + -drive file=disk.qcow2,if=none,snapshot,id=img-direct \ + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ + -device ide-hd,drive=img-blkreplay \ + -netdev user,id=net1 -device rtl8139,netdev=net1 \ + -object filter-replay,id=replay,netdev=net1 + The only difference with recording is changing the rr option + from record to replay. + * Block device images are not actually changed in the recording mode, + because all of the changes are written to the temporary overlay file. + This behavior is enabled by using blkreplay driver. It should be used + for every enabled block device, as described in 'Block devices' section. + * '-net none' option should be specified when network is not used, + because QEMU adds network card by default. When network is needed, + it should be configured explicitly with replay filter, as described + in 'Network devices' section. + * Interaction with audio devices and serial ports are recorded and replayed + automatically when such devices are enabled. + +Academic papers with description of deterministic replay implementation: +http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html +http://dl.acm.org/citation.cfm?id=2786805.2803179 + +Modifications of qemu include: + * wrappers for clock and time functions to save their return values in the log + * saving different asynchronous events (e.g. system shutdown) into the log + * synchronization of the bottom halves execution + * synchronization of the threads from thread pool + * recording/replaying user input (mouse, keyboard, and microphone) + * adding internal checkpoints for cpu and io synchronization + * network filter for recording and replaying the packets + * block driver for making block layer deterministic + * serial port input record and replay + * recording of random numbers obtained from the external sources + +Locking and thread synchronisation +---------------------------------- + +Previously the synchronisation of the main thread and the vCPU thread +was ensured by the holding of the BQL. However the trend has been to +reduce the time the BQL was held across the system including under TCG +system emulation. As it is important that batches of events are kept +in sequence (e.g. expiring timers and checkpoints in the main thread +while instruction checkpoints are written by the vCPU thread) we need +another lock to keep things in lock-step. This role is now handled by +the replay_mutex_lock. It used to be held only for each event being +written but now it is held for a whole execution period. This results +in a deterministic ping-pong between the two main threads. + +As the BQL is now a finer grained lock than the replay_lock it is almost +certainly a bug, and a source of deadlocks, to take the +replay_mutex_lock while the BQL is held. This is enforced by an assert. +While the unlocks are usually in the reverse order, this is not +necessary; you can drop the replay_lock while holding the BQL, without +doing a more complicated unlock_iothread/replay_unlock/lock_iothread +sequence. + +Non-deterministic events +------------------------ + +Our record/replay system is based on saving and replaying non-deterministic +events (e.g. keyboard input) and simulating deterministic ones (e.g. reading +from HDD or memory of the VM). Saving only non-deterministic events makes +log file smaller and simulation faster. + +The following non-deterministic data from peripheral devices is saved into +the log: mouse and keyboard input, network packets, audio controller input, +serial port input, and hardware clocks (they are non-deterministic +too, because their values are taken from the host machine). Inputs from +simulated hardware, memory of VM, software interrupts, and execution of +instructions are not saved into the log, because they are deterministic and +can be replayed by simulating the behavior of virtual machine starting from +initial state. + +We had to solve three tasks to implement deterministic replay: recording +non-deterministic events, replaying non-deterministic events, and checking +that there is no divergence between record and replay modes. + +We changed several parts of QEMU to make event log recording and replaying. +Devices' models that have non-deterministic input from external devices were +changed to write every external event into the execution log immediately. +E.g. network packets are written into the log when they arrive into the virtual +network adapter. + +All non-deterministic events are coming from these devices. But to +replay them we need to know at which moments they occur. We specify +these moments by counting the number of instructions executed between +every pair of consecutive events. + +Instruction counting +-------------------- + +QEMU should work in icount mode to use record/replay feature. icount was +designed to allow deterministic execution in absence of external inputs +of the virtual machine. We also use icount to control the occurrence of the +non-deterministic events. The number of instructions elapsed from the last event +is written to the log while recording the execution. In replay mode we +can predict when to inject that event using the instruction counter. + +Timers +------ + +Timers are used to execute callbacks from different subsystems of QEMU +at the specified moments of time. There are several kinds of timers: + * Real time clock. Based on host time and used only for callbacks that + do not change the virtual machine state. For this reason real time + clock and timers does not affect deterministic replay at all. + * Virtual clock. These timers run only during the emulation. In icount + mode virtual clock value is calculated using executed instructions counter. + That is why it is completely deterministic and does not have to be recorded. + * Host clock. This clock is used by device models that simulate real time + sources (e.g. real time clock chip). Host clock is the one of the sources + of non-determinism. Host clock read operations should be logged to + make the execution deterministic. + * Virtual real time clock. This clock is similar to real time clock but + it is used only for increasing virtual clock while virtual machine is + sleeping. Due to its nature it is also non-deterministic as the host clock + and has to be logged too. + +Checkpoints +----------- + +Replaying of the execution of virtual machine is bound by sources of +non-determinism. These are inputs from clock and peripheral devices, +and QEMU thread scheduling. Thread scheduling affect on processing events +from timers, asynchronous input-output, and bottom halves. + +Invocations of timers are coupled with clock reads and changing the state +of the virtual machine. Reads produce non-deterministic data taken from +host clock. And VM state changes should preserve their order. Their relative +order in replay mode must replicate the order of callbacks in record mode. +To preserve this order we use checkpoints. When a specific clock is processed +in record mode we save to the log special "checkpoint" event. +Checkpoints here do not refer to virtual machine snapshots. They are just +record/replay events used for synchronization. + +QEMU in replay mode will try to invoke timers processing in random moment +of time. That's why we do not process a group of timers until the checkpoint +event will be read from the log. Such an event allows synchronizing CPU +execution and timer events. + +Two other checkpoints govern the "warping" of the virtual clock. +While the virtual machine is idle, the virtual clock increments at +1 ns per *real time* nanosecond. This is done by setting up a timer +(called the warp timer) on the virtual real time clock, so that the +timer fires at the next deadline of the virtual clock; the virtual clock +is then incremented (which is called "warping" the virtual clock) as +soon as the timer fires or the CPUs need to go out of the idle state. +Two functions are used for this purpose; because these actions change +virtual machine state and must be deterministic, each of them creates a +checkpoint. icount_start_warp_timer checks if the CPUs are idle and if so +starts accounting real time to virtual clock. icount_account_warp_timer +is called when the CPUs get an interrupt or when the warp timer fires, +and it warps the virtual clock by the amount of real time that has passed +since icount_start_warp_timer. + +Bottom halves +------------- + +Disk I/O events are completely deterministic in our model, because +in both record and replay modes we start virtual machine from the same +disk state. But callbacks that virtual disk controller uses for reading and +writing the disk may occur at different moments of time in record and replay +modes. + +Reading and writing requests are created by CPU thread of QEMU. Later these +requests proceed to block layer which creates "bottom halves". Bottom +halves consist of callback and its parameters. They are processed when +main loop locks the global mutex. These locks are not synchronized with +replaying process because main loop also processes the events that do not +affect the virtual machine state (like user interaction with monitor). + +That is why we had to implement saving and replaying bottom halves callbacks +synchronously to the CPU execution. When the callback is about to execute +it is added to the queue in the replay module. This queue is written to the +log when its callbacks are executed. In replay mode callbacks are not processed +until the corresponding event is read from the events log file. + +Sometimes the block layer uses asynchronous callbacks for its internal purposes +(like reading or writing VM snapshots or disk image cluster tables). In this +case bottom halves are not marked as "replayable" and do not saved +into the log. + +Block devices +------------- + +Block devices record/replay module intercepts calls of +bdrv coroutine functions at the top of block drivers stack. +To record and replay block operations the drive must be configured +as following: + -drive file=disk.qcow2,if=none,snapshot,id=img-direct + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay + -device ide-hd,drive=img-blkreplay + +blkreplay driver should be inserted between disk image and virtual driver +controller. Therefore all disk requests may be recorded and replayed. + +All block completion operations are added to the queue in the coroutines. +Queue is flushed at checkpoints and information about processed requests +is recorded to the log. In replay phase the queue is matched with +events read from the log. Therefore block devices requests are processed +deterministically. + +Snapshotting +------------ + +New VM snapshots may be created in replay mode. They can be used later +to recover the desired VM state. All VM states created in replay mode +are associated with the moment of time in the replay scenario. +After recovering the VM state replay will start from that position. + +Default starting snapshot name may be specified with icount field +rrsnapshot as follows: + -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name + +This snapshot is created at start of recording and restored at start +of replaying. It also can be loaded while replaying to roll back +the execution. + +'snapshot' flag of the disk image must be removed to save the snapshots +in the overlay (or original image) instead of using the temporary overlay. + -drive file=disk.ovl,if=none,id=img-direct + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay + -device ide-hd,drive=img-blkreplay + +Use QEMU monitor to create additional snapshots. 'savevm <name>' command +created the snapshot and 'loadvm <name>' restores it. To prevent corruption +of the original disk image, use overlay files linked to the original images. +Therefore all new snapshots (including the starting one) will be saved in +overlays and the original image remains unchanged. + +When you need to use snapshots with diskless virtual machine, +it must be started with 'orphan' qcow2 image. This image will be used +for storing VM snapshots. Here is the example of the command line for this: + + qemu-system-i386 -icount shift=3,rr=replay,rrfile=record.bin,rrsnapshot=init \ + -net none -drive file=empty.qcow2,if=none,id=rr + +empty.qcow2 drive does not connected to any virtual block device and used +for VM snapshots only. + +Network devices +--------------- + +Record and replay for network interactions is performed with the network filter. +Each backend must have its own instance of the replay filter as follows: + -netdev user,id=net1 -device rtl8139,netdev=net1 + -object filter-replay,id=replay,netdev=net1 + +Replay network filter is used to record and replay network packets. While +recording the virtual machine this filter puts all packets coming from +the outer world into the log. In replay mode packets from the log are +injected into the network device. All interactions with network backend +in replay mode are disabled. + +Audio devices +------------- + +Audio data is recorded and replay automatically. The command line for recording +and replaying must contain identical specifications of audio hardware, e.g.: + -soundhw ac97 + +Serial ports +------------ + +Serial ports input is recorded and replay automatically. The command lines +for recording and replaying must contain identical number of ports in record +and replay modes, but their backends may differ. +E.g., '-serial stdio' in record mode, and '-serial null' in replay mode. + +Reverse debugging +----------------- + +Reverse debugging allows "executing" the program in reverse direction. +GDB remote protocol supports "reverse step" and "reverse continue" +commands. The first one steps single instruction backwards in time, +and the second one finds the last breakpoint in the past. + +Recorded executions may be used to enable reverse debugging. QEMU can't +execute the code in backwards direction, but can load a snapshot and +replay forward to find the desired position or breakpoint. + +The following GDB commands are supported: + - reverse-stepi (or rsi) - step one instruction backwards + - reverse-continue (or rc) - find last breakpoint in the past + +Reverse step loads the nearest snapshot and replays the execution until +the required instruction is met. + +Reverse continue may include several passes of examining the execution +between the snapshots. Each of the passes include the following steps: + 1. loading the snapshot + 2. replaying to examine the breakpoints + 3. if breakpoint or watchpoint was met + - loading the snapshot again + - replaying to the required breakpoint + 4. else + - proceeding to the p.1 with the earlier snapshot + +Therefore usage of the reverse debugging requires at least one snapshot +created in advance. This can be done by omitting 'snapshot' option +for the block drives and adding 'rrsnapshot' for both record and replay +command lines. +See the "Snapshotting" section to learn more about running record/replay +and creating the snapshot in these modes. + +Replay log format +----------------- + +Record/replay log consists of the header and the sequence of execution +events. The header includes 4-byte replay version id and 8-byte reserved +field. Version is updated every time replay log format changes to prevent +using replay log created by another build of qemu. + +The sequence of the events describes virtual machine state changes. +It includes all non-deterministic inputs of VM, synchronization marks and +instruction counts used to correctly inject inputs at replay. + +Synchronization marks (checkpoints) are used for synchronizing qemu threads +that perform operations with virtual hardware. These operations may change +system's state (e.g., change some register or generate interrupt) and +therefore should execute synchronously with CPU thread. + +Every event in the log includes 1-byte event id and optional arguments. +When argument is an array, it is stored as 4-byte array length +and corresponding number of bytes with data. +Here is the list of events that are written into the log: + + - EVENT_INSTRUCTION. Instructions executed since last event. + Argument: 4-byte number of executed instructions. + - EVENT_INTERRUPT. Used to synchronize interrupt processing. + - EVENT_EXCEPTION. Used to synchronize exception handling. + - EVENT_ASYNC. This is a group of events. They are always processed + together with checkpoints. When such an event is generated, it is + stored in the queue and processed only when checkpoint occurs. + Every such event is followed by 1-byte checkpoint id and 1-byte + async event id from the following list: + - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes + callbacks that affect virtual machine state, but normally called + asynchronously. + Argument: 8-byte operation id. + - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains + parameters of keyboard and mouse input operations + (key press/release, mouse pointer movement). + Arguments: 9-16 bytes depending of input event. + - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event. + - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input + initiated by the sender. + Arguments: 1-byte character device id. + Array with bytes were read. + - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize + operations with disk and flash drives with CPU. + Argument: 8-byte operation id. + - REPLAY_ASYNC_EVENT_NET. Incoming network packet. + Arguments: 1-byte network adapter id. + 4-byte packet flags. + Array with packet bytes. + - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu, + e.g., by closing the window. + - EVENT_CHAR_WRITE. Used to synchronize character output operations. + Arguments: 4-byte output function return value. + 4-byte offset in the output array. + - EVENT_CHAR_READ_ALL. Used to synchronize character input operations, + initiated by qemu. + Argument: Array with bytes that were read. + - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation, + initiated by qemu. + Argument: 4-byte error code. + - EVENT_CLOCK + clock_id. Group of events for host clock read operations. + Argument: 8-byte clock value. + - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of + CPU, internal threads, and asynchronous input events. May be followed + by one or more EVENT_ASYNC events. + - EVENT_END. Last event in the log. diff --git a/docs/specs/acpi_cpu_hotplug.rst b/docs/specs/acpi_cpu_hotplug.rst new file mode 100644 index 000000000..351057c96 --- /dev/null +++ b/docs/specs/acpi_cpu_hotplug.rst @@ -0,0 +1,235 @@ +QEMU<->ACPI BIOS CPU hotplug interface +====================================== + +QEMU supports CPU hotplug via ACPI. This document +describes the interface between QEMU and the ACPI BIOS. + +ACPI BIOS GPE.2 handler is dedicated for notifying OS about CPU hot-add +and hot-remove events. + + +Legacy ACPI CPU hotplug interface registers +------------------------------------------- + +CPU present bitmap for: + +- ICH9-LPC (IO port 0x0cd8-0xcf7, 1-byte access) +- PIIX-PM (IO port 0xaf00-0xaf1f, 1-byte access) +- One bit per CPU. Bit position reflects corresponding CPU APIC ID. Read-only. +- The first DWORD in bitmap is used in write mode to switch from legacy + to modern CPU hotplug interface, write 0 into it to do switch. + +QEMU sets corresponding CPU bit on hot-add event and issues SCI +with GPE.2 event set. CPU present map is read by ACPI BIOS GPE.2 handler +to notify OS about CPU hot-add events. CPU hot-remove isn't supported. + + +Modern ACPI CPU hotplug interface registers +------------------------------------------- + +Register block base address: + +- ICH9-LPC IO port 0x0cd8 +- PIIX-PM IO port 0xaf00 + +Register block size: + +- ACPI_CPU_HOTPLUG_REG_LEN = 12 + +All accesses to registers described below, imply little-endian byte order. + +Reserved registers behavior: + +- write accesses are ignored +- read accesses return all bits set to 0. + +The last stored value in 'CPU selector' must refer to a possible CPU, otherwise + +- reads from any register return 0 +- writes to any other register are ignored until valid value is stored into it + +On QEMU start, 'CPU selector' is initialized to a valid value, on reset it +keeps the current value. + +Read access behavior +^^^^^^^^^^^^^^^^^^^^ + +offset [0x0-0x3] + Command data 2: (DWORD access) + + If value last stored in 'Command field' is: + + 0: + reads as 0x0 + 3: + upper 32 bits of architecture specific CPU ID value + other values: + reserved + +offset [0x4] + CPU device status fields: (1 byte access) + + bits: + + 0: + Device is enabled and may be used by guest + 1: + Device insert event, used to distinguish device for which + no device check event to OSPM was issued. + It's valid only when bit 0 is set. + 2: + Device remove event, used to distinguish device for which + no device eject request to OSPM was issued. Firmware must + ignore this bit. + 3: + reserved and should be ignored by OSPM + 4: + if set to 1, OSPM requests firmware to perform device eject. + 5-7: + reserved and should be ignored by OSPM + +offset [0x5-0x7] + reserved + +offset [0x8] + Command data: (DWORD access) + + If value last stored in 'Command field' is one of: + + 0: + contains 'CPU selector' value of a CPU with pending event[s] + 3: + lower 32 bits of architecture specific CPU ID value + (in x86 case: APIC ID) + otherwise: + contains 0 + +Write access behavior +^^^^^^^^^^^^^^^^^^^^^ + +offset [0x0-0x3] + CPU selector: (DWORD access) + + Selects active CPU device. All following accesses to other + registers will read/store data from/to selected CPU. + Valid values: [0 .. max_cpus) + +offset [0x4] + CPU device control fields: (1 byte access) + + bits: + + 0: + reserved, OSPM must clear it before writing to register. + 1: + if set to 1 clears device insert event, set by OSPM + after it has emitted device check event for the + selected CPU device + 2: + if set to 1 clears device remove event, set by OSPM + after it has emitted device eject request for the + selected CPU device. + 3: + if set to 1 initiates device eject, set by OSPM when it + triggers CPU device removal and calls _EJ0 method or by firmware + when bit #4 is set. In case bit #4 were set, it's cleared as + part of device eject. + 4: + if set to 1, OSPM hands over device eject to firmware. + Firmware shall issue device eject request as described above + (bit #3) and OSPM should not touch device eject bit (#3) in case + it's asked firmware to perform CPU device eject. + 5-7: + reserved, OSPM must clear them before writing to register + +offset[0x5] + Command field: (1 byte access) + + value: + + 0: + selects a CPU device with inserting/removing events and + following reads from 'Command data' register return + selected CPU ('CPU selector' value). + If no CPU with events found, the current 'CPU selector' doesn't + change and corresponding insert/remove event flags are not modified. + + 1: + following writes to 'Command data' register set OST event + register in QEMU + 2: + following writes to 'Command data' register set OST status + register in QEMU + 3: + following reads from 'Command data' and 'Command data 2' return + architecture specific CPU ID value for currently selected CPU. + other values: + reserved + +offset [0x6-0x7] + reserved + +offset [0x8] + Command data: (DWORD access) + + If last stored 'Command field' value is: + + 1: + stores value into OST event register + 2: + stores value into OST status register, triggers + ACPI_DEVICE_OST QMP event from QEMU to external applications + with current values of OST event and status registers. + other values: + reserved + +Typical usecases +---------------- + +(x86) Detecting and enabling modern CPU hotplug interface +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +QEMU starts with legacy CPU hotplug interface enabled. Detecting and +switching to modern interface is based on the 2 legacy CPU hotplug features: + +#. Writes into CPU bitmap are ignored. +#. CPU bitmap always has bit #0 set, corresponding to boot CPU. + +Use following steps to detect and enable modern CPU hotplug interface: + +#. Store 0x0 to the 'CPU selector' register, attempting to switch to modern mode +#. Store 0x0 to the 'CPU selector' register, to ensure valid selector value +#. Store 0x0 to the 'Command field' register +#. Read the 'Command data 2' register. + If read value is 0x0, the modern interface is enabled. + Otherwise legacy or no CPU hotplug interface available + +Get a cpu with pending event +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Store 0x0 to the 'CPU selector' register. +#. Store 0x0 to the 'Command field' register. +#. Read the 'CPU device status fields' register. +#. If both bit #1 and bit #2 are clear in the value read, there is no CPU + with a pending event and selected CPU remains unchanged. +#. Otherwise, read the 'Command data' register. The value read is the + selector of the CPU with the pending event (which is already selected). + +Enumerate CPUs present/non present CPUs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Set the present CPU count to 0. +#. Set the iterator to 0. +#. Store 0x0 to the 'CPU selector' register, to ensure that it's in + a valid state and that access to other registers won't be ignored. +#. Store 0x0 to the 'Command field' register to make 'Command data' + register return 'CPU selector' value of selected CPU +#. Read the 'CPU device status fields' register. +#. If bit #0 is set, increment the present CPU count. +#. Increment the iterator. +#. Store the iterator to the 'CPU selector' register. +#. Read the 'Command data' register. +#. If the value read is not zero, goto 05. +#. Otherwise store 0x0 to the 'CPU selector' register, to put it + into a valid state and exit. + The iterator at this point equals "max_cpus". diff --git a/docs/specs/acpi_hest_ghes.rst b/docs/specs/acpi_hest_ghes.rst new file mode 100644 index 000000000..68f1fbe0a --- /dev/null +++ b/docs/specs/acpi_hest_ghes.rst @@ -0,0 +1,110 @@ +APEI tables generating and CPER record +====================================== + +.. + Copyright (c) 2020 HUAWEI TECHNOLOGIES CO., LTD. + + This work is licensed under the terms of the GNU GPL, version 2 or later. + See the COPYING file in the top-level directory. + +Design Details +-------------- + +:: + + etc/acpi/tables etc/hardware_errors + ==================== =============================== + + +--------------------------+ +----------------------------+ + | | HEST | +--------->| error_block_address1 |------+ + | +--------------------------+ | +----------------------------+ | + | | GHES1 | | +------->| error_block_address2 |------+-+ + | +--------------------------+ | | +----------------------------+ | | + | | ................. | | | | .............. | | | + | | error_status_address-----+-+ | -----------------------------+ | | + | | ................. | | +--->| error_block_addressN |------+-+---+ + | | read_ack_register--------+-+ | | +----------------------------+ | | | + | | read_ack_preserve | +-+---+--->| read_ack_register1 | | | | + | | read_ack_write | | | +----------------------------+ | | | + + +--------------------------+ | +-+--->| read_ack_register2 | | | | + | | GHES2 | | | | +----------------------------+ | | | + + +--------------------------+ | | | | ............. | | | | + | | ................. | | | | +----------------------------+ | | | + | | error_status_address-----+---+ | | +->| read_ack_registerN | | | | + | | ................. | | | | +----------------------------+ | | | + | | read_ack_register--------+-----+ | | |Generic Error Status Block 1|<-----+ | | + | | read_ack_preserve | | | |-+------------------------+-+ | | + | | read_ack_write | | | | | CPER | | | | + + +--------------------------| | | | | CPER | | | | + | | ............... | | | | | .... | | | | + + +--------------------------+ | | | | CPER | | | | + | | GHESN | | | |-+------------------------+-| | | + + +--------------------------+ | | |Generic Error Status Block 2|<-------+ | + | | ................. | | | |-+------------------------+-+ | + | | error_status_address-----+-------+ | | | CPER | | | + | | ................. | | | | CPER | | | + | | read_ack_register--------+---------+ | | .... | | | + | | read_ack_preserve | | | CPER | | | + | | read_ack_write | +-+------------------------+-+ | + + +--------------------------+ | .......... | | + |----------------------------+ | + |Generic Error Status Block N |<----------+ + |-+-------------------------+-+ + | | CPER | | + | | CPER | | + | | .... | | + | | CPER | | + +-+-------------------------+-+ + + +(1) QEMU generates the ACPI HEST table. This table goes in the current + "etc/acpi/tables" fw_cfg blob. Each error source has different + notification types. + +(2) A new fw_cfg blob called "etc/hardware_errors" is introduced. QEMU + also needs to populate this blob. The "etc/hardware_errors" fw_cfg blob + contains an address registers table and an Error Status Data Block table. + +(3) The address registers table contains N Error Block Address entries + and N Read Ack Register entries. The size for each entry is 8-byte. + The Error Status Data Block table contains N Error Status Data Block + entries. The size for each entry is 4096(0x1000) bytes. The total size + for the "etc/hardware_errors" fw_cfg blob is (N * 8 * 2 + N * 4096) bytes. + N is the number of the kinds of hardware error sources. + +(4) QEMU generates the ACPI linker/loader script for the firmware. The + firmware pre-allocates memory for "etc/acpi/tables", "etc/hardware_errors" + and copies blob contents there. + +(5) QEMU generates N ADD_POINTER commands, which patch addresses in the + "error_status_address" fields of the HEST table with a pointer to the + corresponding "address registers" in the "etc/hardware_errors" blob. + +(6) QEMU generates N ADD_POINTER commands, which patch addresses in the + "read_ack_register" fields of the HEST table with a pointer to the + corresponding "read_ack_register" within the "etc/hardware_errors" blob. + +(7) QEMU generates N ADD_POINTER commands for the firmware, which patch + addresses in the "error_block_address" fields with a pointer to the + respective "Error Status Data Block" in the "etc/hardware_errors" blob. + +(8) QEMU defines a third and write-only fw_cfg blob which is called + "etc/hardware_errors_addr". Through that blob, the firmware can send back + the guest-side allocation addresses to QEMU. The "etc/hardware_errors_addr" + blob contains a 8-byte entry. QEMU generates a single WRITE_POINTER command + for the firmware. The firmware will write back the start address of + "etc/hardware_errors" blob to the fw_cfg file "etc/hardware_errors_addr". + +(9) When QEMU gets a SIGBUS from the kernel, QEMU writes CPER into corresponding + "Error Status Data Block", guest memory, and then injects platform specific + interrupt (in case of arm/virt machine it's Synchronous External Abort) as a + notification which is necessary for notifying the guest. + +(10) This notification (in virtual hardware) will be handled by the guest + kernel, on receiving notification, guest APEI driver could read the CPER error + and take appropriate action. + +(11) kvm_arch_on_sigbus_vcpu() uses source_id as index in "etc/hardware_errors" to + find out "Error Status Data Block" entry corresponding to error source. So supported + source_id values should be assigned here and not be changed afterwards to make sure + that guest will write error into expected "Error Status Data Block" even if guest was + migrated to a newer QEMU. diff --git a/docs/specs/acpi_hw_reduced_hotplug.rst b/docs/specs/acpi_hw_reduced_hotplug.rst new file mode 100644 index 000000000..0bd3f9399 --- /dev/null +++ b/docs/specs/acpi_hw_reduced_hotplug.rst @@ -0,0 +1,71 @@ +================================================== +QEMU and ACPI BIOS Generic Event Device interface +================================================== + +The ACPI *Generic Event Device* (GED) is a HW reduced platform +specific device introduced in ACPI v6.1 that handles all platform +events, including the hotplug ones. GED is modelled as a device +in the namespace with a _HID defined to be ACPI0013. This document +describes the interface between QEMU and the ACPI BIOS. + +GED allows HW reduced platforms to handle interrupts in ACPI ASL +statements. It follows a very similar approach to the _EVT method +from GPIO events. All interrupts are listed in _CRS and the handler +is written in _EVT method. However, the QEMU implementation uses a +single interrupt for the GED device, relying on an IO memory region +to communicate the type of device affected by the interrupt. This way, +we can support up to 32 events with a unique interrupt. + +**Here is an example,** + +:: + + Device (\_SB.GED) + { + Name (_HID, "ACPI0013") + Name (_UID, Zero) + Name (_CRS, ResourceTemplate () + { + Interrupt (ResourceConsumer, Edge, ActiveHigh, Exclusive, ,, ) + { + 0x00000029, + } + }) + OperationRegion (EREG, SystemMemory, 0x09080000, 0x04) + Field (EREG, DWordAcc, NoLock, WriteAsZeros) + { + ESEL, 32 + } + Method (_EVT, 1, Serialized) + { + Local0 = ESEL // ESEL = IO memory region which specifies the + // device type. + If (((Local0 & One) == One)) + { + MethodEvent1() + } + If ((Local0 & 0x2) == 0x2) + { + MethodEvent2() + } + ... + } + } + +GED IO interface (4 byte access) +-------------------------------- +**read access:** + +:: + + [0x0-0x3] Event selector bit field (32 bit) set by QEMU. + + bits: + 0: Memory hotplug event + 1: System power down event + 2: NVDIMM hotplug event + 3-31: Reserved + +**write_access:** + +Nothing is expected to be written into GED IO memory diff --git a/docs/specs/acpi_mem_hotplug.rst b/docs/specs/acpi_mem_hotplug.rst new file mode 100644 index 000000000..069819bc3 --- /dev/null +++ b/docs/specs/acpi_mem_hotplug.rst @@ -0,0 +1,128 @@ +QEMU<->ACPI BIOS memory hotplug interface +========================================= + +ACPI BIOS GPE.3 handler is dedicated for notifying OS about memory hot-add +and hot-remove events. + +Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access) +---------------------------------------------------------------- + +Read access behavior +^^^^^^^^^^^^^^^^^^^^ + +[0x0-0x3] + Lo part of memory device phys address +[0x4-0x7] + Hi part of memory device phys address +[0x8-0xb] + Lo part of memory device size in bytes +[0xc-0xf] + Hi part of memory device size in bytes +[0x10-0x13] + Memory device proximity domain +[0x14] + Memory device status fields + + bits: + + 0: + Device is enabled and may be used by guest + 1: + Device insert event, used to distinguish device for which + no device check event to OSPM was issued. + It's valid only when bit 1 is set. + 2: + Device remove event, used to distinguish device for which + no device eject request to OSPM was issued. + 3-7: + reserved and should be ignored by OSPM + +[0x15-0x17] + reserved + +Write access behavior +^^^^^^^^^^^^^^^^^^^^^ + + +[0x0-0x3] + Memory device slot selector, selects active memory device. + All following accesses to other registers in 0xa00-0xa17 + region will read/store data from/to selected memory device. +[0x4-0x7] + OST event code reported by OSPM +[0x8-0xb] + OST status code reported by OSPM +[0xc-0x13] + reserved, writes into it are ignored +[0x14] + Memory device control fields + + bits: + + 0: + reserved, OSPM must clear it before writing to register. + Due to BUG in versions prior 2.4 that field isn't cleared + when other fields are written. Keep it reserved and don't + try to reuse it. + 1: + if set to 1 clears device insert event, set by OSPM + after it has emitted device check event for the + selected memory device + 2: + if set to 1 clears device remove event, set by OSPM + after it has emitted device eject request for the + selected memory device + 3: + if set to 1 initiates device eject, set by OSPM when it + triggers memory device removal and calls _EJ0 method + 4-7: + reserved, OSPM must clear them before writing to register + +Selecting memory device slot beyond present range has no effect on platform: + +- write accesses to memory hot-plug registers not documented above are ignored +- read accesses to memory hot-plug registers not documented above return + all bits set to 1. + +Memory hot remove process diagram +--------------------------------- + +:: + + +-------------+ +-----------------------+ +------------------+ + | 1. QEMU | | 2. QEMU | |3. QEMU | + | device_del +---->+ device unplug request +----->+Send SCI to guest,| + | | | cb | |return control to | + | | | | |management | + +-------------+ +-----------------------+ +------------------+ + + +---------------------------------------------------------------------+ + + +---------------------+ +-------------------------+ + | OSPM: | remove event | OSPM: | + | send Eject Request, | | Scan memory devices | + | clear remove event +<-------------+ for event flags | + | | | | + +---------------------+ +-------------------------+ + | + | + +---------v--------+ +-----------------------+ + | Guest OS: | success | OSPM: | + | process Ejection +----------->+ Execute _EJ0 method, | + | request | | set eject bit in flags| + +------------------+ +-----------------------+ + |failure | + v v + +------------------------+ +-----------------------+ + | OSPM: | | QEMU: | + | set OST event & status | | call device unplug cb | + | fields | | | + +------------------------+ +-----------------------+ + | | + v v + +------------------+ +-------------------+ + |QEMU: | |QEMU: | + |Send OST QMP event| |Send device deleted| + | | |QMP event | + +------------------+ | | + +-------------------+ diff --git a/docs/specs/acpi_nvdimm.rst b/docs/specs/acpi_nvdimm.rst new file mode 100644 index 000000000..ab0335253 --- /dev/null +++ b/docs/specs/acpi_nvdimm.rst @@ -0,0 +1,228 @@ +QEMU<->ACPI BIOS NVDIMM interface +================================= + +QEMU supports NVDIMM via ACPI. This document describes the basic concepts of +NVDIMM ACPI and the interface between QEMU and the ACPI BIOS. + +NVDIMM ACPI Background +---------------------- + +NVDIMM is introduced in ACPI 6.0 which defines an NVDIMM root device under +_SB scope with a _HID of "ACPI0012". For each NVDIMM present or intended +to be supported by platform, platform firmware also exposes an ACPI +Namespace Device under the root device. + +The NVDIMM child devices under the NVDIMM root device are defined with _ADR +corresponding to the NFIT device handle. The NVDIMM root device and the +NVDIMM devices can have device specific methods (_DSM) to provide additional +functions specific to a particular NVDIMM implementation. + +This is an example from ACPI 6.0, a platform contains one NVDIMM:: + + Scope (\_SB){ + Device (NVDR) // Root device + { + Name (_HID, "ACPI0012") + Method (_STA) {...} + Method (_FIT) {...} + Method (_DSM, ...) {...} + Device (NVD) + { + Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM + Method (_DSM, ...) {...} + } + } + } + +Methods supported on both NVDIMM root device and NVDIMM device +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +_DSM (Device Specific Method) + It is a control method that enables devices to provide device specific + control functions that are consumed by the device driver. + The NVDIMM DSM specification can be found at + http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf + + Arguments: + + Arg0 + A Buffer containing a UUID (16 Bytes) + Arg1 + An Integer containing the Revision ID (4 Bytes) + Arg2 + An Integer containing the Function Index (4 Bytes) + Arg3 + A package containing parameters for the function specified by the + UUID, Revision ID, and Function Index + + Return Value: + + If Function Index = 0, a Buffer containing a function index bitfield. + Otherwise, the return value and type depends on the UUID, revision ID + and function index which are described in the DSM specification. + +Methods on NVDIMM ROOT Device +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +_FIT(Firmware Interface Table) + It evaluates to a buffer returning data in the format of a series of NFIT + Type Structure. + + Arguments: None + + Return Value: + A Buffer containing a list of NFIT Type structure entries. + + The detailed definition of the structure can be found at ACPI 6.0: 5.2.25 + NVDIMM Firmware Interface Table (NFIT). + +QEMU NVDIMM Implementation +-------------------------- + +QEMU uses 4 bytes IO Port starting from 0x0a18 and a RAM-based memory page +for NVDIMM ACPI. + +Memory: + QEMU uses BIOS Linker/loader feature to ask BIOS to allocate a memory + page and dynamically patch its address into an int32 object named "MEMA" + in ACPI. + + This page is RAM-based and it is used to transfer data between _DSM + method and QEMU. If ACPI has control, this pages is owned by ACPI which + writes _DSM input data to it, otherwise, it is owned by QEMU which + emulates _DSM access and writes the output data to it. + + ACPI writes _DSM Input Data (based on the offset in the page): + + [0x0 - 0x3] + 4 bytes, NVDIMM Device Handle. + + The handle is completely QEMU internal thing, the values in + range [1, 0xFFFF] indicate nvdimm device. Other values are + reserved for other purposes. + + Reserved handles: + + - 0 is reserved for nvdimm root device named NVDR. + - 0x10000 is reserved for QEMU internal DSM function called on + the root device. + + [0x4 - 0x7] + 4 bytes, Revision ID, that is the Arg1 of _DSM method. + + [0x8 - 0xB] + 4 bytes. Function Index, that is the Arg2 of _DSM method. + + [0xC - 0xFFF] + 4084 bytes, the Arg3 of _DSM method. + + QEMU writes Output Data (based on the offset in the page): + + [0x0 - 0x3] + 4 bytes, the length of result + + [0x4 - 0xFFF] + 4092 bytes, the DSM result filled by QEMU + +IO Port 0x0a18 - 0xa1b: + ACPI writes the address of the memory page allocated by BIOS to this + port then QEMU gets the control and fills the result in the memory page. + + Write Access: + + [0x0a18 - 0xa1b] + 4 bytes, the address of the memory page allocated by BIOS. + +_DSM process diagram +-------------------- + +"MEMA" indicates the address of memory page allocated by BIOS. + +:: + + +----------------------+ +-----------------------+ + | 1. OSPM | | 2. OSPM | + | save _DSM input data | | write "MEMA" to | Exit to QEMU + | to the page +----->| IO port 0x0a18 +------------+ + | indicated by "MEMA" | | | | + +----------------------+ +-----------------------+ | + | + v + +--------------------+ +-----------+ +------------------+--------+ + | 5 QEMU | | 4 QEMU | | 3. QEMU | + | write _DSM result | | emulate | | get _DSM input data from | + | to the page +<------+ _DSM +<-----+ the page indicated by the | + | | | | | value from the IO port | + +--------+-----------+ +-----------+ +---------------------------+ + | + | Enter Guest + | + v + +--------------------------+ +--------------+ + | 6 OSPM | | 7 OSPM | + | result size is returned | | _DSM return | + | by reading DSM +----->+ | + | result from the page | | | + +--------------------------+ +--------------+ + +NVDIMM hotplug +-------------- + +ACPI BIOS GPE.4 handler is dedicated for notifying OS about nvdimm device +hot-add event. + +QEMU internal use only _DSM functions +------------------------------------- + +Read FIT +^^^^^^^^ + +_FIT method uses _DSM method to fetch NFIT structures blob from QEMU +in 1 page sized increments which are then concatenated and returned +as _FIT method result. + +Input parameters: + +Arg0 + UUID {set to 648B9CF2-CDA1-4312-8AD9-49C4AF32BD62} +Arg1 + Revision ID (set to 1) +Arg2 + Function Index, 0x1 +Arg3 + A package containing a buffer whose layout is as follows: + + +----------+--------+--------+-------------------------------------------+ + | Field | Length | Offset | Description | + +----------+--------+--------+-------------------------------------------+ + | offset | 4 | 0 | offset in QEMU's NFIT structures blob to | + | | | | read from | + +----------+--------+--------+-------------------------------------------+ + +Output layout in the dsm memory page: + + +----------+--------+--------+-------------------------------------------+ + | Field | Length | Offset | Description | + +----------+--------+--------+-------------------------------------------+ + | length | 4 | 0 | length of entire returned data | + | | | | (including this header) | + +----------+--------+--------+-------------------------------------------+ + | | | | return status codes | + | | | | | + | | | | - 0x0 - success | + | | | | - 0x100 - error caused by NFIT update | + | status | 4 | 4 | while read by _FIT wasn't completed | + | | | | - other codes follow Chapter 3 in | + | | | | DSM Spec Rev1 | + +----------+--------+--------+-------------------------------------------+ + | fit data | Varies | 8 | contains FIT data. This field is present | + | | | | if status field is 0. | + +----------+--------+--------+-------------------------------------------+ + +The FIT offset is maintained by the OSPM itself, current offset plus +the size of the fit data returned by the function is the next offset +OSPM should read. When all FIT data has been read out, zero fit data +size is returned. + +If it returns status code 0x100, OSPM should restart to read FIT (read +from offset 0 again). diff --git a/docs/specs/acpi_pci_hotplug.rst b/docs/specs/acpi_pci_hotplug.rst new file mode 100644 index 000000000..685bc5c32 --- /dev/null +++ b/docs/specs/acpi_pci_hotplug.rst @@ -0,0 +1,48 @@ +QEMU<->ACPI BIOS PCI hotplug interface +====================================== + +QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document +describes the interface between QEMU and the ACPI BIOS. + +ACPI GPE block (IO ports 0xafe0-0xafe3, byte access) +---------------------------------------------------- + +Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject +event to ACPI BIOS, via SCI interrupt. + +PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access) +------------------------------------------------------------------------------ + +Slot injection notification pending. One bit per slot. + +Read by ACPI BIOS GPE.1 handler to notify OS of injection +events. Read-only. + +PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access) +-------------------------------------------------------------------- + +Slot removal notification pending. One bit per slot. + +Read by ACPI BIOS GPE.1 handler to notify OS of removal +events. Read-only. + +PCI device eject (IO port 0xae08-0xae0b, 4-byte access) +------------------------------------------------------- + +Write: Used by ACPI BIOS _EJ0 method to request device removal. +One bit per slot. + +Read: Hotplug features register. Used by platform to identify features +available. Current base feature set (no bits set): + +- Read-only "up" register @0xae00, 4-byte access, bit per slot +- Read-only "down" register @0xae04, 4-byte access, bit per slot +- Read/write "eject" register @0xae08, 4-byte access, + write: bit per slot eject, read: hotplug feature set +- Read-only hotplug capable register @0xae0c, 4-byte access, bit per slot + +PCI removability status (IO port 0xae0c-0xae0f, 4-byte access) +-------------------------------------------------------------- + +Used by ACPI BIOS _RMV method to indicate removability status to OS. One +bit per slot. Read-only. diff --git a/docs/specs/edu.txt b/docs/specs/edu.txt new file mode 100644 index 000000000..087631080 --- /dev/null +++ b/docs/specs/edu.txt @@ -0,0 +1,115 @@ + +EDU device +========== + +Copyright (c) 2014-2015 Jiri Slaby + +This document is licensed under the GPLv2 (or later). + +This is an educational device for writing (kernel) drivers. Its original +intention was to support the Linux kernel lectures taught at the Masaryk +University. Students are given this virtual device and are expected to write a +driver with I/Os, IRQs, DMAs and such. + +The devices behaves very similar to the PCI bridge present in the COMBO6 cards +developed under the Liberouter wings. Both PCI device ID and PCI space is +inherited from that device. + +Command line switches: + -device edu[,dma_mask=mask] + + dma_mask makes the virtual device work with DMA addresses with the given + mask. For educational purposes, the device supports only 28 bits (256 MiB) + by default. Students shall set dma_mask for the device in the OS driver + properly. + +PCI specs +--------- + +PCI ID: 1234:11e8 + +PCI Region 0: + I/O memory, 1 MB in size. Users are supposed to communicate with the card + through this memory. + +MMIO area spec +-------------- + +Only size == 4 accesses are allowed for addresses < 0x80. size == 4 or +size == 8 for the rest. + +0x00 (RO) : identification (0xRRrr00edu) + RR -- major version + rr -- minor version + +0x04 (RW) : card liveness check + It is a simple value inversion (~ C operator). + +0x08 (RW) : factorial computation + The stored value is taken and factorial of it is put back here. + This happens only after factorial bit in the status register (0x20 + below) is cleared. + +0x20 (RW) : status register, bitwise OR + 0x01 -- computing factorial (RO) + 0x80 -- raise interrupt after finishing factorial computation + +0x24 (RO) : interrupt status register + It contains values which raised the interrupt (see interrupt raise + register below). + +0x60 (WO) : interrupt raise register + Raise an interrupt. The value will be put to the interrupt status + register (using bitwise OR). + +0x64 (WO) : interrupt acknowledge register + Clear an interrupt. The value will be cleared from the interrupt + status register. This needs to be done from the ISR to stop + generating interrupts. + +0x80 (RW) : DMA source address + Where to perform the DMA from. + +0x88 (RW) : DMA destination address + Where to perform the DMA to. + +0x90 (RW) : DMA transfer count + The size of the area to perform the DMA on. + +0x98 (RW) : DMA command register, bitwise OR + 0x01 -- start transfer + 0x02 -- direction (0: from RAM to EDU, 1: from EDU to RAM) + 0x04 -- raise interrupt 0x100 after finishing the DMA + +IRQ controller +-------------- +An IRQ is generated when written to the interrupt raise register. The value +appears in interrupt status register when the interrupt is raised and has to +be written to the interrupt acknowledge register to lower it. + +The device supports both INTx and MSI interrupt. By default, INTx is +used. Even if the driver disabled INTx and only uses MSI, it still +needs to update the acknowledge register at the end of the IRQ handler +routine. + +DMA controller +-------------- +One has to specify, source, destination, size, and start the transfer. One +4096 bytes long buffer at offset 0x40000 is available in the EDU device. I.e. +one can perform DMA to/from this space when programmed properly. + +Example of transferring a 100 byte block to and from the buffer using a given +PCI address 'addr': +addr -> DMA source address +0x40000 -> DMA destination address +100 -> DMA transfer count +1 -> DMA command register +while (DMA command register & 1) + ; + +0x40000 -> DMA source address +addr+100 -> DMA destination address +100 -> DMA transfer count +3 -> DMA command register +while (DMA command register & 1) + ; diff --git a/docs/specs/fw_cfg.txt b/docs/specs/fw_cfg.txt new file mode 100644 index 000000000..3e6d586f6 --- /dev/null +++ b/docs/specs/fw_cfg.txt @@ -0,0 +1,265 @@ +QEMU Firmware Configuration (fw_cfg) Device +=========================================== + += Guest-side Hardware Interface = + +This hardware interface allows the guest to retrieve various data items +(blobs) that can influence how the firmware configures itself, or may +contain tables to be installed for the guest OS. Examples include device +boot order, ACPI and SMBIOS tables, virtual machine UUID, SMP and NUMA +information, kernel/initrd images for direct (Linux) kernel booting, etc. + +== Selector (Control) Register == + +* Write only +* Location: platform dependent (IOport or MMIO) +* Width: 16-bit +* Endianness: little-endian (if IOport), or big-endian (if MMIO) + +A write to this register sets the index of a firmware configuration +item which can subsequently be accessed via the data register. + +Setting the selector register will cause the data offset to be set +to zero. The data offset impacts which data is accessed via the data +register, and is explained below. + +Bit14 of the selector register indicates whether the configuration +setting is being written. A value of 0 means the item is only being +read, and all write access to the data port will be ignored. A value +of 1 means the item's data can be overwritten by writes to the data +register. In other words, configuration write mode is enabled when +the selector value is between 0x4000-0x7fff or 0xc000-0xffff. + +NOTE: As of QEMU v2.4, writes to the fw_cfg data register are no + longer supported, and will be ignored (treated as no-ops)! + +NOTE: As of QEMU v2.9, writes are reinstated, but only through the DMA + interface (see below). Furthermore, writeability of any specific item is + governed independently of Bit14 in the selector key value. + +Bit15 of the selector register indicates whether the configuration +setting is architecture specific. A value of 0 means the item is a +generic configuration item. A value of 1 means the item is specific +to a particular architecture. In other words, generic configuration +items are accessed with a selector value between 0x0000-0x7fff, and +architecture specific configuration items are accessed with a selector +value between 0x8000-0xffff. + +== Data Register == + +* Read/Write (writes ignored as of QEMU v2.4, but see the DMA interface) +* Location: platform dependent (IOport [*] or MMIO) +* Width: 8-bit (if IOport), 8/16/32/64-bit (if MMIO) +* Endianness: string-preserving + +[*] On platforms where the data register is exposed as an IOport, its +port number will always be one greater than the port number of the +selector register. In other words, the two ports overlap, and can not +be mapped separately. + +The data register allows access to an array of bytes for each firmware +configuration data item. The specific item is selected by writing to +the selector register, as described above. + +Initially following a write to the selector register, the data offset +will be set to zero. Each successful access to the data register will +increment the data offset by the appropriate access width. + +Each firmware configuration item has a maximum length of data +associated with the item. After the data offset has passed the +end of this maximum data length, then any reads will return a data +value of 0x00, and all writes will be ignored. + +An N-byte wide read of the data register will return the next available +N bytes of the selected firmware configuration item, as a substring, in +increasing address order, similar to memcpy(). + +== Register Locations == + +=== x86, x86_64 Register Locations === + +Selector Register IOport: 0x510 +Data Register IOport: 0x511 +DMA Address IOport: 0x514 + +=== Arm Register Locations === + +Selector Register address: Base + 8 (2 bytes) +Data Register address: Base + 0 (8 bytes) +DMA Address address: Base + 16 (8 bytes) + +== ACPI Interface == + +The fw_cfg device is defined with ACPI ID "QEMU0002". Since we expect +ACPI tables to be passed into the guest through the fw_cfg device itself, +the guest-side firmware can not use ACPI to find fw_cfg. However, once the +firmware is finished setting up ACPI tables and hands control over to the +guest kernel, the latter can use the fw_cfg ACPI node for a more accurate +inventory of in-use IOport or MMIO regions. + +== Firmware Configuration Items == + +=== Signature (Key 0x0000, FW_CFG_SIGNATURE) === + +The presence of the fw_cfg selector and data registers can be verified +by selecting the "signature" item using key 0x0000 (FW_CFG_SIGNATURE), +and reading four bytes from the data register. If the fw_cfg device is +present, the four bytes read will contain the characters "QEMU". + +If the DMA interface is available, then reading the DMA Address +Register returns 0x51454d5520434647 ("QEMU CFG" in big-endian format). + +=== Revision / feature bitmap (Key 0x0001, FW_CFG_ID) === + +A 32-bit little-endian unsigned int, this item is used to check for enabled +features. + - Bit 0: traditional interface. Always set. + - Bit 1: DMA interface. + +=== File Directory (Key 0x0019, FW_CFG_FILE_DIR) === + +Firmware configuration items stored at selector keys 0x0020 or higher +(FW_CFG_FILE_FIRST or higher) have an associated entry in a directory +structure, which makes it easier for guest-side firmware to identify +and retrieve them. The format of this file directory (from fw_cfg.h in +the QEMU source tree) is shown here, slightly annotated for clarity: + +struct FWCfgFiles { /* the entire file directory fw_cfg item */ + uint32_t count; /* number of entries, in big-endian format */ + struct FWCfgFile f[]; /* array of file entries, see below */ +}; + +struct FWCfgFile { /* an individual file entry, 64 bytes total */ + uint32_t size; /* size of referenced fw_cfg item, big-endian */ + uint16_t select; /* selector key of fw_cfg item, big-endian */ + uint16_t reserved; + char name[56]; /* fw_cfg item name, NUL-terminated ascii */ +}; + +=== All Other Data Items === + +Please consult the QEMU source for the most up-to-date and authoritative list +of selector keys and their respective items' purpose, format and writeability. + +=== Ranges === + +Theoretically, there may be up to 0x4000 generic firmware configuration +items, and up to 0x4000 architecturally specific ones. + +Selector Reg. Range Usage +--------------- ----------- +0x0000 - 0x3fff Generic (0x0000 - 0x3fff, generally RO, possibly RW through + the DMA interface in QEMU v2.9+) +0x4000 - 0x7fff Generic (0x0000 - 0x3fff, RW, ignored in QEMU v2.4+) +0x8000 - 0xbfff Arch. Specific (0x0000 - 0x3fff, generally RO, possibly RW + through the DMA interface in QEMU v2.9+) +0xc000 - 0xffff Arch. Specific (0x0000 - 0x3fff, RW, ignored in v2.4+) + +In practice, the number of allowed firmware configuration items depends on the +machine type/version. + += Guest-side DMA Interface = + +If bit 1 of the feature bitmap is set, the DMA interface is present. This does +not replace the existing fw_cfg interface, it is an add-on. This interface +can be used through the 64-bit wide address register. + +The address register is in big-endian format. The value for the register is 0 +at startup and after an operation. A write to the least significant half (at +offset 4) triggers an operation. This means that operations with 32-bit +addresses can be triggered with just one write, whereas operations with +64-bit addresses can be triggered with one 64-bit write or two 32-bit writes, +starting with the most significant half (at offset 0). + +In this register, the physical address of a FWCfgDmaAccess structure in RAM +should be written. This is the format of the FWCfgDmaAccess structure: + +typedef struct FWCfgDmaAccess { + uint32_t control; + uint32_t length; + uint64_t address; +} FWCfgDmaAccess; + +The fields of the structure are in big endian mode, and the field at the lowest +address is the "control" field. + +The "control" field has the following bits: + - Bit 0: Error + - Bit 1: Read + - Bit 2: Skip + - Bit 3: Select. The upper 16 bits are the selected index. + - Bit 4: Write + +When an operation is triggered, if the "control" field has bit 3 set, the +upper 16 bits are interpreted as an index of a firmware configuration item. +This has the same effect as writing the selector register. + +If the "control" field has bit 1 set, a read operation will be performed. +"length" bytes for the current selector and offset will be copied into the +physical RAM address specified by the "address" field. + +If the "control" field has bit 4 set (and not bit 1), a write operation will be +performed. "length" bytes will be copied from the physical RAM address +specified by the "address" field to the current selector and offset. QEMU +prevents starting or finishing the write beyond the end of the item associated +with the current selector (i.e., the item cannot be resized). Truncated writes +are dropped entirely. Writes to read-only items are also rejected. All of these +write errors set bit 0 (the error bit) in the "control" field. + +If the "control" field has bit 2 set (and neither bit 1 nor bit 4), a skip +operation will be performed. The offset for the current selector will be +advanced "length" bytes. + +To check the result, read the "control" field: + error bit set -> something went wrong. + all bits cleared -> transfer finished successfully. + otherwise -> transfer still in progress (doesn't happen + today due to implementation not being async, + but may in the future). + += Externally Provided Items = + +Since v2.4, "file" fw_cfg items (i.e., items with selector keys above +FW_CFG_FILE_FIRST, and with a corresponding entry in the fw_cfg file +directory structure) may be inserted via the QEMU command line, using +the following syntax: + + -fw_cfg [name=]<item_name>,file=<path> + +Or + + -fw_cfg [name=]<item_name>,string=<string> + +Since v5.1, QEMU allows some objects to generate fw_cfg-specific content, +the content is then associated with a "file" item using the 'gen_id' option +in the command line, using the following syntax: + + -object <generator-type>,id=<generated_id>,[generator-specific-options] \ + -fw_cfg [name=]<item_name>,gen_id=<generated_id> + +See QEMU man page for more documentation. + +Using item_name with plain ASCII characters only is recommended. + +Item names beginning with "opt/" are reserved for users. QEMU will +never create entries with such names unless explicitly ordered by the +user. + +To avoid clashes among different users, it is strongly recommended +that you use names beginning with opt/RFQDN/, where RFQDN is a reverse +fully qualified domain name you control. For instance, if SeaBIOS +wanted to define additional names, the prefix "opt/org.seabios/" would +be appropriate. + +For historical reasons, "opt/ovmf/" is reserved for OVMF firmware. + +Prefix "opt/org.qemu/" is reserved for QEMU itself. + +Use of names not beginning with "opt/" is potentially dangerous and +entirely unsupported. QEMU will warn if you try. + +Use of names not beginning with "opt/" is tolerated with 'gen_id' (that +is, the warning is suppressed), but you must know exactly what you're +doing. + +All externally provided fw_cfg items are read-only to the guest. diff --git a/docs/specs/index.rst b/docs/specs/index.rst new file mode 100644 index 000000000..ecc43896b --- /dev/null +++ b/docs/specs/index.rst @@ -0,0 +1,20 @@ +---------------------------------------------- +System Emulation Guest Hardware Specifications +---------------------------------------------- + +This section of the manual contains specifications of +guest hardware that is specific to QEMU. + +.. toctree:: + :maxdepth: 2 + + ppc-xive + ppc-spapr-xive + ppc-spapr-numa + acpi_hw_reduced_hotplug + tpm + acpi_hest_ghes + acpi_cpu_hotplug + acpi_mem_hotplug + acpi_pci_hotplug + acpi_nvdimm diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt new file mode 100644 index 000000000..1beb3a01e --- /dev/null +++ b/docs/specs/ivshmem-spec.txt @@ -0,0 +1,258 @@ += Device Specification for Inter-VM shared memory device = + +The Inter-VM shared memory device (ivshmem) is designed to share a +memory region between multiple QEMU processes running different guests +and the host. In order for all guests to be able to pick up the +shared memory area, it is modeled by QEMU as a PCI device exposing +said memory to the guest as a PCI BAR. + +The device can use a shared memory object on the host directly, or it +can obtain one from an ivshmem server. + +In the latter case, the device can additionally interrupt its peers, and +get interrupted by its peers. + + +== Configuring the ivshmem PCI device == + +There are two basic configurations: + +- Just shared memory: + + -device ivshmem-plain,memdev=HMB,... + + This uses host memory backend HMB. It should have option "share" + set. + +- Shared memory plus interrupts: + + -device ivshmem-doorbell,chardev=CHR,vectors=N,... + + An ivshmem server must already be running on the host. The device + connects to the server's UNIX domain socket via character device + CHR. + + Each peer gets assigned a unique ID by the server. IDs must be + between 0 and 65535. + + Interrupts are message-signaled (MSI-X). vectors=N configures the + number of vectors to use. + +For more details on ivshmem device properties, see the QEMU Emulator +user documentation. + + +== The ivshmem PCI device's guest interface == + +The device has vendor ID 1af4, device ID 1110, revision 1. Before +QEMU 2.6.0, it had revision 0. + +=== PCI BARs === + +The ivshmem PCI device has two or three BARs: + +- BAR0 holds device registers (256 Byte MMIO) +- BAR1 holds MSI-X table and PBA (only ivshmem-doorbell) +- BAR2 maps the shared memory object + +There are two ways to use this device: + +- If you only need the shared memory part, BAR2 suffices. This way, + you have access to the shared memory in the guest and can use it as + you see fit. Memnic, for example, uses ivshmem this way from guest + user space (see http://dpdk.org/browse/memnic). + +- If you additionally need the capability for peers to interrupt each + other, you need BAR0 and BAR1. You will most likely want to write a + kernel driver to handle interrupts. Requires the device to be + configured for interrupts, obviously. + +Before QEMU 2.6.0, BAR2 can initially be invalid if the device is +configured for interrupts. It becomes safely accessible only after +the ivshmem server provided the shared memory. These devices have PCI +revision 0 rather than 1. Guest software should wait for the +IVPosition register (described below) to become non-negative before +accessing BAR2. + +Revision 0 of the device is not capable to tell guest software whether +it is configured for interrupts. + +=== PCI device registers === + +BAR 0 contains the following registers: + + Offset Size Access On reset Function + 0 4 read/write 0 Interrupt Mask + bit 0: peer interrupt (rev 0) + reserved (rev 1) + bit 1..31: reserved + 4 4 read/write 0 Interrupt Status + bit 0: peer interrupt (rev 0) + reserved (rev 1) + bit 1..31: reserved + 8 4 read-only 0 or ID IVPosition + 12 4 write-only N/A Doorbell + bit 0..15: vector + bit 16..31: peer ID + 16 240 none N/A reserved + +Software should only access the registers as specified in column +"Access". Reserved bits should be ignored on read, and preserved on +write. + +In revision 0 of the device, Interrupt Status and Mask Register +together control the legacy INTx interrupt when the device has no +MSI-X capability: INTx is asserted when the bit-wise AND of Status and +Mask is non-zero and the device has no MSI-X capability. Interrupt +Status Register bit 0 becomes 1 when an interrupt request from a peer +is received. Reading the register clears it. + +IVPosition Register: if the device is not configured for interrupts, +this is zero. Else, it is the device's ID (between 0 and 65535). + +Before QEMU 2.6.0, the register may read -1 for a short while after +reset. These devices have PCI revision 0 rather than 1. + +There is no good way for software to find out whether the device is +configured for interrupts. A positive IVPosition means interrupts, +but zero could be either. + +Doorbell Register: writing this register requests to interrupt a peer. +The written value's high 16 bits are the ID of the peer to interrupt, +and its low 16 bits select an interrupt vector. + +If the device is not configured for interrupts, the write is ignored. + +If the interrupt hasn't completed setup, the write is ignored. The +device is not capable to tell guest software whether setup is +complete. Interrupts can regress to this state on migration. + +If the peer with the requested ID isn't connected, or it has fewer +interrupt vectors connected, the write is ignored. The device is not +capable to tell guest software what peers are connected, or how many +interrupt vectors are connected. + +The peer's interrupt for this vector then becomes pending. There is +no way for software to clear the pending bit, and a polling mode of +operation is therefore impossible. + +If the peer is a revision 0 device without MSI-X capability, its +Interrupt Status register is set to 1. This asserts INTx unless +masked by the Interrupt Mask register. The device is not capable to +communicate the interrupt vector to guest software then. + +With multiple MSI-X vectors, different vectors can be used to indicate +different events have occurred. The semantics of interrupt vectors +are left to the application. + + +== Interrupt infrastructure == + +When configured for interrupts, the peers share eventfd objects in +addition to shared memory. The shared resources are managed by an +ivshmem server. + +=== The ivshmem server === + +The server listens on a UNIX domain socket. + +For each new client that connects to the server, the server +- picks an ID, +- creates eventfd file descriptors for the interrupt vectors, +- sends the ID and the file descriptor for the shared memory to the + new client, +- sends connect notifications for the new client to the other clients + (these contain file descriptors for sending interrupts), +- sends connect notifications for the other clients to the new client, + and +- sends interrupt setup messages to the new client (these contain file + descriptors for receiving interrupts). + +The first client to connect to the server receives ID zero. + +When a client disconnects from the server, the server sends disconnect +notifications to the other clients. + +The next section describes the protocol in detail. + +If the server terminates without sending disconnect notifications for +its connected clients, the clients can elect to continue. They can +communicate with each other normally, but won't receive disconnect +notification on disconnect, and no new clients can connect. There is +no way for the clients to connect to a restarted server. The device +is not capable to tell guest software whether the server is still up. + +Example server code is in contrib/ivshmem-server/. Not to be used in +production. It assumes all clients use the same number of interrupt +vectors. + +A standalone client is in contrib/ivshmem-client/. It can be useful +for debugging. + +=== The ivshmem Client-Server Protocol === + +An ivshmem device configured for interrupts connects to an ivshmem +server. This section details the protocol between the two. + +The connection is one-way: the server sends messages to the client. +Each message consists of a single 8 byte little-endian signed number, +and may be accompanied by a file descriptor via SCM_RIGHTS. Both +client and server close the connection on error. + +Note: QEMU currently doesn't close the connection right on error, but +only when the character device is destroyed. + +On connect, the server sends the following messages in order: + +1. The protocol version number, currently zero. The client should + close the connection on receipt of versions it can't handle. + +2. The client's ID. This is unique among all clients of this server. + IDs must be between 0 and 65535, because the Doorbell register + provides only 16 bits for them. + +3. The number -1, accompanied by the file descriptor for the shared + memory. + +4. Connect notifications for existing other clients, if any. This is + a peer ID (number between 0 and 65535 other than the client's ID), + repeated N times. Each repetition is accompanied by one file + descriptor. These are for interrupting the peer with that ID using + vector 0,..,N-1, in order. If the client is configured for fewer + vectors, it closes the extra file descriptors. If it is configured + for more, the extra vectors remain unconnected. + +5. Interrupt setup. This is the client's own ID, repeated N times. + Each repetition is accompanied by one file descriptor. These are + for receiving interrupts from peers using vector 0,..,N-1, in + order. If the client is configured for fewer vectors, it closes + the extra file descriptors. If it is configured for more, the + extra vectors remain unconnected. + +From then on, the server sends these kinds of messages: + +6. Connection / disconnection notification. This is a peer ID. + + - If the number comes with a file descriptor, it's a connection + notification, exactly like in step 4. + + - Else, it's a disconnection notification for the peer with that ID. + +Known bugs: + +* The protocol changed incompatibly in QEMU 2.5. Before, messages + were native endian long, and there was no version number. + +* The protocol is poorly designed. + +=== The ivshmem Client-Client Protocol === + +An ivshmem device configured for interrupts receives eventfd file +descriptors for interrupting peers and getting interrupted by peers +from the server, as explained in the previous section. + +To interrupt a peer, the device writes the 8-byte integer 1 in native +byte order to the respective file descriptor. + +To receive an interrupt, the device reads and discards as many 8-byte +integers as it can. diff --git a/docs/specs/pci-ids.txt b/docs/specs/pci-ids.txt new file mode 100644 index 000000000..5e407a6f3 --- /dev/null +++ b/docs/specs/pci-ids.txt @@ -0,0 +1,71 @@ + +PCI IDs for qemu +================ + +Red Hat, Inc. donates a part of its device ID range to qemu, to be used for +virtual devices. The vendor IDs are 1af4 (formerly Qumranet ID) and 1b36. + +Contact Gerd Hoffmann <kraxel@redhat.com> to get a device ID assigned +for your devices. + +1af4 vendor ID +-------------- + +The 1000 -> 10ff device ID range is used as follows for virtio-pci devices. +Note that this allocation separate from the virtio device IDs, which are +maintained as part of the virtio specification. + +1af4:1000 network device (legacy) +1af4:1001 block device (legacy) +1af4:1002 balloon device (legacy) +1af4:1003 console device (legacy) +1af4:1004 SCSI host bus adapter device (legacy) +1af4:1005 entropy generator device (legacy) +1af4:1009 9p filesystem device (legacy) + +1af4:1041 network device (modern) +1af4:1042 block device (modern) +1af4:1043 console device (modern) +1af4:1044 entropy generator device (modern) +1af4:1045 balloon device (modern) +1af4:1048 SCSI host bus adapter device (modern) +1af4:1049 9p filesystem device (modern) +1af4:1050 virtio gpu device (modern) +1af4:1052 virtio input device (modern) + +1af4:10f0 Available for experimental usage without registration. Must get + to official ID when the code leaves the test lab (i.e. when seeking +1af4:10ff upstream merge or shipping a distro/product) to avoid conflicts. + +1af4:1100 Used as PCI Subsystem ID for existing hardware devices emulated + by qemu. + +1af4:1110 ivshmem device (shared memory, docs/specs/ivshmem-spec.txt) + +All other device IDs are reserved. + +1b36 vendor ID +-------------- + +The 0000 -> 00ff device ID range is used as follows for QEMU-specific +PCI devices (other than virtio): + +1b36:0001 PCI-PCI bridge +1b36:0002 PCI serial port (16550A) adapter (docs/specs/pci-serial.txt) +1b36:0003 PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt) +1b36:0004 PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt) +1b36:0005 PCI test device (docs/specs/pci-testdev.txt) +1b36:0006 PCI Rocker Ethernet switch device +1b36:0007 PCI SD Card Host Controller Interface (SDHCI) +1b36:0008 PCIe host bridge +1b36:0009 PCI Expander Bridge (-device pxb) +1b36:000a PCI-PCI bridge (multiseat) +1b36:000b PCIe Expander Bridge (-device pxb-pcie) +1b36:000d PCI xhci usb host adapter +1b36:000f mdpy (mdev sample device), linux/samples/vfio-mdev/mdpy.c +1b36:0010 PCIe NVMe device (-device nvme) +1b36:0011 PCI PVPanic device (-device pvpanic-pci) + +All these devices are documented in docs/specs. + +The 0100 device ID is used for the QXL video card device. diff --git a/docs/specs/pci-serial.txt b/docs/specs/pci-serial.txt new file mode 100644 index 000000000..66c761f2b --- /dev/null +++ b/docs/specs/pci-serial.txt @@ -0,0 +1,34 @@ + +QEMU pci serial devices +======================= + +There is one single-port variant and two muliport-variants. Linux +guests out-of-the box with all cards. There is a Windows inf file +(docs/qemupciserial.inf) to setup the single-port card in Windows +guests. + + +single-port card +---------------- + +Name: pci-serial +PCI ID: 1b36:0002 + +PCI Region 0: + IO bar, 8 bytes long, with the 16550 uart mapped to it. + Interrupt is wired to pin A. + + +multiport cards +--------------- + +Name: pci-serial-2x +PCI ID: 1b36:0003 + +Name: pci-serial-4x +PCI ID: 1b36:0004 + +PCI Region 0: + IO bar, with two/four 16550 uart mapped after each other. + The first is at offset 0, second at offset 8, ... + Interrupt is wired to pin A. diff --git a/docs/specs/pci-testdev.txt b/docs/specs/pci-testdev.txt new file mode 100644 index 000000000..4280a1e73 --- /dev/null +++ b/docs/specs/pci-testdev.txt @@ -0,0 +1,31 @@ +pci-test is a device used for testing low level IO + +device implements up to three BARs: BAR0, BAR1 and BAR2. +Each of BAR 0+1 can be memory or IO. Guests must detect +BAR types and act accordingly. + +BAR 0+1 size is up to 4K bytes each. +BAR 0+1 starts with the following header: + +typedef struct PCITestDevHdr { + uint8_t test; <- write-only, starts a given test number + uint8_t width_type; <- read-only, type and width of access for a given test. + 1,2,4 for byte,word or long write. + any other value if test not supported on this BAR + uint8_t pad0[2]; + uint32_t offset; <- read-only, offset in this BAR for a given test + uint32_t data; <- read-only, data to use for a given test + uint32_t count; <- for debugging. number of writes detected. + uint8_t name[]; <- for debugging. 0-terminated ASCII string. +} PCITestDevHdr; + +All registers are little endian. + +device is expected to always implement tests 0 to N on each BAR, and to add new +tests with higher numbers. In this way a guest can scan test numbers until it +detects an access type that it does not support on this BAR, then stop. + +BAR2 is a 64bit memory bar, without backing storage. It is disabled +by default and can be enabled using the membar=<size> property. This +can be used to test whether guests handle pci bars of a specific +(possibly quite large) size correctly. diff --git a/docs/specs/ppc-spapr-hcalls.txt b/docs/specs/ppc-spapr-hcalls.txt new file mode 100644 index 000000000..93fe3da91 --- /dev/null +++ b/docs/specs/ppc-spapr-hcalls.txt @@ -0,0 +1,78 @@ +When used with the "pseries" machine type, QEMU-system-ppc64 implements +a set of hypervisor calls using a subset of the server "PAPR" specification +(IBM internal at this point), which is also what IBM's proprietary hypervisor +adheres too. + +The subset is selected based on the requirements of Linux as a guest. + +In addition to those calls, we have added our own private hypervisor +calls which are mostly used as a private interface between the firmware +running in the guest and QEMU. + +All those hypercalls start at hcall number 0xf000 which correspond +to an implementation specific range in PAPR. + +- H_RTAS (0xf000) + +RTAS is a set of runtime services generally provided by the firmware +inside the guest to the operating system. It predates the existence +of hypervisors (it was originally an extension to Open Firmware) and +is still used by PAPR to provide various services that aren't performance +sensitive. + +We currently implement the RTAS services in QEMU itself. The actual RTAS +"firmware" blob in the guest is a small stub of a few instructions which +calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU. + +Arguments: + + r3 : H_RTAS (0xf000) + r4 : Guest physical address of RTAS parameter block + +Returns: + + H_SUCCESS : Successfully called the RTAS function (RTAS result + will have been stored in the parameter block) + H_PARAMETER : Unknown token + +- H_LOGICAL_MEMOP (0xf001) + +When the guest runs in "real mode" (in powerpc lingua this means +with MMU disabled, ie guest effective == guest physical), it only +has access to a subset of memory and no IOs. + +PAPR provides a set of hypervisor calls to perform cacheable or +non-cacheable accesses to any guest physical addresses that the +guest can use in order to access IO devices while in real mode. + +This is typically used by the firmware running in the guest. + +However, doing a hypercall for each access is extremely inefficient +(even more so when running KVM) when accessing the frame buffer. In +that case, things like scrolling become unusably slow. + +This hypercall allows the guest to request a "memory op" to be applied +to memory. The supported memory ops at this point are to copy a range +of memory (supports overlap of source and destination) and XOR which +is used by our SLOF firmware to invert the screen. + +Arguments: + + r3: H_LOGICAL_MEMOP (0xf001) + r4: Guest physical address of destination + r5: Guest physical address of source + r6: Individual element size + 0 = 1 byte + 1 = 2 bytes + 2 = 4 bytes + 3 = 8 bytes + r7: Number of elements + r8: Operation + 0 = copy + 1 = xor + +Returns: + + H_SUCCESS : Success + H_PARAMETER : Invalid argument + diff --git a/docs/specs/ppc-spapr-hotplug.txt b/docs/specs/ppc-spapr-hotplug.txt new file mode 100644 index 000000000..d4fb2d46d --- /dev/null +++ b/docs/specs/ppc-spapr-hotplug.txt @@ -0,0 +1,409 @@ += sPAPR Dynamic Reconfiguration = + +sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration +to handle hotplugging of dynamic "physical" resources like PCI cards, or +"logical"/paravirtual resources like memory, CPUs, and "physical" +host-bridges, which are generally managed by the host/hypervisor and provided +to guests as virtualized resources. The specifics of dynamic-reconfiguration +are documented extensively in PAPR+ v2.7, Section 13.1. This document +provides a summary of that information as it applies to the implementation +within QEMU. + +== Dynamic-reconfiguration Connectors == + +To manage hotplug/unplug of these resources, a firmware abstraction known as +a Dynamic Resource Connector (DRC) is used to assign a particular dynamic +resource to the guest, and provide an interface for the guest to manage +configuration/removal of the resource associated with it. + +== Device-tree description of DRCs == + +A set of 4 Open Firmware device tree array properties are used to describe +the name/index/power-domain/type of each DRC allocated to a guest at +boot-time. There may be multiple sets of these arrays, rooted at different +paths in the device tree depending on the type of resource the DRCs manage. + +In some cases, the DRCs themselves may be provided by a dynamic resource, +such as the DRCs managing PCI slots on a hotplugged PHB. In this case the +arrays would be fetched as part of the device tree retrieval interfaces +for hotplugged resources described under "Guest->Host interface". + +The array properties are described below. Each entry/element in an array +describes the DRC identified by the element in the corresponding position +of ibm,drc-indexes: + +ibm,drc-names: + first 4-bytes: BE-encoded integer denoting the number of entries + each entry: a NULL-terminated <name> string encoded as a byte array + + <name> values for logical/virtual resources are defined in PAPR+ v2.7, + Section 13.5.2.4, and basically consist of the type of the resource + followed by a space and a numerical value that's unique across resources + of that type. + + <name> values for "physical" resources such as PCI or VIO devices are + defined as being "location codes", which are the "location labels" of + each encapsulating device, starting from the chassis down to the + individual slot for the device, concatenated by a hyphen. This provides + a mapping of resources to a physical location in a chassis for debugging + purposes. For QEMU, this mapping is less important, so we assign a + location code that conforms to naming specifications, but is simply a + location label for the slot by itself to simplify the implementation. + The naming convention for location labels is documented in detail in + PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>" + for PCI/VIO device slots, where <n> is unique across all PCI/VIO + device slots. + +ibm,drc-indexes: + first 4-bytes: BE-encoded integer denoting the number of entries + each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs + in the machine + + <index> is arbitrary, but in the case of QEMU we try to maintain the + convention used to assign them to pSeries guests on pHyp: + + bit[31:28]: integer encoding of <type>, where <type> is: + 1 for CPU resource + 2 for PHB resource + 3 for VIO resource + 4 for PCI resource + 8 for Memory resource + bit[27:0]: integer encoding of <id>, where <id> is unique across + all resources of specified type + +ibm,drc-power-domains: + first 4-bytes: BE-encoded integer denoting the number of entries + each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the + power domain the resource will be assigned to. In the case of QEMU + we associated all resources with a "live insertion" domain, where the + power is assumed to be managed automatically. The integer value for + this domain is a special value of -1. + + +ibm,drc-types: + first 4-bytes: BE-encoded integer denoting the number of entries + each entry: a NULL-terminated <type> string encoded as a byte array + + <type> is assigned as follows: + "CPU" for a CPU + "PHB" for a physical host-bridge + "SLOT" for a VIO slot + "28" for a PCI slot + "MEM" for memory resource + +== Guest->Host interface to manage dynamic resources == + +Each DRC is given a globally unique DRC Index, and resources associated with +a particular DRC are configured/managed by the guest via a number of RTAS +calls which reference individual DRCs based on the DRC index. This can be +considered the guest->host interface. + +rtas-set-power-level: + arg[0]: integer identifying power domain + arg[1]: new power level for the domain, 0-100 + output[0]: status, 0 on success + output[1]: power level after command + + Set the power level for a specified power domain + +rtas-get-power-level: + arg[0]: integer identifying power domain + output[0]: status, 0 on success + output[1]: current power level + + Get the power level for a specified power domain + +rtas-set-indicator: + arg[0]: integer identifying sensor/indicator type + arg[1]: index of sensor, for DR-related sensors this is generally the + DRC index + arg[2]: desired sensor value + output[0]: status, 0 on success + + Set the state of an indicator or sensor. For the purpose of this document we + focus on the indicator/sensor types associated with a DRC. The types are: + + 9001: isolation-state, controls/indicates whether a device has been made + accessible to a guest + + supported sensor values: + 0: isolate, device is made unaccessible by guest OS + 1: unisolate, device is made available to guest OS + + 9002: dr-indicator, controls "visual" indicator associated with device + + supported sensor values: + 0: inactive, resource may be safely removed + 1: active, resource is in use and cannot be safely removed + 2: identify, used to visually identify slot for interactive hotplug + 3: action, in most cases, used in the same manner as identify + + 9003: allocation-state, generally only used for "logical" DR resources to + request the allocation/deallocation of a resource prior to acquiring + it via isolation-state->unisolate, or after releasing it via + isolation-state->isolate, respectively. for "physical" DR (like PCI + hotplug/unplug) the pre-allocation of the resource is implied and + this sensor is unused. + + supported sensor values: + 0: unusable, tell firmware/system the resource can be + unallocated/reclaimed and added back to the system resource pool + 1: usable, request the resource be allocated/reserved for use by + guest OS + 2: exchange, used to allocate a spare resource to use for fail-over + in certain situations. unused in QEMU + 3: recover, used to reclaim a previously allocated resource that's + not currently allocated to the guest OS. unused in QEMU + +rtas-get-sensor-state: + arg[0]: integer identifying sensor/indicator type + arg[1]: index of sensor, for DR-related sensors this is generally the + DRC index + output[0]: status, 0 on success + + Used to read an indicator or sensor value. + + For DR-related operations, the only noteworthy sensor is dr-entity-sense, + which has a type value of 9003, as allocation-state does in the case of + rtas-set-indicator. The semantics/encodings of the sensor values are distinct + however: + + supported sensor values for dr-entity-sense (9003) sensor: + 0: empty, + for physical resources: DRC/slot is empty + for logical resources: unused + 1: present, + for physical resources: DRC/slot is populated with a device/resource + for logical resources: resource has been allocated to the DRC + 2: unusable, + for physical resources: unused + for logical resources: DRC has no resource allocated to it + 3: exchange, + for physical resources: unused + for logical resources: resource available for exchange (see + allocation-state sensor semantics above) + 4: recovery, + for physical resources: unused + for logical resources: resource available for recovery (see + allocation-state sensor semantics above) + +rtas-ibm-configure-connector: + arg[0]: guest physical address of 4096-byte work area buffer + arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero + if a prior RTAS response indicated a need for additional memory + output[0]: status: + 0: completed transmittal of device-tree node + 1: instruct guest to prepare for next DT sibling node + 2: instruct guest to prepare for next DT child node + 3: instruct guest to prepare for next DT property + 4: instruct guest to ascend to parent DT node + 5: instruct guest to provide additional work-area buffer + via arg[1] + 990x: instruct guest that operation took too long and to try + again later + + Used to fetch an OF device-tree description of the resource associated with + a particular DRC. The DRC index is encoded in the first 4-bytes of the first + work area buffer. + + Work area layout, using 4-byte offsets: + wa[0]: DRC index of the DRC to fetch device-tree nodes from + wa[1]: 0 (hard-coded) + wa[2]: for next-sibling/next-child response: + wa offset of null-terminated string denoting the new node's name + for next-property response: + wa offset of null-terminated string denoting new property's name + wa[3]: for next-property response (unused otherwise): + byte-length of new property's value + wa[4]: for next-property response (unused otherwise): + new property's value, encoded as an OFDT-compatible byte array + +== hotplug/unplug events == + +For most DR operations, the hypervisor will issue host->guest add/remove events +using the EPOW/check-exception notification framework, where the host issues a +check-exception interrupt, then provides an RTAS event log via an +rtas-check-exception call issued by the guest in response. This framework is +documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown +requests via EPOW events. + +For DR, this framework has been extended to include hotplug events, which were +previously unneeded due to direct manipulation of DR-related guest userspace +tools by host-level management such as an HMC. This level of management is not +applicable to PowerKVM, hence the reason for extending the notification +framework to support hotplug events. + +The format for these EPOW-signalled events is described below under +"hotplug/unplug event structure". Note that these events are not +formally part of the PAPR+ specification, and have been superseded by a +newer format, also described below under "hotplug/unplug event structure", +and so are now deemed a "legacy" format. The formats are similar, but the +"modern" format contains additional fields/flags, which are denoted for the +purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards. + +QEMU should assume support only for "legacy" fields/flags unless the guest +advertises support for the "modern" format via ibm,client-architecture-support +hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector +structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events, +"modern" format events are surfaced to the guest via check-exception RTAS calls, +but use a dedicated event source to signal the guest. This event source is +advertised to the guest by the addition of a "hot-plug-events" node under +"/event-sources" node of the guest's device tree using the standard format +described in LoPAPR v11, B.6.12.1. + +== hotplug/unplug event structure == + +The hotplug-specific payload in QEMU is implemented as follows (with all values +encoded in big-endian format): + +struct rtas_event_log_v6_hp { +#define SECTION_ID_HOTPLUG 0x4850 /* HP */ + struct section_header { + uint16_t section_id; /* set to SECTION_ID_HOTPLUG */ + uint16_t section_length; /* sizeof(rtas_event_log_v6_hp), + * plus the length of the DRC name + * if a DRC name identifier is + * specified for hotplug_identifier + */ + uint8_t section_version; /* version 1 */ + uint8_t section_subtype; /* unused */ + uint16_t creator_component_id; /* unused */ + } hdr; +#define RTAS_LOG_V6_HP_TYPE_CPU 1 +#define RTAS_LOG_V6_HP_TYPE_MEMORY 2 +#define RTAS_LOG_V6_HP_TYPE_SLOT 3 +#define RTAS_LOG_V6_HP_TYPE_PHB 4 +#define RTAS_LOG_V6_HP_TYPE_PCI 5 + uint8_t hotplug_type; /* type of resource/device */ +#define RTAS_LOG_V6_HP_ACTION_ADD 1 +#define RTAS_LOG_V6_HP_ACTION_REMOVE 2 + uint8_t hotplug_action; /* action (add/remove) */ +#define RTAS_LOG_V6_HP_ID_DRC_NAME 1 +#define RTAS_LOG_V6_HP_ID_DRC_INDEX 2 +#define RTAS_LOG_V6_HP_ID_DRC_COUNT 3 +#ifdef GUEST_SUPPORTS_MODERN +#define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4 +#endif + uint8_t hotplug_identifier; /* type of the resource identifier, + * which serves as the discriminator + * for the 'drc' union field below + */ +#ifdef GUEST_SUPPORTS_MODERN + uint8_t capabilities; /* capability flags, currently unused + * by QEMU + */ +#else + uint8_t reserved; +#endif + union { + uint32_t index; /* DRC index of resource to take action + * on + */ + uint32_t count; /* number of DR resources to take + * action on (guest chooses which) + */ +#ifdef GUEST_SUPPORTS_MODERN + struct { + uint32_t count; /* number of DR resources to take + * action on + */ + uint32_t index; /* DRC index of first resource to take + * action on. guest will take action + * on DRC index <index> through + * DRC index <index + count - 1> in + * sequential order + */ + } count_indexed; +#endif + char name[1]; /* string representing the name of the + * DRC to take action on + */ + } drc; +} QEMU_PACKED; + +== ibm,lrdr-capacity == + +ibm,lrdr-capacity is a property in the /rtas device tree node that identifies +the dynamic reconfiguration capabilities of the guest. It consists of a triple +consisting of <phys>, <size> and <maxcpus>. + + <phys>, encoded in BE format represents the maximum address in bytes and + hence the maximum memory that can be allocated to the guest. + + <size>, encoded in BE format represents the size increments in which + memory can be hot-plugged to the guest. + + <maxcpus>, a BE-encoded integer, represents the maximum number of + processors that the guest can have. + +pseries guests use this property to note the maximum allowed CPUs for the +guest. + +== ibm,dynamic-reconfiguration-memory == + +ibm,dynamic-reconfiguration-memory is a device tree node that represents +dynamically reconfigurable logical memory blocks (LMB). This node +is generated only when the guest advertises the support for it via +ibm,client-architecture-support call. Memory that is not dynamically +reconfigurable is represented by /memory nodes. The properties of this +node that are of interest to the sPAPR memory hotplug implementation +in QEMU are described here. + +ibm,lmb-size + +This 64bit integer defines the size of each dynamically reconfigurable LMB. + +ibm,associativity-lookup-arrays + +This property defines a lookup array in which the NUMA associativity +information for each LMB can be found. It is a property encoded array +that begins with an integer M, the number of associativity lists followed +by an integer N, the number of entries per associativity list and terminated +by M associativity lists each of length N integers. + +This property provides the same information as given by ibm,associativity +property in a /memory node. Each assigned LMB has an index value between +0 and M-1 which is used as an index into this table to select which +associativity list to use for the LMB. This index value for each LMB +is defined in ibm,dynamic-memory property. + +ibm,dynamic-memory + +This property describes the dynamically reconfigurable memory. It is a +property encoded array that has an integer N, the number of LMBs followed +by N LMB list entries. + +Each LMB list entry consists of the following elements: + +- Logical address of the start of the LMB encoded as a 64bit integer. This + corresponds to reg property in /memory node. +- DRC index of the LMB that corresponds to ibm,my-drc-index property + in a /memory node. +- Four bytes reserved for expansion. +- Associativity list index for the LMB that is used as an index into + ibm,associativity-lookup-arrays property described earlier. This + is used to retrieve the right associativity list to be used for this + LMB. +- A 32bit flags word. The bit at bit position 0x00000008 defines whether + the LMB is assigned to the partition as of boot time. + +ibm,dynamic-memory-v2 + +This property describes the dynamically reconfigurable memory. This is +an alternate and newer way to describe dynamically reconfigurable memory. +It is a property encoded array that has an integer N (the number of +LMB set entries) followed by N LMB set entries. There is an LMB set entry +for each sequential group of LMBs that share common attributes. + +Each LMB set entry consists of the following elements: + +- Number of sequential LMBs in the entry represented by a 32bit integer. +- Logical address of the first LMB in the set encoded as a 64bit integer. +- DRC index of the first LMB in the set. +- Associativity list index that is used as an index into + ibm,associativity-lookup-arrays property described earlier. This + is used to retrieve the right associativity list to be used for all + the LMBs in this set. +- A 32bit flags word that applies to all the LMBs in the set. + +[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867 diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst new file mode 100644 index 000000000..ffa687dc8 --- /dev/null +++ b/docs/specs/ppc-spapr-numa.rst @@ -0,0 +1,410 @@ + +NUMA mechanics for sPAPR (pseries machines) +============================================ + +NUMA in sPAPR works different than the System Locality Distance +Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR +1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This +document aims to complement this specification, providing details +of the elements that impacts how QEMU views NUMA in pseries. + +Associativity and ibm,associativity property +-------------------------------------------- + +Associativity is defined as a group of platform resources that has +similar mean performance (or in our context here, distance) relative to +everyone else outside of the group. + +The format of the ibm,associativity property varies with the value of +bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with +bit 0 equal to zero is deprecated. The current format, with the bit 0 +with the value of one, makes ibm,associativity property represent the +physical hierarchy of the platform, as one or more lists that starts +with the highest level grouping up to the smallest. Considering the +following topology: + +:: + + Mem M1 ---- Proc P1 | + ----------------- | Socket S1 ---| + chip C1 | | + | HW module 1 (MOD1) + Mem M2 ---- Proc P2 | | + ----------------- | Socket S2 ---| + chip C2 | + +The ibm,associativity property for the processors would be: + +* P1: {MOD1, S1, C1, P1} +* P2: {MOD1, S2, C2, P2} + +Each allocable resource has an ibm,associativity property. The LOPAPR +specification allows multiple lists to be present in this property, +considering that the same resource can have multiple connections to the +platform. + +Relative Performance Distance and ibm,associativity-reference-points +-------------------------------------------------------------------- + +The ibm,associativity-reference-points property is an array that is used +to define the relevant performance/distance related boundaries, defining +the NUMA levels for the platform. + +The definition of its elements also varies with the value of bit 0 of byte 5 +of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero +is also deprecated. With the current format, each integer of the +ibm,associativity-reference-points represents an 1 based ordinal index (i.e. +the first element is 1) of the ibm,associativity array. The first +boundary is the most significant to application performance, followed by +less significant boundaries. Allocated resources that belongs to the +same performance boundaries are expected to have relative NUMA distance +that matches the relevancy of the boundary itself. Resources that belongs +to the same first boundary will have the shortest distance from each +other. Subsequent boundaries represents greater distances and degraded +performance. + +Using the previous example, the following setting reference points defines +three NUMA levels: + +* ibm,associativity-reference-points = {0x3, 0x2, 0x1} + +The first NUMA level (0x3) is interpreted as the third element of each +ibm,associativity array, the second level is the second element and +the third level is the first element. Let's also consider that elements +belonging to the first NUMA level have distance equal to 10 from each +other, and each NUMA level doubles the distance from the previous. This +means that the second would be 20 and the third level 40. For the P1 and +P2 processors, we would have the following NUMA levels: + +:: + + * ibm,associativity-reference-points = {0x3, 0x2, 0x1} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x3) => associativity[2] = C1 + Second NUMA level (0x2) => associativity[1] = S1 + Third NUMA level (0x1) => associativity[0] = MOD1 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x3) => associativity[2] = C2 + Second NUMA level (0x2) => associativity[1] = S2 + Third NUMA level (0x1) => associativity[0] = MOD1 + + P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40 + +Changing the ibm,associativity-reference-points array changes the performance +distance attributes for the same associativity arrays, as the following +example illustrates: + +:: + + * ibm,associativity-reference-points = {0x2} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x2) => associativity[1] = S1 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x2) => associativity[1] = S2 + + P1 and P2 does not have a common performance boundary. Since this is a one level + NUMA configuration, distance between them is one boundary above the first + level, 20. + + +In a hypothetical platform where all resources inside the same hardware module +is considered to be on the same performance boundary: + +:: + + * ibm,associativity-reference-points = {0x1} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x1) => associativity[0] = MOD0 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x1) => associativity[0] = MOD0 + + P1 and P2 belongs to the same first order boundary. The distance between then + is 10. + + +How the pseries Linux guest calculates NUMA distances +===================================================== + +Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is +how the distances are expressed. The SLIT table provides the NUMA distance +value between the relevant resources. LOPAPR does not provide a standard +way to calculate it. We have the ibm,associativity for each resource, which +provides a common-performance hierarchy, and the ibm,associativity-reference-points +array that tells which level of associativity is considered to be relevant +or not. + +The result is that each OS is free to implement and to interpret the distance +as it sees fit. For the pseries Linux guest, each level of NUMA duplicates +the distance of the previous level, and the maximum amount of levels is +limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the +kernel tree). This results in the following distances: + +* both resources in the first NUMA level: 10 +* resources one NUMA level apart: 20 +* resources two NUMA levels apart: 40 +* resources three NUMA levels apart: 80 +* resources four NUMA levels apart: 160 + + +pseries NUMA mechanics +====================== + +Starting in QEMU 5.2, the pseries machine considers user input when setting NUMA +topology of the guest. The overall design is: + +* ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing + for 4 distinct NUMA distance values based on the NUMA levels + +* ibm,max-associativity-domains supports multiple associativity domains in all + NUMA levels, granting user flexibility + +* ibm,associativity for all resources varies with user input + +These changes are only effective for pseries-5.2 and newer machines that are +created with more than one NUMA node (disconsidering NUMA nodes created by +the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been +around for such a long time, with users seeing NUMA distances 10 and 40 +(and 80 if using NVLink2 GPUs), and there is no need to disrupt the +existing experience of those guests. + +To bring the user experience x86 users have when tuning up NUMA, we had +to operate under the current pseries Linux kernel logic described in +`How the pseries Linux guest calculates NUMA distances`_. The result +is that we needed to translate NUMA distance user input to pseries +Linux kernel input. + +Translating user distance to kernel distance +-------------------------------------------- + +User input for NUMA distance can vary from 10 to 254. We need to translate +that to the values that the Linux kernel operates on (10, 20, 40, 80, 160). +This is how it is being done: + +* user distance 11 to 30 will be interpreted as 20 +* user distance 31 to 60 will be interpreted as 40 +* user distance 61 to 120 will be interpreted as 80 +* user distance 121 and beyond will be interpreted as 160 +* user distance 10 stays 10 + +The reasoning behind this approximation is to avoid any round up to the local +distance (10), keeping it exclusive to the 4th NUMA level (which is still +exclusive to the node_id). All other ranges were chosen under the developer +discretion of what would be (somewhat) sensible considering the user input. +Any other strategy can be used here, but in the end the reality is that we'll +have to accept that a large array of values will be translated to the same +NUMA topology in the guest, e.g. this user input: + +:: + + 0 1 2 + 0 10 31 120 + 1 31 10 30 + 2 120 30 10 + +And this other user input: + +:: + + 0 1 2 + 0 10 60 61 + 1 60 10 11 + 2 61 11 10 + +Will both be translated to the same values internally: + +:: + + 0 1 2 + 0 10 40 80 + 1 40 10 20 + 2 80 20 10 + +Users are encouraged to use only the kernel values in the NUMA definition to +avoid being taken by surprise with that the guest is actually seeing in the +topology. There are enough potential surprises that are inherent to the +associativity domain assignment process, discussed below. + + +How associativity domains are assigned +-------------------------------------- + +LOPAPR allows more than one associativity array (or 'string') per allocated +resource. This would be used to represent that the resource has multiple +connections with the board, and then the operational system, when deciding +NUMA distancing, should consider the associativity information that provides +the shortest distance. + +The spapr implementation does not support multiple associativity arrays per +resource, neither does the pseries Linux kernel. We'll have to represent the +NUMA topology using one associativity per resource, which means that choices +and compromises are going to be made. + +Consider the following NUMA topology entered by user input: + +:: + + 0 1 2 3 + 0 10 40 20 40 + 1 40 10 80 40 + 2 20 80 10 20 + 3 40 40 20 10 + +All the associativity arrays are initialized with NUMA id in all associativity +domains: + +* node 0: 0 0 0 0 +* node 1: 1 1 1 1 +* node 2: 2 2 2 2 +* node 3: 3 3 3 3 + + +Honoring just the relative distances of node 0 to every other node, we find the +NUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for +each distance: + +* distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match + at 0x2) +* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3) +* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match + at 0x2) + +We'll copy the associativity domains of node 0 to all other nodes, based on +the NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy +the domains 0x2 and 0x1 from 0 to 1 as well. This will give us: + +* node 0: 0 0 0 0 +* node 1: 0 0 1 1 + +Doing the same to node 2 and node 3, these are the associativity arrays +after considering all matches with node 0: + +* node 0: 0 0 0 0 +* node 1: 0 0 1 1 +* node 2: 0 0 0 2 +* node 3: 0 0 3 3 + +The distances related to node 0 are accounted for. For node 1, and keeping +in mind that we don't need to revisit node 0 again, the distance from +node 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40, +match in 0x2. Repeating the same logic of copying all domains up to +the NUMA level match: + +* node 0: 0 0 0 0 +* node 1: 1 0 1 1 +* node 2: 1 0 0 2 +* node 3: 1 0 3 3 + +In the last step we will analyze just nodes 2 and 3. The desired distance +between 2 and 3 is 20, i.e. a match in 0x3: + +* node 0: 0 0 0 0 +* node 1: 1 0 1 1 +* node 2: 1 0 0 2 +* node 3: 1 0 0 3 + + +The kernel will read these arrays and will calculate the following NUMA topology for +the guest: + +:: + + 0 1 2 3 + 0 10 40 20 20 + 1 40 10 40 40 + 2 20 40 10 20 + 3 20 40 20 10 + +Note that this is not what the user wanted - the desired distance between +0 and 3 is 40, we calculated it as 20. This is what the current logic and +implementation constraints of the kernel and QEMU will provide inside the +LOPAPR specification. + +Users are welcome to use this knowledge and experiment with the input to get +the NUMA topology they want, or as closer as they want. The important thing +is to keep expectations up to par with what we are capable of provide at this +moment: an approximation. + +Limitations of the implementation +--------------------------------- + +As mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate +user choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways +to fully map user input to actual NUMA distance the guest will use. These limitations +creates two notable limitations in our support: + +* Asymmetrical topologies aren't supported. We only support NUMA topologies where + the distance from node A to B is always the same as B to A. We do not support + any A-B pair where the distance back and forth is asymmetric. For example, the + following topology isn't supported and the pSeries guest will not boot with this + user input: + +:: + + 0 1 + 0 10 40 + 1 20 10 + + +* 'non-transitive' topologies will be poorly translated to the guest. This is the + kind of topology where the distance from a node A to B is X, B to C is X, but + the distance A to C is not X. E.g.: + +:: + + 0 1 2 3 + 0 10 20 20 40 + 1 20 10 80 40 + 2 20 80 10 20 + 3 40 40 20 10 + + In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40. + The kernel will always match with the shortest associativity domain possible, + and we're attempting to retain the previous established relations between the + nodes. This means that a distance equal to 20 between nodes 0 and 2 and the + same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3 + to also be 20. + + +Legacy (5.1 and older) pseries NUMA mechanics +============================================= + +In short, we can summarize the NUMA distances seem in pseries Linux guests, using +QEMU up to 5.1, as follows: + +* local distance, i.e. the distance of the resource to its own NUMA node: 10 +* if it's a NVLink GPU device, distance: 80 +* every other resource, distance: 40 + +The way the pseries Linux guest calculates NUMA distances has a direct effect +on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is +the default ibm,associativity-reference-points being used in the pseries +machine: + +ibm,associativity-reference-points = {0x4, 0x4, 0x2} + +The first and second level are equal, 0x4, and a third one was added in +commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that +regardless of how the ibm,associativity properties are being created in +the device tree, the pseries Linux guest will only recognize three scenarios +as far as NUMA distance goes: + +* if the resources belongs to the same first NUMA level = 10 +* second level is skipped since it's equal to the first +* all resources that aren't a NVLink GPU, it is guaranteed that they will belong + to the same third NUMA level, having distance = 40 +* for NVLink GPUs, distance = 80 from everything else + +This also means that user input in QEMU command line does not change the +NUMA distancing inside the guest for the pseries machine. diff --git a/docs/specs/ppc-spapr-uv-hcalls.txt b/docs/specs/ppc-spapr-uv-hcalls.txt new file mode 100644 index 000000000..389c2740d --- /dev/null +++ b/docs/specs/ppc-spapr-uv-hcalls.txt @@ -0,0 +1,76 @@ +On PPC64 systems supporting Protected Execution Facility (PEF), system +memory can be placed in a secured region where only an "ultravisor" +running in firmware can provide to access it. pseries guests on such +systems can communicate with the ultravisor (via ultracalls) to switch to a +secure VM mode (SVM) where the guest's memory is relocated to this secured +region, making its memory inaccessible to normal processes/guests running on +the host. + +The various ultracalls/hypercalls relating to SVM mode are currently +only documented internally, but are planned for direct inclusion into the +public OpenPOWER version of the PAPR specification (LoPAPR/LoPAR). An internal +ACR has been filed to reserve a hypercall number range specific to this +use-case to avoid any future conflicts with the internally-maintained PAPR +specification. This document summarizes some of these details as they relate +to QEMU. + +== hypercalls needed by the ultravisor == + +Switching to SVM mode involves a number of hcalls issued by the ultravisor +to the hypervisor to orchestrate the movement of guest memory to secure +memory and various other aspects SVM mode. Numbers are assigned for these +hcalls within the reserved range 0xEF00-0xEF80. The below documents the +hcalls relevant to QEMU. + +- H_TPM_COMM (0xef10) + + For TPM_COMM_OP_EXECUTE operation: + Send a request to a TPM and receive a response, opening a new TPM session + if one has not already been opened. + + For TPM_COMM_OP_CLOSE_SESSION operation: + Close the existing TPM session, if any. + + Arguments: + + r3 : H_TPM_COMM (0xef10) + r4 : TPM operation, one of: + TPM_COMM_OP_EXECUTE (0x1) + TPM_COMM_OP_CLOSE_SESSION (0x2) + r5 : in_buffer, guest physical address of buffer containing the request + - Caller may use the same address for both request and response + r6 : in_size, size of the in buffer + - Must be less than or equal to 4KB + r7 : out_buffer, guest physical address of buffer to store the response + - Caller may use the same address for both request and response + r8 : out_size, size of the out buffer + - Must be at least 4KB, as this is the maximum request/response size + supported by most TPM implementations, including the TPM Resource + Manager in the linux kernel. + + Return values: + + r3 : H_Success request processed successfully + H_PARAMETER invalid TPM operation + H_P2 in_buffer is invalid + H_P3 in_size is invalid + H_P4 out_buffer is invalid + H_P5 out_size is invalid + H_RESOURCE problem communicating with TPM + H_FUNCTION TPM access is not currently allowed/configured + r4 : For TPM_COMM_OP_EXECUTE, the size of the response will be stored here + upon success. + + Use-case/notes: + + SVM filesystems are encrypted using a symmetric key. This key is then + wrapped/encrypted using the public key of a trusted system which has the + private key stored in the system's TPM. An Ultravisor will use this + hcall to unwrap/unseal the symmetric key using the system's TPM device + or a TPM Resource Manager associated with the device. + + The Ultravisor sets up a separate session key with the TPM in advance + during host system boot. All sensitive in and out values will be + encrypted using the session key. Though the hypervisor will see the 'in' + and 'out' buffers in raw form, any sensitive contents will generally be + encrypted using this session key. diff --git a/docs/specs/ppc-spapr-xive.rst b/docs/specs/ppc-spapr-xive.rst new file mode 100644 index 000000000..f47f739e0 --- /dev/null +++ b/docs/specs/ppc-spapr-xive.rst @@ -0,0 +1,282 @@ +XIVE for sPAPR (pseries machines) +================================= + +The POWER9 processor comes with a new interrupt controller +architecture, called XIVE as "eXternal Interrupt Virtualization +Engine". It supports a larger number of interrupt sources and offers +virtualization features which enables the HW to deliver interrupts +directly to virtual processors without hypervisor assistance. + +A QEMU ``pseries`` machine (which is PAPR compliant) using POWER9 +processors can run under two interrupt modes: + +- *Legacy Compatibility Mode* + + the hypervisor provides identical interfaces and similar + functionality to PAPR+ Version 2.7. This is the default mode + + It is also referred as *XICS* in QEMU. + +- *XIVE native exploitation mode* + + the hypervisor provides new interfaces to manage the XIVE control + structures, and provides direct control for interrupt management + through MMIO pages. + +Which interrupt modes can be used by the machine is negotiated with +the guest O/S during the Client Architecture Support negotiation +sequence. The two modes are mutually exclusive. + +Both interrupt mode share the same IRQ number space. See below for the +layout. + +CAS Negotiation +--------------- + +QEMU advertises the supported interrupt modes in the device tree +property ``ibm,arch-vec-5-platform-support`` in byte 23 and the OS +Selection for XIVE is indicated in the ``ibm,architecture-vec-5`` +property byte 23. + +The interrupt modes supported by the machine depend on the CPU type +(POWER9 is required for XIVE) but also on the machine property +``ic-mode`` which can be set on the command line. It can take the +following values: ``xics``, ``xive``, and ``dual`` which is the +default mode. ``dual`` means that both modes XICS **and** XIVE are +supported and if the guest OS supports XIVE, this mode will be +selected. + +The chosen interrupt mode is activated after a reconfiguration done +in a machine reset. + +KVM negotiation +--------------- + +When the guest starts under KVM, the capabilities of the host kernel +and QEMU are also negotiated. Depending on the version of the host +kernel, KVM will advertise the XIVE capability to QEMU or not. + +Nevertheless, the available interrupt modes in the machine should not +depend on the XIVE KVM capability of the host. On older kernels +without XIVE KVM support, QEMU will use the emulated XIVE device as a +fallback and on newer kernels (>=5.2), the KVM XIVE device. + +XIVE native exploitation mode is not supported for KVM nested guests, +VMs running under a L1 hypervisor (KVM on pSeries). In that case, the +hypervisor will not advertise the KVM capability and QEMU will use the +emulated XIVE device, same as for older versions of KVM. + +As a final refinement, the user can also switch the use of the KVM +device with the machine option ``kernel_irqchip``. + + +XIVE support in KVM +~~~~~~~~~~~~~~~~~~~ + +For guest OSes supporting XIVE, the resulting interrupt modes on host +kernels with XIVE KVM support are the following: + +============== ============= ============= ================ +ic-mode kernel_irqchip +-------------- ---------------------------------------------- +/ allowed off on + (default) +============== ============= ============= ================ +dual (default) XIVE KVM XIVE emul. XIVE KVM +xive XIVE KVM XIVE emul. XIVE KVM +xics XICS KVM XICS emul. XICS KVM +============== ============= ============= ================ + +For legacy guest OSes without XIVE support, the resulting interrupt +modes are the following: + +============== ============= ============= ================ +ic-mode kernel_irqchip +-------------- ---------------------------------------------- +/ allowed off on + (default) +============== ============= ============= ================ +dual (default) XICS KVM XICS emul. XICS KVM +xive QEMU error(3) QEMU error(3) QEMU error(3) +xics XICS KVM XICS emul. XICS KVM +============== ============= ============= ================ + +(3) QEMU fails at CAS with ``Guest requested unavailable interrupt + mode (XICS), either don't set the ic-mode machine property or try + ic-mode=xics or ic-mode=dual`` + + +No XIVE support in KVM +~~~~~~~~~~~~~~~~~~~~~~ + +For guest OSes supporting XIVE, the resulting interrupt modes on host +kernels without XIVE KVM support are the following: + +============== ============= ============= ================ +ic-mode kernel_irqchip +-------------- ---------------------------------------------- +/ allowed off on + (default) +============== ============= ============= ================ +dual (default) XIVE emul.(1) XIVE emul. QEMU error (2) +xive XIVE emul.(1) XIVE emul. QEMU error (2) +xics XICS KVM XICS emul. XICS KVM +============== ============= ============= ================ + + +(1) QEMU warns with ``warning: kernel_irqchip requested but unavailable: + IRQ_XIVE capability must be present for KVM`` + In some cases (old host kernels or KVM nested guests), one may hit a + QEMU/KVM incompatibility due to device destruction in reset. QEMU fails + with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on`` +(2) QEMU fails with ``kernel_irqchip requested but unavailable: + IRQ_XIVE capability must be present for KVM`` + + +For legacy guest OSes without XIVE support, the resulting interrupt +modes are the following: + +============== ============= ============= ================ +ic-mode kernel_irqchip +-------------- ---------------------------------------------- +/ allowed off on + (default) +============== ============= ============= ================ +dual (default) QEMU error(4) XICS emul. QEMU error(4) +xive QEMU error(3) QEMU error(3) QEMU error(3) +xics XICS KVM XICS emul. XICS KVM +============== ============= ============= ================ + +(3) QEMU fails at CAS with ``Guest requested unavailable interrupt + mode (XICS), either don't set the ic-mode machine property or try + ic-mode=xics or ic-mode=dual`` +(4) QEMU/KVM incompatibility due to device destruction in reset. QEMU fails + with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on`` + + +XIVE Device tree properties +--------------------------- + +The properties for the PAPR interrupt controller node when the *XIVE +native exploitation mode* is selected should contain: + +- ``device_type`` + + value should be "power-ivpe". + +- ``compatible`` + + value should be "ibm,power-ivpe". + +- ``reg`` + + contains the base address and size of the thread interrupt + managnement areas (TIMA), for the User level and for the Guest OS + level. Only the Guest OS level is taken into account today. + +- ``ibm,xive-eq-sizes`` + + the size of the event queues. One cell per size supported, contains + log2 of size, in ascending order. + +- ``ibm,xive-lisn-ranges`` + + the IRQ interrupt number ranges assigned to the guest for the IPIs. + +The root node also exports : + +- ``ibm,plat-res-int-priorities`` + + contains a list of priorities that the hypervisor has reserved for + its own use. + +IRQ number space +---------------- + +IRQ Number space of the ``pseries`` machine is 8K wide and is the same +for both interrupt mode. The different ranges are defined as follow : + +- ``0x0000 .. 0x0FFF`` 4K CPU IPIs (only used under XIVE) +- ``0x1000 .. 0x1000`` 1 EPOW +- ``0x1001 .. 0x1001`` 1 HOTPLUG +- ``0x1002 .. 0x10FF`` unused +- ``0x1100 .. 0x11FF`` 256 VIO devices +- ``0x1200 .. 0x127F`` 32x4 LSIs for PHB devices +- ``0x1280 .. 0x12FF`` unused +- ``0x1300 .. 0x1FFF`` PHB MSIs (dynamically allocated) + +Monitoring XIVE +--------------- + +The state of the XIVE interrupt controller can be queried through the +monitor commands ``info pic``. The output comes in two parts. + +First, the state of the thread interrupt context registers is dumped +for each CPU : + +:: + + (qemu) info pic + CPU[0000]: QW NSR CPPR IPB LSMFB ACK# INC AGE PIPR W2 + CPU[0000]: USER 00 00 00 00 00 00 00 00 00000000 + CPU[0000]: OS 00 ff 00 00 ff 00 ff ff 80000400 + CPU[0000]: POOL 00 00 00 00 00 00 00 00 00000000 + CPU[0000]: PHYS 00 00 00 00 00 00 00 ff 00000000 + ... + +In the case of a ``pseries`` machine, QEMU acts as the hypervisor and only +the O/S and USER register rings make sense. ``W2`` contains the vCPU CAM +line which is set to the VP identifier. + +Then comes the routing information which aggregates the EAS and the +END configuration: + +:: + + ... + LISN PQ EISN CPU/PRIO EQ + 00000000 MSI -- 00000010 0/6 380/16384 @1fe3e0000 ^1 [ 80000010 ... ] + 00000001 MSI -- 00000010 1/6 305/16384 @1fc230000 ^1 [ 80000010 ... ] + 00000002 MSI -- 00000010 2/6 220/16384 @1fc2f0000 ^1 [ 80000010 ... ] + 00000003 MSI -- 00000010 3/6 201/16384 @1fc390000 ^1 [ 80000010 ... ] + 00000004 MSI -Q M 00000000 + 00000005 MSI -Q M 00000000 + 00000006 MSI -Q M 00000000 + 00000007 MSI -Q M 00000000 + 00001000 MSI -- 00000012 0/6 380/16384 @1fe3e0000 ^1 [ 80000010 ... ] + 00001001 MSI -- 00000013 0/6 380/16384 @1fe3e0000 ^1 [ 80000010 ... ] + 00001100 MSI -- 00000100 1/6 305/16384 @1fc230000 ^1 [ 80000010 ... ] + 00001101 MSI -Q M 00000000 + 00001200 LSI -Q M 00000000 + 00001201 LSI -Q M 00000000 + 00001202 LSI -Q M 00000000 + 00001203 LSI -Q M 00000000 + 00001300 MSI -- 00000102 1/6 305/16384 @1fc230000 ^1 [ 80000010 ... ] + 00001301 MSI -- 00000103 2/6 220/16384 @1fc2f0000 ^1 [ 80000010 ... ] + 00001302 MSI -- 00000104 3/6 201/16384 @1fc390000 ^1 [ 80000010 ... ] + +The source information and configuration: + +- The ``LISN`` column outputs the interrupt number of the source in + range ``[ 0x0 ... 0x1FFF ]`` and its type : ``MSI`` or ``LSI`` +- The ``PQ`` column reflects the state of the PQ bits of the source : + + - ``--`` source is ready to take events + - ``P-`` an event was sent and an EOI is PENDING + - ``PQ`` an event was QUEUED + - ``-Q`` source is OFF + + a ``M`` indicates that source is *MASKED* at the EAS level, + +The targeting configuration : + +- The ``EISN`` column is the event data that will be queued in the event + queue of the O/S. +- The ``CPU/PRIO`` column is the tuple defining the CPU number and + priority queue serving the source. +- The ``EQ`` column outputs : + + - the current index of the event queue/ the max number of entries + - the O/S event queue address + - the toggle bit + - the last entries that were pushed in the event queue. diff --git a/docs/specs/ppc-xive.rst b/docs/specs/ppc-xive.rst new file mode 100644 index 000000000..83d43f658 --- /dev/null +++ b/docs/specs/ppc-xive.rst @@ -0,0 +1,200 @@ +================================ +POWER9 XIVE interrupt controller +================================ + +The POWER9 processor comes with a new interrupt controller +architecture, called XIVE as "eXternal Interrupt Virtualization +Engine". + +Compared to the previous architecture, the main characteristics of +XIVE are to support a larger number of interrupt sources and to +deliver interrupts directly to virtual processors without hypervisor +assistance. This removes the context switches required for the +delivery process. + + +XIVE architecture +================= + +The XIVE IC is composed of three sub-engines, each taking care of a +processing layer of external interrupts: + +- Interrupt Virtualization Source Engine (IVSE), or Source Controller + (SC). These are found in PCI PHBs, in the Processor Service + Interface (PSI) host bridge Controller, but also inside the main + controller for the core IPIs and other sub-chips (NX, CAP, NPU) of + the chip/processor. They are configured to feed the IVRE with + events. +- Interrupt Virtualization Routing Engine (IVRE) or Virtualization + Controller (VC). It handles event coalescing and perform interrupt + routing by matching an event source number with an Event + Notification Descriptor (END). +- Interrupt Virtualization Presentation Engine (IVPE) or Presentation + Controller (PC). It maintains the interrupt context state of each + thread and handles the delivery of the external interrupt to the + thread. + +:: + + XIVE Interrupt Controller + +------------------------------------+ IPIs + | +---------+ +---------+ +--------+ | +-------+ + | |IVRE | |Common Q | |IVPE |----> | CORES | + | | esb | | | | |----> | | + | | eas | | Bridge | | tctx |----> | | + | |SC end | | | | nvt | | | | + +------+ | +---------+ +----+----+ +--------+ | +-+-+-+-+ + | RAM | +------------------|-----------------+ | | | + | | | | | | + | | | | | | + | | +--------------------v------------------------v-v-v--+ other + | <--+ Power Bus +--> chips + | esb | +---------+-----------------------+------------------+ + | eas | | | + | end | +--|------+ | + | nvt | +----+----+ | +----+----+ + +------+ |IVSE | | |IVSE | + | | | | | + | PQ-bits | | | PQ-bits | + | local |-+ | in VC | + +---------+ +---------+ + PCIe NX,NPU,CAPI + + + PQ-bits: 2 bits source state machine (P:pending Q:queued) + esb: Event State Buffer (Array of PQ bits in an IVSE) + eas: Event Assignment Structure + end: Event Notification Descriptor + nvt: Notification Virtual Target + tctx: Thread interrupt Context registers + + + +XIVE internal tables +-------------------- + +Each of the sub-engines uses a set of tables to redirect interrupts +from event sources to CPU threads. + +:: + + +-------+ + User or O/S | EQ | + or +------>|entries| + Hypervisor | | .. | + Memory | +-------+ + | ^ + | | + +-------------------------------------------------+ + | | + Hypervisor +------+ +---+--+ +---+--+ +------+ + Memory | ESB | | EAT | | ENDT | | NVTT | + (skiboot) +----+-+ +----+-+ +----+-+ +------+ + ^ | ^ | ^ | ^ + | | | | | | | + +-------------------------------------------------+ + | | | | | | | + | | | | | | | + +----|--|--------|--|--------|--|-+ +-|-----+ +------+ + | | | | | | | | | | tctx| |Thread| + IPI or ---+ + v + v + v |---| + .. |-----> | + HW events | | | | | | + | IVRE | | IVPE | +------+ + +---------------------------------+ +-------+ + + +The IVSE have a 2-bits state machine, P for pending and Q for queued, +for each source that allows events to be triggered. They are stored in +an Event State Buffer (ESB) array and can be controlled by MMIOs. + +If the event is let through, the IVRE looks up in the Event Assignment +Structure (EAS) table for an Event Notification Descriptor (END) +configured for the source. Each Event Notification Descriptor defines +a notification path to a CPU and an in-memory Event Queue, in which +will be enqueued an EQ data for the O/S to pull. + +The IVPE determines if a Notification Virtual Target (NVT) can handle +the event by scanning the thread contexts of the VCPUs dispatched on +the processor HW threads. It maintains the interrupt context state of +each thread in a NVT table. + +XIVE thread interrupt context +----------------------------- + +The XIVE presenter can generate four different exceptions to its +HW threads: + +- hypervisor exception +- O/S exception +- Event-Based Branch (user level) +- msgsnd (doorbell) + +Each exception has a state independent from the others called a Thread +Interrupt Management context. This context is a set of registers which +lets the thread handle priority management and interrupt +acknowledgment among other things. The most important ones being : + +- Interrupt Priority Register (PIPR) +- Interrupt Pending Buffer (IPB) +- Current Processor Priority (CPPR) +- Notification Source Register (NSR) + +TIMA +~~~~ + +The Thread Interrupt Management registers are accessible through a +specific MMIO region, called the Thread Interrupt Management Area +(TIMA), four aligned pages, each exposing a different view of the +registers. First page (page address ending in ``0b00``) gives access +to the entire context and is reserved for the ring 0 view for the +physical thread context. The second (page address ending in ``0b01``) +is for the hypervisor, ring 1 view. The third (page address ending in +``0b10``) is for the operating system, ring 2 view. The fourth (page +address ending in ``0b11``) is for user level, ring 3 view. + +Interrupt flow from an O/S perspective +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After an event data has been enqueued in the O/S Event Queue, the IVPE +raises the bit corresponding to the priority of the pending interrupt +in the register IBP (Interrupt Pending Buffer) to indicate that an +event is pending in one of the 8 priority queues. The Pending +Interrupt Priority Register (PIPR) is also updated using the IPB. This +register represent the priority of the most favored pending +notification. + +The PIPR is then compared to the Current Processor Priority +Register (CPPR). If it is more favored (numerically less than), the +CPU interrupt line is raised and the EO bit of the Notification Source +Register (NSR) is updated to notify the presence of an exception for +the O/S. The O/S acknowledges the interrupt with a special load in the +Thread Interrupt Management Area. + +The O/S handles the interrupt and when done, performs an EOI using a +MMIO operation on the ESB management page of the associate source. + +Overview of the QEMU models for XIVE +==================================== + +The XiveSource models the IVSE in general, internal and external. It +handles the source ESBs and the MMIO interface to control them. + +The XiveNotifier is a small helper interface interconnecting the +XiveSource to the XiveRouter. + +The XiveRouter is an abstract model acting as a combined IVRE and +IVPE. It routes event notifications using the EAS and END tables to +the IVPE sub-engine which does a CAM scan to find a CPU to deliver the +exception. Storage should be provided by the inheriting classes. + +XiveEnDSource is a special source object. It exposes the END ESB MMIOs +of the Event Queues which are used for coalescing event notifications +and for escalation. Not used on the field, only to sync the EQ cache +in OPAL. + +Finally, the XiveTCTX contains the interrupt state context of a thread, +four sets of registers, one for each exception that can be delivered +to a CPU. These contexts are scanned by the IVPE to find a matching VP +when a notification is triggered. It also models the Thread Interrupt +Management Area (TIMA), which exposes the thread context registers to +the CPU for interrupt management. diff --git a/docs/specs/pvpanic.txt b/docs/specs/pvpanic.txt new file mode 100644 index 000000000..8afcde11c --- /dev/null +++ b/docs/specs/pvpanic.txt @@ -0,0 +1,54 @@ +PVPANIC DEVICE +============== + +pvpanic device is a simulated device, through which a guest panic +event is sent to qemu, and a QMP event is generated. This allows +management apps (e.g. libvirt) to be notified and respond to the event. + +The management app has the option of waiting for GUEST_PANICKED events, +and/or polling for guest-panicked RunState, to learn when the pvpanic +device has fired a panic event. + +The pvpanic device can be implemented as an ISA device (using IOPORT) or as a +PCI device. + +ISA Interface +------------- + +pvpanic exposes a single I/O port, by default 0x505. On read, the bits +recognized by the device are set. Software should ignore bits it doesn't +recognize. On write, the bits not recognized by the device are ignored. +Software should set only bits both itself and the device recognize. + +Bit Definition +-------------- +bit 0: a guest panic has happened and should be processed by the host +bit 1: a guest panic has happened and will be handled by the guest; + the host should record it or report it, but should not affect + the execution of the guest. + +PCI Interface +------------- + +The PCI interface is similar to the ISA interface except that it uses an MMIO +address space provided by its BAR0, 1 byte long. Any machine with a PCI bus +can enable a pvpanic device by adding '-device pvpanic-pci' to the command +line. + +ACPI Interface +-------------- + +pvpanic device is defined with ACPI ID "QEMU0001". Custom methods: + +RDPT: To determine whether guest panic notification is supported. +Arguments: None +Return: Returns a byte, with the same semantics as the I/O port + interface. + +WRPT: To send a guest panic event +Arguments: Arg0 is a byte to be written, with the same semantics as + the I/O interface. +Return: None + +The ACPI device will automatically refer to the right port in case it +is modified. diff --git a/docs/specs/rocker.txt b/docs/specs/rocker.txt new file mode 100644 index 000000000..1857b3170 --- /dev/null +++ b/docs/specs/rocker.txt @@ -0,0 +1,1014 @@ +Rocker Network Switch Register Programming Guide +Copyright (c) Scott Feldman <sfeldma@gmail.com> +Copyright (c) Neil Horman <nhorman@tuxdriver.com> +Version 0.11, 12/29/2014 + +LICENSE +======= + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 2 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +SECTION 1: Introduction +======================= + +Overview +-------- + +This document describes the hardware/software interface for the Rocker switch +device. The intended audience is authors of OS drivers and device emulation +software. + +Notations and Conventions +------------------------- + +o In register descriptions, [n:m] indicates a range from bit n to bit m, +inclusive. +o Use of leading 0x indicates a hexadecimal number. +o Use of leading 0b indicates a binary number. +o The use of RSVD or Reserved indicates that a bit or field is reserved for +future use. +o Field width is in bytes, unless otherwise noted. +o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear +on read +o TLV values in network-byte-order are designated with (N). + + +SECTION 2: PCI Configuration Registers +====================================== + +PCI Configuration Space +----------------------- + +Each switch instance registers as a PCI device with PCI configuration space: + + offset width description value + --------------------------------------------- + 0x0 2 Vendor ID 0x1b36 + 0x2 2 Device ID 0x0006 + 0x4 4 Command/Status + 0x8 1 Revision ID 0x01 + 0x9 3 Class code 0x2800 + 0xC 1 Cache line size + 0xD 1 Latency timer + 0xE 1 Header type + 0xF 1 Built-in self test + 0x10 4 Base address low + 0x14 4 Base address high + 0x18-28 Reserved + 0x2C 2 Subsystem vendor ID * + 0x2E 2 Subsystem ID * + 0x30-38 Reserved + 0x3C 1 Interrupt line + 0x3D 1 Interrupt pin 0x00 + 0x3E 1 Min grant 0x00 + 0x3D 1 Max latency 0x00 + 0x40 1 TRDY timeout + 0x41 1 Retry count + 0x42 2 Reserved + + +* Assigned by sub-system implementation + +SECTION 3: Memory-Mapped Register Space +======================================= + +There are two memory-mapped BARs. BAR0 maps device register space and is +0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in +size, allowing for 256 MSI-X vectors. + +All registers are 4 or 8 bytes long. It is assumed host software will access 4 +byte registers with one 4-byte access, and 8 byte registers with either two +4-byte accesses or a single 8-byte access. In the case of two 4-byte accesses, +access must be lower and then upper 4-bytes, in that order. + +BAR0 device register space is organized as follows: + + offset description + ------------------------------------------------------ + 0x0000-0x000f Bogus registers to catch misbehaving + drivers. Writes do nothing. Reads + back as 0xDEADBABE. + 0x0010-0x00ff Test registers + 0x0300-0x03ff General purpose registers + 0x1000-0x1fff Descriptor control + +Holes in register space are reserved. Writes to reserved registers do nothing. +Reads to reserved registers read back as 0. + +No fancy stuff like write-combining is enabled on any of the registers. + +BAR1 MSI-X register space is organized as follows: + + offset description + ------------------------------------------------------ + 0x0000-0x0fff MSI-X vector table (256 vectors total) + 0x1000-0x1fff MSI-X PBA table + + +SECTION 4: Interrupts, DMA, and Endianness +========================================== + +PCI Interrupts +-------------- + +The device supports only MSI-X interrupts. BAR1 memory-mapped region contains +the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors. + +The vector assignment is: + + vector description + ----------------------------------------------------- + 0 Command descriptor ring completion + 1 Event descriptor ring completion + 2 Test operation completion + 3 RSVD + 4-255 Tx and Rx descriptor ring completion + Tx vector is even + Rx vector is odd + +A MSI-X vector table entry is 16 bytes: + + field offset width description + ------------------------------------------------------------- + lower_addr 0x0 4 [31:2] message address[31:2] + [1:0] Rsvd (4 byte alignment + required) + upper_addr 0x4 4 [31:19] Rsvd + [14:0] message address[46:32] + data 0x8 4 message data[31:0] + control 0xc 4 [31:1] Rsvd + [0] mask (0 = enable, + 1 = masked) + +Software should install the Interrupt Service Routine (ISR) before any ports +are enabled or any commands are issued on the command ring. + +DMA Operations +-------------- + +DMA operations are used for packet DMA to/from the CPU, command and event +processing. Command processing includes statistical counters and table dumps, +table insertion/deletion, and more. Event processing provides an async +notification method for device-originating events. Each DMA operation has a +set of control registers to manage a descriptor ring. The descriptor rings are +allocated from contiguous host DMA-able memory and registers specify the rings +base address, size and current head and tail indices. Software always writes +the head, and hardware always writes the tail. + +The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion +of a descriptor. Software will clear this bit when posting a descriptor to the +ring, and hardware will set this bit when the descriptor is complete. + +Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries. +Descriptor rings' base address must be 8-byte aligned. Descriptors must be +packed within ring. Each descriptor in each ring must also be aligned on an 8 +byte boundary. Each descriptor ring will have these registers: + + DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W) + DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W) + DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W) + DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R) + DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W) + DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W) + DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W) + +Where x is descriptor ring index: + + index ring + -------------------- + 0 CMD + 1 EVENT + 2 TX (port 0) + 3 RX (port 0) + 4 TX (port 1) + 5 RX (port 1) + . + . + . + 124 TX (port 61) + 125 RX (port 61) + 126 Resv + 127 Resv + +Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero. HEAD cannot be +written past TAIL. To do so would wrap the ring. An empty ring is when HEAD +== TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and +TAIL increment and modulo wrap at the ring size. + +CTRL register bits: + + bit name description + ------------------------------------------------------------------------ + [0] CTRL_RESET Reset the descriptor ring + [1:31] Reserved + +All descriptor types share some common fields: + + field width description + ------------------------------------------------------------------- + DMA_DESC_BUF_ADDR 8 Phys addr of desc payload, 8-byte + aligned + DMA_DESC_COOKIE 8 Desc cookie for completion matching, + upper-most bit is reserved + DMA_DESC_BUF_SIZE 2 Desc payload size in bytes + DMA_DESC_TLV_SIZE 2 Desc payload total size in bytes + used for TLVs. Must be <= + DMA_DESC_BUF_SIZE. + DMA_DESC_COMP_ERR 2 Completion status of associated + desc payload. High order bit is + clear on new descs, toggled by + hw for completed items. + +To support forward- and backward-compatibility, descriptor and completion +payloads are specified in TLV format. Fields are packed with Type=field name, +Length=field length, and Value=field value. Software will ignore unknown fields +filled in by the switch. Likewise, the switch will ignore unknown fields +filled in by software. + +Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The +value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is: + + field width description + ----------------------------- + type 4 TLV type + len 2 TLV value length + pad 2 Reserved + +The alignment requirements for descriptors and TLVs are to avoid unaligned +access exceptions in software. Note that the payload for each TLV is also +8 byte aligned. + +Figure 1 shows an example descriptor buffer with two TLVs. + + <------- 8 bytes -------> + + 8-byte +––––+ +–––––––––––+–––––+–––––+ +–+ + align | type | len | pad | TLV#1 hdr | + +–––––––––––+–––––+–––––+ (len=22) | + | | | + | value | TVL#1 value | + | | (padded to 8-byte | + | +–––––+ alignment) | + | |/////| | + 8-byte +––––+ +–––––––––––+–––––––––––+ | + align | type | len | pad | TLV#2 hdr DESC_BUF_SIZE + +–––––+–––––+–––––+–––––+ (len=2) | + |value|/////////////////| TLV#2 value | + +–––––+/////////////////| | + |///////////////////////| | + |///////////////////////| | + |///////////////////////| | + |////////unused/////////| | + |////////space//////////| | + |///////////////////////| | + |///////////////////////| | + |///////////////////////| | + +–––––––––––––––––––––––+ +–+ + + fig. 1 + +TLVs can be nested within the NEST TLV type. + +Interrupt credits +^^^^^^^^^^^^^^^^^ + +MSI-X vectors used for descriptor ring completions use a credit mechanism for +efficient device, PCIe bus, OS and driver operations. Each descriptor ring has +a credit count which represents the number of outstanding descriptors to be +processed by the driver. As the device marks descriptors complete, the credit +count is incremented. As the driver processes those outstanding descriptors, +it returns credits back to the device. This way, the device knows the driver's +progress and can make decisions about when to fire the next interrupt or not. +When the credit count is zero, and the first descriptors are posted for the +driver, a single interrupt is fired. Once the interrupt is fired, the +interrupt is disabled (auto-masked*). In response to the interrupt, the driver +will process descriptors and PIO write a returned credit value for that +descriptor ring. If the driver returns all credits (the driver caught up with +the device and there is no outstanding work), then the interrupt is unmasked, +but not fired. If only partial credits are returned, the interrupt remains +masked but the device generates an interrupt, signaling the driver that more +outstanding work is available. + +(* this masking is unrelated to the MSI-X interrupt mask register) + +Endianness +---------- + +Device registers are hard-coded to little-endian (LE). The driver should +convert to/from host endianness to LE for device register accesses. + +Descriptors are LE. Descriptor buffer TLVs will have LE type and length +fields, but the value field can either be LE or network-byte-order, depending +on context. TLV values containing network packet data will be in network-byte +order. A TLV value containing a field or mask used to compare against network +packet data is network-byte order. For example, flow match fields (and masks) +are network-byte-order since they're matched directly, byte-by-byte, against +network packet data. All non-network-packet TLV multi-byte values will be LE. + +TLV values in network-byte-order are designated with (N). + + +SECTION 5: Test Registers +========================= + +Rocker has several test registers to support troubleshooting register access, +interrupt generation, and DMA operations: + + TEST_REG, offset 0x0010, 32-bit (R/W) + TEST_REG64, offset 0x0018, 64-bit (R/W) + TEST_IRQ, offset 0x0020, 32-bit (R/W) + TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W) + TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W) + TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W) + +Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last +value written to the register. The 32-bit and 64-bit versions are for testing +32-bit and 64-bit host accesses. + +A vector can be written to TEST_IRQ and the device will generate an interrupt +for that vector. + +To test basic DMA operations, allocate a DMA-able host buffer and put the +buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to +TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are: + + operation value description + ----------------------------------------------------------- + TEST_DMA_CTRL_CLEAR 1 clear buffer + TEST_DMA_CTRL_FILL 2 fill buffer bytes with 0x96 + TEST_DMA_CTRL_INVERT 4 invert bytes in buffer + +Various buffer address and sizes should be tested to verify no address boundary +issue exists. In particular, buffers that start on odd-8-byte boundary and/or +span multiple PAGE sizes should be tested. + + +SECTION 6: Ports +================ + +Physical and Logical Ports +------------------------------------ + +The switch supports up to 62 physical (front-panel) ports. Register +PORT_PHYS_COUNT returns the actual number of physical ports available: + + PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R) + +In addition to front-panel ports, the switch supports logical ports for +tunnels. + +Front-panel ports and logical tunnel ports are mapped into a single 32-bit port +space. A special CPU port is assigned port 0. The front-panel ports are +mapped to ports 1-62. A special loopback port is assigned port 63. Logical +tunnel ports are assigned ports 0x0001000-0x0001ffff. +To summarize the port assignments: + + port mapping + ------------------------------------------------------- + 0 CPU port (for packets to/from host CPU) + 1-62 front-panel physical ports + 63 loopback port + 64-0x0000ffff RSVD + 0x00010000-0x0001ffff logical tunnel ports + 0x00020000-0xffffffff RSVD + +Physical Port Mode +------------------ + +Switch front-panel ports operate in a mode. Currently, the only mode is +OF-DPA. OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA) +Abstract Switch Specification, Version 1.0, from Broadcom Corporation. To +set/get the mode for front-panel ports, see port settings, below. + +Port Settings +------------- + +Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS: + + PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R) + + Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62 + read 1 for link UP and 0 for link DOWN for respective front-panel ports. + +Other properties for front-panel ports are available via DMA CMD descriptors: + + Get PORT_SETTINGS descriptor: + + field width description + ---------------------------------------------- + PORT_SETTINGS 2 CMD_GET + PPORT 4 Physical port # + + Get PORT_SETTINGS completion: + + field width description + ---------------------------------------------- + PPORT 4 Physical port # + SPEED 4 Current port interface speed, in Mbps + DUPLEX 1 1 = Full, 0 = Half + AUTONEG 1 1 = enabled, 0 = disabled + MACADDR 6 Port MAC address + MODE 1 0 = OF-DPA + LEARNING 1 MAC address learning on port + 1 = enabled + 0 = disabled + PHYS_NAME <var> Physical port name (string) + + Set PORT_SETTINGS descriptor: + + field width description + ---------------------------------------------- + PORT_SETTINGS 2 CMD_SET + PPORT 4 Physical port # + SPEED 4 Port interface speed, in Mbps + DUPLEX 1 1 = Full, 0 = Half + AUTONEG 1 1 = enabled, 0 = disabled + MACADDR 6 Port MAC address + MODE 1 0 = OF-DPA + +Port Enable +----------- + +Front-panel ports are initially disabled, which means port ingress and egress +packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE: + + PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W) + + Value is bitmap of first 64 ports. Bits 0 and 63 are ignored + and always read as 0. Write 1 to enable port; write 0 to disable it. + Default is 0. + + +SECTION 7: Switch Control +========================= + +This section covers switch-wide register settings. + +Control +------- + +This register is used for low level control of the switch. + + CONTROL: offset 0x0300, 32-bit, (W) + + bit name description + ------------------------------------------------------------------------ + [0] CONTROL_RESET If set, device will perform reset + [1:31] Reserved + +Switch ID +--------- + +The switch has a SWITCH_ID to be used by software to uniquely identify the +switch: + + SWITCH_ID: offset 0x0320, 64-bit, (R) + + Value is opaque to switch software and no special encoding is implied. + + +SECTION 8: Events +================= + +Non-I/O asynchronous events from the device are notified to the host using the +event ring. The TLV structure for events is: + + field width description + --------------------------------------------------- + TYPE 4 Event type, one of: + 1: LINK_CHANGED + 2: MAC_VLAN_SEEN + INFO <nest> Event info (details below) + +Link Changed Event +------------------ + +When link status changes on a physical port, this event is generated. + + field width description + --------------------------------------------------- + INFO <nest> + PPORT 4 Physical port + LINKUP 1 Link status: + 0: down + 1: up + +MAC VLAN Seen Event +------------------- + +When a packet ingresses on a port and the source MAC/VLAN isn't known to the +device, the device will generate this event. In response to the event, the +driver should install to the device the MAC/VLAN on the port into the bridge +table. Once installed, the MAC/VLAN is known on the port and this event will +no longer be generated. + + field width description + --------------------------------------------------- + INFO <nest> + PPORT 4 Physical port + MAC 6 MAC address + VLAN 2 VLAN ID + + +SECTION 9: CPU Packet Processing +================================ + +Ingress packets directed to the host CPU for further processing are delivered +in the DMA RX ring. Likewise, host CPU originating packets destined to egress +on switch ports are scheduled by software using the DMA TX ring. + +Tx Packet Processing +-------------------- + +Software schedules packets for egress on switch ports using the DMA TX ring. A +TX descriptor buffer describes the packet location and size in host DMA-able +memory, the destination port, and any hardware-offload functions (such as L3 +payload checksum offload). Software then bumps the descriptor head to signal +hardware of new Tx work. In response, hardware will DMA read Tx descriptors up +to head, DMA read descriptor buffer and packet data, perform offloading +functions, and finally frame packet on wire (network). Once packet processing +is complete, hardware will writeback status to descriptor(s) to signal to +software that Tx is complete and software resources (e.g. skb) backing packet +can be released. + +Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A +TLV is used for each packet fragment. + + pkt frag 1 + +–––––––+ +–+ + +–––+ | | + desc buf | | | | + +––––––––+ | | | | + Tx ring +–––+ +–––––+ | | | + +–––––––––+ | | TLVs | +–––––––+ | + | +–––+ +––––––––+ pkt frag 2 | + | desc 0 | | +–––––+ +–––––––+ | + +–––––––––+ | TLVs | +–––+ | | + head+–+ | +––––––––+ | | | + | desc 1 | | +–––––+ +–––––––+ |pkt + +–––––––––+ | TLVs | | | + | | +––––––––+ | pkt frag 3 | + | | | +–––––––+ | + +–––––––––+ +–––+ | | + | | | | | + | | | | | + +–––––––––+ | | | + | | | | | + | | | | | + +–––––––––+ | | | + | | +–––––––+ +–+ + | | + +–––––––––+ + + fig 2. + +The TLVs for Tx descriptor buffer are: + + field width description + --------------------------------------------------------------------- + PPORT 4 Destination physical port # + TX_OFFLOAD 1 Hardware offload modes: + 0: no offload + 1: insert IP csum (ipv4 only) + 2: insert TCP/UDP csum + 3: L3 csum calc and insert + into csum offset (TX_L3_CSUM_OFF) + 16-bit 1's complement csum value. + IPv4 pseudo-header and IP + already calculated by OS + and inserted. + 4: TSO (TCP Segmentation Offload) + TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset, + from the beginning of the packet, + of the csum field in the L3 header + TX_TSO_MSS 2 For TSO offload mode, the + Maximum Segment Size in bytes + TX_TSO_HDR_LEN 2 For TSO offload mode, the + length of ethernet, IP, and + TCP/UDP headers, including IP + and TCP options. + TX_FRAGS <array> Packet fragments + TX_FRAG <nest> Packet fragment + TX_FRAG_ADDR 8 DMA address of packet fragment + TX_FRAG_LEN 2 Packet fragment length + +Possible status return codes in descriptor on completion are: + + DESC_COMP_ERR reason + -------------------------------------------------------------------- + 0 OK + -ROCKER_ENXIO address or data read err on desc buf or packet + fragment + -ROCKER_EINVAL bad pport or TSO or csum offloading error + -ROCKER_ENOMEM no memory for internal staging tx fragment + +Rx Packet Processing +-------------------- + +For packets ingressing on switch ports that are not forwarded by the switch but +rather directed to the host CPU for further processing are delivered in the DMA +RX ring. Rx descriptor buffers are allocated by software and placed on the +ring. Hardware will fill Rx descriptor buffers with packet data, write the +completion, and signal to software that a new packet is ready. Since Rx packet +size is not known a-priori, the Rx descriptor buffer must be allocated for +worst-case packet size. A single Rx descriptor will contain the entire Rx +packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads +performed on the packet, such as checksum validation. + +The TLVs for Rx descriptor buffer are: + + field width description + --------------------------------------------------- + PPORT 4 Source physical port # + RX_FLAGS 2 Packet parsing flags: + (1 << 0): IPv4 packet + (1 << 1): IPv6 packet + (1 << 2): csum calculated + (1 << 3): IPv4 csum good + (1 << 4): IP fragment + (1 << 5): TCP packet + (1 << 6): UDP packet + (1 << 7): TCP/UDP csum good + (1 << 8): Offload forward + RX_CSUM 2 IP calculated checksum: + IPv4: IP payload csum + IPv6: header and payload csum + (Only valid is RX_FLAGS:csum calc is set) + RX_FRAG_ADDR 8 DMA address of packet fragment + RX_FRAG_MAX_LEN 2 Packet maximum fragment length + RX_FRAG_LEN 2 Actual packet fragment length after receive + +Offload forward RX_FLAG indicates the device has already forwarded the packet +so the host CPU should not also forward the packet. + +Possible status return codes in descriptor on completion are: + + DESC_COMP_ERR reason + -------------------------------------------------------------------- + 0 OK + -ROCKER_ENXIO address or data read err on desc buf + -ROCKER_ENOMEM no memory for internal staging desc buf + -ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain + packet data TLV and other TLVs. + + +SECTION 10: OF-DPA Mode +====================== + +OF-DPA mode allows the switch to offload flow packet processing functions to +hardware. An OpenFlow controller would communicate with an OpenFlow agent +installed on the switch. The OpenFlow agent would (directly or indirectly) +communicate with the Rocker switch driver, which in turn would program switch +hardware with flow functionality, as defined in OF-DPA. The block diagram is: + + +–––––––––––––––----–––+ + | OF | + | Remote Controller | + +––––––––+––----–––––––+ + | + | + +––––––––+–––––––––+ + | OF | + | Local Agent | + +––––––––––––––––––+ + | | + | Rocker Driver | + +––––––––––––––––––+ + <this spec> + +––––––––––––––––––+ + | | + | Rocker Switch | + +––––––––––––––––––+ + +To participate in flow functions, ports must be configure for OF-DPA mode +during switch initialization. + +OF-DPA Flow Table Interface +--------------------------- + +There are commands to add, modify, delete, and get stats of flow table entries. +The commands are issued using the DMA CMD descriptor ring. The following +commands are defined: + + CMD_ADD: add an entry to flow table + CMD_MOD: modify an entry in flow table + CMD_DEL: delete an entry from flow table + CMD_GET_STATS: get stats for flow entry + +TLVs for add and modify commands are: + + field width description + ---------------------------------------------------- + OF_DPA_CMD 2 CMD_[ADD|MOD] + OF_DPA_TBL 2 Flow table ID + 0: ingress port + 10: vlan + 20: termination mac + 30: unicast routing + 40: multicast routing + 50: bridging + 60: ACL policy + OF_DPA_PRIORITY 4 Flow priority + OF_DPA_HARDTIME 4 Hard timeout for flow + OF_DPA_IDLETIME 4 Idle timeout for flow + OF_DPA_COOKIE 8 Cookie + +Additional TLVs based on flow table ID: + +Table ID 0: ingress port + + field width description + ---------------------------------------------------- + OF_DPA_IN_PPORT 4 ingress physical port number + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + +Table ID 10: vlan + + field width description + ---------------------------------------------------- + OF_DPA_IN_PPORT 4 ingress physical port number + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID + +Table ID 20: termination mac + + field width description + ---------------------------------------------------- + OF_DPA_IN_PPORT 4 ingress physical port number + OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask + OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd + OF_DPA_DST_MAC 6 (N) destination MAC + OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask + OF_DPA_GOTO_TBL 2 only acceptable values are + unicast or multicast routing + table IDs + OF_DPA_OUT_PPORT 2 if specified, must be + controller, set zero otherwise + +Table ID 30: unicast routing + + field width description + ---------------------------------------------------- + OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd + OF_DPA_DST_IP 4 (N) destination IPv4 address. + Must be unicast address + OF_DPA_DST_IP_MASK 4 (N) IP mask. Must be prefix mask + OF_DPA_DST_IPV6 16 (N) destination IPv6 address. + Must be unicast address + OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask. Must be prefix mask + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + OF_DPA_GROUP_ID 4 data for GROUP action must + be an L3 Unicast group entry + +Table ID 40: multicast routing + + field width description + ---------------------------------------------------- + OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_SRC_IP 4 (N) source IPv4. Optional, + can contain IPv4 address, + must be completely masked + if not used + OF_DPA_SRC_IP_MASK 4 (N) IP Mask + OF_DPA_DST_IP 4 (N) destination IPv4 address. + Must be multicast address + OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. + Can contain IPv6 address, + must be completely masked + if not used + OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask. + OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must + be multicast address + Must be multicast address + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + OF_DPA_GROUP_ID 4 data for GROUP action must + be an L3 multicast group entry + +Table ID 50: bridging + + field width description + ---------------------------------------------------- + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_TUNNEL_ID 4 tunnel ID + OF_DPA_DST_MAC 6 (N) destination MAC + OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask + OF_DPA_GOTO_TBL 2 goto table ID; zero to drop + OF_DPA_GROUP_ID 4 data for GROUP action must + be a L2 Interface, L2 + Multicast, L2 Flood, + or L2 Overlay group entry + as appropriate + OF_DPA_TUNNEL_LPORT 4 unicast Tenant Bridging + flows specify a tunnel + logical port ID + OF_DPA_OUT_PPORT 2 data for OUTPUT action, + restricted to CONTROLLER, + set to 0 otherwise + +Table ID 60: acl policy + + field width description + ---------------------------------------------------- + OF_DPA_IN_PPORT 4 ingress physical port number + OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask + OF_DPA_ETHERTYPE 2 (N) ethertype + OF_DPA_VLAN_ID 2 (N) vlan ID + OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask + OF_DPA_VLAN_PCP 2 (N) vlan Priority Code Point + OF_DPA_VLAN_PCP_MASK 2 (N) vlan Priority Code Point mask + OF_DPA_SRC_MAC 6 (N) source MAC + OF_DPA_SRC_MAC_MASK 6 (N) source MAC mask + OF_DPA_DST_MAC 6 (N) destination MAC + OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask + OF_DPA_TUNNEL_ID 4 tunnel ID + OF_DPA_SRC_IP 4 (N) source IPv4. Optional, + can contain IPv4 address, + must be completely masked + if not used + OF_DPA_SRC_IP_MASK 4 (N) IP Mask + OF_DPA_DST_IP 4 (N) destination IPv4 address. + Must be multicast address + OF_DPA_DST_IP_MASK 4 (N) IP Mask + OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. + Can contain IPv6 address, + must be completely masked + if not used + OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask + OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must + be multicast address. + OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask + OF_DPA_SRC_ARP_IP 4 (N) source IPv4 address in the ARP + payload. Only used if ethertype + == 0x0806. + OF_DPA_SRC_ARP_IP_MASK 4 (N) IP Mask + OF_DPA_IP_PROTO 1 IP protocol + OF_DPA_IP_PROTO_MASK 1 IP protocol mask + OF_DPA_IP_DSCP 1 DSCP + OF_DPA_IP_DSCP_MASK 1 DSCP mask + OF_DPA_IP_ECN 1 ECN + OF_DPA_IP_ECN_MASK 1 ECN mask + OF_DPA_L4_SRC_PORT 2 (N) L4 source port, only for + TCP, UDP, or SCTP + OF_DPA_L4_SRC_PORT_MASK 2 (N) L4 source port mask + OF_DPA_L4_DST_PORT 2 (N) L4 source port, only for + TCP, UDP, or SCTP + OF_DPA_L4_DST_PORT_MASK 2 (N) L4 source port mask + OF_DPA_ICMP_TYPE 1 ICMP type, only if IP + protocol is 1 + OF_DPA_ICMP_TYPE_MASK 1 ICMP type mask + OF_DPA_ICMP_CODE 1 ICMP code + OF_DPA_ICMP_CODE_MASK 1 ICMP code mask + OF_DPA_IPV6_LABEL 4 (N) IPv6 flow label + OF_DPA_IPV6_LABEL_MASK 4 (N) IPv6 flow label mask + OF_DPA_GROUP_ID 4 data for GROUP action + OF_DPA_QUEUE_ID_ACTION 1 write the queue ID + OF_DPA_NEW_QUEUE_ID 1 queue ID + OF_DPA_VLAN_PCP_ACTION 1 write the VLAN priority + OF_DPA_NEW_VLAN_PCP 1 VLAN priority + OF_DPA_IP_DSCP_ACTION 1 write the DSCP + OF_DPA_NEW_IP_DSCP 1 new DSCP + OF_DPA_TUNNEL_LPORT 4 restrct to valid tunnel + logical port, set to 0 + otherwise. + OF_DPA_OUT_PPORT 2 data for OUTPUT action, + restricted to CONTROLLER, + set to 0 otherwise + OF_DPA_CLEAR_ACTIONS 4 if 1 packets matching flow are + dropped (all other instructions + ignored) + +TLVs for flow delete and get stats command are: + + field width description + --------------------------------------------------- + OF_DPA_CMD 2 CMD_[DEL|GET_STATS] + OF_DPA_COOKIE 8 Cookie + +On completion of get stats command, the descriptor buffer is written back with +the following TLVs: + + field width description + --------------------------------------------------- + OF_DPA_STAT_DURATION 4 Flow duration + OF_DPA_STAT_RX_PKTS 8 Received packets + OF_DPA_STAT_TX_PKTS 8 Transmit packets + +Possible status return codes in descriptor on completion are: + + DESC_COMP_ERR command reason + -------------------------------------------------------------------- + 0 all OK + -ROCKER_EFAULT all head or tail index outside + of ring + -ROCKER_ENXIO all address or data read err on + desc buf + -ROCKER_EMSGSIZE GET_STATS cmd descriptor buffer wasn't + big enough to contain write-back + TLVs + -ROCKER_EINVAL all invalid parameters passed in + -ROCKER_EEXIST ADD entry already exists + -ROCKER_ENOSPC ADD no space left in flow table + -ROCKER_ENOENT MOD|DEL|GET_STATS cookie invalid + +Group Table Interface +--------------------- + +There are commands to add, modify, delete, and get stats of group table +entries. The commands are issued using the DMA CMD descriptor ring. The +following commands are defined: + + CMD_ADD: add an entry to group table + CMD_MOD: modify an entry in group table + CMD_DEL: delete an entry from group table + CMD_GET_STATS: get stats for group entry + +TLVs for add and modify commands are: + + field width description + ----------------------------------------------------------- + FLOW_GROUP_CMD 2 CMD_[ADD|MOD] + FLOW_GROUP_ID 2 Flow group ID + FLOW_GROUP_TYPE 1 Group type: + 0: L2 interface + 1: L2 rewrite + 2: L3 unicast + 3: L2 multicast + 4: L2 flood + 5: L3 interface + 6: L3 multicast + 7: L3 ECMP + 8: L2 overlay + FLOW_VLAN_ID 2 Vlan ID (types 0, 3, 4, 6) + FLOW_L2_PORT 2 Port (types 0) + FLOW_INDEX 4 Index (all types but 0) + FLOW_OVERLAY_TYPE 1 Overlay sub-type (type 8): + 0: Flood unicast tunnel + 1: Flood multicast tunnel + 2: Multicast unicast tunnel + 3: Multicast multicast tunnel + FLOW_GROUP_ACTION nest + FLOW_GROUP_ID 2 next group ID in chain (all + types except 0) + FLOW_OUT_PORT 4 egress port (types 0, 8) + FLOW_POP_VLAN_TAG 1 strip outer VLAN tag (type 1 + only) + FLOW_VLAN_ID 2 (types 1, 5) + FLOW_SRC_MAC 6 (types 1, 2, 5) + FLOW_DST_MAC 6 (types 1, 2) + +TLVs for flow delete and get stats command are: + + field width description + ----------------------------------------------------------- + FLOW_GROUP_CMD 2 CMD_[DEL|GET_STATS] + FLOW_GROUP_ID 2 Flow group ID + +On completion of get stats command, the descriptor buffer is written back with +the following TLVs: + + field width description + --------------------------------------------------- + FLOW_GROUP_ID 2 Flow group ID + FLOW_STAT_DURATION 4 Flow duration + FLOW_STAT_REF_COUNT 4 Flow reference count + FLOW_STAT_BUCKET_COUNT 4 Flow bucket count + +Possible status return codes in descriptor on completion are: + + DESC_COMP_ERR command reason + -------------------------------------------------------------------- + 0 all OK + -ROCKER_EFAULT all head or tail index outside + of ring + -ROCKER_ENXIO all address or data read err on + desc buf + -ROCKER_ENOSPC GET_STATS cmd descriptor buffer wasn't + big enough to contain write-back + TLVs + -ROCKER_EINVAL ADD|MOD invalid parameters passed in + -ROCKER_EEXIST ADD entry already exists + -ROCKER_ENOSPC ADD no space left in flow table + -ROCKER_ENOENT MOD|DEL|GET_STATS group ID invalid + -ROCKER_EBUSY DEL group reference count non-zero + -ROCKER_ENODEV ADD next group ID doesn't exist + + + +References +========== + +[1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification, +Version 1.0, from Broadcom Corporation, February 21, 2014. diff --git a/docs/specs/standard-vga.txt b/docs/specs/standard-vga.txt new file mode 100644 index 000000000..18f75f1b3 --- /dev/null +++ b/docs/specs/standard-vga.txt @@ -0,0 +1,81 @@ + +QEMU Standard VGA +================= + +Exists in two variants, for isa and pci. + +command line switches: + -vga std [ picks isa for -M isapc, otherwise pci ] + -device VGA [ pci variant ] + -device isa-vga [ isa variant ] + -device secondary-vga [ legacy-free pci variant ] + + +PCI spec +-------- + +Applies to the pci variant only for obvious reasons. + +PCI ID: 1234:1111 + +PCI Region 0: + Framebuffer memory, 16 MB in size (by default). + Size is tunable via vga_mem_mb property. + +PCI Region 1: + Reserved (so we have the option to make the framebuffer bar 64bit). + +PCI Region 2: + MMIO bar, 4096 bytes in size (qemu 1.3+) + +PCI ROM Region: + Holds the vgabios (qemu 0.14+). + + +The legacy-free variant has no ROM and has PCI_CLASS_DISPLAY_OTHER +instead of PCI_CLASS_DISPLAY_VGA. + + +IO ports used +------------- + +Doesn't apply to the legacy-free pci variant, use the MMIO bar instead. + +03c0 - 03df : standard vga ports +01ce : bochs vbe interface index port +01cf : bochs vbe interface data port (x86 only) +01d0 : bochs vbe interface data port + + +Memory regions used +------------------- + +0xe0000000 : Framebuffer memory, isa variant only. + +The pci variant used to mirror the framebuffer bar here, qemu 0.14+ +stops doing that (except when in -M pc-$old compat mode). + + +MMIO area spec +-------------- + +Likewise applies to the pci variant only for obvious reasons. + +0000 - 03ff : edid data blob. +0400 - 041f : vga ioports (0x3c0 -> 0x3df), remapped 1:1. + word access is supported, bytes are written + in little endia order (aka index port first), + so indexed registers can be updated with a + single mmio write (and thus only one vmexit). +0500 - 0515 : bochs dispi interface registers, mapped flat + without index/data ports. Use (index << 1) + as offset for (16bit) register access. + +0600 - 0607 : qemu extended registers. qemu 2.2+ only. + The pci revision is 2 (or greater) when + these registers are present. The registers + are 32bit. + 0600 : qemu extended register region size, in bytes. + 0604 : framebuffer endianness register. + - 0xbebebebe indicates big endian. + - 0x1e1e1e1e indicates little endian. diff --git a/docs/specs/tpm.rst b/docs/specs/tpm.rst new file mode 100644 index 000000000..3be190343 --- /dev/null +++ b/docs/specs/tpm.rst @@ -0,0 +1,524 @@ +=============== +QEMU TPM Device +=============== + +Guest-side hardware interface +============================= + +TIS interface +------------- + +The QEMU TPM emulation implements a TPM TIS hardware interface +following the Trusted Computing Group's specification "TCG PC Client +Specific TPM Interface Specification (TIS)", Specification Version +1.3, 21 March 2013. (see the `TIS specification`_, or a later version +of it). + +The TIS interface makes a memory mapped IO region in the area +0xfed40000-0xfed44fff available to the guest operating system. + +QEMU files related to TPM TIS interface: + - ``hw/tpm/tpm_tis_common.c`` + - ``hw/tpm/tpm_tis_isa.c`` + - ``hw/tpm/tpm_tis_sysbus.c`` + - ``hw/tpm/tpm_tis.h`` + +Both an ISA device and a sysbus device are available. The former is +used with pc/q35 machine while the latter can be instantiated in the +Arm virt machine. + +CRB interface +------------- + +QEMU also implements a TPM CRB interface following the Trusted +Computing Group's specification "TCG PC Client Platform TPM Profile +(PTP) Specification", Family "2.0", Level 00 Revision 01.03 v22, May +22, 2017. (see the `CRB specification`_, or a later version of it) + +The CRB interface makes a memory mapped IO region in the area +0xfed40000-0xfed40fff (1 locality) available to the guest +operating system. + +QEMU files related to TPM CRB interface: + - ``hw/tpm/tpm_crb.c`` + +SPAPR interface +--------------- + +pSeries (ppc64) machines offer a tpm-spapr device model. + +QEMU files related to the SPAPR interface: + - ``hw/tpm/tpm_spapr.c`` + +fw_cfg interface +================ + +The bios/firmware may read the ``"etc/tpm/config"`` fw_cfg entry for +configuring the guest appropriately. + +The entry of 6 bytes has the following content, in little-endian: + +.. code-block:: c + + #define TPM_VERSION_UNSPEC 0 + #define TPM_VERSION_1_2 1 + #define TPM_VERSION_2_0 2 + + #define TPM_PPI_VERSION_NONE 0 + #define TPM_PPI_VERSION_1_30 1 + + struct FwCfgTPMConfig { + uint32_t tpmppi_address; /* PPI memory location */ + uint8_t tpm_version; /* TPM version */ + uint8_t tpmppi_version; /* PPI version */ + }; + +ACPI interface +============== + +The TPM device is defined with ACPI ID "PNP0C31". QEMU builds a SSDT +and passes it into the guest through the fw_cfg device. The device +description contains the base address of the TIS interface 0xfed40000 +and the size of the MMIO area (0x5000). In case a TPM2 is used by +QEMU, a TPM2 ACPI table is also provided. The device is described to +be used in polling mode rather than interrupt mode primarily because +no unused IRQ could be found. + +To support measurement logs to be written by the firmware, +e.g. SeaBIOS, a TCPA table is implemented. This table provides a 64kb +buffer where the firmware can write its log into. For TPM 2 only a +more recent version of the TPM2 table provides support for +measurements logs and a TCPA table does not need to be created. + +The TCPA and TPM2 ACPI tables follow the Trusted Computing Group +specification "TCG ACPI Specification" Family "1.2" and "2.0", Level +00 Revision 00.37. (see the `ACPI specification`_, or a later version +of it) + +ACPI PPI Interface +------------------ + +QEMU supports the Physical Presence Interface (PPI) for TPM 1.2 and +TPM 2. This interface requires ACPI and firmware support. (see the +`PPI specification`_) + +PPI enables a system administrator (root) to request a modification to +the TPM upon reboot. The PPI specification defines the operation +requests and the actions the firmware has to take. The system +administrator passes the operation request number to the firmware +through an ACPI interface which writes this number to a memory +location that the firmware knows. Upon reboot, the firmware finds the +number and sends commands to the TPM. The firmware writes the TPM +result code and the operation request number to a memory location that +ACPI can read from and pass the result on to the administrator. + +The PPI specification defines a set of mandatory and optional +operations for the firmware to implement. The ACPI interface also +allows an administrator to list the supported operations. In QEMU the +ACPI code is generated by QEMU, yet the firmware needs to implement +support on a per-operations basis, and different firmwares may support +a different subset. Therefore, QEMU introduces the virtual memory +device for PPI where the firmware can indicate which operations it +supports and ACPI can enable the ones that are supported and disable +all others. This interface lies in main memory and has the following +layout: + + +-------------+--------+--------+-------------------------------------------+ + | Field | Length | Offset | Description | + +=============+========+========+===========================================+ + | ``func`` | 0x100 | 0x000 | Firmware sets values for each supported | + | | | | operation. See defined values below. | + +-------------+--------+--------+-------------------------------------------+ + | ``ppin`` | 0x1 | 0x100 | SMI interrupt to use. Set by firmware. | + | | | | Not supported. | + +-------------+--------+--------+-------------------------------------------+ + | ``ppip`` | 0x4 | 0x101 | ACPI function index to pass to SMM code. | + | | | | Set by ACPI. Not supported. | + +-------------+--------+--------+-------------------------------------------+ + | ``pprp`` | 0x4 | 0x105 | Result of last executed operation. Set by | + | | | | firmware. See function index 5 for values.| + +-------------+--------+--------+-------------------------------------------+ + | ``pprq`` | 0x4 | 0x109 | Operation request number to execute. See | + | | | | 'Physical Presence Interface Operation | + | | | | Summary' tables in specs. Set by ACPI. | + +-------------+--------+--------+-------------------------------------------+ + | ``pprm`` | 0x4 | 0x10d | Operation request optional parameter. | + | | | | Values depend on operation. Set by ACPI. | + +-------------+--------+--------+-------------------------------------------+ + | ``lppr`` | 0x4 | 0x111 | Last executed operation request number. | + | | | | Copied from pprq field by firmware. | + +-------------+--------+--------+-------------------------------------------+ + | ``fret`` | 0x4 | 0x115 | Result code from SMM function. | + | | | | Not supported. | + +-------------+--------+--------+-------------------------------------------+ + | ``res1`` | 0x40 | 0x119 | Reserved for future use | + +-------------+--------+--------+-------------------------------------------+ + |``next_step``| 0x1 | 0x159 | Operation to execute after reboot by | + | | | | firmware. Used by firmware. | + +-------------+--------+--------+-------------------------------------------+ + | ``movv`` | 0x1 | 0x15a | Memory overwrite variable | + +-------------+--------+--------+-------------------------------------------+ + +The following values are supported for the ``func`` field. They +correspond to the values used by ACPI function index 8. + + +----------+-------------------------------------------------------------+ + | Value | Description | + +==========+=============================================================+ + | 0 | Operation is not implemented. | + +----------+-------------------------------------------------------------+ + | 1 | Operation is only accessible through firmware. | + +----------+-------------------------------------------------------------+ + | 2 | Operation is blocked for OS by firmware configuration. | + +----------+-------------------------------------------------------------+ + | 3 | Operation is allowed and physically present user required. | + +----------+-------------------------------------------------------------+ + | 4 | Operation is allowed and physically present user is not | + | | required. | + +----------+-------------------------------------------------------------+ + +The location of the table is given by the fw_cfg ``tpmppi_address`` +field. The PPI memory region size is 0x400 (``TPM_PPI_ADDR_SIZE``) to +leave enough room for future updates. + +QEMU files related to TPM ACPI tables: + - ``hw/i386/acpi-build.c`` + - ``include/hw/acpi/tpm.h`` + +TPM backend devices +=================== + +The TPM implementation is split into two parts, frontend and +backend. The frontend part is the hardware interface, such as the TPM +TIS interface described earlier, and the other part is the TPM backend +interface. The backend interfaces implement the interaction with a TPM +device, which may be a physical or an emulated device. The split +between the front- and backend devices allows a frontend to be +connected with any available backend. This enables the TIS interface +to be used with the passthrough backend or the swtpm backend. + +QEMU files related to TPM backends: + - ``backends/tpm.c`` + - ``include/sysemu/tpm.h`` + - ``include/sysemu/tpm_backend.h`` + +The QEMU TPM passthrough device +------------------------------- + +In case QEMU is run on Linux as the host operating system it is +possible to make the hardware TPM device available to a single QEMU +guest. In this case the user must make sure that no other program is +using the device, e.g., /dev/tpm0, before trying to start QEMU with +it. + +The passthrough driver uses the host's TPM device for sending TPM +commands and receiving responses from. Besides that it accesses the +TPM device's sysfs entry for support of command cancellation. Since +none of the state of a hardware TPM can be migrated between hosts, +virtual machine migration is disabled when the TPM passthrough driver +is used. + +Since the host's TPM device will already be initialized by the host's +firmware, certain commands, e.g. ``TPM_Startup()``, sent by the +virtual firmware for device initialization, will fail. In this case +the firmware should not use the TPM. + +Sharing the device with the host is generally not a recommended usage +scenario for a TPM device. The primary reason for this is that two +operating systems can then access the device's single set of +resources, such as platform configuration registers +(PCRs). Applications or kernel security subsystems, such as the Linux +Integrity Measurement Architecture (IMA), are not expecting to share +PCRs. + +QEMU files related to the TPM passthrough device: + - ``backends/tpm/tpm_passthrough.c`` + - ``backends/tpm/tpm_util.c`` + - ``include/sysemu/tpm_util.h`` + + +Command line to start QEMU with the TPM passthrough device using the host's +hardware TPM ``/dev/tpm0``: + +.. code-block:: console + + qemu-system-x86_64 -display sdl -accel kvm \ + -m 1024 -boot d -bios bios-256k.bin -boot menu=on \ + -tpmdev passthrough,id=tpm0,path=/dev/tpm0 \ + -device tpm-tis,tpmdev=tpm0 test.img + + +The following commands should result in similar output inside the VM +with a Linux kernel that either has the TPM TIS driver built-in or +available as a module: + +.. code-block:: console + + # dmesg | grep -i tpm + [ 0.711310] tpm_tis 00:06: 1.2 TPM (device=id 0x1, rev-id 1) + + # dmesg | grep TCPA + [ 0.000000] ACPI: TCPA 0x0000000003FFD191C 000032 (v02 BOCHS \ + BXPCTCPA 0000001 BXPC 00000001) + + # ls -l /dev/tpm* + crw-------. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0 + + # find /sys/devices/ | grep pcrs$ | xargs cat + PCR-00: 35 4E 3B CE 23 9F 38 59 ... + ... + PCR-23: 00 00 00 00 00 00 00 00 ... + +The QEMU TPM emulator device +---------------------------- + +The TPM emulator device uses an external TPM emulator called 'swtpm' +for sending TPM commands to and receiving responses from. The swtpm +program must have been started before trying to access it through the +TPM emulator with QEMU. + +The TPM emulator implements a command channel for transferring TPM +commands and responses as well as a control channel over which control +commands can be sent. (see the `SWTPM protocol`_ specification) + +The control channel serves the purpose of resetting, initializing, and +migrating the TPM state, among other things. + +The swtpm program behaves like a hardware TPM and therefore needs to +be initialized by the firmware running inside the QEMU virtual +machine. One necessary step for initializing the device is to send +the TPM_Startup command to it. SeaBIOS, for example, has been +instrumented to initialize a TPM 1.2 or TPM 2 device using this +command. + +QEMU files related to the TPM emulator device: + - ``backends/tpm/tpm_emulator.c`` + - ``backends/tpm/tpm_util.c`` + - ``include/sysemu/tpm_util.h`` + +The following commands start the swtpm with a UnixIO control channel over +a socket interface. They do not need to be run as root. + +.. code-block:: console + + mkdir /tmp/mytpm1 + swtpm socket --tpmstate dir=/tmp/mytpm1 \ + --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \ + --log level=20 + +Command line to start QEMU with the TPM emulator device communicating +with the swtpm (x86): + +.. code-block:: console + + qemu-system-x86_64 -display sdl -accel kvm \ + -m 1024 -boot d -bios bios-256k.bin -boot menu=on \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-tis,tpmdev=tpm0 test.img + +In case a pSeries machine is emulated, use the following command line: + +.. code-block:: console + + qemu-system-ppc64 -display sdl -machine pseries,accel=kvm \ + -m 1024 -bios slof.bin -boot menu=on \ + -nodefaults -device VGA -device pci-ohci -device usb-kbd \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-spapr,tpmdev=tpm0 \ + -device spapr-vscsi,id=scsi0,reg=0x00002000 \ + -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0 \ + -drive file=test.img,format=raw,if=none,id=drive-virtio-disk0 + +In case an Arm virt machine is emulated, use the following command line: + +.. code-block:: console + + qemu-system-aarch64 -machine virt,gic-version=3,accel=kvm \ + -cpu host -m 4G \ + -nographic -no-acpi \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-tis-device,tpmdev=tpm0 \ + -device virtio-blk-pci,drive=drv0 \ + -drive format=qcow2,file=hda.qcow2,if=none,id=drv0 \ + -drive if=pflash,format=raw,file=flash0.img,readonly=on \ + -drive if=pflash,format=raw,file=flash1.img + +In case SeaBIOS is used as firmware, it should show the TPM menu item +after entering the menu with 'ESC'. + +.. code-block:: console + + Select boot device: + 1. DVD/CD [ata1-0: QEMU DVD-ROM ATAPI-4 DVD/CD] + [...] + 5. Legacy option rom + + t. TPM Configuration + +The following commands should result in similar output inside the VM +with a Linux kernel that either has the TPM TIS driver built-in or +available as a module: + +.. code-block:: console + + # dmesg | grep -i tpm + [ 0.711310] tpm_tis 00:06: 1.2 TPM (device=id 0x1, rev-id 1) + + # dmesg | grep TCPA + [ 0.000000] ACPI: TCPA 0x0000000003FFD191C 000032 (v02 BOCHS \ + BXPCTCPA 0000001 BXPC 00000001) + + # ls -l /dev/tpm* + crw-------. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0 + + # find /sys/devices/ | grep pcrs$ | xargs cat + PCR-00: 35 4E 3B CE 23 9F 38 59 ... + ... + PCR-23: 00 00 00 00 00 00 00 00 ... + +Migration with the TPM emulator +=============================== + +The TPM emulator supports the following types of virtual machine +migration: + +- VM save / restore (migration into a file) +- Network migration +- Snapshotting (migration into storage like QoW2 or QED) + +The following command sequences can be used to test VM save / restore. + +In a 1st terminal start an instance of a swtpm using the following command: + +.. code-block:: console + + mkdir /tmp/mytpm1 + swtpm socket --tpmstate dir=/tmp/mytpm1 \ + --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \ + --log level=20 --tpm2 + +In a 2nd terminal start the VM: + +.. code-block:: console + + qemu-system-x86_64 -display sdl -accel kvm \ + -m 1024 -boot d -bios bios-256k.bin -boot menu=on \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-tis,tpmdev=tpm0 \ + -monitor stdio \ + test.img + +Verify that the attached TPM is working as expected using applications +inside the VM. + +To store the state of the VM use the following command in the QEMU +monitor in the 2nd terminal: + +.. code-block:: console + + (qemu) migrate "exec:cat > testvm.bin" + (qemu) quit + +At this point a file called ``testvm.bin`` should exists and the swtpm +and QEMU processes should have ended. + +To test 'VM restore' you have to start the swtpm with the same +parameters as before. If previously a TPM 2 [--tpm2] was saved, --tpm2 +must now be passed again on the command line. + +In the 1st terminal restart the swtpm with the same command line as +before: + +.. code-block:: console + + swtpm socket --tpmstate dir=/tmp/mytpm1 \ + --ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \ + --log level=20 --tpm2 + +In the 2nd terminal restore the state of the VM using the additional +'-incoming' option. + +.. code-block:: console + + qemu-system-x86_64 -display sdl -accel kvm \ + -m 1024 -boot d -bios bios-256k.bin -boot menu=on \ + -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \ + -tpmdev emulator,id=tpm0,chardev=chrtpm \ + -device tpm-tis,tpmdev=tpm0 \ + -incoming "exec:cat < testvm.bin" \ + test.img + +Troubleshooting migration +------------------------- + +There are several reasons why migration may fail. In case of problems, +please ensure that the command lines adhere to the following rules +and, if possible, that identical versions of QEMU and swtpm are used +at all times. + +VM save and restore: + + - QEMU command line parameters should be identical apart from the + '-incoming' option on VM restore + + - swtpm command line parameters should be identical + +VM migration to 'localhost': + + - QEMU command line parameters should be identical apart from the + '-incoming' option on the destination side + + - swtpm command line parameters should point to two different + directories on the source and destination swtpm (--tpmstate dir=...) + (especially if different versions of libtpms were to be used on the + same machine). + +VM migration across the network: + + - QEMU command line parameters should be identical apart from the + '-incoming' option on the destination side + + - swtpm command line parameters should be identical + +VM Snapshotting: + - QEMU command line parameters should be identical + + - swtpm command line parameters should be identical + + +Besides that, migration failure reasons on the swtpm level may include +the following: + + - the versions of the swtpm on the source and destination sides are + incompatible + + - downgrading of TPM state may not be supported + + - the source and destination libtpms were compiled with different + compile-time options and the destination side refuses to accept the + state + + - different migration keys are used on the source and destination side + and the destination side cannot decrypt the migrated state + (swtpm ... --migration-key ... ) + + +.. _TIS specification: + https://trustedcomputinggroup.org/pc-client-work-group-pc-client-specific-tpm-interface-specification-tis/ + +.. _CRB specification: + https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/ + + +.. _ACPI specification: + https://trustedcomputinggroup.org/tcg-acpi-specification/ + +.. _PPI specification: + https://trustedcomputinggroup.org/resource/tcg-physical-presence-interface-specification/ + +.. _SWTPM protocol: + https://github.com/stefanberger/swtpm/blob/master/man/man3/swtpm_ioctls.pod diff --git a/docs/specs/virt-ctlr.txt b/docs/specs/virt-ctlr.txt new file mode 100644 index 000000000..24d38084f --- /dev/null +++ b/docs/specs/virt-ctlr.txt @@ -0,0 +1,26 @@ +Virtual System Controller +========================= + +This device is a simple interface defined for the pure virtual machine with no +hardware reference implementation to allow the guest kernel to send command +to the host hypervisor. + +The specification can evolve, the current state is defined as below. + +This is a MMIO mapped device using 256 bytes. + +Two 32bit registers are defined: + +1- the features register (read-only, address 0x00) + + This register allows the device to report features supported by the + controller. + The only feature supported for the moment is power control (0x01). + +2- the command register (write-only, address 0x04) + + This register allows the kernel to send the commands to the hypervisor. + The implemented commands are part of the power control feature and + are reset (1), halt (2) and panic (3). + A basic command, no-op (0), is always present and can be used to test the + register access. This command has no effect. diff --git a/docs/specs/vmcoreinfo.txt b/docs/specs/vmcoreinfo.txt new file mode 100644 index 000000000..bcbca6fe4 --- /dev/null +++ b/docs/specs/vmcoreinfo.txt @@ -0,0 +1,53 @@ +================= +VMCoreInfo device +================= + +The `-device vmcoreinfo` will create a fw_cfg entry for a guest to +store dump details. + +etc/vmcoreinfo +************** + +A guest may use this fw_cfg entry to add information details to qemu +dumps. + +The entry of 16 bytes has the following layout, in little-endian:: + +#define VMCOREINFO_FORMAT_NONE 0x0 +#define VMCOREINFO_FORMAT_ELF 0x1 + + struct FWCfgVMCoreInfo { + uint16_t host_format; /* formats host supports */ + uint16_t guest_format; /* format guest supplies */ + uint32_t size; /* size of vmcoreinfo region */ + uint64_t paddr; /* physical address of vmcoreinfo region */ + }; + +Only full write (of 16 bytes) are considered valid for further +processing of entry values. + +A write of 0 in guest_format will disable further processing of +vmcoreinfo entry values & content. + +You may write a guest_format that is not supported by the host, in +which case the entry data can be ignored by qemu (but you may still +access it through a debugger, via vmcoreinfo_realize::vmcoreinfo_state). + +Format & content +**************** + +As of qemu 2.11, only VMCOREINFO_FORMAT_ELF is supported. + +The entry gives location and size of an ELF note that is appended in +qemu dumps. + +The note format/class must be of the target bitness and the size must +be less than 1Mb. + +If the ELF note name is "VMCOREINFO", it is expected to be the Linux +vmcoreinfo note (see Documentation/ABI/testing/sysfs-kernel-vmcoreinfo +in Linux source). In this case, qemu dump code will read the content +as a key=value text file, looking for "NUMBER(phys_base)" key +value. The value is expected to be more accurate than architecture +guess of the value. This is useful for KASLR-enabled guest with +ancient tools not handling the VMCOREINFO note. diff --git a/docs/specs/vmgenid.txt b/docs/specs/vmgenid.txt new file mode 100644 index 000000000..aa9f51867 --- /dev/null +++ b/docs/specs/vmgenid.txt @@ -0,0 +1,245 @@ +VIRTUAL MACHINE GENERATION ID +============================= + +Copyright (C) 2016 Red Hat, Inc. +Copyright (C) 2017 Skyport Systems, Inc. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +=== + +The VM generation ID (vmgenid) device is an emulated device which +exposes a 128-bit, cryptographically random, integer value identifier, +referred to as a Globally Unique Identifier, or GUID. + +This allows management applications (e.g. libvirt) to notify the guest +operating system when the virtual machine is executed with a different +configuration (e.g. snapshot execution or creation from a template). The +guest operating system notices the change, and is then able to react as +appropriate by marking its copies of distributed databases as dirty, +re-initializing its random number generator etc. + + +Requirements +------------ + +These requirements are extracted from the "How to implement virtual machine +generation ID support in a virtualization platform" section of the +specification, dated August 1, 2012. + + +The document may be found on the web at: + http://go.microsoft.com/fwlink/?LinkId=260709 + +R1a. The generation ID shall live in an 8-byte aligned buffer. + +R1b. The buffer holding the generation ID shall be in guest RAM, ROM, or device + MMIO range. + +R1c. The buffer holding the generation ID shall be kept separate from areas + used by the operating system. + +R1d. The buffer shall not be covered by an AddressRangeMemory or + AddressRangeACPI entry in the E820 or UEFI memory map. + +R1e. The generation ID shall not live in a page frame that could be mapped with + caching disabled. (In other words, regardless of whether the generation ID + lives in RAM, ROM or MMIO, it shall only be mapped as cacheable.) + +R2 to R5. [These AML requirements are isolated well enough in the Microsoft + specification for us to simply refer to them here.] + +R6. The hypervisor shall expose a _HID (hardware identifier) object in the + VMGenId device's scope that is unique to the hypervisor vendor. + + +QEMU Implementation +------------------- + +The above-mentioned specification does not dictate which ACPI descriptor table +will contain the VM Generation ID device. Other implementations (Hyper-V and +Xen) put it in the main descriptor table (Differentiated System Description +Table or DSDT). For ease of debugging and implementation, we have decided to +put it in its own Secondary System Description Table, or SSDT. + +The following is a dump of the contents from a running system: + +# iasl -p ./SSDT -d /sys/firmware/acpi/tables/SSDT + +Intel ACPI Component Architecture +ASL+ Optimizing Compiler version 20150717-64 +Copyright (c) 2000 - 2015 Intel Corporation + +Reading ACPI table from file /sys/firmware/acpi/tables/SSDT - Length +00000198 (0x0000C6) +ACPI: SSDT 0x0000000000000000 0000C6 (v01 BOCHS VMGENID 00000001 BXPC +00000001) +Acpi table [SSDT] successfully installed and loaded +Pass 1 parse of [SSDT] +Pass 2 parse of [SSDT] +Parsing Deferred Opcodes (Methods/Buffers/Packages/Regions) + +Parsing completed +Disassembly completed +ASL Output: ./SSDT.dsl - 1631 bytes +# cat SSDT.dsl +/* + * Intel ACPI Component Architecture + * AML/ASL+ Disassembler version 20150717-64 + * Copyright (c) 2000 - 2015 Intel Corporation + * + * Disassembling to symbolic ASL+ operators + * + * Disassembly of /sys/firmware/acpi/tables/SSDT, Sun Feb 5 00:19:37 2017 + * + * Original Table Header: + * Signature "SSDT" + * Length 0x000000CA (202) + * Revision 0x01 + * Checksum 0x4B + * OEM ID "BOCHS " + * OEM Table ID "VMGENID" + * OEM Revision 0x00000001 (1) + * Compiler ID "BXPC" + * Compiler Version 0x00000001 (1) + */ +DefinitionBlock ("/sys/firmware/acpi/tables/SSDT.aml", "SSDT", 1, "BOCHS ", +"VMGENID", 0x00000001) +{ + Name (VGIA, 0x07FFF000) + Scope (\_SB) + { + Device (VGEN) + { + Name (_HID, "QEMUVGID") // _HID: Hardware ID + Name (_CID, "VM_Gen_Counter") // _CID: Compatible ID + Name (_DDN, "VM_Gen_Counter") // _DDN: DOS Device Name + Method (_STA, 0, NotSerialized) // _STA: Status + { + Local0 = 0x0F + If ((VGIA == Zero)) + { + Local0 = Zero + } + + Return (Local0) + } + + Method (ADDR, 0, NotSerialized) + { + Local0 = Package (0x02) {} + Index (Local0, Zero) = (VGIA + 0x28) + Index (Local0, One) = Zero + Return (Local0) + } + } + } + + Method (\_GPE._E05, 0, NotSerialized) // _Exx: Edge-Triggered GPE + { + Notify (\_SB.VGEN, 0x80) // Status Change + } +} + + +Design Details: +--------------- + +Requirements R1a through R1e dictate that the memory holding the +VM Generation ID must be allocated and owned by the guest firmware, +in this case BIOS or UEFI. However, to be useful, QEMU must be able to +change the contents of the memory at runtime, specifically when starting a +backed-up or snapshotted image. In order to do this, QEMU must know the +address that has been allocated. + +The mechanism chosen for this memory sharing is writeable fw_cfg blobs. +These are data object that are visible to both QEMU and guests, and are +addressable as sequential files. + +More information about fw_cfg can be found in "docs/specs/fw_cfg.txt" + +Two fw_cfg blobs are used in this case: + +/etc/vmgenid_guid - contains the actual VM Generation ID GUID + - read-only to the guest +/etc/vmgenid_addr - contains the address of the downloaded vmgenid blob + - writeable by the guest + + +QEMU sends the following commands to the guest at startup: + +1. Allocate memory for vmgenid_guid fw_cfg blob. +2. Write the address of vmgenid_guid into the SSDT (VGIA ACPI variable as + shown above in the iasl dump). Note that this change is not propagated + back to QEMU. +3. Write the address of vmgenid_guid back to QEMU's copy of vmgenid_addr + via the fw_cfg DMA interface. + +After step 3, QEMU is able to update the contents of vmgenid_guid at will. + +Since BIOS or UEFI does not necessarily run when we wish to change the GUID, +the value of VGIA is persisted via the VMState mechanism. + +As spelled out in the specification, any change to the GUID executes an +ACPI notification. The exact handler to use is not specified, so the vmgenid +device uses the first unused one: \_GPE._E05. + + +Endian-ness Considerations: +--------------------------- + +Although not specified in Microsoft's document, it is assumed that the +device is expected to use little-endian format. + +All GUID passed in via command line or monitor are treated as big-endian. +GUID values displayed via monitor are shown in big-endian format. + + +GUID Storage Format: +-------------------- + +In order to implement an OVMF "SDT Header Probe Suppressor", the contents of +the vmgenid_guid fw_cfg blob are not simply a 128-bit GUID. There is also +significant padding in order to align and fill a memory page, as shown in the +following diagram: + ++----------------------------------+ +| SSDT with OEM Table ID = VMGENID | ++----------------------------------+ +| ... | TOP OF PAGE +| VGIA dword object ---------------|-----> +---------------------------+ +| ... | | fw-allocated array for | +| _STA method referring to VGIA | | "etc/vmgenid_guid" | +| ... | +---------------------------+ +| ADDR method referring to VGIA | | 0: OVMF SDT Header probe | +| ... | | suppressor | ++----------------------------------+ | 36: padding for 8-byte | + | alignment | + | 40: GUID | + | 56: padding to page size | + +---------------------------+ + END OF PAGE + + +Device Usage: +------------- + +The device has one property, which may be only be set using the command line: + + guid - sets the value of the GUID. A special value "auto" instructs + QEMU to generate a new random GUID. + +For example: + + QEMU -device vmgenid,guid="324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87" + QEMU -device vmgenid,guid=auto + +The property may be queried via QMP/HMP: + + (QEMU) query-vm-generation-id + {"return": {"guid": "324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87"}} + +Setting of this parameter is intentionally left out from the QMP/HMP +interfaces. There are no known use cases for changing the GUID once QEMU is +running, and adding this capability would greatly increase the complexity. diff --git a/docs/specs/vmw_pvscsi-spec.txt b/docs/specs/vmw_pvscsi-spec.txt new file mode 100644 index 000000000..49affb2a4 --- /dev/null +++ b/docs/specs/vmw_pvscsi-spec.txt @@ -0,0 +1,92 @@ +General Description +=================== + +This document describes VMWare PVSCSI device interface specification. +Created by Dmitry Fleytman (dmitry@daynix.com), Daynix Computing LTD. +Based on source code of PVSCSI Linux driver from kernel 3.0.4 + +PVSCSI Device Interface Overview +================================ + +The interface is based on memory area shared between hypervisor and VM. +Memory area is obtained by driver as device IO memory resource of +PVSCSI_MEM_SPACE_SIZE length. +The shared memory consists of registers area and rings area. +The registers area is used to raise hypervisor interrupts and issue device +commands. The rings area is used to transfer data descriptors and SCSI +commands from VM to hypervisor and to transfer messages produced by +hypervisor to VM. Data itself is transferred via virtual scatter-gather DMA. + +PVSCSI Device Registers +======================= + +The length of the registers area is 1 page (PVSCSI_MEM_SPACE_COMMAND_NUM_PAGES). +The structure of the registers area is described by the PVSCSIRegOffset enum. +There are registers to issue device command (with optional short data), +issue device interrupt, control interrupts masking. + +PVSCSI Device Rings +=================== + +There are three rings in shared memory: + + 1. Request ring (struct PVSCSIRingReqDesc *req_ring) + - ring for OS to device requests + 2. Completion ring (struct PVSCSIRingCmpDesc *cmp_ring) + - ring for device request completions + 3. Message ring (struct PVSCSIRingMsgDesc *msg_ring) + - ring for messages from device. + This ring is optional and the guest might not configure it. +There is a control area (struct PVSCSIRingsState *rings_state) used to control +rings operation. + +PVSCSI Device to Host Interrupts +================================ +There are following interrupt types supported by PVSCSI device: + 1. Completion interrupts (completion ring notifications): + PVSCSI_INTR_CMPL_0 + PVSCSI_INTR_CMPL_1 + 2. Message interrupts (message ring notifications): + PVSCSI_INTR_MSG_0 + PVSCSI_INTR_MSG_1 + +Interrupts are controlled via PVSCSI_REG_OFFSET_INTR_MASK register +Bit set means interrupt enabled, bit cleared - disabled + +Interrupt modes supported are legacy, MSI and MSI-X +In case of legacy interrupts, register PVSCSI_REG_OFFSET_INTR_STATUS +is used to check which interrupt has arrived. Interrupts are +acknowledged when the corresponding bit is written to the interrupt +status register. + +PVSCSI Device Operation Sequences +================================= + +1. Startup sequence: + a. Issue PVSCSI_CMD_ADAPTER_RESET command; + aa. Windows driver reads interrupt status register here; + b. Issue PVSCSI_CMD_SETUP_MSG_RING command with no additional data, + check status and disable device messages if error returned; + (Omitted if device messages disabled by driver configuration) + c. Issue PVSCSI_CMD_SETUP_RINGS command, provide rings configuration + as struct PVSCSICmdDescSetupRings; + d. Issue PVSCSI_CMD_SETUP_MSG_RING command again, provide + rings configuration as struct PVSCSICmdDescSetupMsgRing; + e. Unmask completion and message (if device messages enabled) interrupts. + +2. Shutdown sequences + a. Mask interrupts; + b. Flush request ring using PVSCSI_REG_OFFSET_KICK_NON_RW_IO; + c. Issue PVSCSI_CMD_ADAPTER_RESET command. + +3. Send request + a. Fill next free request ring descriptor; + b. Issue PVSCSI_REG_OFFSET_KICK_RW_IO for R/W operations; + or PVSCSI_REG_OFFSET_KICK_NON_RW_IO for other operations. + +4. Abort command + a. Issue PVSCSI_CMD_ABORT_CMD command; + +5. Request completion processing + a. Upon completion interrupt arrival process completion + and message (if enabled) rings. diff --git a/docs/sphinx-static/custom.js b/docs/sphinx-static/custom.js new file mode 100644 index 000000000..71a860530 --- /dev/null +++ b/docs/sphinx-static/custom.js @@ -0,0 +1,9 @@ +document.addEventListener('keydown', (event) => { + // find a better way to look it up? + let search_input = document.getElementsByName('q')[0]; + + if (event.code === 'KeyS' && document.activeElement !== search_input) { + event.preventDefault(); + search_input.focus(); + } +}); diff --git a/docs/sphinx-static/theme_overrides.css b/docs/sphinx-static/theme_overrides.css new file mode 100644 index 000000000..c70ef9512 --- /dev/null +++ b/docs/sphinx-static/theme_overrides.css @@ -0,0 +1,161 @@ +/* -*- coding: utf-8; mode: css -*- + * + * Sphinx HTML theme customization: read the doc + * Based on Linux Documentation/sphinx-static/theme_overrides.css + */ + +/* Improve contrast and increase size for easier reading. */ + +body { + font-family: serif; + color: black; + font-size: 100%; +} + +h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend { + font-family: sans-serif; +} + +.rst-content dl:not(.docutils) dt { + border-top: none; + border-left: solid 3px #ccc; + background-color: #f0f0f0; + color: black; +} + +.wy-nav-top { + background: #802400; +} + +.wy-side-nav-search input[type="text"] { + border-color: #f60; +} + +.wy-menu-vertical p.caption { + color: white; +} + +.wy-menu-vertical li.current a { + color: #505050; +} + +.wy-menu-vertical li.on a, .wy-menu-vertical li.current > a { + color: #303030; +} + +.fa-gitlab { + box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2), 0 3px 10px 0 rgba(0,0,0,0.19); + border-radius: 5px; +} + +div[class^="highlight"] pre { + font-family: monospace; + color: black; + font-size: 100%; +} + +.wy-menu-vertical { + font-family: sans-serif; +} + +.c { + font-style: normal; +} + +p { + font-size: 100%; +} + +/* Interim: Code-blocks with line nos - lines and line numbers don't line up. + * see: https://github.com/rtfd/sphinx_rtd_theme/issues/419 + */ + +div[class^="highlight"] pre { + line-height: normal; +} +.rst-content .highlight > pre { + line-height: normal; +} + +/* Keep fields from being strangely far apart due to inheirited table CSS. */ +.rst-content table.field-list th.field-name { + padding-top: 1px; + padding-bottom: 1px; +} +.rst-content table.field-list td.field-body { + padding-top: 1px; + padding-bottom: 1px; +} + +@media screen { + + /* content column + * + * RTD theme's default is 800px as max width for the content, but we have + * tables with tons of columns, which need the full width of the view-port. + */ + + .wy-nav-content{max-width: none; } + + /* table: + * + * - Sequences of whitespace should collapse into a single whitespace. + * - make the overflow auto (scrollbar if needed) + * - align caption "left" ("center" is unsuitable on vast tables) + */ + + .wy-table-responsive table td { white-space: normal; } + .wy-table-responsive { overflow: auto; } + .rst-content table.docutils caption { text-align: left; font-size: 100%; } + + /* captions: + * + * - captions should have 100% (not 85%) font size + * - hide the permalink symbol as long as link is not hovered + */ + + .toc-title { + font-size: 150%; + font-weight: bold; + } + + caption, .wy-table caption, .rst-content table.field-list caption { + font-size: 100%; + } + caption a.headerlink { opacity: 0; } + caption a.headerlink:hover { opacity: 1; } + + /* Menu selection and keystrokes */ + + span.menuselection { + color: blue; + font-family: "Courier New", Courier, monospace + } + + code.kbd, code.kbd span { + color: white; + background-color: darkblue; + font-weight: bold; + font-family: "Courier New", Courier, monospace + } + + /* fix bottom margin of lists items */ + + .rst-content .section ul li:last-child, .rst-content .section ul li p:last-child { + margin-bottom: 12px; + } + + /* inline literal: drop the borderbox, padding and red color */ + + code, .rst-content tt, .rst-content code { + color: inherit; + border: none; + padding: unset; + background: inherit; + font-size: 85%; + } + + .rst-content tt.literal,.rst-content tt.literal,.rst-content code.literal { + color: inherit; + } +} diff --git a/docs/sphinx/depfile.py b/docs/sphinx/depfile.py new file mode 100644 index 000000000..afdcbcec6 --- /dev/null +++ b/docs/sphinx/depfile.py @@ -0,0 +1,66 @@ +# coding=utf-8 +# +# QEMU depfile generation extension +# +# Copyright (c) 2020 Red Hat, Inc. +# +# This work is licensed under the terms of the GNU GPLv2 or later. +# See the COPYING file in the top-level directory. + +"""depfile is a Sphinx extension that writes a dependency file for + an external build system""" + +import os +import sphinx +import sys +from pathlib import Path + +__version__ = '1.0' + +def get_infiles(env): + for x in env.found_docs: + yield env.doc2path(x) + yield from ((os.path.join(env.srcdir, dep) + for dep in env.dependencies[x])) + for mod in sys.modules.values(): + if hasattr(mod, '__file__'): + if mod.__file__: + yield mod.__file__ + # this is perhaps going to include unused files: + for static_path in env.config.html_static_path + env.config.templates_path: + for path in Path(static_path).rglob('*'): + yield str(path) + + +def write_depfile(app, exception): + if exception: + return + + env = app.env + if not env.config.depfile: + return + + # Using a directory as the output file does not work great because + # its timestamp does not necessarily change when the contents change. + # So create a timestamp file. + if env.config.depfile_stamp: + with open(env.config.depfile_stamp, 'w') as f: + pass + + with open(env.config.depfile, 'w') as f: + print((env.config.depfile_stamp or app.outdir) + ": \\", file=f) + print(*get_infiles(env), file=f) + for x in get_infiles(env): + print(x + ":", file=f) + + +def setup(app): + app.add_config_value('depfile', None, 'env') + app.add_config_value('depfile_stamp', None, 'env') + app.connect('build-finished', write_depfile) + + return dict( + version = __version__, + parallel_read_safe = True, + parallel_write_safe = True + ) diff --git a/docs/sphinx/hxtool.py b/docs/sphinx/hxtool.py new file mode 100644 index 000000000..fb0649a3d --- /dev/null +++ b/docs/sphinx/hxtool.py @@ -0,0 +1,192 @@ +# coding=utf-8 +# +# QEMU hxtool .hx file parsing extension +# +# Copyright (c) 2020 Linaro +# +# This work is licensed under the terms of the GNU GPLv2 or later. +# See the COPYING file in the top-level directory. +"""hxtool is a Sphinx extension that implements the hxtool-doc directive""" + +# The purpose of this extension is to read fragments of rST +# from .hx files, and insert them all into the current document. +# The rST fragments are delimited by SRST/ERST lines. +# The conf.py file must set the hxtool_srctree config value to +# the root of the QEMU source tree. +# Each hxtool-doc:: directive takes one argument which is the +# path of the .hx file to process, relative to the source tree. + +import os +import re +from enum import Enum + +from docutils import nodes +from docutils.statemachine import ViewList +from docutils.parsers.rst import directives, Directive +from sphinx.errors import ExtensionError +from sphinx.util.nodes import nested_parse_with_titles +import sphinx + +# Sphinx up to 1.6 uses AutodocReporter; 1.7 and later +# use switch_source_input. Check borrowed from kerneldoc.py. +Use_SSI = sphinx.__version__[:3] >= '1.7' +if Use_SSI: + from sphinx.util.docutils import switch_source_input +else: + from sphinx.ext.autodoc import AutodocReporter + +__version__ = '1.0' + +# We parse hx files with a state machine which may be in one of two +# states: reading the C code fragment, or inside a rST fragment. +class HxState(Enum): + CTEXT = 1 + RST = 2 + +def serror(file, lnum, errtext): + """Raise an exception giving a user-friendly syntax error message""" + raise ExtensionError('%s line %d: syntax error: %s' % (file, lnum, errtext)) + +def parse_directive(line): + """Return first word of line, if any""" + return re.split('\W', line)[0] + +def parse_defheading(file, lnum, line): + """Handle a DEFHEADING directive""" + # The input should be "DEFHEADING(some string)", though note that + # the 'some string' could be the empty string. If the string is + # empty we ignore the directive -- these are used only to add + # blank lines in the plain-text content of the --help output. + # + # Return the heading text. We strip out any trailing ':' for + # consistency with other headings in the rST documentation. + match = re.match(r'DEFHEADING\((.*?):?\)', line) + if match is None: + serror(file, lnum, "Invalid DEFHEADING line") + return match.group(1) + +def parse_archheading(file, lnum, line): + """Handle an ARCHHEADING directive""" + # The input should be "ARCHHEADING(some string, other arg)", + # though note that the 'some string' could be the empty string. + # As with DEFHEADING, empty string ARCHHEADINGs will be ignored. + # + # Return the heading text. We strip out any trailing ':' for + # consistency with other headings in the rST documentation. + match = re.match(r'ARCHHEADING\((.*?):?,.*\)', line) + if match is None: + serror(file, lnum, "Invalid ARCHHEADING line") + return match.group(1) + +class HxtoolDocDirective(Directive): + """Extract rST fragments from the specified .hx file""" + required_argument = 1 + optional_arguments = 1 + option_spec = { + 'hxfile': directives.unchanged_required + } + has_content = False + + def run(self): + env = self.state.document.settings.env + hxfile = env.config.hxtool_srctree + '/' + self.arguments[0] + + # Tell sphinx of the dependency + env.note_dependency(os.path.abspath(hxfile)) + + state = HxState.CTEXT + # We build up lines of rST in this ViewList, which we will + # later put into a 'section' node. + rstlist = ViewList() + current_node = None + node_list = [] + + with open(hxfile) as f: + lines = (l.rstrip() for l in f) + for lnum, line in enumerate(lines, 1): + directive = parse_directive(line) + + if directive == 'HXCOMM': + pass + elif directive == 'SRST': + if state == HxState.RST: + serror(hxfile, lnum, 'expected ERST, found SRST') + else: + state = HxState.RST + elif directive == 'ERST': + if state == HxState.CTEXT: + serror(hxfile, lnum, 'expected SRST, found ERST') + else: + state = HxState.CTEXT + elif directive == 'DEFHEADING' or directive == 'ARCHHEADING': + if directive == 'DEFHEADING': + heading = parse_defheading(hxfile, lnum, line) + else: + heading = parse_archheading(hxfile, lnum, line) + if heading == "": + continue + # Put the accumulated rST into the previous node, + # and then start a fresh section with this heading. + if len(rstlist) > 0: + if current_node is None: + # We had some rST fragments before the first + # DEFHEADING. We don't have a section to put + # these in, so rather than magicing up a section, + # make it a syntax error. + serror(hxfile, lnum, + 'first DEFHEADING must precede all rST text') + self.do_parse(rstlist, current_node) + rstlist = ViewList() + if current_node is not None: + node_list.append(current_node) + section_id = 'hxtool-%d' % env.new_serialno('hxtool') + current_node = nodes.section(ids=[section_id]) + current_node += nodes.title(heading, heading) + else: + # Not a directive: put in output if we are in rST fragment + if state == HxState.RST: + # Sphinx counts its lines from 0 + rstlist.append(line, hxfile, lnum - 1) + + if current_node is None: + # We don't have multiple sections, so just parse the rst + # fragments into a dummy node so we can return the children. + current_node = nodes.section() + self.do_parse(rstlist, current_node) + return current_node.children + else: + # Put the remaining accumulated rST into the last section, and + # return all the sections. + if len(rstlist) > 0: + self.do_parse(rstlist, current_node) + node_list.append(current_node) + return node_list + + # This is from kerneldoc.py -- it works around an API change in + # Sphinx between 1.6 and 1.7. Unlike kerneldoc.py, we use + # sphinx.util.nodes.nested_parse_with_titles() rather than the + # plain self.state.nested_parse(), and so we can drop the saving + # of title_styles and section_level that kerneldoc.py does, + # because nested_parse_with_titles() does that for us. + def do_parse(self, result, node): + if Use_SSI: + with switch_source_input(self.state, result): + nested_parse_with_titles(self.state, result, node) + else: + save = self.state.memo.reporter + self.state.memo.reporter = AutodocReporter(result, self.state.memo.reporter) + try: + nested_parse_with_titles(self.state, result, node) + finally: + self.state.memo.reporter = save + +def setup(app): + """ Register hxtool-doc directive with Sphinx""" + app.add_config_value('hxtool_srctree', None, 'env') + app.add_directive('hxtool-doc', HxtoolDocDirective) + + return dict( + version = __version__, + parallel_read_safe = True, + parallel_write_safe = True + ) diff --git a/docs/sphinx/kerneldoc.py b/docs/sphinx/kerneldoc.py new file mode 100644 index 000000000..bf4421501 --- /dev/null +++ b/docs/sphinx/kerneldoc.py @@ -0,0 +1,177 @@ +# coding=utf-8 +# +# Copyright © 2016 Intel Corporation +# +# Permission is hereby granted, free of charge, to any person obtaining a +# copy of this software and associated documentation files (the "Software"), +# to deal in the Software without restriction, including without limitation +# the rights to use, copy, modify, merge, publish, distribute, sublicense, +# and/or sell copies of the Software, and to permit persons to whom the +# Software is furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice (including the next +# paragraph) shall be included in all copies or substantial portions of the +# Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL +# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS +# IN THE SOFTWARE. +# +# Authors: +# Jani Nikula <jani.nikula@intel.com> +# +# Please make sure this works on both python2 and python3. +# + +import codecs +import os +import subprocess +import sys +import re +import glob + +from docutils import nodes, statemachine +from docutils.statemachine import ViewList +from docutils.parsers.rst import directives, Directive + +# +# AutodocReporter is only good up to Sphinx 1.7 +# +import sphinx + +Use_SSI = sphinx.__version__[:3] >= '1.7' +if Use_SSI: + from sphinx.util.docutils import switch_source_input +else: + from sphinx.ext.autodoc import AutodocReporter + +import kernellog + +__version__ = '1.0' + +class KernelDocDirective(Directive): + """Extract kernel-doc comments from the specified file""" + required_argument = 1 + optional_arguments = 4 + option_spec = { + 'doc': directives.unchanged_required, + 'functions': directives.unchanged, + 'export': directives.unchanged, + 'internal': directives.unchanged, + } + has_content = False + + def run(self): + env = self.state.document.settings.env + cmd = env.config.kerneldoc_bin + ['-rst', '-enable-lineno'] + + # Pass the version string to kernel-doc, as it needs to use a different + # dialect, depending what the C domain supports for each specific + # Sphinx versions + cmd += ['-sphinx-version', sphinx.__version__] + + filename = env.config.kerneldoc_srctree + '/' + self.arguments[0] + export_file_patterns = [] + + # Tell sphinx of the dependency + env.note_dependency(os.path.abspath(filename)) + + tab_width = self.options.get('tab-width', self.state.document.settings.tab_width) + + # FIXME: make this nicer and more robust against errors + if 'export' in self.options: + cmd += ['-export'] + export_file_patterns = str(self.options.get('export')).split() + elif 'internal' in self.options: + cmd += ['-internal'] + export_file_patterns = str(self.options.get('internal')).split() + elif 'doc' in self.options: + cmd += ['-function', str(self.options.get('doc'))] + elif 'functions' in self.options: + functions = self.options.get('functions').split() + if functions: + for f in functions: + cmd += ['-function', f] + else: + cmd += ['-no-doc-sections'] + + for pattern in export_file_patterns: + for f in glob.glob(env.config.kerneldoc_srctree + '/' + pattern): + env.note_dependency(os.path.abspath(f)) + cmd += ['-export-file', f] + + cmd += [filename] + + try: + kernellog.verbose(env.app, + 'calling kernel-doc \'%s\'' % (" ".join(cmd))) + + p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + out, err = p.communicate() + + out, err = codecs.decode(out, 'utf-8'), codecs.decode(err, 'utf-8') + + if p.returncode != 0: + sys.stderr.write(err) + + kernellog.warn(env.app, + 'kernel-doc \'%s\' failed with return code %d' % (" ".join(cmd), p.returncode)) + return [nodes.error(None, nodes.paragraph(text = "kernel-doc missing"))] + elif env.config.kerneldoc_verbosity > 0: + sys.stderr.write(err) + + lines = statemachine.string2lines(out, tab_width, convert_whitespace=True) + result = ViewList() + + lineoffset = 0; + line_regex = re.compile("^#define LINENO ([0-9]+)$") + for line in lines: + match = line_regex.search(line) + if match: + # sphinx counts lines from 0 + lineoffset = int(match.group(1)) - 1 + # we must eat our comments since the upset the markup + else: + result.append(line, filename, lineoffset) + lineoffset += 1 + + node = nodes.section() + self.do_parse(result, node) + + return node.children + + except Exception as e: # pylint: disable=W0703 + kernellog.warn(env.app, 'kernel-doc \'%s\' processing failed with: %s' % + (" ".join(cmd), str(e))) + return [nodes.error(None, nodes.paragraph(text = "kernel-doc missing"))] + + def do_parse(self, result, node): + if Use_SSI: + with switch_source_input(self.state, result): + self.state.nested_parse(result, 0, node, match_titles=1) + else: + save = self.state.memo.title_styles, self.state.memo.section_level, self.state.memo.reporter + self.state.memo.reporter = AutodocReporter(result, self.state.memo.reporter) + self.state.memo.title_styles, self.state.memo.section_level = [], 0 + try: + self.state.nested_parse(result, 0, node, match_titles=1) + finally: + self.state.memo.title_styles, self.state.memo.section_level, self.state.memo.reporter = save + + +def setup(app): + app.add_config_value('kerneldoc_bin', None, 'env') + app.add_config_value('kerneldoc_srctree', None, 'env') + app.add_config_value('kerneldoc_verbosity', 1, 'env') + + app.add_directive('kernel-doc', KernelDocDirective) + + return dict( + version = __version__, + parallel_read_safe = True, + parallel_write_safe = True + ) diff --git a/docs/sphinx/kernellog.py b/docs/sphinx/kernellog.py new file mode 100644 index 000000000..af924f51a --- /dev/null +++ b/docs/sphinx/kernellog.py @@ -0,0 +1,28 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Sphinx has deprecated its older logging interface, but the replacement +# only goes back to 1.6. So here's a wrapper layer to keep around for +# as long as we support 1.4. +# +import sphinx + +if sphinx.__version__[:3] >= '1.6': + UseLogging = True + from sphinx.util import logging + logger = logging.getLogger('kerneldoc') +else: + UseLogging = False + +def warn(app, message): + if UseLogging: + logger.warning(message) + else: + app.warn(message) + +def verbose(app, message): + if UseLogging: + logger.verbose(message) + else: + app.verbose(message) + + diff --git a/docs/sphinx/qapidoc.py b/docs/sphinx/qapidoc.py new file mode 100644 index 000000000..d791b5949 --- /dev/null +++ b/docs/sphinx/qapidoc.py @@ -0,0 +1,554 @@ +# coding=utf-8 +# +# QEMU qapidoc QAPI file parsing extension +# +# Copyright (c) 2020 Linaro +# +# This work is licensed under the terms of the GNU GPLv2 or later. +# See the COPYING file in the top-level directory. + +""" +qapidoc is a Sphinx extension that implements the qapi-doc directive + +The purpose of this extension is to read the documentation comments +in QAPI schema files, and insert them all into the current document. + +It implements one new rST directive, "qapi-doc::". +Each qapi-doc:: directive takes one argument, which is the +pathname of the schema file to process, relative to the source tree. + +The docs/conf.py file must set the qapidoc_srctree config value to +the root of the QEMU source tree. + +The Sphinx documentation on writing extensions is at: +https://www.sphinx-doc.org/en/master/development/index.html +""" + +import os +import re + +from docutils import nodes +from docutils.statemachine import ViewList +from docutils.parsers.rst import directives, Directive +from sphinx.errors import ExtensionError +from sphinx.util.nodes import nested_parse_with_titles +import sphinx +from qapi.gen import QAPISchemaVisitor +from qapi.error import QAPIError, QAPISemError +from qapi.schema import QAPISchema + + +# Sphinx up to 1.6 uses AutodocReporter; 1.7 and later +# use switch_source_input. Check borrowed from kerneldoc.py. +Use_SSI = sphinx.__version__[:3] >= '1.7' +if Use_SSI: + from sphinx.util.docutils import switch_source_input +else: + from sphinx.ext.autodoc import AutodocReporter + + +__version__ = '1.0' + + +# Function borrowed from pydash, which is under the MIT license +def intersperse(iterable, separator): + """Yield the members of *iterable* interspersed with *separator*.""" + iterable = iter(iterable) + yield next(iterable) + for item in iterable: + yield separator + yield item + + +class QAPISchemaGenRSTVisitor(QAPISchemaVisitor): + """A QAPI schema visitor which generates docutils/Sphinx nodes + + This class builds up a tree of docutils/Sphinx nodes corresponding + to documentation for the various QAPI objects. To use it, first + create a QAPISchemaGenRSTVisitor object, and call its + visit_begin() method. Then you can call one of the two methods + 'freeform' (to add documentation for a freeform documentation + chunk) or 'symbol' (to add documentation for a QAPI symbol). These + will cause the visitor to build up the tree of document + nodes. Once you've added all the documentation via 'freeform' and + 'symbol' method calls, you can call 'get_document_nodes' to get + the final list of document nodes (in a form suitable for returning + from a Sphinx directive's 'run' method). + """ + def __init__(self, sphinx_directive): + self._cur_doc = None + self._sphinx_directive = sphinx_directive + self._top_node = nodes.section() + self._active_headings = [self._top_node] + + def _make_dlitem(self, term, defn): + """Return a dlitem node with the specified term and definition. + + term should be a list of Text and literal nodes. + defn should be one of: + - a string, which will be handed to _parse_text_into_node + - a list of Text and literal nodes, which will be put into + a paragraph node + """ + dlitem = nodes.definition_list_item() + dlterm = nodes.term('', '', *term) + dlitem += dlterm + if defn: + dldef = nodes.definition() + if isinstance(defn, list): + dldef += nodes.paragraph('', '', *defn) + else: + self._parse_text_into_node(defn, dldef) + dlitem += dldef + return dlitem + + def _make_section(self, title): + """Return a section node with optional title""" + section = nodes.section(ids=[self._sphinx_directive.new_serialno()]) + if title: + section += nodes.title(title, title) + return section + + def _nodes_for_ifcond(self, ifcond, with_if=True): + """Return list of Text, literal nodes for the ifcond + + Return a list which gives text like ' (If: condition)'. + If with_if is False, we don't return the "(If: " and ")". + """ + + doc = ifcond.docgen() + if not doc: + return [] + doc = nodes.literal('', doc) + if not with_if: + return [doc] + + nodelist = [nodes.Text(' ('), nodes.strong('', 'If: ')] + nodelist.append(doc) + nodelist.append(nodes.Text(')')) + return nodelist + + def _nodes_for_one_member(self, member): + """Return list of Text, literal nodes for this member + + Return a list of doctree nodes which give text like + 'name: type (optional) (If: ...)' suitable for use as the + 'term' part of a definition list item. + """ + term = [nodes.literal('', member.name)] + if member.type.doc_type(): + term.append(nodes.Text(': ')) + term.append(nodes.literal('', member.type.doc_type())) + if member.optional: + term.append(nodes.Text(' (optional)')) + if member.ifcond.is_present(): + term.extend(self._nodes_for_ifcond(member.ifcond)) + return term + + def _nodes_for_variant_when(self, variants, variant): + """Return list of Text, literal nodes for variant 'when' clause + + Return a list of doctree nodes which give text like + 'when tagname is variant (If: ...)' suitable for use in + the 'variants' part of a definition list. + """ + term = [nodes.Text(' when '), + nodes.literal('', variants.tag_member.name), + nodes.Text(' is '), + nodes.literal('', '"%s"' % variant.name)] + if variant.ifcond.is_present(): + term.extend(self._nodes_for_ifcond(variant.ifcond)) + return term + + def _nodes_for_members(self, doc, what, base=None, variants=None): + """Return list of doctree nodes for the table of members""" + dlnode = nodes.definition_list() + for section in doc.args.values(): + term = self._nodes_for_one_member(section.member) + # TODO drop fallbacks when undocumented members are outlawed + if section.text: + defn = section.text + elif (variants and variants.tag_member == section.member + and not section.member.type.doc_type()): + values = section.member.type.member_names() + defn = [nodes.Text('One of ')] + defn.extend(intersperse([nodes.literal('', v) for v in values], + nodes.Text(', '))) + else: + defn = [nodes.Text('Not documented')] + + dlnode += self._make_dlitem(term, defn) + + if base: + dlnode += self._make_dlitem([nodes.Text('The members of '), + nodes.literal('', base.doc_type())], + None) + + if variants: + for v in variants.variants: + if v.type.is_implicit(): + assert not v.type.base and not v.type.variants + for m in v.type.local_members: + term = self._nodes_for_one_member(m) + term.extend(self._nodes_for_variant_when(variants, v)) + dlnode += self._make_dlitem(term, None) + else: + term = [nodes.Text('The members of '), + nodes.literal('', v.type.doc_type())] + term.extend(self._nodes_for_variant_when(variants, v)) + dlnode += self._make_dlitem(term, None) + + if not dlnode.children: + return [] + + section = self._make_section(what) + section += dlnode + return [section] + + def _nodes_for_enum_values(self, doc): + """Return list of doctree nodes for the table of enum values""" + seen_item = False + dlnode = nodes.definition_list() + for section in doc.args.values(): + termtext = [nodes.literal('', section.member.name)] + if section.member.ifcond.is_present(): + termtext.extend(self._nodes_for_ifcond(section.member.ifcond)) + # TODO drop fallbacks when undocumented members are outlawed + if section.text: + defn = section.text + else: + defn = [nodes.Text('Not documented')] + + dlnode += self._make_dlitem(termtext, defn) + seen_item = True + + if not seen_item: + return [] + + section = self._make_section('Values') + section += dlnode + return [section] + + def _nodes_for_arguments(self, doc, boxed_arg_type): + """Return list of doctree nodes for the arguments section""" + if boxed_arg_type: + assert not doc.args + section = self._make_section('Arguments') + dlnode = nodes.definition_list() + dlnode += self._make_dlitem( + [nodes.Text('The members of '), + nodes.literal('', boxed_arg_type.name)], + None) + section += dlnode + return [section] + + return self._nodes_for_members(doc, 'Arguments') + + def _nodes_for_features(self, doc): + """Return list of doctree nodes for the table of features""" + seen_item = False + dlnode = nodes.definition_list() + for section in doc.features.values(): + dlnode += self._make_dlitem([nodes.literal('', section.name)], + section.text) + seen_item = True + + if not seen_item: + return [] + + section = self._make_section('Features') + section += dlnode + return [section] + + def _nodes_for_example(self, exampletext): + """Return list of doctree nodes for a code example snippet""" + return [nodes.literal_block(exampletext, exampletext)] + + def _nodes_for_sections(self, doc): + """Return list of doctree nodes for additional sections""" + nodelist = [] + for section in doc.sections: + snode = self._make_section(section.name) + if section.name and section.name.startswith('Example'): + snode += self._nodes_for_example(section.text) + else: + self._parse_text_into_node(section.text, snode) + nodelist.append(snode) + return nodelist + + def _nodes_for_if_section(self, ifcond): + """Return list of doctree nodes for the "If" section""" + nodelist = [] + if ifcond.is_present(): + snode = self._make_section('If') + snode += nodes.paragraph( + '', '', *self._nodes_for_ifcond(ifcond, with_if=False) + ) + nodelist.append(snode) + return nodelist + + def _add_doc(self, typ, sections): + """Add documentation for a command/object/enum... + + We assume we're documenting the thing defined in self._cur_doc. + typ is the type of thing being added ("Command", "Object", etc) + + sections is a list of nodes for sections to add to the definition. + """ + + doc = self._cur_doc + snode = nodes.section(ids=[self._sphinx_directive.new_serialno()]) + snode += nodes.title('', '', *[nodes.literal(doc.symbol, doc.symbol), + nodes.Text(' (' + typ + ')')]) + self._parse_text_into_node(doc.body.text, snode) + for s in sections: + if s is not None: + snode += s + self._add_node_to_current_heading(snode) + + def visit_enum_type(self, name, info, ifcond, features, members, prefix): + doc = self._cur_doc + self._add_doc('Enum', + self._nodes_for_enum_values(doc) + + self._nodes_for_features(doc) + + self._nodes_for_sections(doc) + + self._nodes_for_if_section(ifcond)) + + def visit_object_type(self, name, info, ifcond, features, + base, members, variants): + doc = self._cur_doc + if base and base.is_implicit(): + base = None + self._add_doc('Object', + self._nodes_for_members(doc, 'Members', base, variants) + + self._nodes_for_features(doc) + + self._nodes_for_sections(doc) + + self._nodes_for_if_section(ifcond)) + + def visit_alternate_type(self, name, info, ifcond, features, variants): + doc = self._cur_doc + self._add_doc('Alternate', + self._nodes_for_members(doc, 'Members') + + self._nodes_for_features(doc) + + self._nodes_for_sections(doc) + + self._nodes_for_if_section(ifcond)) + + def visit_command(self, name, info, ifcond, features, arg_type, + ret_type, gen, success_response, boxed, allow_oob, + allow_preconfig, coroutine): + doc = self._cur_doc + self._add_doc('Command', + self._nodes_for_arguments(doc, + arg_type if boxed else None) + + self._nodes_for_features(doc) + + self._nodes_for_sections(doc) + + self._nodes_for_if_section(ifcond)) + + def visit_event(self, name, info, ifcond, features, arg_type, boxed): + doc = self._cur_doc + self._add_doc('Event', + self._nodes_for_arguments(doc, + arg_type if boxed else None) + + self._nodes_for_features(doc) + + self._nodes_for_sections(doc) + + self._nodes_for_if_section(ifcond)) + + def symbol(self, doc, entity): + """Add documentation for one symbol to the document tree + + This is the main entry point which causes us to add documentation + nodes for a symbol (which could be a 'command', 'object', 'event', + etc). We do this by calling 'visit' on the schema entity, which + will then call back into one of our visit_* methods, depending + on what kind of thing this symbol is. + """ + self._cur_doc = doc + entity.visit(self) + self._cur_doc = None + + def _start_new_heading(self, heading, level): + """Start a new heading at the specified heading level + + Create a new section whose title is 'heading' and which is placed + in the docutils node tree as a child of the most recent level-1 + heading. Subsequent document sections (commands, freeform doc chunks, + etc) will be placed as children of this new heading section. + """ + if len(self._active_headings) < level: + raise QAPISemError(self._cur_doc.info, + 'Level %d subheading found outside a ' + 'level %d heading' + % (level, level - 1)) + snode = self._make_section(heading) + self._active_headings[level - 1] += snode + self._active_headings = self._active_headings[:level] + self._active_headings.append(snode) + + def _add_node_to_current_heading(self, node): + """Add the node to whatever the current active heading is""" + self._active_headings[-1] += node + + def freeform(self, doc): + """Add a piece of 'freeform' documentation to the document tree + + A 'freeform' document chunk doesn't relate to any particular + symbol (for instance, it could be an introduction). + + If the freeform document starts with a line of the form + '= Heading text', this is a section or subsection heading, with + the heading level indicated by the number of '=' signs. + """ + + # QAPIDoc documentation says free-form documentation blocks + # must have only a body section, nothing else. + assert not doc.sections + assert not doc.args + assert not doc.features + self._cur_doc = doc + + text = doc.body.text + if re.match(r'=+ ', text): + # Section/subsection heading (if present, will always be + # the first line of the block) + (heading, _, text) = text.partition('\n') + (leader, _, heading) = heading.partition(' ') + self._start_new_heading(heading, len(leader)) + if text == '': + return + + node = self._make_section(None) + self._parse_text_into_node(text, node) + self._add_node_to_current_heading(node) + self._cur_doc = None + + def _parse_text_into_node(self, doctext, node): + """Parse a chunk of QAPI-doc-format text into the node + + The doc comment can contain most inline rST markup, including + bulleted and enumerated lists. + As an extra permitted piece of markup, @var will be turned + into ``var``. + """ + + # Handle the "@var means ``var`` case + doctext = re.sub(r'@([\w-]+)', r'``\1``', doctext) + + rstlist = ViewList() + for line in doctext.splitlines(): + # The reported line number will always be that of the start line + # of the doc comment, rather than the actual location of the error. + # Being more precise would require overhaul of the QAPIDoc class + # to track lines more exactly within all the sub-parts of the doc + # comment, as well as counting lines here. + rstlist.append(line, self._cur_doc.info.fname, + self._cur_doc.info.line) + # Append a blank line -- in some cases rST syntax errors get + # attributed to the line after one with actual text, and if there + # isn't anything in the ViewList corresponding to that then Sphinx + # 1.6's AutodocReporter will then misidentify the source/line location + # in the error message (usually attributing it to the top-level + # .rst file rather than the offending .json file). The extra blank + # line won't affect the rendered output. + rstlist.append("", self._cur_doc.info.fname, self._cur_doc.info.line) + self._sphinx_directive.do_parse(rstlist, node) + + def get_document_nodes(self): + """Return the list of docutils nodes which make up the document""" + return self._top_node.children + + +class QAPISchemaGenDepVisitor(QAPISchemaVisitor): + """A QAPI schema visitor which adds Sphinx dependencies each module + + This class calls the Sphinx note_dependency() function to tell Sphinx + that the generated documentation output depends on the input + schema file associated with each module in the QAPI input. + """ + def __init__(self, env, qapidir): + self._env = env + self._qapidir = qapidir + + def visit_module(self, name): + if name != "./builtin": + qapifile = self._qapidir + '/' + name + self._env.note_dependency(os.path.abspath(qapifile)) + super().visit_module(name) + + +class QAPIDocDirective(Directive): + """Extract documentation from the specified QAPI .json file""" + required_argument = 1 + optional_arguments = 1 + option_spec = { + 'qapifile': directives.unchanged_required + } + has_content = False + + def new_serialno(self): + """Return a unique new ID string suitable for use as a node's ID""" + env = self.state.document.settings.env + return 'qapidoc-%d' % env.new_serialno('qapidoc') + + def run(self): + env = self.state.document.settings.env + qapifile = env.config.qapidoc_srctree + '/' + self.arguments[0] + qapidir = os.path.dirname(qapifile) + + try: + schema = QAPISchema(qapifile) + + # First tell Sphinx about all the schema files that the + # output documentation depends on (including 'qapifile' itself) + schema.visit(QAPISchemaGenDepVisitor(env, qapidir)) + + vis = QAPISchemaGenRSTVisitor(self) + vis.visit_begin(schema) + for doc in schema.docs: + if doc.symbol: + vis.symbol(doc, schema.lookup_entity(doc.symbol)) + else: + vis.freeform(doc) + return vis.get_document_nodes() + except QAPIError as err: + # Launder QAPI parse errors into Sphinx extension errors + # so they are displayed nicely to the user + raise ExtensionError(str(err)) + + def do_parse(self, rstlist, node): + """Parse rST source lines and add them to the specified node + + Take the list of rST source lines rstlist, parse them as + rST, and add the resulting docutils nodes as children of node. + The nodes are parsed in a way that allows them to include + subheadings (titles) without confusing the rendering of + anything else. + """ + # This is from kerneldoc.py -- it works around an API change in + # Sphinx between 1.6 and 1.7. Unlike kerneldoc.py, we use + # sphinx.util.nodes.nested_parse_with_titles() rather than the + # plain self.state.nested_parse(), and so we can drop the saving + # of title_styles and section_level that kerneldoc.py does, + # because nested_parse_with_titles() does that for us. + if Use_SSI: + with switch_source_input(self.state, rstlist): + nested_parse_with_titles(self.state, rstlist, node) + else: + save = self.state.memo.reporter + self.state.memo.reporter = AutodocReporter( + rstlist, self.state.memo.reporter) + try: + nested_parse_with_titles(self.state, rstlist, node) + finally: + self.state.memo.reporter = save + + +def setup(app): + """ Register qapi-doc directive with Sphinx""" + app.add_config_value('qapidoc_srctree', None, 'env') + app.add_directive('qapi-doc', QAPIDocDirective) + + return dict( + version=__version__, + parallel_read_safe=True, + parallel_write_safe=True + ) diff --git a/docs/sphinx/qmp_lexer.py b/docs/sphinx/qmp_lexer.py new file mode 100644 index 000000000..f7e4c0e19 --- /dev/null +++ b/docs/sphinx/qmp_lexer.py @@ -0,0 +1,43 @@ +# QEMU Monitor Protocol Lexer Extension +# +# Copyright (C) 2019, Red Hat Inc. +# +# Authors: +# Eduardo Habkost <ehabkost@redhat.com> +# John Snow <jsnow@redhat.com> +# +# This work is licensed under the terms of the GNU GPLv2 or later. +# See the COPYING file in the top-level directory. +"""qmp_lexer is a Sphinx extension that provides a QMP lexer for code blocks.""" + +from pygments.lexer import RegexLexer, DelegatingLexer +from pygments.lexers.data import JsonLexer +from pygments import token +from sphinx import errors + +class QMPExampleMarkersLexer(RegexLexer): + """ + QMPExampleMarkersLexer lexes QMP example annotations. + This lexer adds support for directionality flow and elision indicators. + """ + tokens = { + 'root': [ + (r'-> ', token.Generic.Prompt), + (r'<- ', token.Generic.Prompt), + (r' ?\.{3} ?', token.Generic.Prompt), + ] + } + +class QMPExampleLexer(DelegatingLexer): + """QMPExampleLexer lexes annotated QMP examples.""" + def __init__(self, **options): + super(QMPExampleLexer, self).__init__(JsonLexer, QMPExampleMarkersLexer, + token.Error, **options) + +def setup(sphinx): + """For use by the Sphinx extensions API.""" + try: + sphinx.require_sphinx('2.1') + sphinx.add_lexer('QMP', QMPExampleLexer) + except errors.VersionRequirementError: + sphinx.add_lexer('QMP', QMPExampleLexer()) diff --git a/docs/spice-port-fqdn.txt b/docs/spice-port-fqdn.txt new file mode 100644 index 000000000..50778952e --- /dev/null +++ b/docs/spice-port-fqdn.txt @@ -0,0 +1,19 @@ +A Spice port channel is an arbitrary communication between the Spice +server host side and the client side. + +Thanks to the associated reverse fully qualified domain name (fqdn), +a Spice client can handle the various ports appropriately. + +The following fqdn names are reserved by the QEMU project: + +org.qemu.monitor.hmp.0 + QEMU human monitor + +org.qemu.monitor.qmp.0: + QEMU control monitor + +org.qemu.console.serial.0 + QEMU virtual serial port + +org.qemu.console.debug.0 + QEMU debug console diff --git a/docs/spin/aio_notify.promela b/docs/spin/aio_notify.promela new file mode 100644 index 000000000..fccc7ee1c --- /dev/null +++ b/docs/spin/aio_notify.promela @@ -0,0 +1,93 @@ +/* + * This model describes the interaction between ctx->notify_me + * and aio_notify(). + * + * Author: Paolo Bonzini <pbonzini@redhat.com> + * + * This file is in the public domain. If you really want a license, + * the WTFPL will do. + * + * To simulate it: + * spin -p docs/aio_notify.promela + * + * To verify it: + * spin -a docs/aio_notify.promela + * gcc -O2 pan.c + * ./a.out -a + * + * To verify it (with a bug planted in the model): + * spin -a -DBUG docs/aio_notify.promela + * gcc -O2 pan.c + * ./a.out -a + */ + +#define MAX 4 +#define LAST (1 << (MAX - 1)) +#define FINAL ((LAST << 1) - 1) + +bool notify_me; +bool event; + +int req; +int done; + +active proctype waiter() +{ + int fetch; + + do + :: true -> { + notify_me++; + + if +#ifndef BUG + :: (req > 0) -> skip; +#endif + :: else -> + // Wait for a nudge from the other side + do + :: event == 1 -> { event = 0; break; } + od; + fi; + + notify_me--; + + atomic { fetch = req; req = 0; } + done = done | fetch; + } + od +} + +active proctype notifier() +{ + int next = 1; + + do + :: next <= LAST -> { + // generate a request + req = req | next; + next = next << 1; + + // aio_notify + if + :: notify_me == 1 -> event = 1; + :: else -> printf("Skipped event_notifier_set\n"); skip; + fi; + + // Test both synchronous and asynchronous delivery + if + :: 1 -> do + :: req == 0 -> break; + od; + :: 1 -> skip; + fi; + } + od; +} + +never { /* [] done < FINAL */ +accept_init: + do + :: done < FINAL -> skip; + od; +} diff --git a/docs/spin/aio_notify_accept.promela b/docs/spin/aio_notify_accept.promela new file mode 100644 index 000000000..9cef2c955 --- /dev/null +++ b/docs/spin/aio_notify_accept.promela @@ -0,0 +1,152 @@ +/* + * This model describes the interaction between ctx->notified + * and ctx->notifier. + * + * Author: Paolo Bonzini <pbonzini@redhat.com> + * + * This file is in the public domain. If you really want a license, + * the WTFPL will do. + * + * To verify the buggy version: + * spin -a -DBUG1 docs/aio_notify_bug.promela + * gcc -O2 pan.c + * ./a.out -a -f + * (or -DBUG2) + * + * To verify the fixed version: + * spin -a docs/aio_notify_bug.promela + * gcc -O2 pan.c + * ./a.out -a -f + * + * Add -DCHECK_REQ to test an alternative invariant and the + * "notify_me" optimization. + */ + +int notify_me; +bool notified; +bool event; +bool req; +bool notifier_done; + +#ifdef CHECK_REQ +#define USE_NOTIFY_ME 1 +#else +#define USE_NOTIFY_ME 0 +#endif + +#ifdef BUG +#error Please define BUG1 or BUG2 instead. +#endif + +active proctype notifier() +{ + do + :: true -> { + req = 1; + if + :: !USE_NOTIFY_ME || notify_me -> +#if defined BUG1 + /* CHECK_REQ does not detect this bug! */ + notified = 1; + event = 1; +#elif defined BUG2 + if + :: !notified -> event = 1; + :: else -> skip; + fi; + notified = 1; +#else + event = 1; + notified = 1; +#endif + :: else -> skip; + fi + } + :: true -> break; + od; + notifier_done = 1; +} + +#define AIO_POLL \ + notify_me++; \ + if \ + :: !req -> { \ + if \ + :: event -> skip; \ + fi; \ + } \ + :: else -> skip; \ + fi; \ + notify_me--; \ + \ + atomic { old = notified; notified = 0; } \ + if \ + :: old -> event = 0; \ + :: else -> skip; \ + fi; \ + \ + req = 0; + +active proctype waiter() +{ + bool old; + + do + :: true -> AIO_POLL; + od; +} + +/* Same as waiter(), but disappears after a while. */ +active proctype temporary_waiter() +{ + bool old; + + do + :: true -> AIO_POLL; + :: true -> break; + od; +} + +#ifdef CHECK_REQ +never { + do + :: req -> goto accept_if_req_not_eventually_false; + :: true -> skip; + od; + +accept_if_req_not_eventually_false: + if + :: req -> goto accept_if_req_not_eventually_false; + fi; + assert(0); +} + +#else +/* There must be infinitely many transitions of event as long + * as the notifier does not exit. + * + * If event stayed always true, the waiters would be busy looping. + * If event stayed always false, the waiters would be sleeping + * forever. + */ +never { + do + :: !event -> goto accept_if_event_not_eventually_true; + :: event -> goto accept_if_event_not_eventually_false; + :: true -> skip; + od; + +accept_if_event_not_eventually_true: + if + :: !event && notifier_done -> do :: true -> skip; od; + :: !event && !notifier_done -> goto accept_if_event_not_eventually_true; + fi; + assert(0); + +accept_if_event_not_eventually_false: + if + :: event -> goto accept_if_event_not_eventually_false; + fi; + assert(0); +} +#endif diff --git a/docs/spin/aio_notify_bug.promela b/docs/spin/aio_notify_bug.promela new file mode 100644 index 000000000..b3bfca1ca --- /dev/null +++ b/docs/spin/aio_notify_bug.promela @@ -0,0 +1,140 @@ +/* + * This model describes a bug in aio_notify. If ctx->notifier is + * cleared too late, a wakeup could be lost. + * + * Author: Paolo Bonzini <pbonzini@redhat.com> + * + * This file is in the public domain. If you really want a license, + * the WTFPL will do. + * + * To verify the buggy version: + * spin -a -DBUG docs/aio_notify_bug.promela + * gcc -O2 pan.c + * ./a.out -a -f + * + * To verify the fixed version: + * spin -a docs/aio_notify_bug.promela + * gcc -O2 pan.c + * ./a.out -a -f + * + * Add -DCHECK_REQ to test an alternative invariant and the + * "notify_me" optimization. + */ + +int notify_me; +bool event; +bool req; +bool notifier_done; + +#ifdef CHECK_REQ +#define USE_NOTIFY_ME 1 +#else +#define USE_NOTIFY_ME 0 +#endif + +active proctype notifier() +{ + do + :: true -> { + req = 1; + if + :: !USE_NOTIFY_ME || notify_me -> event = 1; + :: else -> skip; + fi + } + :: true -> break; + od; + notifier_done = 1; +} + +#ifdef BUG +#define AIO_POLL \ + notify_me++; \ + if \ + :: !req -> { \ + if \ + :: event -> skip; \ + fi; \ + } \ + :: else -> skip; \ + fi; \ + notify_me--; \ + \ + req = 0; \ + event = 0; +#else +#define AIO_POLL \ + notify_me++; \ + if \ + :: !req -> { \ + if \ + :: event -> skip; \ + fi; \ + } \ + :: else -> skip; \ + fi; \ + notify_me--; \ + \ + event = 0; \ + req = 0; +#endif + +active proctype waiter() +{ + do + :: true -> AIO_POLL; + od; +} + +/* Same as waiter(), but disappears after a while. */ +active proctype temporary_waiter() +{ + do + :: true -> AIO_POLL; + :: true -> break; + od; +} + +#ifdef CHECK_REQ +never { + do + :: req -> goto accept_if_req_not_eventually_false; + :: true -> skip; + od; + +accept_if_req_not_eventually_false: + if + :: req -> goto accept_if_req_not_eventually_false; + fi; + assert(0); +} + +#else +/* There must be infinitely many transitions of event as long + * as the notifier does not exit. + * + * If event stayed always true, the waiters would be busy looping. + * If event stayed always false, the waiters would be sleeping + * forever. + */ +never { + do + :: !event -> goto accept_if_event_not_eventually_true; + :: event -> goto accept_if_event_not_eventually_false; + :: true -> skip; + od; + +accept_if_event_not_eventually_true: + if + :: !event && notifier_done -> do :: true -> skip; od; + :: !event && !notifier_done -> goto accept_if_event_not_eventually_true; + fi; + assert(0); + +accept_if_event_not_eventually_false: + if + :: event -> goto accept_if_event_not_eventually_false; + fi; + assert(0); +} +#endif diff --git a/docs/spin/tcg-exclusive.promela b/docs/spin/tcg-exclusive.promela new file mode 100644 index 000000000..c91cfca9f --- /dev/null +++ b/docs/spin/tcg-exclusive.promela @@ -0,0 +1,225 @@ +/* + * This model describes the implementation of exclusive sections in + * cpus-common.c (start_exclusive, end_exclusive, cpu_exec_start, + * cpu_exec_end). + * + * Author: Paolo Bonzini <pbonzini@redhat.com> + * + * This file is in the public domain. If you really want a license, + * the WTFPL will do. + * + * To verify it: + * spin -a docs/tcg-exclusive.promela + * gcc pan.c -O2 + * ./a.out -a + * + * Tunable processor macros: N_CPUS, N_EXCLUSIVE, N_CYCLES, USE_MUTEX, + * TEST_EXPENSIVE. + */ + +// Define the missing parameters for the model +#ifndef N_CPUS +#define N_CPUS 2 +#warning defaulting to 2 CPU processes +#endif + +// the expensive test is not so expensive for <= 2 CPUs +// If the mutex is used, it's also cheap (300 MB / 4 seconds) for 3 CPUs +// For 3 CPUs and the lock-free option it needs 1.5 GB of RAM +#if N_CPUS <= 2 || (N_CPUS <= 3 && defined USE_MUTEX) +#define TEST_EXPENSIVE +#endif + +#ifndef N_EXCLUSIVE +# if !defined N_CYCLES || N_CYCLES <= 1 || defined TEST_EXPENSIVE +# define N_EXCLUSIVE 2 +# warning defaulting to 2 concurrent exclusive sections +# else +# define N_EXCLUSIVE 1 +# warning defaulting to 1 concurrent exclusive sections +# endif +#endif +#ifndef N_CYCLES +# if N_EXCLUSIVE <= 1 || defined TEST_EXPENSIVE +# define N_CYCLES 2 +# warning defaulting to 2 CPU cycles +# else +# define N_CYCLES 1 +# warning defaulting to 1 CPU cycles +# endif +#endif + + +// synchronization primitives. condition variables require a +// process-local "cond_t saved;" variable. + +#define mutex_t byte +#define MUTEX_LOCK(m) atomic { m == 0 -> m = 1 } +#define MUTEX_UNLOCK(m) m = 0 + +#define cond_t int +#define COND_WAIT(c, m) { \ + saved = c; \ + MUTEX_UNLOCK(m); \ + c != saved -> MUTEX_LOCK(m); \ + } +#define COND_BROADCAST(c) c++ + +// this is the logic from cpus-common.c + +mutex_t mutex; +cond_t exclusive_cond; +cond_t exclusive_resume; +byte pending_cpus; + +byte running[N_CPUS]; +byte has_waiter[N_CPUS]; + +#define exclusive_idle() \ + do \ + :: pending_cpus -> COND_WAIT(exclusive_resume, mutex); \ + :: else -> break; \ + od + +#define start_exclusive() \ + MUTEX_LOCK(mutex); \ + exclusive_idle(); \ + pending_cpus = 1; \ + \ + i = 0; \ + do \ + :: i < N_CPUS -> { \ + if \ + :: running[i] -> has_waiter[i] = 1; pending_cpus++; \ + :: else -> skip; \ + fi; \ + i++; \ + } \ + :: else -> break; \ + od; \ + \ + do \ + :: pending_cpus > 1 -> COND_WAIT(exclusive_cond, mutex); \ + :: else -> break; \ + od; \ + MUTEX_UNLOCK(mutex); + +#define end_exclusive() \ + MUTEX_LOCK(mutex); \ + pending_cpus = 0; \ + COND_BROADCAST(exclusive_resume); \ + MUTEX_UNLOCK(mutex); + +#ifdef USE_MUTEX +// Simple version using mutexes +#define cpu_exec_start(id) \ + MUTEX_LOCK(mutex); \ + exclusive_idle(); \ + running[id] = 1; \ + MUTEX_UNLOCK(mutex); + +#define cpu_exec_end(id) \ + MUTEX_LOCK(mutex); \ + running[id] = 0; \ + if \ + :: pending_cpus -> { \ + pending_cpus--; \ + if \ + :: pending_cpus == 1 -> COND_BROADCAST(exclusive_cond); \ + :: else -> skip; \ + fi; \ + } \ + :: else -> skip; \ + fi; \ + MUTEX_UNLOCK(mutex); +#else +// Wait-free fast path, only needs mutex when concurrent with +// an exclusive section +#define cpu_exec_start(id) \ + running[id] = 1; \ + if \ + :: pending_cpus -> { \ + MUTEX_LOCK(mutex); \ + if \ + :: !has_waiter[id] -> { \ + running[id] = 0; \ + exclusive_idle(); \ + running[id] = 1; \ + } \ + :: else -> skip; \ + fi; \ + MUTEX_UNLOCK(mutex); \ + } \ + :: else -> skip; \ + fi; + +#define cpu_exec_end(id) \ + running[id] = 0; \ + if \ + :: pending_cpus -> { \ + MUTEX_LOCK(mutex); \ + if \ + :: has_waiter[id] -> { \ + has_waiter[id] = 0; \ + pending_cpus--; \ + if \ + :: pending_cpus == 1 -> COND_BROADCAST(exclusive_cond); \ + :: else -> skip; \ + fi; \ + } \ + :: else -> skip; \ + fi; \ + MUTEX_UNLOCK(mutex); \ + } \ + :: else -> skip; \ + fi +#endif + +// Promela processes + +byte done_cpu; +byte in_cpu; +active[N_CPUS] proctype cpu() +{ + byte id = _pid % N_CPUS; + byte cycles = 0; + cond_t saved; + + do + :: cycles == N_CYCLES -> break; + :: else -> { + cycles++; + cpu_exec_start(id) + in_cpu++; + done_cpu++; + in_cpu--; + cpu_exec_end(id) + } + od; +} + +byte done_exclusive; +byte in_exclusive; +active[N_EXCLUSIVE] proctype exclusive() +{ + cond_t saved; + byte i; + + start_exclusive(); + in_exclusive = 1; + done_exclusive++; + in_exclusive = 0; + end_exclusive(); +} + +#define LIVENESS (done_cpu == N_CPUS * N_CYCLES && done_exclusive == N_EXCLUSIVE) +#define SAFETY !(in_exclusive && in_cpu) + +never { /* ! ([] SAFETY && <> [] LIVENESS) */ + do + // once the liveness property is satisfied, this is not executable + // and the never clause is not accepted + :: ! LIVENESS -> accept_liveness: skip + :: 1 -> assert(SAFETY) + od; +} diff --git a/docs/spin/win32-qemu-event.promela b/docs/spin/win32-qemu-event.promela new file mode 100644 index 000000000..c446a7155 --- /dev/null +++ b/docs/spin/win32-qemu-event.promela @@ -0,0 +1,98 @@ +/* + * This model describes the implementation of QemuEvent in + * util/qemu-thread-win32.c. + * + * Author: Paolo Bonzini <pbonzini@redhat.com> + * + * This file is in the public domain. If you really want a license, + * the WTFPL will do. + * + * To verify it: + * spin -a docs/event.promela + * gcc -O2 pan.c -DSAFETY + * ./a.out + */ + +bool event; +int value; + +/* Primitives for a Win32 event */ +#define RAW_RESET event = false +#define RAW_SET event = true +#define RAW_WAIT do :: event -> break; od + +#if 0 +/* Basic sanity checking: test the Win32 event primitives */ +#define RESET RAW_RESET +#define SET RAW_SET +#define WAIT RAW_WAIT +#else +/* Full model: layer a userspace-only fast path on top of the RAW_* + * primitives. SET/RESET/WAIT have exactly the same semantics as + * RAW_SET/RAW_RESET/RAW_WAIT, but try to avoid invoking them. + */ +#define EV_SET 0 +#define EV_FREE 1 +#define EV_BUSY -1 + +int state = EV_FREE; + +int xchg_result; +#define SET if :: state != EV_SET -> \ + atomic { /* xchg_result=xchg(state, EV_SET) */ \ + xchg_result = state; \ + state = EV_SET; \ + } \ + if :: xchg_result == EV_BUSY -> RAW_SET; \ + :: else -> skip; \ + fi; \ + :: else -> skip; \ + fi + +#define RESET if :: state == EV_SET -> atomic { state = state | EV_FREE; } \ + :: else -> skip; \ + fi + +int tmp1, tmp2; +#define WAIT tmp1 = state; \ + if :: tmp1 != EV_SET -> \ + if :: tmp1 == EV_FREE -> \ + RAW_RESET; \ + atomic { /* tmp2=cas(state, EV_FREE, EV_BUSY) */ \ + tmp2 = state; \ + if :: tmp2 == EV_FREE -> state = EV_BUSY; \ + :: else -> skip; \ + fi; \ + } \ + if :: tmp2 == EV_SET -> tmp1 = EV_SET; \ + :: else -> tmp1 = EV_BUSY; \ + fi; \ + :: else -> skip; \ + fi; \ + assert(tmp1 != EV_FREE); \ + if :: tmp1 == EV_BUSY -> RAW_WAIT; \ + :: else -> skip; \ + fi; \ + :: else -> skip; \ + fi +#endif + +active proctype waiter() +{ + if + :: !value -> + RESET; + if + :: !value -> WAIT; + :: else -> skip; + fi; + :: else -> skip; + fi; + assert(value); +} + +active proctype notifier() +{ + value = true; + SET; +} diff --git a/docs/system/arm/aspeed.rst b/docs/system/arm/aspeed.rst new file mode 100644 index 000000000..cec87e374 --- /dev/null +++ b/docs/system/arm/aspeed.rst @@ -0,0 +1,109 @@ +Aspeed family boards (``*-bmc``, ``ast2500-evb``, ``ast2600-evb``) +================================================================== + +The QEMU Aspeed machines model BMCs of various OpenPOWER systems and +Aspeed evaluation boards. They are based on different releases of the +Aspeed SoC : the AST2400 integrating an ARM926EJ-S CPU (400MHz), the +AST2500 with an ARM1176JZS CPU (800MHz) and more recently the AST2600 +with dual cores ARM Cortex-A7 CPUs (1.2GHz). + +The SoC comes with RAM, Gigabit ethernet, USB, SD/MMC, USB, SPI, I2C, +etc. + +AST2400 SoC based machines : + +- ``palmetto-bmc`` OpenPOWER Palmetto POWER8 BMC +- ``quanta-q71l-bmc`` OpenBMC Quanta BMC + +AST2500 SoC based machines : + +- ``ast2500-evb`` Aspeed AST2500 Evaluation board +- ``romulus-bmc`` OpenPOWER Romulus POWER9 BMC +- ``witherspoon-bmc`` OpenPOWER Witherspoon POWER9 BMC +- ``sonorapass-bmc`` OCP SonoraPass BMC +- ``swift-bmc`` OpenPOWER Swift BMC POWER9 + +AST2600 SoC based machines : + +- ``ast2600-evb`` Aspeed AST2600 Evaluation board (Cortex-A7) +- ``tacoma-bmc`` OpenPOWER Witherspoon POWER9 AST2600 BMC + +Supported devices +----------------- + + * SMP (for the AST2600 Cortex-A7) + * Interrupt Controller (VIC) + * Timer Controller + * RTC Controller + * I2C Controller + * System Control Unit (SCU) + * SRAM mapping + * X-DMA Controller (basic interface) + * Static Memory Controller (SMC or FMC) - Only SPI Flash support + * SPI Memory Controller + * USB 2.0 Controller + * SD/MMC storage controllers + * SDRAM controller (dummy interface for basic settings and training) + * Watchdog Controller + * GPIO Controller (Master only) + * UART + * Ethernet controllers + * Front LEDs (PCA9552 on I2C bus) + * LPC Peripheral Controller (a subset of subdevices are supported) + * Hash/Crypto Engine (HACE) - Hash support only. TODO: HMAC and RSA + + +Missing devices +--------------- + + * Coprocessor support + * ADC (out of tree implementation) + * PWM and Fan Controller + * Slave GPIO Controller + * Super I/O Controller + * PCI-Express 1 Controller + * Graphic Display Controller + * PECI Controller + * MCTP Controller + * Mailbox Controller + * Virtual UART + * eSPI Controller + * I3C Controller + +Boot options +------------ + +The Aspeed machines can be started using the ``-kernel`` option to +load a Linux kernel or from a firmware. Images can be downloaded from +the OpenBMC jenkins : + + https://jenkins.openbmc.org/job/ci-openbmc/lastSuccessfulBuild/distro=ubuntu,label=docker-builder + +or directly from the OpenBMC GitHub release repository : + + https://github.com/openbmc/openbmc/releases + +The image should be attached as an MTD drive. Run : + +.. code-block:: bash + + $ qemu-system-arm -M romulus-bmc -nic user \ + -drive file=obmc-phosphor-image-romulus.static.mtd,format=raw,if=mtd -nographic + +Options specific to Aspeed machines are : + + * ``execute-in-place`` which emulates the boot from the CE0 flash + device by using the FMC controller to load the instructions, and + not simply from RAM. This takes a little longer. + + * ``fmc-model`` to change the FMC Flash model. FW needs support for + the chip model to boot. + + * ``spi-model`` to change the SPI Flash model. + +For instance, to start the ``ast2500-evb`` machine with a different +FMC chip and a bigger (64M) SPI chip, use : + +.. code-block:: bash + + -M ast2500-evb,fmc-model=mx25l25635e,spi-model=mx66u51235f diff --git a/docs/system/arm/collie.rst b/docs/system/arm/collie.rst new file mode 100644 index 000000000..5cc67b6d1 --- /dev/null +++ b/docs/system/arm/collie.rst @@ -0,0 +1,16 @@ +Sharp Zaurus SL-5500 (``collie``) +================================= + +This machine is a model of the Sharp Zaurus SL-5500, which was +a 1990s PDA based on the StrongARM SA1110. + +Implemented devices: + + * NOR flash + * Interrupt controller + * Timer + * RTC + * GPIO + * Peripheral Pin Controller (PPC) + * UARTs + * Synchronous Serial Ports (SSP) diff --git a/docs/system/arm/cpu-features.rst b/docs/system/arm/cpu-features.rst new file mode 100644 index 000000000..584eb1709 --- /dev/null +++ b/docs/system/arm/cpu-features.rst @@ -0,0 +1,393 @@ +Arm CPU Features +================ + +CPU features are optional features that a CPU of supporting type may +choose to implement or not. In QEMU, optional CPU features have +corresponding boolean CPU proprieties that, when enabled, indicate +that the feature is implemented, and, conversely, when disabled, +indicate that it is not implemented. An example of an Arm CPU feature +is the Performance Monitoring Unit (PMU). CPU types such as the +Cortex-A15 and the Cortex-A57, which respectively implement Arm +architecture reference manuals ARMv7-A and ARMv8-A, may both optionally +implement PMUs. For example, if a user wants to use a Cortex-A15 without +a PMU, then the ``-cpu`` parameter should contain ``pmu=off`` on the QEMU +command line, i.e. ``-cpu cortex-a15,pmu=off``. + +As not all CPU types support all optional CPU features, then whether or +not a CPU property exists depends on the CPU type. For example, CPUs +that implement the ARMv8-A architecture reference manual may optionally +support the AArch32 CPU feature, which may be enabled by disabling the +``aarch64`` CPU property. A CPU type such as the Cortex-A15, which does +not implement ARMv8-A, will not have the ``aarch64`` CPU property. + +QEMU's support may be limited for some CPU features, only partially +supporting the feature or only supporting the feature under certain +configurations. For example, the ``aarch64`` CPU feature, which, when +disabled, enables the optional AArch32 CPU feature, is only supported +when using the KVM accelerator and when running on a host CPU type that +supports the feature. While ``aarch64`` currently only works with KVM, +it could work with TCG. CPU features that are specific to KVM are +prefixed with "kvm-" and are described in "KVM VCPU Features". + +CPU Feature Probing +=================== + +Determining which CPU features are available and functional for a given +CPU type is possible with the ``query-cpu-model-expansion`` QMP command. +Below are some examples where ``scripts/qmp/qmp-shell`` (see the top comment +block in the script for usage) is used to issue the QMP commands. + +1. Determine which CPU features are available for the ``max`` CPU type + (Note, we started QEMU with qemu-system-aarch64, so ``max`` is + implementing the ARMv8-A reference manual in this case):: + + (QEMU) query-cpu-model-expansion type=full model={"name":"max"} + { "return": { + "model": { "name": "max", "props": { + "sve1664": true, "pmu": true, "sve1792": true, "sve1920": true, + "sve128": true, "aarch64": true, "sve1024": true, "sve": true, + "sve640": true, "sve768": true, "sve1408": true, "sve256": true, + "sve1152": true, "sve512": true, "sve384": true, "sve1536": true, + "sve896": true, "sve1280": true, "sve2048": true + }}}} + +We see that the ``max`` CPU type has the ``pmu``, ``aarch64``, ``sve``, and many +``sve<N>`` CPU features. We also see that all the CPU features are +enabled, as they are all ``true``. (The ``sve<N>`` CPU features are all +optional SVE vector lengths (see "SVE CPU Properties"). While with TCG +all SVE vector lengths can be supported, when KVM is in use it's more +likely that only a few lengths will be supported, if SVE is supported at +all.) + +(2) Let's try to disable the PMU:: + + (QEMU) query-cpu-model-expansion type=full model={"name":"max","props":{"pmu":false}} + { "return": { + "model": { "name": "max", "props": { + "sve1664": true, "pmu": false, "sve1792": true, "sve1920": true, + "sve128": true, "aarch64": true, "sve1024": true, "sve": true, + "sve640": true, "sve768": true, "sve1408": true, "sve256": true, + "sve1152": true, "sve512": true, "sve384": true, "sve1536": true, + "sve896": true, "sve1280": true, "sve2048": true + }}}} + +We see it worked, as ``pmu`` is now ``false``. + +(3) Let's try to disable ``aarch64``, which enables the AArch32 CPU feature:: + + (QEMU) query-cpu-model-expansion type=full model={"name":"max","props":{"aarch64":false}} + {"error": { + "class": "GenericError", "desc": + "'aarch64' feature cannot be disabled unless KVM is enabled and 32-bit EL1 is supported" + }} + +It looks like this feature is limited to a configuration we do not +currently have. + +(4) Let's disable ``sve`` and see what happens to all the optional SVE + vector lengths:: + + (QEMU) query-cpu-model-expansion type=full model={"name":"max","props":{"sve":false}} + { "return": { + "model": { "name": "max", "props": { + "sve1664": false, "pmu": true, "sve1792": false, "sve1920": false, + "sve128": false, "aarch64": true, "sve1024": false, "sve": false, + "sve640": false, "sve768": false, "sve1408": false, "sve256": false, + "sve1152": false, "sve512": false, "sve384": false, "sve1536": false, + "sve896": false, "sve1280": false, "sve2048": false + }}}} + +As expected they are now all ``false``. + +(5) Let's try probing CPU features for the Cortex-A15 CPU type:: + + (QEMU) query-cpu-model-expansion type=full model={"name":"cortex-a15"} + {"return": {"model": {"name": "cortex-a15", "props": {"pmu": true}}}} + +Only the ``pmu`` CPU feature is available. + +A note about CPU feature dependencies +------------------------------------- + +It's possible for features to have dependencies on other features. I.e. +it may be possible to change one feature at a time without error, but +when attempting to change all features at once an error could occur +depending on the order they are processed. It's also possible changing +all at once doesn't generate an error, because a feature's dependencies +are satisfied with other features, but the same feature cannot be changed +independently without error. For these reasons callers should always +attempt to make their desired changes all at once in order to ensure the +collection is valid. + +A note about CPU models and KVM +------------------------------- + +Named CPU models generally do not work with KVM. There are a few cases +that do work, e.g. using the named CPU model ``cortex-a57`` with KVM on a +seattle host, but mostly if KVM is enabled the ``host`` CPU type must be +used. This means the guest is provided all the same CPU features as the +host CPU type has. And, for this reason, the ``host`` CPU type should +enable all CPU features that the host has by default. Indeed it's even +a bit strange to allow disabling CPU features that the host has when using +the ``host`` CPU type, but in the absence of CPU models it's the best we can +do if we want to launch guests without all the host's CPU features enabled. + +Enabling KVM also affects the ``query-cpu-model-expansion`` QMP command. The +affect is not only limited to specific features, as pointed out in example +(3) of "CPU Feature Probing", but also to which CPU types may be expanded. +When KVM is enabled, only the ``max``, ``host``, and current CPU type may be +expanded. This restriction is necessary as it's not possible to know all +CPU types that may work with KVM, but it does impose a small risk of users +experiencing unexpected errors. For example on a seattle, as mentioned +above, the ``cortex-a57`` CPU type is also valid when KVM is enabled. +Therefore a user could use the ``host`` CPU type for the current type, but +then attempt to query ``cortex-a57``, however that query will fail with our +restrictions. This shouldn't be an issue though as management layers and +users have been preferring the ``host`` CPU type for use with KVM for quite +some time. Additionally, if the KVM-enabled QEMU instance running on a +seattle host is using the ``cortex-a57`` CPU type, then querying ``cortex-a57`` +will work. + +Using CPU Features +================== + +After determining which CPU features are available and supported for a +given CPU type, then they may be selectively enabled or disabled on the +QEMU command line with that CPU type:: + + $ qemu-system-aarch64 -M virt -cpu max,pmu=off,sve=on,sve128=on,sve256=on + +The example above disables the PMU and enables the first two SVE vector +lengths for the ``max`` CPU type. Note, the ``sve=on`` isn't actually +necessary, because, as we observed above with our probe of the ``max`` CPU +type, ``sve`` is already on by default. Also, based on our probe of +defaults, it would seem we need to disable many SVE vector lengths, rather +than only enabling the two we want. This isn't the case, because, as +disabling many SVE vector lengths would be quite verbose, the ``sve<N>`` CPU +properties have special semantics (see "SVE CPU Property Parsing +Semantics"). + +KVM VCPU Features +================= + +KVM VCPU features are CPU features that are specific to KVM, such as +paravirt features or features that enable CPU virtualization extensions. +The features' CPU properties are only available when KVM is enabled and +are named with the prefix "kvm-". KVM VCPU features may be probed, +enabled, and disabled in the same way as other CPU features. Below is +the list of KVM VCPU features and their descriptions. + + kvm-no-adjvtime By default kvm-no-adjvtime is disabled. This + means that by default the virtual time + adjustment is enabled (vtime is not *not* + adjusted). + + When virtual time adjustment is enabled each + time the VM transitions back to running state + the VCPU's virtual counter is updated to ensure + stopped time is not counted. This avoids time + jumps surprising guest OSes and applications, + as long as they use the virtual counter for + timekeeping. However it has the side effect of + the virtual and physical counters diverging. + All timekeeping based on the virtual counter + will appear to lag behind any timekeeping that + does not subtract VM stopped time. The guest + may resynchronize its virtual counter with + other time sources as needed. + + Enable kvm-no-adjvtime to disable virtual time + adjustment, also restoring the legacy (pre-5.0) + behavior. + + kvm-steal-time Since v5.2, kvm-steal-time is enabled by + default when KVM is enabled, the feature is + supported, and the guest is 64-bit. + + When kvm-steal-time is enabled a 64-bit guest + can account for time its CPUs were not running + due to the host not scheduling the corresponding + VCPU threads. The accounting statistics may + influence the guest scheduler behavior and/or be + exposed to the guest userspace. + +TCG VCPU Features +================= + +TCG VCPU features are CPU features that are specific to TCG. +Below is the list of TCG VCPU features and their descriptions. + + pauth Enable or disable ``FEAT_Pauth``, pointer + authentication. By default, the feature is + enabled with ``-cpu max``. + + pauth-impdef When ``FEAT_Pauth`` is enabled, either the + *impdef* (Implementation Defined) algorithm + is enabled or the *architected* QARMA algorithm + is enabled. By default the impdef algorithm + is disabled, and QARMA is enabled. + + The architected QARMA algorithm has good + cryptographic properties, but can be quite slow + to emulate. The impdef algorithm used by QEMU + is non-cryptographic but significantly faster. + +SVE CPU Properties +================== + +There are two types of SVE CPU properties: ``sve`` and ``sve<N>``. The first +is used to enable or disable the entire SVE feature, just as the ``pmu`` +CPU property completely enables or disables the PMU. The second type +is used to enable or disable specific vector lengths, where ``N`` is the +number of bits of the length. The ``sve<N>`` CPU properties have special +dependencies and constraints, see "SVE CPU Property Dependencies and +Constraints" below. Additionally, as we want all supported vector lengths +to be enabled by default, then, in order to avoid overly verbose command +lines (command lines full of ``sve<N>=off``, for all ``N`` not wanted), we +provide the parsing semantics listed in "SVE CPU Property Parsing +Semantics". + +SVE CPU Property Dependencies and Constraints +--------------------------------------------- + + 1) At least one vector length must be enabled when ``sve`` is enabled. + + 2) If a vector length ``N`` is enabled, then, when KVM is enabled, all + smaller, host supported vector lengths must also be enabled. If + KVM is not enabled, then only all the smaller, power-of-two vector + lengths must be enabled. E.g. with KVM if the host supports all + vector lengths up to 512-bits (128, 256, 384, 512), then if ``sve512`` + is enabled, the 128-bit vector length, 256-bit vector length, and + 384-bit vector length must also be enabled. Without KVM, the 384-bit + vector length would not be required. + + 3) If KVM is enabled then only vector lengths that the host CPU type + support may be enabled. If SVE is not supported by the host, then + no ``sve*`` properties may be enabled. + +SVE CPU Property Parsing Semantics +---------------------------------- + + 1) If SVE is disabled (``sve=off``), then which SVE vector lengths + are enabled or disabled is irrelevant to the guest, as the entire + SVE feature is disabled and that disables all vector lengths for + the guest. However QEMU will still track any ``sve<N>`` CPU + properties provided by the user. If later an ``sve=on`` is provided, + then the guest will get only the enabled lengths. If no ``sve=on`` + is provided and there are explicitly enabled vector lengths, then + an error is generated. + + 2) If SVE is enabled (``sve=on``), but no ``sve<N>`` CPU properties are + provided, then all supported vector lengths are enabled, which when + KVM is not in use means including the non-power-of-two lengths, and, + when KVM is in use, it means all vector lengths supported by the host + processor. + + 3) If SVE is enabled, then an error is generated when attempting to + disable the last enabled vector length (see constraint (1) of "SVE + CPU Property Dependencies and Constraints"). + + 4) If one or more vector lengths have been explicitly enabled and at + at least one of the dependency lengths of the maximum enabled length + has been explicitly disabled, then an error is generated (see + constraint (2) of "SVE CPU Property Dependencies and Constraints"). + + 5) When KVM is enabled, if the host does not support SVE, then an error + is generated when attempting to enable any ``sve*`` properties (see + constraint (3) of "SVE CPU Property Dependencies and Constraints"). + + 6) When KVM is enabled, if the host does support SVE, then an error is + generated when attempting to enable any vector lengths not supported + by the host (see constraint (3) of "SVE CPU Property Dependencies and + Constraints"). + + 7) If one or more ``sve<N>`` CPU properties are set ``off``, but no ``sve<N>``, + CPU properties are set ``on``, then the specified vector lengths are + disabled but the default for any unspecified lengths remains enabled. + When KVM is not enabled, disabling a power-of-two vector length also + disables all vector lengths larger than the power-of-two length. + When KVM is enabled, then disabling any supported vector length also + disables all larger vector lengths (see constraint (2) of "SVE CPU + Property Dependencies and Constraints"). + + 8) If one or more ``sve<N>`` CPU properties are set to ``on``, then they + are enabled and all unspecified lengths default to disabled, except + for the required lengths per constraint (2) of "SVE CPU Property + Dependencies and Constraints", which will even be auto-enabled if + they were not explicitly enabled. + + 9) If SVE was disabled (``sve=off``), allowing all vector lengths to be + explicitly disabled (i.e. avoiding the error specified in (3) of + "SVE CPU Property Parsing Semantics"), then if later an ``sve=on`` is + provided an error will be generated. To avoid this error, one must + enable at least one vector length prior to enabling SVE. + +SVE CPU Property Examples +------------------------- + + 1) Disable SVE:: + + $ qemu-system-aarch64 -M virt -cpu max,sve=off + + 2) Implicitly enable all vector lengths for the ``max`` CPU type:: + + $ qemu-system-aarch64 -M virt -cpu max + + 3) When KVM is enabled, implicitly enable all host CPU supported vector + lengths with the ``host`` CPU type:: + + $ qemu-system-aarch64 -M virt,accel=kvm -cpu host + + 4) Only enable the 128-bit vector length:: + + $ qemu-system-aarch64 -M virt -cpu max,sve128=on + + 5) Disable the 512-bit vector length and all larger vector lengths, + since 512 is a power-of-two. This results in all the smaller, + uninitialized lengths (128, 256, and 384) defaulting to enabled:: + + $ qemu-system-aarch64 -M virt -cpu max,sve512=off + + 6) Enable the 128-bit, 256-bit, and 512-bit vector lengths:: + + $ qemu-system-aarch64 -M virt -cpu max,sve128=on,sve256=on,sve512=on + + 7) The same as (6), but since the 128-bit and 256-bit vector + lengths are required for the 512-bit vector length to be enabled, + then allow them to be auto-enabled:: + + $ qemu-system-aarch64 -M virt -cpu max,sve512=on + + 8) Do the same as (7), but by first disabling SVE and then re-enabling it:: + + $ qemu-system-aarch64 -M virt -cpu max,sve=off,sve512=on,sve=on + + 9) Force errors regarding the last vector length:: + + $ qemu-system-aarch64 -M virt -cpu max,sve128=off + $ qemu-system-aarch64 -M virt -cpu max,sve=off,sve128=off,sve=on + +SVE CPU Property Recommendations +-------------------------------- + +The examples in "SVE CPU Property Examples" exhibit many ways to select +vector lengths which developers may find useful in order to avoid overly +verbose command lines. However, the recommended way to select vector +lengths is to explicitly enable each desired length. Therefore only +example's (1), (4), and (6) exhibit recommended uses of the properties. + +SVE User-mode Default Vector Length Property +-------------------------------------------- + +For qemu-aarch64, the cpu property ``sve-default-vector-length=N`` is +defined to mirror the Linux kernel parameter file +``/proc/sys/abi/sve_default_vector_length``. The default length, ``N``, +is in units of bytes and must be between 16 and 8192. +If not specified, the default vector length is 64. + +If the default length is larger than the maximum vector length enabled, +the actual vector length will be reduced. Note that the maximum vector +length supported by QEMU is 256. + +If this property is set to ``-1`` then the default vector length +is set to the maximum possible length. diff --git a/docs/system/arm/cubieboard.rst b/docs/system/arm/cubieboard.rst new file mode 100644 index 000000000..344ff8cef --- /dev/null +++ b/docs/system/arm/cubieboard.rst @@ -0,0 +1,16 @@ +Cubietech Cubieboard (``cubieboard``) +===================================== + +The ``cubieboard`` model emulates the Cubietech Cubieboard, +which is a Cortex-A8 based single-board computer using +the AllWinner A10 SoC. + +Emulated devices: + +- Timer +- UART +- RTC +- EMAC +- SDHCI +- USB controller +- SATA controller diff --git a/docs/system/arm/digic.rst b/docs/system/arm/digic.rst new file mode 100644 index 000000000..2b3520ff5 --- /dev/null +++ b/docs/system/arm/digic.rst @@ -0,0 +1,11 @@ +Canon A1100 (``canon-a1100``) +============================= + +This machine is a model of the Canon PowerShot A1100 camera, which +uses the DIGIC SoC. This model is based on reverse engineering efforts +by the contributors to the `CHDK <http://chdk.wikia.com/>`_ and +`Magic Lantern <http://www.magiclantern.fm/>`_ projects. + +The emulation is incomplete. In particular it can't be used +to run the original camera firmware, but it can successfully run +an experimental version of the `barebox bootloader <http://www.barebox.org/>`_. diff --git a/docs/system/arm/emcraft-sf2.rst b/docs/system/arm/emcraft-sf2.rst new file mode 100644 index 000000000..377e24872 --- /dev/null +++ b/docs/system/arm/emcraft-sf2.rst @@ -0,0 +1,15 @@ +Emcraft SmartFusion2 SOM kit (``emcraft-sf2``) +============================================== + +The ``emcraft-sf2`` board emulates the SmartFusion2 SOM kit from +Emcraft (M2S010). This is a System-on-Module from EmCraft systems, +based on the SmartFusion2 SoC FPGA from Microsemi Corporation. +The SoC is based on a Cortex-M4 processor. + +Emulated devices: + +- System timer +- System registers +- SPI controller +- UART +- EMAC diff --git a/docs/system/arm/emulation.rst b/docs/system/arm/emulation.rst new file mode 100644 index 000000000..144dc491d --- /dev/null +++ b/docs/system/arm/emulation.rst @@ -0,0 +1,103 @@ +A-profile CPU architecture support +================================== + +QEMU's TCG emulation includes support for the Armv5, Armv6, Armv7 and +Armv8 versions of the A-profile architecture. It also has support for +the following architecture extensions: + +- FEAT_AA32BF16 (AArch32 BFloat16 instructions) +- FEAT_AA32HPD (AArch32 hierarchical permission disables) +- FEAT_AA32I8MM (AArch32 Int8 matrix multiplication instructions) +- FEAT_AES (AESD and AESE instructions) +- FEAT_BF16 (AArch64 BFloat16 instructions) +- FEAT_BTI (Branch Target Identification) +- FEAT_DIT (Data Independent Timing instructions) +- FEAT_DPB (DC CVAP instruction) +- FEAT_DotProd (Advanced SIMD dot product instructions) +- FEAT_FCMA (Floating-point complex number instructions) +- FEAT_FHM (Floating-point half-precision multiplication instructions) +- FEAT_FP16 (Half-precision floating-point data processing) +- FEAT_FRINTTS (Floating-point to integer instructions) +- FEAT_FlagM (Flag manipulation instructions v2) +- FEAT_FlagM2 (Enhancements to flag manipulation instructions) +- FEAT_HPDS (Hierarchical permission disables) +- FEAT_I8MM (AArch64 Int8 matrix multiplication instructions) +- FEAT_JSCVT (JavaScript conversion instructions) +- FEAT_LOR (Limited ordering regions) +- FEAT_LRCPC (Load-acquire RCpc instructions) +- FEAT_LRCPC2 (Load-acquire RCpc instructions v2) +- FEAT_LSE (Large System Extensions) +- FEAT_MTE (Memory Tagging Extension) +- FEAT_MTE2 (Memory Tagging Extension) +- FEAT_MTE3 (MTE Asymmetric Fault Handling) +- FEAT_PAN (Privileged access never) +- FEAT_PAN2 (AT S1E1R and AT S1E1W instruction variants affected by PSTATE.PAN) +- FEAT_PAuth (Pointer authentication) +- FEAT_PMULL (PMULL, PMULL2 instructions) +- FEAT_PMUv3p1 (PMU Extensions v3.1) +- FEAT_PMUv3p4 (PMU Extensions v3.4) +- FEAT_RDM (Advanced SIMD rounding double multiply accumulate instructions) +- FEAT_RNG (Random number generator) +- FEAT_SB (Speculation Barrier) +- FEAT_SEL2 (Secure EL2) +- FEAT_SHA1 (SHA1 instructions) +- FEAT_SHA256 (SHA256 instructions) +- FEAT_SHA3 (Advanced SIMD SHA3 instructions) +- FEAT_SHA512 (Advanced SIMD SHA512 instructions) +- FEAT_SM3 (Advanced SIMD SM3 instructions) +- FEAT_SM4 (Advanced SIMD SM4 instructions) +- FEAT_SPECRES (Speculation restriction instructions) +- FEAT_SSBS (Speculative Store Bypass Safe) +- FEAT_TLBIOS (TLB invalidate instructions in Outer Shareable domain) +- FEAT_TLBIRANGE (TLB invalidate range instructions) +- FEAT_TTCNP (Translation table Common not private translations) +- FEAT_TTST (Small translation tables) +- FEAT_UAO (Unprivileged Access Override control) +- FEAT_VHE (Virtualization Host Extensions) +- FEAT_VMID16 (16-bit VMID) +- FEAT_XNX (Translation table stage 2 Unprivileged Execute-never) +- SVE (The Scalable Vector Extension) +- SVE2 (The Scalable Vector Extension v2) + +For information on the specifics of these extensions, please refer +to the `Armv8-A Arm Architecture Reference Manual +<https://developer.arm.com/documentation/ddi0487/latest>`_. + +When a specific named CPU is being emulated, only those features which +are present in hardware for that CPU are emulated. (If a feature is +not in the list above then it is not supported, even if the real +hardware should have it.) The ``max`` CPU enables all features. + +R-profile CPU architecture support +================================== + +QEMU's TCG emulation support for R-profile CPUs is currently limited. +We emulate only the Cortex-R5 and Cortex-R5F CPUs. + +M-profile CPU architecture support +================================== + +QEMU's TCG emulation includes support for Armv6-M, Armv7-M, Armv8-M, and +Armv8.1-M versions of the M-profile architucture. It also has support +for the following architecture extensions: + +- FP (Floating-point Extension) +- FPCXT (FPCXT access instructions) +- HP (Half-precision floating-point instructions) +- LOB (Low Overhead loops and Branch future) +- M (Main Extension) +- MPU (Memory Protection Unit Extension) +- PXN (Privileged Execute Never) +- RAS (Reliability, Serviceability and Availability): "minimum RAS Extension" only +- S (Security Extension) +- ST (System Timer Extension) + +For information on the specifics of these extensions, please refer +to the `Armv8-M Arm Architecture Reference Manual +<https://developer.arm.com/documentation/ddi0553/latest>`_. + +When a specific named CPU is being emulated, only those features which +are present in hardware for that CPU are emulated. (If a feature is +not in the list above then it is not supported, even if the real +hardware should have it.) There is no equivalent of the ``max`` CPU for +M-profile. diff --git a/docs/system/arm/gumstix.rst b/docs/system/arm/gumstix.rst new file mode 100644 index 000000000..cb373139d --- /dev/null +++ b/docs/system/arm/gumstix.rst @@ -0,0 +1,21 @@ +Gumstix Connex and Verdex (``connex``, ``verdex``) +================================================== + +These machines model the Gumstix Connex and Verdex boards. +The Connex has a PXA255 CPU and the Verdex has a PXA270. + +Implemented devices: + + * NOR flash + * SMC91C111 ethernet + * Interrupt controller + * DMA + * Timer + * GPIO + * MMC/SD card + * Fast infra-red communications port (FIR) + * LCD controller + * Synchronous serial ports (SPI) + * PCMCIA interface + * I2C + * I2S diff --git a/docs/system/arm/highbank.rst b/docs/system/arm/highbank.rst new file mode 100644 index 000000000..bb4965b36 --- /dev/null +++ b/docs/system/arm/highbank.rst @@ -0,0 +1,19 @@ +Calxeda Highbank and Midway (``highbank``, ``midway``) +====================================================== + +``highbank`` is a model of the Calxeda Highbank (ECX-1000) system, +which has four Cortex-A9 cores. + +``midway`` is a model of the Calxeda Midway (ECX-2000) system, +which has four Cortex-A15 cores. + +Emulated devices: + +- L2x0 cache controller +- SP804 dual timer +- PL011 UART +- PL061 GPIOs +- PL031 RTC +- PL022 synchronous serial port controller +- AHCI +- XGMAC ethernet controllers diff --git a/docs/system/arm/imx25-pdk.rst b/docs/system/arm/imx25-pdk.rst new file mode 100644 index 000000000..2a9711e8a --- /dev/null +++ b/docs/system/arm/imx25-pdk.rst @@ -0,0 +1,19 @@ +NXP i.MX25 PDK board (``imx25-pdk``) +==================================== + +The ``imx25-pdk`` board emulates the NXP i.MX25 Product Development Kit +board, which is based on an i.MX25 SoC which uses an ARM926 CPU. + +Emulated devices: + +- SD controller +- AVIC +- CCM +- GPT +- EPIT timers +- FEC +- RNGC +- I2C +- GPIO controllers +- Watchdog timer +- USB controllers diff --git a/docs/system/arm/integratorcp.rst b/docs/system/arm/integratorcp.rst new file mode 100644 index 000000000..594438008 --- /dev/null +++ b/docs/system/arm/integratorcp.rst @@ -0,0 +1,16 @@ +Arm Integrator/CP (``integratorcp``) +==================================== + +The Arm Integrator/CP board is emulated with the following devices: + +- ARM926E, ARM1026E, ARM946E, ARM1136 or Cortex-A8 CPU + +- Two PL011 UARTs + +- SMC 91c111 Ethernet adapter + +- PL110 LCD controller + +- PL050 KMI with PS/2 keyboard and mouse. + +- PL181 MultiMedia Card Interface with SD card. diff --git a/docs/system/arm/kzm.rst b/docs/system/arm/kzm.rst new file mode 100644 index 000000000..bb018fbdf --- /dev/null +++ b/docs/system/arm/kzm.rst @@ -0,0 +1,18 @@ +Kyoto Microcomputer KZM-ARM11-01 (``kzm``) +========================================== + +The ``kzm`` board emulates the Kyoto Microcomputer KZM-ARM11-01 +evaluation board, which is based on an NXP i.MX32 SoC +which uses an ARM1136 CPU. + +Emulated devices: + +- UARTs +- LAN9118 ethernet +- AVIC +- CCM +- GPT +- EPIT timers +- I2C +- GPIO controllers +- Watchdog timer diff --git a/docs/system/arm/mainstone.rst b/docs/system/arm/mainstone.rst new file mode 100644 index 000000000..05310f42c --- /dev/null +++ b/docs/system/arm/mainstone.rst @@ -0,0 +1,25 @@ +Intel Mainstone II board (``mainstone``) +======================================== + +The ``mainstone`` board emulates the Intel Mainstone II development +board, which uses a PXA270 CPU. + +Emulated devices: + +- Flash memory +- Keypad +- MMC controller +- 91C111 ethernet +- PIC +- Timer +- DMA +- GPIO +- FIR +- Serial +- LCD controller +- SSP +- USB controller +- RTC +- PCMCIA +- I2C +- I2S diff --git a/docs/system/arm/mps2.rst b/docs/system/arm/mps2.rst new file mode 100644 index 000000000..8a75beb3a --- /dev/null +++ b/docs/system/arm/mps2.rst @@ -0,0 +1,57 @@ +Arm MPS2 and MPS3 boards (``mps2-an385``, ``mps2-an386``, ``mps2-an500``, ``mps2-an505``, ``mps2-an511``, ``mps2-an521``, ``mps3-an524``, ``mps3-an547``) +========================================================================================================================================================= + +These board models all use Arm M-profile CPUs. + +The Arm MPS2, MPS2+ and MPS3 dev boards are FPGA based (the 2+ has a +bigger FPGA but is otherwise the same as the 2; the 3 has a bigger +FPGA again, can handle 4GB of RAM and has a USB controller and QSPI flash). + +Since the CPU itself and most of the devices are in the FPGA, the +details of the board as seen by the guest depend significantly on the +FPGA image. + +QEMU models the following FPGA images: + +``mps2-an385`` + Cortex-M3 as documented in Arm Application Note AN385 +``mps2-an386`` + Cortex-M4 as documented in Arm Application Note AN386 +``mps2-an500`` + Cortex-M7 as documented in Arm Application Note AN500 +``mps2-an505`` + Cortex-M33 as documented in Arm Application Note AN505 +``mps2-an511`` + Cortex-M3 'DesignStart' as documented in Arm Application Note AN511 +``mps2-an521`` + Dual Cortex-M33 as documented in Arm Application Note AN521 +``mps3-an524`` + Dual Cortex-M33 on an MPS3, as documented in Arm Application Note AN524 +``mps3-an547`` + Cortex-M55 on an MPS3, as documented in Arm Application Note AN547 + +Differences between QEMU and real hardware: + +- AN385/AN386 remapping of low 16K of memory to either ZBT SSRAM1 or to + block RAM is unimplemented (QEMU always maps this to ZBT SSRAM1, as + if zbt_boot_ctrl is always zero) +- AN524 remapping of low memory to either BRAM or to QSPI flash is + unimplemented (QEMU always maps this to BRAM, ignoring the + SCC CFG_REG0 memory-remap bit) +- QEMU provides a LAN9118 ethernet rather than LAN9220; the only guest + visible difference is that the LAN9118 doesn't support checksum + offloading +- QEMU does not model the QSPI flash in MPS3 boards as real QSPI + flash, but only as simple ROM, so attempting to rewrite the flash + from the guest will fail +- QEMU does not model the USB controller in MPS3 boards + +Machine-specific options +"""""""""""""""""""""""" + +The following machine-specific options are supported: + +remap + Supported for ``mps3-an524`` only. + Set ``BRAM``/``QSPI`` to select the initial memory mapping. The + default is ``BRAM``. diff --git a/docs/system/arm/musca.rst b/docs/system/arm/musca.rst new file mode 100644 index 000000000..81e3dc921 --- /dev/null +++ b/docs/system/arm/musca.rst @@ -0,0 +1,31 @@ +Arm Musca boards (``musca-a``, ``musca-b1``) +============================================ + +The Arm Musca development boards are a reference implementation +of a system using the SSE-200 Subsystem for Embedded. They are +dual Cortex-M33 systems. + +QEMU provides models of the A and B1 variants of this board. + +Unimplemented devices: + +- SPI +- |I2C| +- |I2S| +- PWM +- QSPI +- Timer +- SCC +- GPIO +- eFlash +- MHU +- PVT +- SDIO +- CryptoCell + +Note that (like the real hardware) the Musca-A machine is +asymmetric: CPU 0 does not have the FPU or DSP extensions, +but CPU 1 does. Also like the real hardware, the memory maps +for the A and B1 variants differ significantly, so guest +software must be built for the right variant. + diff --git a/docs/system/arm/musicpal.rst b/docs/system/arm/musicpal.rst new file mode 100644 index 000000000..9de380edf --- /dev/null +++ b/docs/system/arm/musicpal.rst @@ -0,0 +1,19 @@ +Freecom MusicPal (``musicpal``) +=============================== + +The Freecom MusicPal internet radio emulation includes the following +elements: + +- Marvell MV88W8618 Arm core. + +- 32 MB RAM, 256 KB SRAM, 8 MB flash. + +- Up to 2 16550 UARTs + +- MV88W8xx8 Ethernet controller + +- MV88W8618 audio controller, WM8750 CODEC and mixer + +- 128x64 display with brightness control + +- 2 buttons, 2 navigation wheels with button function diff --git a/docs/system/arm/nrf.rst b/docs/system/arm/nrf.rst new file mode 100644 index 000000000..eda87bd76 --- /dev/null +++ b/docs/system/arm/nrf.rst @@ -0,0 +1,51 @@ +Nordic nRF boards (``microbit``) +================================ + +The `Nordic nRF`_ chips are a family of ARM-based System-on-Chip that +are designed to be used for low-power and short-range wireless solutions. + +.. _Nordic nRF: https://www.nordicsemi.com/Products + +The nRF51 series is the first series for short range wireless applications. +It is superseded by the nRF52 series. +The following machines are based on this chip : + +- ``microbit`` BBC micro:bit board with nRF51822 SoC + +There are other series such as nRF52, nRF53 and nRF91 which are currently not +supported by QEMU. + +Supported devices +----------------- + + * ARM Cortex-M0 (ARMv6-M) + * Serial ports (UART) + * Clock controller + * Timers + * Random Number Generator (RNG) + * GPIO controller + * NVMC + * SWI + +Missing devices +--------------- + + * Watchdog + * Real-Time Clock (RTC) controller + * TWI (i2c) + * SPI controller + * Analog to Digital Converter (ADC) + * Quadrature decoder + * Radio + +Boot options +------------ + +The Micro:bit machine can be started using the ``-device`` option to load a +firmware in `ihex format`_. Example: + +.. _ihex format: https://en.wikipedia.org/wiki/Intel_HEX + +.. code-block:: bash + + $ qemu-system-arm -M microbit -device loader,file=test.hex diff --git a/docs/system/arm/nseries.rst b/docs/system/arm/nseries.rst new file mode 100644 index 000000000..cd9edf5d8 --- /dev/null +++ b/docs/system/arm/nseries.rst @@ -0,0 +1,33 @@ +Nokia N800 and N810 tablets (``n800``, ``n810``) +================================================ + +Nokia N800 and N810 internet tablets (known also as RX-34 and RX-44 / +48) emulation supports the following elements: + +- Texas Instruments OMAP2420 System-on-chip (ARM1136 core) + +- RAM and non-volatile OneNAND Flash memories + +- Display connected to EPSON remote framebuffer chip and OMAP on-chip + display controller and a LS041y3 MIPI DBI-C controller + +- TI TSC2301 (in N800) and TI TSC2005 (in N810) touchscreen + controllers driven through SPI bus + +- National Semiconductor LM8323-controlled qwerty keyboard driven + through |I2C| bus + +- Secure Digital card connected to OMAP MMC/SD host + +- Three OMAP on-chip UARTs and on-chip STI debugging console + +- Mentor Graphics \"Inventra\" dual-role USB controller embedded in a + TI TUSB6010 chip - only USB host mode is supported + +- TI TMP105 temperature sensor driven through |I2C| bus + +- TI TWL92230C power management companion with an RTC on + |I2C| bus + +- Nokia RETU and TAHVO multi-purpose chips with an RTC, connected + through CBUS diff --git a/docs/system/arm/nuvoton.rst b/docs/system/arm/nuvoton.rst new file mode 100644 index 000000000..adf497e67 --- /dev/null +++ b/docs/system/arm/nuvoton.rst @@ -0,0 +1,95 @@ +Nuvoton iBMC boards (``*-bmc``, ``npcm750-evb``, ``quanta-gsj``) +================================================================ + +The `Nuvoton iBMC`_ chips (NPCM7xx) are a family of ARM-based SoCs that are +designed to be used as Baseboard Management Controllers (BMCs) in various +servers. They all feature one or two ARM Cortex-A9 CPU cores, as well as an +assortment of peripherals targeted for either Enterprise or Data Center / +Hyperscale applications. The former is a superset of the latter, so NPCM750 has +all the peripherals of NPCM730 and more. + +.. _Nuvoton iBMC: https://www.nuvoton.com/products/cloud-computing/ibmc/ + +The NPCM750 SoC has two Cortex-A9 cores and is targeted for the Enterprise +segment. The following machines are based on this chip : + +- ``npcm750-evb`` Nuvoton NPCM750 Evaluation board + +The NPCM730 SoC has two Cortex-A9 cores and is targeted for Data Center and +Hyperscale applications. The following machines are based on this chip : + +- ``quanta-gbs-bmc`` Quanta GBS server BMC +- ``quanta-gsj`` Quanta GSJ server BMC +- ``kudo-bmc`` Fii USA Kudo server BMC + +There are also two more SoCs, NPCM710 and NPCM705, which are single-core +variants of NPCM750 and NPCM730, respectively. These are currently not +supported by QEMU. + +Supported devices +----------------- + + * SMP (Dual Core Cortex-A9) + * Cortex-A9MPCore built-in peripherals: SCU, GIC, Global Timer, Private Timer + and Watchdog. + * SRAM, ROM and DRAM mappings + * System Global Control Registers (GCR) + * Clock and reset controller (CLK) + * Timer controller (TIM) + * Serial ports (16550-based) + * DDR4 memory controller (dummy interface indicating memory training is done) + * OTP controllers (no protection features) + * Flash Interface Unit (FIU; no protection features) + * Random Number Generator (RNG) + * USB host (USBH) + * GPIO controller + * Analog to Digital Converter (ADC) + * Pulse Width Modulation (PWM) + * SMBus controller (SMBF) + * Ethernet controller (EMC) + * Tachometer + +Missing devices +--------------- + + * LPC/eSPI host-to-BMC interface, including + + * Keyboard and mouse controller interface (KBCI) + * Keyboard Controller Style (KCS) channels + * BIOS POST code FIFO + * System Wake-up Control (SWC) + * Shared memory (SHM) + * eSPI slave interface + + * Ethernet controller (GMAC) + * USB device (USBD) + * Peripheral SPI controller (PSPI) + * SD/MMC host + * PECI interface + * PCI and PCIe root complex and bridges + * VDM and MCTP support + * Serial I/O expansion + * LPC/eSPI host + * Coprocessor + * Graphics + * Video capture + * Encoding compression engine + * Security features + +Boot options +------------ + +The Nuvoton machines can boot from an OpenBMC firmware image, or directly into +a kernel using the ``-kernel`` option. OpenBMC images for ``quanta-gsj`` and +possibly others can be downloaded from the OpenPOWER jenkins : + + https://openpower.xyz/ + +The firmware image should be attached as an MTD drive. Example : + +.. code-block:: bash + + $ qemu-system-arm -machine quanta-gsj -nographic \ + -drive file=image-bmc,if=mtd,bus=0,unit=0,format=raw + +The default root password for test images is usually ``0penBmc``. diff --git a/docs/system/arm/orangepi.rst b/docs/system/arm/orangepi.rst new file mode 100644 index 000000000..83c744519 --- /dev/null +++ b/docs/system/arm/orangepi.rst @@ -0,0 +1,263 @@ +Orange Pi PC (``orangepi-pc``) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The Xunlong Orange Pi PC is an Allwinner H3 System on Chip +based embedded computer with mainline support in both U-Boot +and Linux. The board comes with a Quad Core Cortex-A7 @ 1.3GHz, +1GiB RAM, 100Mbit ethernet, USB, SD/MMC, USB, HDMI and +various other I/O. + +Supported devices +""""""""""""""""" + +The Orange Pi PC machine supports the following devices: + + * SMP (Quad Core Cortex-A7) + * Generic Interrupt Controller configuration + * SRAM mappings + * SDRAM controller + * Real Time Clock + * Timer device (re-used from Allwinner A10) + * UART + * SD/MMC storage controller + * EMAC ethernet + * USB 2.0 interfaces + * Clock Control Unit + * System Control module + * Security Identifier device + +Limitations +""""""""""" + +Currently, Orange Pi PC does *not* support the following features: + +- Graphical output via HDMI, GPU and/or the Display Engine +- Audio output +- Hardware Watchdog + +Also see the 'unimplemented' array in the Allwinner H3 SoC module +for a complete list of unimplemented I/O devices: ``./hw/arm/allwinner-h3.c`` + +Boot options +"""""""""""" + +The Orange Pi PC machine can start using the standard -kernel functionality +for loading a Linux kernel or ELF executable. Additionally, the Orange Pi PC +machine can also emulate the BootROM which is present on an actual Allwinner H3 +based SoC, which loads the bootloader from a SD card, specified via the -sd argument +to qemu-system-arm. + +Machine-specific options +"""""""""""""""""""""""" + +The following machine-specific options are supported: + +- allwinner-rtc.base-year=YYYY + + The Allwinner RTC device is automatically created by the Orange Pi PC machine + and uses a default base year value which can be overridden using the 'base-year' property. + The base year is the actual represented year when the RTC year value is zero. + This option can be used in case the target operating system driver uses a different + base year value. The minimum value for the base year is 1900. + +- allwinner-sid.identifier=abcd1122-a000-b000-c000-12345678ffff + + The Security Identifier value can be read by the guest. + For example, U-Boot uses it to determine a unique MAC address. + +The above machine-specific options can be specified in qemu-system-arm +via the '-global' argument, for example: + +.. code-block:: bash + + $ qemu-system-arm -M orangepi-pc -sd mycard.img \ + -global allwinner-rtc.base-year=2000 + +Running mainline Linux +"""""""""""""""""""""" + +Mainline Linux kernels from 4.19 up to latest master are known to work. +To build a Linux mainline kernel that can be booted by the Orange Pi PC machine, +simply configure the kernel using the sunxi_defconfig configuration: + +.. code-block:: bash + + $ ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- make mrproper + $ ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- make sunxi_defconfig + +To be able to use USB storage, you need to manually enable the corresponding +configuration item. Start the kconfig configuration tool: + +.. code-block:: bash + + $ ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- make menuconfig + +Navigate to the following item, enable it and save your configuration: + + Device Drivers > USB support > USB Mass Storage support + +Build the Linux kernel with: + +.. code-block:: bash + + $ ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- make + +To boot the newly build linux kernel in QEMU with the Orange Pi PC machine, use: + +.. code-block:: bash + + $ qemu-system-arm -M orangepi-pc -nic user -nographic \ + -kernel /path/to/linux/arch/arm/boot/zImage \ + -append 'console=ttyS0,115200' \ + -dtb /path/to/linux/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dtb + +Orange Pi PC images +""""""""""""""""""" + +Note that the mainline kernel does not have a root filesystem. You may provide it +with an official Orange Pi PC image from the official website: + + http://www.orangepi.org/downloadresources/ + +Another possibility is to run an Armbian image for Orange Pi PC which +can be downloaded from: + + https://www.armbian.com/orange-pi-pc/ + +Alternatively, you can also choose to build you own image with buildroot +using the orangepi_pc_defconfig. Also see https://buildroot.org for more information. + +When using an image as an SD card, it must be resized to a power of two. This can be +done with the ``qemu-img`` command. It is recommended to only increase the image size +instead of shrinking it to a power of two, to avoid loss of data. For example, +to prepare a downloaded Armbian image, first extract it and then increase +its size to one gigabyte as follows: + +.. code-block:: bash + + $ qemu-img resize Armbian_19.11.3_Orangepipc_bionic_current_5.3.9.img 1G + +You can choose to attach the selected image either as an SD card or as USB mass storage. +For example, to boot using the Orange Pi PC Debian image on SD card, simply add the -sd +argument and provide the proper root= kernel parameter: + +.. code-block:: bash + + $ qemu-system-arm -M orangepi-pc -nic user -nographic \ + -kernel /path/to/linux/arch/arm/boot/zImage \ + -append 'console=ttyS0,115200 root=/dev/mmcblk0p2' \ + -dtb /path/to/linux/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dtb \ + -sd OrangePi_pc_debian_stretch_server_linux5.3.5_v1.0.img + +To attach the image as an USB mass storage device to the machine, +simply append to the command: + +.. code-block:: bash + + -drive if=none,id=stick,file=myimage.img \ + -device usb-storage,bus=usb-bus.0,drive=stick + +Instead of providing a custom Linux kernel via the -kernel command you may also +choose to let the Orange Pi PC machine load the bootloader from SD card, just like +a real board would do using the BootROM. Simply pass the selected image via the -sd +argument and remove the -kernel, -append, -dbt and -initrd arguments: + +.. code-block:: bash + + $ qemu-system-arm -M orangepi-pc -nic user -nographic \ + -sd Armbian_19.11.3_Orangepipc_buster_current_5.3.9.img + +Note that both the official Orange Pi PC images and Armbian images start +a lot of userland programs via systemd. Depending on the host hardware and OS, +they may be slow to emulate, especially due to emulating the 4 cores. +To help reduce the performance slow down due to emulating the 4 cores, you can +give the following kernel parameters via U-Boot (or via -append): + +.. code-block:: bash + + => setenv extraargs 'systemd.default_timeout_start_sec=9000 loglevel=7 nosmp console=ttyS0,115200' + +Running U-Boot +"""""""""""""" + +U-Boot mainline can be build and configured using the orangepi_pc_defconfig +using similar commands as describe above for Linux. Note that it is recommended +for development/testing to select the following configuration setting in U-Boot: + + Device Tree Control > Provider for DTB for DT Control > Embedded DTB + +To start U-Boot using the Orange Pi PC machine, provide the +u-boot binary to the -kernel argument: + +.. code-block:: bash + + $ qemu-system-arm -M orangepi-pc -nic user -nographic \ + -kernel /path/to/uboot/u-boot -sd disk.img + +Use the following U-boot commands to load and boot a Linux kernel from SD card: + +.. code-block:: bash + + => setenv bootargs console=ttyS0,115200 + => ext2load mmc 0 0x42000000 zImage + => ext2load mmc 0 0x43000000 sun8i-h3-orangepi-pc.dtb + => bootz 0x42000000 - 0x43000000 + +Running NetBSD +"""""""""""""" + +The NetBSD operating system also includes support for Allwinner H3 based boards, +including the Orange Pi PC. NetBSD 9.0 is known to work best for the Orange Pi PC +board and provides a fully working system with serial console, networking and storage. +For the Orange Pi PC machine, get the 'evbarm-earmv7hf' based image from: + + https://cdn.netbsd.org/pub/NetBSD/NetBSD-9.0/evbarm-earmv7hf/binary/gzimg/armv7.img.gz + +The image requires manually installing U-Boot in the image. Build U-Boot with +the orangepi_pc_defconfig configuration as described in the previous section. +Next, unzip the NetBSD image and write the U-Boot binary including SPL using: + +.. code-block:: bash + + $ gunzip armv7.img.gz + $ dd if=/path/to/u-boot-sunxi-with-spl.bin of=armv7.img bs=1024 seek=8 conv=notrunc + +Finally, before starting the machine the SD image must be extended such +that the size of the SD image is a power of two and that the NetBSD kernel +will not conclude the NetBSD partition is larger than the emulated SD card: + +.. code-block:: bash + + $ qemu-img resize armv7.img 2G + +Start the machine using the following command: + +.. code-block:: bash + + $ qemu-system-arm -M orangepi-pc -nic user -nographic \ + -sd armv7.img -global allwinner-rtc.base-year=2000 + +At the U-Boot stage, interrupt the automatic boot process by pressing a key +and set the following environment variables before booting: + +.. code-block:: bash + + => setenv bootargs root=ld0a + => setenv kernel netbsd-GENERIC.ub + => setenv fdtfile dtb/sun8i-h3-orangepi-pc.dtb + => setenv bootcmd 'fatload mmc 0:1 ${kernel_addr_r} ${kernel}; fatload mmc 0:1 ${fdt_addr_r} ${fdtfile}; fdt addr ${fdt_addr_r}; bootm ${kernel_addr_r} - ${fdt_addr_r}' + +Optionally you may save the environment variables to SD card with 'saveenv'. +To continue booting simply give the 'boot' command and NetBSD boots. + +Orange Pi PC integration tests +"""""""""""""""""""""""""""""" + +The Orange Pi PC machine has several integration tests included. +To run the whole set of tests, build QEMU from source and simply +provide the following command: + +.. code-block:: bash + + $ AVOCADO_ALLOW_LARGE_STORAGE=yes avocado --show=app,console run \ + -t machine:orangepi-pc tests/avocado/boot_linux_console.py diff --git a/docs/system/arm/palm.rst b/docs/system/arm/palm.rst new file mode 100644 index 000000000..47ff9b36d --- /dev/null +++ b/docs/system/arm/palm.rst @@ -0,0 +1,23 @@ +Palm Tungsten|E PDA (``cheetah``) +================================= + +The Palm Tungsten|E PDA (codename \"Cheetah\") emulation includes the +following elements: + +- Texas Instruments OMAP310 System-on-chip (ARM925T core) + +- ROM and RAM memories (ROM firmware image can be loaded with + -option-rom) + +- On-chip LCD controller + +- On-chip Real Time Clock + +- TI TSC2102i touchscreen controller / analog-digital converter / + Audio CODEC, connected through MicroWire and |I2S| busses + +- GPIO-connected matrix keypad + +- Secure Digital card connected to OMAP MMC/SD host + +- Three on-chip UARTs diff --git a/docs/system/arm/raspi.rst b/docs/system/arm/raspi.rst new file mode 100644 index 000000000..922fe375a --- /dev/null +++ b/docs/system/arm/raspi.rst @@ -0,0 +1,43 @@ +Raspberry Pi boards (``raspi0``, ``raspi1ap``, ``raspi2b``, ``raspi3ap``, ``raspi3b``) +====================================================================================== + + +QEMU provides models of the following Raspberry Pi boards: + +``raspi0`` and ``raspi1ap`` + ARM1176JZF-S core, 512 MiB of RAM +``raspi2b`` + Cortex-A7 (4 cores), 1 GiB of RAM +``raspi3ap`` + Cortex-A53 (4 cores), 512 MiB of RAM +``raspi3b`` + Cortex-A53 (4 cores), 1 GiB of RAM + + +Implemented devices +------------------- + + * ARM1176JZF-S, Cortex-A7 or Cortex-A53 CPU + * Interrupt controller + * DMA controller + * Clock and reset controller (CPRMAN) + * System Timer + * GPIO controller + * Serial ports (BCM2835 AUX - 16550 based - and PL011) + * Random Number Generator (RNG) + * Frame Buffer + * USB host (USBH) + * GPIO controller + * SD/MMC host controller + * SoC thermal sensor + * USB2 host controller (DWC2 and MPHI) + * MailBox controller (MBOX) + * VideoCore firmware (property) + + +Missing devices +--------------- + + * Peripheral SPI controller (SPI) + * Analog to Digital Converter (ADC) + * Pulse Width Modulation (PWM) diff --git a/docs/system/arm/realview.rst b/docs/system/arm/realview.rst new file mode 100644 index 000000000..65f5be346 --- /dev/null +++ b/docs/system/arm/realview.rst @@ -0,0 +1,34 @@ +Arm Realview boards (``realview-eb``, ``realview-eb-mpcore``, ``realview-pb-a8``, ``realview-pbx-a9``) +====================================================================================================== + +Several variants of the Arm RealView baseboard are emulated, including +the EB, PB-A8 and PBX-A9. Due to interactions with the bootloader, only +certain Linux kernel configurations work out of the box on these boards. + +Kernels for the PB-A8 board should have CONFIG_REALVIEW_HIGH_PHYS_OFFSET +enabled in the kernel, and expect 512M RAM. Kernels for The PBX-A9 board +should have CONFIG_SPARSEMEM enabled, CONFIG_REALVIEW_HIGH_PHYS_OFFSET +disabled and expect 1024M RAM. + +The following devices are emulated: + +- ARM926E, ARM1136, ARM11MPCore, Cortex-A8 or Cortex-A9 MPCore CPU + +- Arm AMBA Generic/Distributed Interrupt Controller + +- Four PL011 UARTs + +- SMC 91c111 or SMSC LAN9118 Ethernet adapter + +- PL110 LCD controller + +- PL050 KMI with PS/2 keyboard and mouse + +- PCI host bridge + +- PCI OHCI USB controller + +- LSI53C895A PCI SCSI Host Bus Adapter with hard disk and CD-ROM + devices + +- PL181 MultiMedia Card Interface with SD card. diff --git a/docs/system/arm/sabrelite.rst b/docs/system/arm/sabrelite.rst new file mode 100644 index 000000000..4ccb0560a --- /dev/null +++ b/docs/system/arm/sabrelite.rst @@ -0,0 +1,119 @@ +Boundary Devices SABRE Lite (``sabrelite``) +=========================================== + +Boundary Devices SABRE Lite i.MX6 Development Board is a low-cost development +platform featuring the powerful Freescale / NXP Semiconductor's i.MX 6 Quad +Applications Processor. + +Supported devices +----------------- + +The SABRE Lite machine supports the following devices: + + * Up to 4 Cortex-A9 cores + * Generic Interrupt Controller + * 1 Clock Controller Module + * 1 System Reset Controller + * 5 UARTs + * 2 EPIC timers + * 1 GPT timer + * 2 Watchdog timers + * 1 FEC Ethernet controller + * 3 I2C controllers + * 7 GPIO controllers + * 4 SDHC storage controllers + * 4 USB 2.0 host controllers + * 5 ECSPI controllers + * 1 SST 25VF016B flash + +Please note above list is a complete superset the QEMU SABRE Lite machine can +support. For a normal use case, a device tree blob that represents a real world +SABRE Lite board, only exposes a subset of devices to the guest software. + +Boot options +------------ + +The SABRE Lite machine can start using the standard -kernel functionality +for loading a Linux kernel, U-Boot bootloader or ELF executable. + +Running Linux kernel +-------------------- + +Linux mainline v5.10 release is tested at the time of writing. To build a Linux +mainline kernel that can be booted by the SABRE Lite machine, simply configure +the kernel using the imx_v6_v7_defconfig configuration: + +.. code-block:: bash + + $ export ARCH=arm + $ export CROSS_COMPILE=arm-linux-gnueabihf- + $ make imx_v6_v7_defconfig + $ make + +To boot the newly built Linux kernel in QEMU with the SABRE Lite machine, use: + +.. code-block:: bash + + $ qemu-system-arm -M sabrelite -smp 4 -m 1G \ + -display none -serial null -serial stdio \ + -kernel arch/arm/boot/zImage \ + -dtb arch/arm/boot/dts/imx6q-sabrelite.dtb \ + -initrd /path/to/rootfs.ext4 \ + -append "root=/dev/ram" + +Running U-Boot +-------------- + +U-Boot mainline v2020.10 release is tested at the time of writing. To build a +U-Boot mainline bootloader that can be booted by the SABRE Lite machine, use +the mx6qsabrelite_defconfig with similar commands as described above for Linux: + +.. code-block:: bash + + $ export CROSS_COMPILE=arm-linux-gnueabihf- + $ make mx6qsabrelite_defconfig + +Note we need to adjust settings by: + +.. code-block:: bash + + $ make menuconfig + +then manually select the following configuration in U-Boot: + + Device Tree Control > Provider of DTB for DT Control > Embedded DTB + +To start U-Boot using the SABRE Lite machine, provide the u-boot binary to +the -kernel argument, along with an SD card image with rootfs: + +.. code-block:: bash + + $ qemu-system-arm -M sabrelite -smp 4 -m 1G \ + -display none -serial null -serial stdio \ + -kernel u-boot + +The following example shows booting Linux kernel from dhcp, and uses the +rootfs on an SD card. This requires some additional command line parameters +for QEMU: + +.. code-block:: none + + -nic user,tftp=/path/to/kernel/zImage \ + -drive file=sdcard.img,id=rootfs -device sd-card,drive=rootfs + +The directory for the built-in TFTP server should also contain the device tree +blob of the SABRE Lite board. The sample SD card image was populated with the +root file system with one single partition. You may adjust the kernel "root=" +boot parameter accordingly. + +After U-Boot boots, type the following commands in the U-Boot command shell to +boot the Linux kernel: + +.. code-block:: none + + => setenv ethaddr 00:11:22:33:44:55 + => setenv bootfile zImage + => dhcp + => tftpboot 14000000 imx6q-sabrelite.dtb + => setenv bootargs root=/dev/mmcblk3p1 + => bootz 12000000 - 14000000 diff --git a/docs/system/arm/sbsa.rst b/docs/system/arm/sbsa.rst new file mode 100644 index 000000000..b499d7e92 --- /dev/null +++ b/docs/system/arm/sbsa.rst @@ -0,0 +1,32 @@ +Arm Server Base System Architecture Reference board (``sbsa-ref``) +================================================================== + +While the ``virt`` board is a generic board platform that doesn't match +any real hardware the ``sbsa-ref`` board intends to look like real +hardware. The `Server Base System Architecture +<https://developer.arm.com/documentation/den0029/latest>`_ defines a +minimum base line of hardware support and importantly how the firmware +reports that to any operating system. It is a static system that +reports a very minimal DT to the firmware for non-discoverable +information about components affected by the qemu command line (i.e. +cpus and memory). As a result it must have a firmware specifically +built to expect a certain hardware layout (as you would in a real +machine). + +It is intended to be a machine for developing firmware and testing +standards compliance with operating systems. + +Supported devices +""""""""""""""""" + +The sbsa-ref board supports: + + - A configurable number of AArch64 CPUs + - GIC version 3 + - System bus AHCI controller + - System bus EHCI controller + - CDROM and hard disc on AHCI bus + - E1000E ethernet card on PCIe bus + - VGA display adaptor on PCIe bus + - A generic SBSA watchdog device + diff --git a/docs/system/arm/stellaris.rst b/docs/system/arm/stellaris.rst new file mode 100644 index 000000000..8af4ad79c --- /dev/null +++ b/docs/system/arm/stellaris.rst @@ -0,0 +1,26 @@ +Stellaris boards (``lm3s6965evb``, ``lm3s811evb``) +================================================== + +The Luminary Micro Stellaris LM3S811EVB emulation includes the following +devices: + +- Cortex-M3 CPU core. + +- 64k Flash and 8k SRAM. + +- Timers, UARTs, ADC and |I2C| interface. + +- OSRAM Pictiva 96x16 OLED with SSD0303 controller on + |I2C| bus. + +The Luminary Micro Stellaris LM3S6965EVB emulation includes the +following devices: + +- Cortex-M3 CPU core. + +- 256k Flash and 64k SRAM. + +- Timers, UARTs, ADC, |I2C| and SSI interfaces. + +- OSRAM Pictiva 128x64 OLED with SSD0323 controller connected via + SSI. diff --git a/docs/system/arm/stm32.rst b/docs/system/arm/stm32.rst new file mode 100644 index 000000000..508b92cf8 --- /dev/null +++ b/docs/system/arm/stm32.rst @@ -0,0 +1,66 @@ +STMicroelectronics STM32 boards (``netduino2``, ``netduinoplus2``, ``stm32vldiscovery``) +======================================================================================== + +The `STM32`_ chips are a family of 32-bit ARM-based microcontroller by +STMicroelectronics. + +.. _STM32: https://www.st.com/en/microcontrollers-microprocessors/stm32-32-bit-arm-cortex-mcus.html + +The STM32F1 series is based on ARM Cortex-M3 core. The following machines are +based on this chip : + +- ``stm32vldiscovery`` STM32VLDISCOVERY board with STM32F100RBT6 microcontroller + +The STM32F2 series is based on ARM Cortex-M3 core. The following machines are +based on this chip : + +- ``netduino2`` Netduino 2 board with STM32F205RFT6 microcontroller + +The STM32F4 series is based on ARM Cortex-M4F core. This series is pin-to-pin +compatible with STM32F2 series. The following machines are based on this chip : + +- ``netduinoplus2`` Netduino Plus 2 board with STM32F405RGT6 microcontroller + +There are many other STM32 series that are currently not supported by QEMU. + +Supported devices +----------------- + + * ARM Cortex-M3, Cortex M4F + * Analog to Digital Converter (ADC) + * EXTI interrupt + * Serial ports (USART) + * SPI controller + * System configuration (SYSCFG) + * Timer controller (TIMER) + +Missing devices +--------------- + + * Camera interface (DCMI) + * Controller Area Network (CAN) + * Cycle Redundancy Check (CRC) calculation unit + * Digital to Analog Converter (DAC) + * DMA controller + * Ethernet controller + * Flash Interface Unit + * GPIO controller + * I2C controller + * Inter-Integrated Sound (I2S) controller + * Power supply configuration (PWR) + * Random Number Generator (RNG) + * Real-Time Clock (RTC) controller + * Reset and Clock Controller (RCC) + * Secure Digital Input/Output (SDIO) interface + * USB OTG + * Watchdog controller (IWDG, WWDG) + +Boot options +------------ + +The STM32 machines can be started using the ``-kernel`` option to load a +firmware. Example: + +.. code-block:: bash + + $ qemu-system-arm -M stm32vldiscovery -kernel firmware.bin diff --git a/docs/system/arm/sx1.rst b/docs/system/arm/sx1.rst new file mode 100644 index 000000000..8bce30d4b --- /dev/null +++ b/docs/system/arm/sx1.rst @@ -0,0 +1,18 @@ +Siemens SX1 (``sx1``, ``sx1-v1``) +================================= + +The Siemens SX1 models v1 and v2 (default) basic emulation. The +emulation includes the following elements: + +- Texas Instruments OMAP310 System-on-chip (ARM925T core) + +- ROM and RAM memories (ROM firmware image can be loaded with + -pflash) V1 1 Flash of 16MB and 1 Flash of 8MB V2 1 Flash of 32MB + +- On-chip LCD controller + +- On-chip Real Time Clock + +- Secure Digital card connected to OMAP MMC/SD host + +- Three on-chip UARTs diff --git a/docs/system/arm/versatile.rst b/docs/system/arm/versatile.rst new file mode 100644 index 000000000..2ae792bac --- /dev/null +++ b/docs/system/arm/versatile.rst @@ -0,0 +1,63 @@ +Arm Versatile boards (``versatileab``, ``versatilepb``) +======================================================= + +The Arm Versatile baseboard is emulated with the following devices: + +- ARM926E, ARM1136 or Cortex-A8 CPU + +- PL190 Vectored Interrupt Controller + +- Four PL011 UARTs + +- SMC 91c111 Ethernet adapter + +- PL110 LCD controller + +- PL050 KMI with PS/2 keyboard and mouse. + +- PCI host bridge. Note the emulated PCI bridge only provides access + to PCI memory space. It does not provide access to PCI IO space. This + means some devices (eg. ne2k_pci NIC) are not usable, and others (eg. + rtl8139 NIC) are only usable when the guest drivers use the memory + mapped control registers. + +- PCI OHCI USB controller. + +- LSI53C895A PCI SCSI Host Bus Adapter with hard disk and CD-ROM + devices. + +- PL181 MultiMedia Card Interface with SD card. + +Booting a Linux kernel +---------------------- + +Building a current Linux kernel with ``versatile_defconfig`` should be +enough to get something running. Nowadays an out-of-tree build is +recommended (and also useful if you build a lot of different targets). +In the following example $BLD points to the build directory and $SRC +points to the root of the Linux source tree. You can drop $SRC if you +are running from there. + +.. code-block:: bash + + $ make O=$BLD -C $SRC ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- versatile_defconfig + $ make O=$BLD -C $SRC ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- + +You may want to enable some additional modules if you want to boot +something from the SCSI interface:: + + CONFIG_PCI=y + CONFIG_PCI_VERSATILE=y + CONFIG_SCSI=y + CONFIG_SCSI_SYM53C8XX_2=y + +You can then boot with a command line like: + +.. code-block:: bash + + $ qemu-system-arm -machine type=versatilepb \ + -serial mon:stdio \ + -drive if=scsi,driver=file,filename=debian-buster-armel-rootfs.ext4 \ + -kernel zImage \ + -dtb versatile-pb.dtb \ + -append "console=ttyAMA0 ro root=/dev/sda" diff --git a/docs/system/arm/vexpress.rst b/docs/system/arm/vexpress.rst new file mode 100644 index 000000000..3e3839e92 --- /dev/null +++ b/docs/system/arm/vexpress.rst @@ -0,0 +1,88 @@ +Arm Versatile Express boards (``vexpress-a9``, ``vexpress-a15``) +================================================================ + +QEMU models two variants of the Arm Versatile Express development +board family: + +- ``vexpress-a9`` models the combination of the Versatile Express + motherboard and the CoreTile Express A9x4 daughterboard +- ``vexpress-a15`` models the combination of the Versatile Express + motherboard and the CoreTile Express A15x2 daughterboard + +Note that as this hardware does not have PCI, IDE or SCSI, +the only available storage option is emulated SD card. + +Implemented devices: + +- PL041 audio +- PL181 SD controller +- PL050 keyboard and mouse +- PL011 UARTs +- SP804 timers +- I2C controller +- PL031 RTC +- PL111 LCD display controller +- Flash memory +- LAN9118 ethernet + +Unimplemented devices: + +- SP810 system control block +- PCI-express +- USB controller (Philips ISP1761) +- Local DAP ROM +- CoreSight interfaces +- PL301 AXI interconnect +- SCC +- System counter +- HDLCD controller (``vexpress-a15``) +- SP805 watchdog +- PL341 dynamic memory controller +- DMA330 DMA controller +- PL354 static memory controller +- BP147 TrustZone Protection Controller +- TrustZone Address Space Controller + +Other differences between the hardware and the QEMU model: + +- QEMU will default to creating one CPU unless you pass a different + ``-smp`` argument +- QEMU allows the amount of RAM provided to be specified with the + ``-m`` argument +- QEMU defaults to providing a CPU which does not provide either + TrustZone or the Virtualization Extensions: if you want these you + must enable them with ``-machine secure=on`` and ``-machine + virtualization=on`` +- QEMU provides 4 virtio-mmio virtio transports; these start at + address ``0x10013000`` for ``vexpress-a9`` and at ``0x1c130000`` for + ``vexpress-a15``, and have IRQs from 40 upwards. If a dtb is + provided on the command line then QEMU will edit it to include + suitable entries describing these transports for the guest. + +Booting a Linux kernel +---------------------- + +Building a current Linux kernel with ``multi_v7_defconfig`` should be +enough to get something running. Nowadays an out-of-tree build is +recommended (and also useful if you build a lot of different targets). +In the following example $BLD points to the build directory and $SRC +points to the root of the Linux source tree. You can drop $SRC if you +are running from there. + +.. code-block:: bash + + $ make O=$BLD -C $SRC ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- multi_v7_defconfig + $ make O=$BLD -C $SRC ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- + +By default you will want to boot your rootfs off the sdcard interface. +Your rootfs will need to be padded to the right size. With a suitable +DTB you could also add devices to the virtio-mmio bus. + +.. code-block:: bash + + $ qemu-system-arm -cpu cortex-a15 -smp 4 -m 4096 \ + -machine type=vexpress-a15 -serial mon:stdio \ + -drive if=sd,driver=file,filename=armel-rootfs.ext4 \ + -kernel zImage \ + -dtb vexpress-v2p-ca15-tc1.dtb \ + -append "console=ttyAMA0 root=/dev/mmcblk0 ro" diff --git a/docs/system/arm/virt.rst b/docs/system/arm/virt.rst new file mode 100644 index 000000000..850787495 --- /dev/null +++ b/docs/system/arm/virt.rst @@ -0,0 +1,168 @@ +'virt' generic virtual platform (``virt``) +========================================== + +The ``virt`` board is a platform which does not correspond to any +real hardware; it is designed for use in virtual machines. +It is the recommended board type if you simply want to run +a guest such as Linux and do not care about reproducing the +idiosyncrasies and limitations of a particular bit of real-world +hardware. + +This is a "versioned" board model, so as well as the ``virt`` machine +type itself (which may have improvements, bugfixes and other minor +changes between QEMU versions) a version is provided that guarantees +to have the same behaviour as that of previous QEMU releases, so +that VM migration will work between QEMU versions. For instance the +``virt-5.0`` machine type will behave like the ``virt`` machine from +the QEMU 5.0 release, and migration should work between ``virt-5.0`` +of the 5.0 release and ``virt-5.0`` of the 5.1 release. Migration +is not guaranteed to work between different QEMU releases for +the non-versioned ``virt`` machine type. + +Supported devices +""""""""""""""""" + +The virt board supports: + +- PCI/PCIe devices +- Flash memory +- One PL011 UART +- An RTC +- The fw_cfg device that allows a guest to obtain data from QEMU +- A PL061 GPIO controller +- An optional SMMUv3 IOMMU +- hotpluggable DIMMs +- hotpluggable NVDIMMs +- An MSI controller (GICv2M or ITS). GICv2M is selected by default along + with GICv2. ITS is selected by default with GICv3 (>= virt-2.7). Note + that ITS is not modeled in TCG mode. +- 32 virtio-mmio transport devices +- running guests using the KVM accelerator on aarch64 hardware +- large amounts of RAM (at least 255GB, and more if using highmem) +- many CPUs (up to 512 if using a GICv3 and highmem) +- Secure-World-only devices if the CPU has TrustZone: + + - A second PL011 UART + - A second PL061 GPIO controller, with GPIO lines for triggering + a system reset or system poweroff + - A secure flash memory + - 16MB of secure RAM + +Supported guest CPU types: + +- ``cortex-a7`` (32-bit) +- ``cortex-a15`` (32-bit; the default) +- ``cortex-a53`` (64-bit) +- ``cortex-a57`` (64-bit) +- ``cortex-a72`` (64-bit) +- ``a64fx`` (64-bit) +- ``host`` (with KVM only) +- ``max`` (same as ``host`` for KVM; best possible emulation with TCG) + +Note that the default is ``cortex-a15``, so for an AArch64 guest you must +specify a CPU type. + +Graphics output is available, but unlike the x86 PC machine types +there is no default display device enabled: you should select one from +the Display devices section of "-device help". The recommended option +is ``virtio-gpu-pci``; this is the only one which will work correctly +with KVM. You may also need to ensure your guest kernel is configured +with support for this; see below. + +Machine-specific options +"""""""""""""""""""""""" + +The following machine-specific options are supported: + +secure + Set ``on``/``off`` to enable/disable emulating a guest CPU which implements the + Arm Security Extensions (TrustZone). The default is ``off``. + +virtualization + Set ``on``/``off`` to enable/disable emulating a guest CPU which implements the + Arm Virtualization Extensions. The default is ``off``. + +mte + Set ``on``/``off`` to enable/disable emulating a guest CPU which implements the + Arm Memory Tagging Extensions. The default is ``off``. + +highmem + Set ``on``/``off`` to enable/disable placing devices and RAM in physical + address space above 32 bits. The default is ``on`` for machine types + later than ``virt-2.12``. + +gic-version + Specify the version of the Generic Interrupt Controller (GIC) to provide. + Valid values are: + + ``2`` + GICv2 + ``3`` + GICv3 + ``host`` + Use the same GIC version the host provides, when using KVM + ``max`` + Use the best GIC version possible (same as host when using KVM; + currently same as ``3``` for TCG, but this may change in future) + +its + Set ``on``/``off`` to enable/disable ITS instantiation. The default is ``on`` + for machine types later than ``virt-2.7``. + +iommu + Set the IOMMU type to create for the guest. Valid values are: + + ``none`` + Don't create an IOMMU (the default) + ``smmuv3`` + Create an SMMUv3 + +ras + Set ``on``/``off`` to enable/disable reporting host memory errors to a guest + using ACPI and guest external abort exceptions. The default is off. + +Linux guest kernel configuration +"""""""""""""""""""""""""""""""" + +The 'defconfig' for Linux arm and arm64 kernels should include the +right device drivers for virtio and the PCI controller; however some older +kernel versions, especially for 32-bit Arm, did not have everything +enabled by default. If you're not seeing PCI devices that you expect, +then check that your guest config has:: + + CONFIG_PCI=y + CONFIG_VIRTIO_PCI=y + CONFIG_PCI_HOST_GENERIC=y + +If you want to use the ``virtio-gpu-pci`` graphics device you will also +need:: + + CONFIG_DRM=y + CONFIG_DRM_VIRTIO_GPU=y + +Hardware configuration information for bare-metal programming +""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +The ``virt`` board automatically generates a device tree blob ("dtb") +which it passes to the guest. This provides information about the +addresses, interrupt lines and other configuration of the various devices +in the system. Guest code can rely on and hard-code the following +addresses: + +- Flash memory starts at address 0x0000_0000 + +- RAM starts at 0x4000_0000 + +All other information about device locations may change between +QEMU versions, so guest code must look in the DTB. + +QEMU supports two types of guest image boot for ``virt``, and +the way for the guest code to locate the dtb binary differs: + +- For guests using the Linux kernel boot protocol (this means any + non-ELF file passed to the QEMU ``-kernel`` option) the address + of the DTB is passed in a register (``r2`` for 32-bit guests, + or ``x0`` for 64-bit guests) + +- For guests booting as "bare-metal" (any other kind of boot), + the DTB is at the start of RAM (0x4000_0000) diff --git a/docs/system/arm/xlnx-versal-virt.rst b/docs/system/arm/xlnx-versal-virt.rst new file mode 100644 index 000000000..92ad10d2d --- /dev/null +++ b/docs/system/arm/xlnx-versal-virt.rst @@ -0,0 +1,226 @@ +Xilinx Versal Virt (``xlnx-versal-virt``) +========================================= + +Xilinx Versal is a family of heterogeneous multi-core SoCs +(System on Chip) that combine traditional hardened CPUs and I/O +peripherals in a Processing System (PS) with runtime programmable +FPGA logic (PL) and an Artificial Intelligence Engine (AIE). + +More details here: +https://www.xilinx.com/products/silicon-devices/acap/versal.html + +The family of Versal SoCs share a single architecture but come in +different parts with different speed grades, amounts of PL and +other differences. + +The Xilinx Versal Virt board in QEMU is a model of a virtual board +(does not exist in reality) with a virtual Versal SoC without I/O +limitations. Currently, we support the following cores and devices: + +Implemented CPU cores: + +- 2 ACPUs (ARM Cortex-A72) + +Implemented devices: + +- Interrupt controller (ARM GICv3) +- 2 UARTs (ARM PL011) +- An RTC (Versal built-in) +- 2 GEMs (Cadence MACB Ethernet MACs) +- 8 ADMA (Xilinx zDMA) channels +- 2 SD Controllers +- OCM (256KB of On Chip Memory) +- XRAM (4MB of on chip Accelerator RAM) +- DDR memory +- BBRAM (36 bytes of Battery-backed RAM) +- eFUSE (3072 bytes of one-time field-programmable bit array) + +QEMU does not yet model any other devices, including the PL and the AI Engine. + +Other differences between the hardware and the QEMU model: + +- QEMU allows the amount of DDR memory provided to be specified with the + ``-m`` argument. If a DTB is provided on the command line then QEMU will + edit it to include suitable entries describing the Versal DDR memory ranges. + +- QEMU provides 8 virtio-mmio virtio transports; these start at + address ``0xa0000000`` and have IRQs from 111 and upwards. + +Running +""""""" +If the user provides an Operating System to be loaded, we expect users +to use the ``-kernel`` command line option. + +Users can load firmware or boot-loaders with the ``-device loader`` options. + +When loading an OS, QEMU generates a DTB and selects an appropriate address +where it gets loaded. This DTB will be passed to the kernel in register x0. + +If there's no ``-kernel`` option, we generate a DTB and place it at 0x1000 +for boot-loaders or firmware to pick it up. + +If users want to provide their own DTB, they can use the ``-dtb`` option. +These DTBs will have their memory nodes modified to match QEMU's +selected ram_size option before they get passed to the kernel or FW. + +When loading an OS, we turn on QEMU's PSCI implementation with SMC +as the PSCI conduit. When there's no ``-kernel`` option, we assume the user +provides EL3 firmware to handle PSCI. + +A few examples: + +Direct Linux boot of a generic ARM64 upstream Linux kernel: + +.. code-block:: bash + + $ qemu-system-aarch64 -M xlnx-versal-virt -m 2G \ + -serial mon:stdio -display none \ + -kernel arch/arm64/boot/Image \ + -nic user -nic user \ + -device virtio-rng-device,bus=virtio-mmio-bus.0 \ + -drive if=none,index=0,file=hd0.qcow2,id=hd0,snapshot \ + -drive file=qemu_sd.qcow2,if=sd,index=0,snapshot \ + -device virtio-blk-device,drive=hd0 -append root=/dev/vda + +Direct Linux boot of PetaLinux 2019.2: + +.. code-block:: bash + + $ qemu-system-aarch64 -M xlnx-versal-virt -m 2G \ + -serial mon:stdio -display none \ + -kernel petalinux-v2019.2/Image \ + -append "rdinit=/sbin/init console=ttyAMA0,115200n8 earlycon=pl011,mmio,0xFF000000,115200n8" \ + -net nic,model=cadence_gem,netdev=net0 -netdev user,id=net0 \ + -device virtio-rng-device,bus=virtio-mmio-bus.0,rng=rng0 \ + -object rng-random,filename=/dev/urandom,id=rng0 + +Boot PetaLinux 2019.2 via ARM Trusted Firmware (2018.3 because the 2019.2 +version of ATF tries to configure the CCI which we don't model) and U-boot: + +.. code-block:: bash + + $ qemu-system-aarch64 -M xlnx-versal-virt -m 2G \ + -serial stdio -display none \ + -device loader,file=petalinux-v2018.3/bl31.elf,cpu-num=0 \ + -device loader,file=petalinux-v2019.2/u-boot.elf \ + -device loader,addr=0x20000000,file=petalinux-v2019.2/Image \ + -nic user -nic user \ + -device virtio-rng-device,bus=virtio-mmio-bus.0,rng=rng0 \ + -object rng-random,filename=/dev/urandom,id=rng0 + +Run the following at the U-Boot prompt: + +.. code-block:: bash + + Versal> + fdt addr $fdtcontroladdr + fdt move $fdtcontroladdr 0x40000000 + fdt set /timer clock-frequency <0x3dfd240> + setenv bootargs "rdinit=/sbin/init maxcpus=1 console=ttyAMA0,115200n8 earlycon=pl011,mmio,0xFF000000,115200n8" + booti 20000000 - 40000000 + fdt addr $fdtcontroladdr + +Boot Linux as DOM0 on Xen via U-Boot: + +.. code-block:: bash + + $ qemu-system-aarch64 -M xlnx-versal-virt -m 4G \ + -serial stdio -display none \ + -device loader,file=petalinux-v2019.2/u-boot.elf,cpu-num=0 \ + -device loader,addr=0x30000000,file=linux/2018-04-24/xen \ + -device loader,addr=0x40000000,file=petalinux-v2019.2/Image \ + -nic user -nic user \ + -device virtio-rng-device,bus=virtio-mmio-bus.0,rng=rng0 \ + -object rng-random,filename=/dev/urandom,id=rng0 + +Run the following at the U-Boot prompt: + +.. code-block:: bash + + Versal> + fdt addr $fdtcontroladdr + fdt move $fdtcontroladdr 0x20000000 + fdt set /timer clock-frequency <0x3dfd240> + fdt set /chosen xen,xen-bootargs "console=dtuart dtuart=/uart@ff000000 dom0_mem=640M bootscrub=0 maxcpus=1 timer_slop=0" + fdt set /chosen xen,dom0-bootargs "rdinit=/sbin/init clk_ignore_unused console=hvc0 maxcpus=1" + fdt mknode /chosen dom0 + fdt set /chosen/dom0 compatible "xen,multiboot-module" + fdt set /chosen/dom0 reg <0x00000000 0x40000000 0x0 0x03100000> + booti 30000000 - 20000000 + +Boot Linux as Dom0 on Xen via ARM Trusted Firmware and U-Boot: + +.. code-block:: bash + + $ qemu-system-aarch64 -M xlnx-versal-virt -m 4G \ + -serial stdio -display none \ + -device loader,file=petalinux-v2018.3/bl31.elf,cpu-num=0 \ + -device loader,file=petalinux-v2019.2/u-boot.elf \ + -device loader,addr=0x30000000,file=linux/2018-04-24/xen \ + -device loader,addr=0x40000000,file=petalinux-v2019.2/Image \ + -nic user -nic user \ + -device virtio-rng-device,bus=virtio-mmio-bus.0,rng=rng0 \ + -object rng-random,filename=/dev/urandom,id=rng0 + +Run the following at the U-Boot prompt: + +.. code-block:: bash + + Versal> + fdt addr $fdtcontroladdr + fdt move $fdtcontroladdr 0x20000000 + fdt set /timer clock-frequency <0x3dfd240> + fdt set /chosen xen,xen-bootargs "console=dtuart dtuart=/uart@ff000000 dom0_mem=640M bootscrub=0 maxcpus=1 timer_slop=0" + fdt set /chosen xen,dom0-bootargs "rdinit=/sbin/init clk_ignore_unused console=hvc0 maxcpus=1" + fdt mknode /chosen dom0 + fdt set /chosen/dom0 compatible "xen,multiboot-module" + fdt set /chosen/dom0 reg <0x00000000 0x40000000 0x0 0x03100000> + booti 30000000 - 20000000 + +BBRAM File Backend +"""""""""""""""""" +BBRAM can have an optional file backend, which must be a seekable +binary file with a size of 36 bytes or larger. A file with all +binary 0s is a 'blank'. + +To add a file-backend for the BBRAM: + +.. code-block:: bash + + -drive if=pflash,index=0,file=versal-bbram.bin,format=raw + +To use a different index value, N, from default of 0, add: + +.. code-block:: bash + + -global xlnx,bbram-ctrl.drive-index=N + +eFUSE File Backend +"""""""""""""""""" +eFUSE can have an optional file backend, which must be a seekable +binary file with a size of 3072 bytes or larger. A file with all +binary 0s is a 'blank'. + +To add a file-backend for the eFUSE: + +.. code-block:: bash + + -drive if=pflash,index=1,file=versal-efuse.bin,format=raw + +To use a different index value, N, from default of 1, add: + +.. code-block:: bash + + -global xlnx,efuse.drive-index=N + +.. warning:: + In actual physical Versal, BBRAM and eFUSE contain sensitive data. + The QEMU device models do **not** encrypt nor obfuscate any data + when holding them in models' memory or when writing them to their + file backends. + + Thus, a file backend should be used with caution, and 'format=luks' + is highly recommended (albeit with usage complexity). + + Better yet, do not use actual product data when running guest image + on this Xilinx Versal Virt board. diff --git a/docs/system/arm/xscale.rst b/docs/system/arm/xscale.rst new file mode 100644 index 000000000..d2d5949e1 --- /dev/null +++ b/docs/system/arm/xscale.rst @@ -0,0 +1,35 @@ +Sharp XScale-based PDA models (``akita``, ``borzoi``, ``spitz``, ``terrier``, ``tosa``) +======================================================================================= + +The Sharp Zaurus are PDAs based on XScale, able to run Linux ('SL series'). + +The SL-6000 (\"Tosa\"), released in 2005, uses a PXA255 System-on-chip. + +The SL-C3000 (\"Spitz\"), SL-C1000 (\"Akita\"), SL-C3100 (\"Borzoi\") and +SL-C3200 (\"Terrier\") use a PXA270. + +The clamshell PDA models emulation includes the following peripherals: + +- Intel PXA255/PXA270 System-on-chip (ARMv5TE core) + +- NAND Flash memory - not in \"Tosa\" + +- IBM/Hitachi DSCM microdrive in a PXA PCMCIA slot - not in \"Akita\" + +- On-chip OHCI USB controller - not in \"Tosa\" + +- On-chip LCD controller + +- On-chip Real Time Clock + +- TI ADS7846 touchscreen controller on SSP bus + +- Maxim MAX1111 analog-digital converter on |I2C| bus + +- GPIO-connected keyboard controller and LEDs + +- Secure Digital card connected to PXA MMC/SD host + +- Three on-chip UARTs + +- WM8750 audio CODEC on |I2C| and |I2S| busses diff --git a/docs/system/authz.rst b/docs/system/authz.rst new file mode 100644 index 000000000..55b7315e4 --- /dev/null +++ b/docs/system/authz.rst @@ -0,0 +1,257 @@ +.. _client authorization: + +Client authorization +-------------------- + +When configuring a QEMU network backend with either TLS certificates or SASL +authentication, access will be granted if the client successfully proves +their identity. If the authorization identity database is scoped to the QEMU +client this may be sufficient. It is common, however, for the identity database +to be much broader and thus authentication alone does not enable sufficient +access control. In this case QEMU provides a flexible system for enforcing +finer grained authorization on clients post-authentication. + +Identity providers +~~~~~~~~~~~~~~~~~~ + +At the time of writing there are two authentication frameworks used by QEMU +that emit an identity upon completion. + + * TLS x509 certificate distinguished name. + + When configuring the QEMU backend as a network server with TLS, there + are a choice of credentials to use. The most common scenario is to utilize + x509 certificates. The simplest configuration only involves issuing + certificates to the servers, allowing the client to avoid a MITM attack + against their intended server. + + It is possible, however, to enable mutual verification by requiring that + the client provide a certificate to the server to prove its own identity. + This is done by setting the property ``verify-peer=yes`` on the + ``tls-creds-x509`` object, which is in fact the default. + + When peer verification is enabled, client will need to be issued with a + certificate by the same certificate authority as the server. If this is + still not sufficiently strong access control the Distinguished Name of + the certificate can be used as an identity in the QEMU authorization + framework. + + * SASL username. + + When configuring the QEMU backend as a network server with SASL, upon + completion of the SASL authentication mechanism, a username will be + provided. The format of this username will vary depending on the choice + of mechanism configured for SASL. It might be a simple UNIX style user + ``joebloggs``, while if using Kerberos/GSSAPI it can have a realm + attached ``joebloggs@QEMU.ORG``. Whatever format the username is presented + in, it can be used with the QEMU authorization framework. + +Authorization drivers +~~~~~~~~~~~~~~~~~~~~~ + +The QEMU authorization framework is a general purpose design with choice of +user customizable drivers. These are provided as objects that can be +created at startup using the ``-object`` argument, or at runtime using the +``object_add`` monitor command. + +Simple +^^^^^^ + +This authorization driver provides a simple mechanism for granting access +based on an exact match against a single identity. This is useful when it is +known that only a single client is to be allowed access. + +A possible use case would be when configuring QEMU for an incoming live +migration. It is known exactly which source QEMU the migration is expected +to arrive from. The x509 certificate associated with this source QEMU would +thus be used as the identity to match against. Alternatively if the virtual +machine is dedicated to a specific tenant, then the VNC server would be +configured with SASL and the username of only that tenant listed. + +To create an instance of this driver via QMP: + +:: + + { + "execute": "object-add", + "arguments": { + "qom-type": "authz-simple", + "id": "authz0", + "identity": "fred" + } + } + + +Or via the command line + +:: + + -object authz-simple,id=authz0,identity=fred + + +List +^^^^ + +In some network backends it will be desirable to grant access to a range of +clients. This authorization driver provides a list mechanism for granting +access by matching identities against a list of permitted one. Each match +rule has an associated policy and a catch all policy applies if no rule +matches. The match can either be done as an exact string comparison, or can +use the shell-like glob syntax, which allows for use of wildcards. + +To create an instance of this class via QMP: + +:: + + { + "execute": "object-add", + "arguments": { + "qom-type": "authz-list", + "id": "authz0", + "rules": [ + { "match": "fred", "policy": "allow", "format": "exact" }, + { "match": "bob", "policy": "allow", "format": "exact" }, + { "match": "danb", "policy": "deny", "format": "exact" }, + { "match": "dan*", "policy": "allow", "format": "glob" } + ], + "policy": "deny" + } + } + + +Due to the way this driver requires setting nested properties, creating +it on the command line will require use of the JSON syntax for ``-object``. +In most cases, however, the next driver will be more suitable. + +List file +^^^^^^^^^ + +This is a variant on the previous driver that allows for a more dynamic +access control policy by storing the match rules in a standalone file +that can be reloaded automatically upon change. + +To create an instance of this class via QMP: + +:: + + { + "execute": "object-add", + "arguments": { + "qom-type": "authz-list-file", + "id": "authz0", + "filename": "/etc/qemu/myvm-vnc.acl", + "refresh": true + } + } + + +If ``refresh`` is ``yes``, inotify is used to monitor for changes +to the file and auto-reload the rules. + +The ``myvm-vnc.acl`` file should contain the match rules in a format that +closely matches the previous driver: + +:: + + { + "rules": [ + { "match": "fred", "policy": "allow", "format": "exact" }, + { "match": "bob", "policy": "allow", "format": "exact" }, + { "match": "danb", "policy": "deny", "format": "exact" }, + { "match": "dan*", "policy": "allow", "format": "glob" } + ], + "policy": "deny" + } + + +The object can be created on the command line using + +:: + + -object authz-list-file,id=authz0,\ + filename=/etc/qemu/myvm-vnc.acl,refresh=on + + +PAM +^^^ + +In some scenarios it might be desirable to integrate with authorization +mechanisms that are implemented outside of QEMU. In order to allow maximum +flexibility, QEMU provides a driver that uses the ``PAM`` framework. + +To create an instance of this class via QMP: + +:: + + { + "execute": "object-add", + "arguments": { + "qom-type": "authz-pam", + "id": "authz0", + "parameters": { + "service": "qemu-vnc-tls" + } + } + } + + +The driver only uses the PAM "account" verification +subsystem. The above config would require a config +file /etc/pam.d/qemu-vnc-tls. For a simple file +lookup it would contain + +:: + + account requisite pam_listfile.so item=user sense=allow \ + file=/etc/qemu/vnc.allow + + +The external file would then contain a list of usernames. +If x509 cert was being used as the username, a suitable +entry would match the distinguished name: + +:: + + CN=laptop.berrange.com,O=Berrange Home,L=London,ST=London,C=GB + + +On the command line it can be created using + +:: + + -object authz-pam,id=authz0,service=qemu-vnc-tls + + +There are a variety of PAM plugins that can be used which are not illustrated +here, and it is possible to implement brand new plugins using the PAM API. + + +Connecting backends +~~~~~~~~~~~~~~~~~~~ + +The authorization driver is created using the ``-object`` argument and then +needs to be associated with a network service. The authorization driver object +will be given a unique ID that needs to be referenced. + +The property to set in the network service will vary depending on the type of +identity to verify. By convention, any network server backend that uses TLS +will provide ``tls-authz`` property, while any server using SASL will provide +a ``sasl-authz`` property. + +Thus an example using SASL and authorization for the VNC server would look +like: + +:: + + $QEMU --object authz-simple,id=authz0,identity=fred \ + --vnc 0.0.0.0:1,sasl,sasl-authz=authz0 + +While to validate both the x509 certificate and SASL username: + +:: + + echo "CN=laptop.qemu.org,O=QEMU Project,L=London,ST=London,C=GB" >> tls.acl + $QEMU --object authz-simple,id=authz0,identity=fred \ + --object authz-list-file,id=authz1,filename=tls.acl \ + --object tls-creds-x509,id=tls0,dir=/etc/qemu/tls,verify-peer=yes \ + --vnc 0.0.0.0:1,sasl,sasl-authz=auth0,tls-creds=tls0,tls-authz=authz1 diff --git a/docs/system/barrier.rst b/docs/system/barrier.rst new file mode 100644 index 000000000..155d7d290 --- /dev/null +++ b/docs/system/barrier.rst @@ -0,0 +1,44 @@ +QEMU Barrier Client +=================== + +Generally, mouse and keyboard are grabbed through the QEMU video +interface emulation. + +But when we want to use a video graphic adapter via a PCI passthrough +there is no way to provide the keyboard and mouse inputs to the VM +except by plugging a second set of mouse and keyboard to the host +or by installing a KVM software in the guest OS. + +The QEMU Barrier client avoids this by implementing directly the Barrier +protocol into QEMU. + +`Barrier <https://github.com/debauchee/barrier>`__ +is a KVM (Keyboard-Video-Mouse) software forked from Symless's +synergy 1.9 codebase. + +This protocol is enabled by adding an input-barrier object to QEMU. + +Syntax:: + + input-barrier,id=<object-id>,name=<guest display name> + [,server=<barrier server address>][,port=<barrier server port>] + [,x-origin=<x-origin>][,y-origin=<y-origin>] + [,width=<width>][,height=<height>] + +The object can be added on the QEMU command line, for instance with:: + + -object input-barrier,id=barrier0,name=VM-1 + +where VM-1 is the name the display configured in the Barrier server +on the host providing the mouse and the keyboard events. + +by default ``<barrier server address>`` is ``localhost``, +``<port>`` is ``24800``, ``<x-origin>`` and ``<y-origin>`` are set to ``0``, +``<width>`` and ``<height>`` to ``1920`` and ``1080``. + +If the Barrier server is stopped QEMU needs to be reconnected manually, +by removing and re-adding the input-barrier object, for instance +with the help of the HMP monitor:: + + (qemu) object_del barrier0 + (qemu) object_add input-barrier,id=barrier0,name=VM-1 diff --git a/docs/system/bootindex.rst b/docs/system/bootindex.rst new file mode 100644 index 000000000..8b057f812 --- /dev/null +++ b/docs/system/bootindex.rst @@ -0,0 +1,76 @@ +Managing device boot order with bootindex properties +==================================================== + +QEMU can tell QEMU-aware guest firmware (like the x86 PC BIOS) +which order it should look for a bootable OS on which devices. +A simple way to set this order is to use the ``-boot order=`` option, +but you can also do this more flexibly, by setting a ``bootindex`` +property on the individual block or net devices you specify +on the QEMU command line. + +The ``bootindex`` properties are used to determine the order in which +firmware will consider devices for booting the guest OS. If the +``bootindex`` property is not set for a device, it gets the lowest +boot priority. There is no particular order in which devices with no +``bootindex`` property set will be considered for booting, but they +will still be bootable. + +Some guest machine types (for instance the s390x machines) do +not support ``-boot order=``; on those machines you must always +use ``bootindex`` properties. + +There is no way to set a ``bootindex`` property if you are using +a short-form option like ``-hda`` or ``-cdrom``, so to use +``bootindex`` properties you will need to expand out those options +into long-form ``-drive`` and ``-device`` option pairs. + +Example +------- + +Let's assume we have a QEMU machine with two NICs (virtio, e1000) and two +disks (IDE, virtio): + +.. parsed-literal:: + + |qemu_system| -drive file=disk1.img,if=none,id=disk1 \\ + -device ide-hd,drive=disk1,bootindex=4 \\ + -drive file=disk2.img,if=none,id=disk2 \\ + -device virtio-blk-pci,drive=disk2,bootindex=3 \\ + -netdev type=user,id=net0 \\ + -device virtio-net-pci,netdev=net0,bootindex=2 \\ + -netdev type=user,id=net1 \\ + -device e1000,netdev=net1,bootindex=1 + +Given the command above, firmware should try to boot from the e1000 NIC +first. If this fails, it should try the virtio NIC next; if this fails +too, it should try the virtio disk, and then the IDE disk. + +Limitations +----------- + +Some firmware has limitations on which devices can be considered for +booting. For instance, the PC BIOS boot specification allows only one +disk to be bootable. If boot from disk fails for some reason, the BIOS +won't retry booting from other disk. It can still try to boot from +floppy or net, though. + +Sometimes, firmware cannot map the device path QEMU wants firmware to +boot from to a boot method. It doesn't happen for devices the firmware +can natively boot from, but if firmware relies on an option ROM for +booting, and the same option ROM is used for booting from more then one +device, the firmware may not be able to ask the option ROM to boot from +a particular device reliably. For instance with the PC BIOS, if a SCSI HBA +has three bootable devices target1, target3, target5 connected to it, +the option ROM will have a boot method for each of them, but it is not +possible to map from boot method back to a specific target. This is a +shortcoming of the PC BIOS boot specification. + +Mixing bootindex and boot order parameters +------------------------------------------ + +Note that it does not make sense to use the bootindex property together +with the ``-boot order=...`` (or ``-boot once=...``) parameter. The guest +firmware implementations normally either support the one or the other, +but not both parameters at the same time. Mixing them will result in +undefined behavior, and thus the guest firmware will likely not boot +from the expected devices. diff --git a/docs/system/cpu-hotplug.rst b/docs/system/cpu-hotplug.rst new file mode 100644 index 000000000..015ce2b6e --- /dev/null +++ b/docs/system/cpu-hotplug.rst @@ -0,0 +1,142 @@ +=================== +Virtual CPU hotplug +=================== + +A complete example of vCPU hotplug (and hot-unplug) using QMP +``device_add`` and ``device_del``. + +vCPU hotplug +------------ + +(1) Launch QEMU as follows (note that the "maxcpus" is mandatory to + allow vCPU hotplug):: + + $ qemu-system-x86_64 -display none -no-user-config -m 2048 \ + -nodefaults -monitor stdio -machine pc,accel=kvm,usb=off \ + -smp 1,maxcpus=2 -cpu IvyBridge-IBRS \ + -qmp unix:/tmp/qmp-sock,server=on,wait=off + +(2) Run 'qmp-shell' (located in the source tree, under: "scripts/qmp/) + to connect to the just-launched QEMU:: + + $> ./qmp-shell -p -v /tmp/qmp-sock + [...] + (QEMU) + +(3) Find out which CPU types could be plugged, and into which sockets:: + + (QEMU) query-hotpluggable-cpus + { + "execute": "query-hotpluggable-cpus", + "arguments": {} + } + { + "return": [ + { + "type": "IvyBridge-IBRS-x86_64-cpu", + "vcpus-count": 1, + "props": { + "socket-id": 1, + "core-id": 0, + "thread-id": 0 + } + }, + { + "qom-path": "/machine/unattached/device[0]", + "type": "IvyBridge-IBRS-x86_64-cpu", + "vcpus-count": 1, + "props": { + "socket-id": 0, + "core-id": 0, + "thread-id": 0 + } + } + ] + } + (QEMU) + +(4) The ``query-hotpluggable-cpus`` command returns an object for CPUs + that are present (containing a "qom-path" member) or which may be + hot-plugged (no "qom-path" member). From its output in step (3), we + can see that ``IvyBridge-IBRS-x86_64-cpu`` is present in socket 0, + while hot-plugging a CPU into socket 1 requires passing the listed + properties to QMP ``device_add``:: + + (QEMU) device_add id=cpu-2 driver=IvyBridge-IBRS-x86_64-cpu socket-id=1 core-id=0 thread-id=0 + { + "execute": "device_add", + "arguments": { + "socket-id": 1, + "driver": "IvyBridge-IBRS-x86_64-cpu", + "id": "cpu-2", + "core-id": 0, + "thread-id": 0 + } + } + { + "return": {} + } + (QEMU) + +(5) Optionally, run QMP ``query-cpus-fast`` for some details about the + vCPUs:: + + (QEMU) query-cpus-fast + { + "execute": "query-cpus-fast", + "arguments": {} + } + { + "return": [ + { + "qom-path": "/machine/unattached/device[0]", + "target": "x86_64", + "thread-id": 11534, + "cpu-index": 0, + "props": { + "socket-id": 0, + "core-id": 0, + "thread-id": 0 + }, + "arch": "x86" + }, + { + "qom-path": "/machine/peripheral/cpu-2", + "target": "x86_64", + "thread-id": 12106, + "cpu-index": 1, + "props": { + "socket-id": 1, + "core-id": 0, + "thread-id": 0 + }, + "arch": "x86" + } + ] + } + (QEMU) + +vCPU hot-unplug +--------------- + +From the 'qmp-shell', invoke the QMP ``device_del`` command:: + + (QEMU) device_del id=cpu-2 + { + "execute": "device_del", + "arguments": { + "id": "cpu-2" + } + } + { + "return": {} + } + (QEMU) + +.. note:: + vCPU hot-unplug requires guest cooperation; so the ``device_del`` + command above does not guarantee vCPU removal -- it's a "request to + unplug". At this point, the guest will get a System Control + Interrupt (SCI) and calls the ACPI handler for the affected vCPU + device. Then the guest kernel will bring the vCPU offline and tell + QEMU to unplug it. diff --git a/docs/system/cpu-models-mips.rst.inc b/docs/system/cpu-models-mips.rst.inc new file mode 100644 index 000000000..02cc4bb88 --- /dev/null +++ b/docs/system/cpu-models-mips.rst.inc @@ -0,0 +1,111 @@ +Supported CPU model configurations on MIPS hosts +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU supports variety of MIPS CPU models: + +Supported CPU models for MIPS32 hosts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following CPU models are supported for use on MIPS32 hosts. +Administrators / applications are recommended to use the CPU model that +matches the generation of the host CPUs in use. In a deployment with a +mixture of host CPU models between machines, if live migration +compatibility is required, use the newest CPU model that is compatible +across all desired hosts. + +``mips32r6-generic`` + MIPS32 Processor (Release 6, 2015) + +``P5600`` + MIPS32 Processor (P5600, 2014) + +``M14K``, ``M14Kc`` + MIPS32 Processor (M14K, 2009) + +``74Kf`` + MIPS32 Processor (74K, 2007) + +``34Kf`` + MIPS32 Processor (34K, 2006) + +``24Kc``, ``24KEc``, ``24Kf`` + MIPS32 Processor (24K, 2003) + +``4Kc``, ``4Km``, ``4KEcR1``, ``4KEmR1``, ``4KEc``, ``4KEm`` + MIPS32 Processor (4K, 1999) + + +Supported CPU models for MIPS64 hosts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following CPU models are supported for use on MIPS64 hosts. +Administrators / applications are recommended to use the CPU model that +matches the generation of the host CPUs in use. In a deployment with a +mixture of host CPU models between machines, if live migration +compatibility is required, use the newest CPU model that is compatible +across all desired hosts. + +``I6400`` + MIPS64 Processor (Release 6, 2014) + +``Loongson-2E`` + MIPS64 Processor (Loongson 2, 2006) + +``Loongson-2F`` + MIPS64 Processor (Loongson 2, 2008) + +``Loongson-3A1000`` + MIPS64 Processor (Loongson 3, 2010) + +``Loongson-3A4000`` + MIPS64 Processor (Loongson 3, 2018) + +``mips64dspr2`` + MIPS64 Processor (Release 2, 2006) + +``MIPS64R2-generic``, ``5KEc``, ``5KEf`` + MIPS64 Processor (Release 2, 2002) + +``20Kc`` + MIPS64 Processor (20K, 2000 + +``5Kc``, ``5Kf`` + MIPS64 Processor (5K, 1999) + +``VR5432`` + MIPS64 Processor (VR, 1998) + +``R4000`` + MIPS64 Processor (MIPS III, 1991) + + +Supported CPU models for nanoMIPS hosts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following CPU models are supported for use on nanoMIPS hosts. +Administrators / applications are recommended to use the CPU model that +matches the generation of the host CPUs in use. In a deployment with a +mixture of host CPU models between machines, if live migration +compatibility is required, use the newest CPU model that is compatible +across all desired hosts. + +``I7200`` + MIPS I7200 (nanoMIPS, 2018) + +Preferred CPU models for MIPS hosts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following CPU models are preferred for use on different MIPS hosts: + +``MIPS III`` + R4000 + +``MIPS32R2`` + 34Kf + +``MIPS64R6`` + I6400 + +``nanoMIPS`` + I7200 + diff --git a/docs/system/cpu-models-x86-abi.csv b/docs/system/cpu-models-x86-abi.csv new file mode 100644 index 000000000..f3f3b60be --- /dev/null +++ b/docs/system/cpu-models-x86-abi.csv @@ -0,0 +1,67 @@ +Model,baseline,v2,v3,v4 +486-v1,,,, +Broadwell-v1,✅,✅,✅, +Broadwell-v2,✅,✅,✅, +Broadwell-v3,✅,✅,✅, +Broadwell-v4,✅,✅,✅, +Cascadelake-Server-v1,✅,✅,✅,✅ +Cascadelake-Server-v2,✅,✅,✅,✅ +Cascadelake-Server-v3,✅,✅,✅,✅ +Cascadelake-Server-v4,✅,✅,✅,✅ +Conroe-v1,✅,,, +Cooperlake-v1,✅,✅,✅,✅ +Denverton-v1,✅,✅,, +Denverton-v2,✅,✅,, +Dhyana-v1,✅,✅,✅, +EPYC-Milan-v1,✅,✅,✅, +EPYC-Rome-v1,✅,✅,✅, +EPYC-Rome-v2,✅,✅,✅, +EPYC-v1,✅,✅,✅, +EPYC-v2,✅,✅,✅, +EPYC-v3,✅,✅,✅, +Haswell-v1,✅,✅,✅, +Haswell-v2,✅,✅,✅, +Haswell-v3,✅,✅,✅, +Haswell-v4,✅,✅,✅, +Icelake-Client-v1,✅,✅,✅, +Icelake-Client-v2,✅,✅,✅, +Icelake-Server-v1,✅,✅,✅,✅ +Icelake-Server-v2,✅,✅,✅,✅ +Icelake-Server-v3,✅,✅,✅,✅ +Icelake-Server-v4,✅,✅,✅,✅ +IvyBridge-v1,✅,✅,, +IvyBridge-v2,✅,✅,, +KnightsMill-v1,✅,✅,✅, +Nehalem-v1,✅,✅,, +Nehalem-v2,✅,✅,, +Opteron_G1-v1,✅,,, +Opteron_G2-v1,✅,,, +Opteron_G3-v1,✅,,, +Opteron_G4-v1,✅,✅,, +Opteron_G5-v1,✅,✅,, +Penryn-v1,✅,,, +SandyBridge-v1,✅,✅,, +SandyBridge-v2,✅,✅,, +Skylake-Client-v1,✅,✅,✅, +Skylake-Client-v2,✅,✅,✅, +Skylake-Client-v3,✅,✅,✅, +Skylake-Server-v1,✅,✅,✅,✅ +Skylake-Server-v2,✅,✅,✅,✅ +Skylake-Server-v3,✅,✅,✅,✅ +Skylake-Server-v4,✅,✅,✅,✅ +Snowridge-v1,✅,✅,, +Snowridge-v2,✅,✅,, +Westmere-v1,✅,✅,, +Westmere-v2,✅,✅,, +athlon-v1,,,, +core2duo-v1,✅,,, +coreduo-v1,,,, +kvm32-v1,,,, +kvm64-v1,✅,,, +n270-v1,,,, +pentium-v1,,,, +pentium2-v1,,,, +pentium3-v1,,,, +phenom-v1,✅,,, +qemu32-v1,,,, +qemu64-v1,✅,,, diff --git a/docs/system/cpu-models-x86.rst.inc b/docs/system/cpu-models-x86.rst.inc new file mode 100644 index 000000000..7f6368f99 --- /dev/null +++ b/docs/system/cpu-models-x86.rst.inc @@ -0,0 +1,440 @@ +Recommendations for KVM CPU model configuration on x86 hosts +============================================================ + +The information that follows provides recommendations for configuring +CPU models on x86 hosts. The goals are to maximise performance, while +protecting guest OS against various CPU hardware flaws, and optionally +enabling live migration between hosts with heterogeneous CPU models. + + +Two ways to configure CPU models with QEMU / KVM +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +(1) **Host passthrough** + + This passes the host CPU model features, model, stepping, exactly to + the guest. Note that KVM may filter out some host CPU model features + if they cannot be supported with virtualization. Live migration is + unsafe when this mode is used as libvirt / QEMU cannot guarantee a + stable CPU is exposed to the guest across hosts. This is the + recommended CPU to use, provided live migration is not required. + +(2) **Named model** + + QEMU comes with a number of predefined named CPU models, that + typically refer to specific generations of hardware released by + Intel and AMD. These allow the guest VMs to have a degree of + isolation from the host CPU, allowing greater flexibility in live + migrating between hosts with differing hardware. @end table + +In both cases, it is possible to optionally add or remove individual CPU +features, to alter what is presented to the guest by default. + +Libvirt supports a third way to configure CPU models known as "Host +model". This uses the QEMU "Named model" feature, automatically picking +a CPU model that is similar the host CPU, and then adding extra features +to approximate the host model as closely as possible. This does not +guarantee the CPU family, stepping, etc will precisely match the host +CPU, as they would with "Host passthrough", but gives much of the +benefit of passthrough, while making live migration safe. + + +ABI compatibility levels for CPU models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The x86_64 architecture has a number of `ABI compatibility levels`_ +defined. Traditionally most operating systems and toolchains would +only target the original baseline ABI. It is expected that in +future OS and toolchains are likely to target newer ABIs. The +table that follows illustrates which ABI compatibility levels +can be satisfied by the QEMU CPU models. Note that the table only +lists the long term stable CPU model versions (eg Haswell-v4). +In addition to what is listed, there are also many CPU model +aliases which resolve to a different CPU model version, +depending on the machine type is in use. + +.. _ABI compatibility levels: https://gitlab.com/x86-psABIs/x86-64-ABI/ + +.. csv-table:: x86-64 ABI compatibility levels + :file: cpu-models-x86-abi.csv + :widths: 40,15,15,15,15 + :header-rows: 2 + + +Preferred CPU models for Intel x86 hosts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following CPU models are preferred for use on Intel hosts. +Administrators / applications are recommended to use the CPU model that +matches the generation of the host CPUs in use. In a deployment with a +mixture of host CPU models between machines, if live migration +compatibility is required, use the newest CPU model that is compatible +across all desired hosts. + +``Cascadelake-Server``, ``Cascadelake-Server-noTSX`` + Intel Xeon Processor (Cascade Lake, 2019), with "stepping" levels 6 + or 7 only. (The Cascade Lake Xeon processor with *stepping 5 is + vulnerable to MDS variants*.) + +``Skylake-Server``, ``Skylake-Server-IBRS``, ``Skylake-Server-IBRS-noTSX`` + Intel Xeon Processor (Skylake, 2016) + +``Skylake-Client``, ``Skylake-Client-IBRS``, ``Skylake-Client-noTSX-IBRS}`` + Intel Core Processor (Skylake, 2015) + +``Broadwell``, ``Broadwell-IBRS``, ``Broadwell-noTSX``, ``Broadwell-noTSX-IBRS`` + Intel Core Processor (Broadwell, 2014) + +``Haswell``, ``Haswell-IBRS``, ``Haswell-noTSX``, ``Haswell-noTSX-IBRS`` + Intel Core Processor (Haswell, 2013) + +``IvyBridge``, ``IvyBridge-IBR`` + Intel Xeon E3-12xx v2 (Ivy Bridge, 2012) + +``SandyBridge``, ``SandyBridge-IBRS`` + Intel Xeon E312xx (Sandy Bridge, 2011) + +``Westmere``, ``Westmere-IBRS`` + Westmere E56xx/L56xx/X56xx (Nehalem-C, 2010) + +``Nehalem``, ``Nehalem-IBRS`` + Intel Core i7 9xx (Nehalem Class Core i7, 2008) + +``Penryn`` + Intel Core 2 Duo P9xxx (Penryn Class Core 2, 2007) + +``Conroe`` + Intel Celeron_4x0 (Conroe/Merom Class Core 2, 2006) + + +Important CPU features for Intel x86 hosts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following are important CPU features that should be used on Intel +x86 hosts, when available in the host CPU. Some of them require explicit +configuration to enable, as they are not included by default in some, or +all, of the named CPU models listed above. In general all of these +features are included if using "Host passthrough" or "Host model". + +``pcid`` + Recommended to mitigate the cost of the Meltdown (CVE-2017-5754) fix. + + Included by default in Haswell, Broadwell & Skylake Intel CPU models. + + Should be explicitly turned on for Westmere, SandyBridge, and + IvyBridge Intel CPU models. Note that some desktop/mobile Westmere + CPUs cannot support this feature. + +``spec-ctrl`` + Required to enable the Spectre v2 (CVE-2017-5715) fix. + + Included by default in Intel CPU models with -IBRS suffix. + + Must be explicitly turned on for Intel CPU models without -IBRS + suffix. + + Requires the host CPU microcode to support this feature before it + can be used for guest CPUs. + +``stibp`` + Required to enable stronger Spectre v2 (CVE-2017-5715) fixes in some + operating systems. + + Must be explicitly turned on for all Intel CPU models. + + Requires the host CPU microcode to support this feature before it can + be used for guest CPUs. + +``ssbd`` + Required to enable the CVE-2018-3639 fix. + + Not included by default in any Intel CPU model. + + Must be explicitly turned on for all Intel CPU models. + + Requires the host CPU microcode to support this feature before it + can be used for guest CPUs. + +``pdpe1gb`` + Recommended to allow guest OS to use 1GB size pages. + + Not included by default in any Intel CPU model. + + Should be explicitly turned on for all Intel CPU models. + + Note that not all CPU hardware will support this feature. + +``md-clear`` + Required to confirm the MDS (CVE-2018-12126, CVE-2018-12127, + CVE-2018-12130, CVE-2019-11091) fixes. + + Not included by default in any Intel CPU model. + + Must be explicitly turned on for all Intel CPU models. + + Requires the host CPU microcode to support this feature before it + can be used for guest CPUs. + +``mds-no`` + Recommended to inform the guest OS that the host is *not* vulnerable + to any of the MDS variants ([MFBDS] CVE-2018-12130, [MLPDS] + CVE-2018-12127, [MSBDS] CVE-2018-12126). + + This is an MSR (Model-Specific Register) feature rather than a CPUID feature, + so it will not appear in the Linux ``/proc/cpuinfo`` in the host or + guest. Instead, the host kernel uses it to populate the MDS + vulnerability file in ``sysfs``. + + So it should only be enabled for VMs if the host reports @code{Not + affected} in the ``/sys/devices/system/cpu/vulnerabilities/mds`` file. + +``taa-no`` + Recommended to inform that the guest that the host is ``not`` + vulnerable to CVE-2019-11135, TSX Asynchronous Abort (TAA). + + This too is an MSR feature, so it does not show up in the Linux + ``/proc/cpuinfo`` in the host or guest. + + It should only be enabled for VMs if the host reports ``Not affected`` + in the ``/sys/devices/system/cpu/vulnerabilities/tsx_async_abort`` + file. + +``tsx-ctrl`` + Recommended to inform the guest that it can disable the Intel TSX + (Transactional Synchronization Extensions) feature; or, if the + processor is vulnerable, use the Intel VERW instruction (a + processor-level instruction that performs checks on memory access) as + a mitigation for the TAA vulnerability. (For details, refer to + Intel's `deep dive into MDS + <https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-microarchitectural-data-sampling>`_.) + + Expose this to the guest OS if and only if: (a) the host has TSX + enabled; *and* (b) the guest has ``rtm`` CPU flag enabled. + + By disabling TSX, KVM-based guests can avoid paying the price of + mitigating TSX-based attacks. + + Note that ``tsx-ctrl`` too is an MSR feature, so it does not show + up in the Linux ``/proc/cpuinfo`` in the host or guest. + + To validate that Intel TSX is indeed disabled for the guest, there are + two ways: (a) check for the *absence* of ``rtm`` in the guest's + ``/proc/cpuinfo``; or (b) the + ``/sys/devices/system/cpu/vulnerabilities/tsx_async_abort`` file in + the guest should report ``Mitigation: TSX disabled``. + + +Preferred CPU models for AMD x86 hosts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following CPU models are preferred for use on AMD hosts. +Administrators / applications are recommended to use the CPU model that +matches the generation of the host CPUs in use. In a deployment with a +mixture of host CPU models between machines, if live migration +compatibility is required, use the newest CPU model that is compatible +across all desired hosts. + +``EPYC``, ``EPYC-IBPB`` + AMD EPYC Processor (2017) + +``Opteron_G5`` + AMD Opteron 63xx class CPU (2012) + +``Opteron_G4`` + AMD Opteron 62xx class CPU (2011) + +``Opteron_G3`` + AMD Opteron 23xx (Gen 3 Class Opteron, 2009) + +``Opteron_G2`` + AMD Opteron 22xx (Gen 2 Class Opteron, 2006) + +``Opteron_G1`` + AMD Opteron 240 (Gen 1 Class Opteron, 2004) + + +Important CPU features for AMD x86 hosts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following are important CPU features that should be used on AMD x86 +hosts, when available in the host CPU. Some of them require explicit +configuration to enable, as they are not included by default in some, or +all, of the named CPU models listed above. In general all of these +features are included if using "Host passthrough" or "Host model". + +``ibpb`` + Required to enable the Spectre v2 (CVE-2017-5715) fix. + + Included by default in AMD CPU models with -IBPB suffix. + + Must be explicitly turned on for AMD CPU models without -IBPB suffix. + + Requires the host CPU microcode to support this feature before it + can be used for guest CPUs. + +``stibp`` + Required to enable stronger Spectre v2 (CVE-2017-5715) fixes in some + operating systems. + + Must be explicitly turned on for all AMD CPU models. + + Requires the host CPU microcode to support this feature before it + can be used for guest CPUs. + +``virt-ssbd`` + Required to enable the CVE-2018-3639 fix + + Not included by default in any AMD CPU model. + + Must be explicitly turned on for all AMD CPU models. + + This should be provided to guests, even if amd-ssbd is also provided, + for maximum guest compatibility. + + Note for some QEMU / libvirt versions, this must be force enabled when + when using "Host model", because this is a virtual feature that + doesn't exist in the physical host CPUs. + +``amd-ssbd`` + Required to enable the CVE-2018-3639 fix + + Not included by default in any AMD CPU model. + + Must be explicitly turned on for all AMD CPU models. + + This provides higher performance than ``virt-ssbd`` so should be + exposed to guests whenever available in the host. ``virt-ssbd`` should + none the less also be exposed for maximum guest compatibility as some + kernels only know about ``virt-ssbd``. + +``amd-no-ssb`` + Recommended to indicate the host is not vulnerable CVE-2018-3639 + + Not included by default in any AMD CPU model. + + Future hardware generations of CPU will not be vulnerable to + CVE-2018-3639, and thus the guest should be told not to enable + its mitigations, by exposing amd-no-ssb. This is mutually + exclusive with virt-ssbd and amd-ssbd. + +``pdpe1gb`` + Recommended to allow guest OS to use 1GB size pages + + Not included by default in any AMD CPU model. + + Should be explicitly turned on for all AMD CPU models. + + Note that not all CPU hardware will support this feature. + + +Default x86 CPU models +^^^^^^^^^^^^^^^^^^^^^^ + +The default QEMU CPU models are designed such that they can run on all +hosts. If an application does not wish to do perform any host +compatibility checks before launching guests, the default is guaranteed +to work. + +The default CPU models will, however, leave the guest OS vulnerable to +various CPU hardware flaws, so their use is strongly discouraged. +Applications should follow the earlier guidance to setup a better CPU +configuration, with host passthrough recommended if live migration is +not needed. + +``qemu32``, ``qemu64`` + QEMU Virtual CPU version 2.5+ (32 & 64 bit variants) + +``qemu64`` is used for x86_64 guests and ``qemu32`` is used for i686 +guests, when no ``-cpu`` argument is given to QEMU, or no ``<cpu>`` is +provided in libvirt XML. + +Other non-recommended x86 CPUs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following CPUs models are compatible with most AMD and Intel x86 +hosts, but their usage is discouraged, as they expose a very limited +featureset, which prevents guests having optimal performance. + +``kvm32``, ``kvm64`` + Common KVM processor (32 & 64 bit variants). + + Legacy models just for historical compatibility with ancient QEMU + versions. + +``486``, ``athlon``, ``phenom``, ``coreduo``, ``core2duo``, ``n270``, ``pentium``, ``pentium2``, ``pentium3`` + Various very old x86 CPU models, mostly predating the introduction + of hardware assisted virtualization, that should thus not be + required for running virtual machines. + + +Syntax for configuring CPU models +================================= + +The examples below illustrate the approach to configuring the various +CPU models / features in QEMU and libvirt. + +QEMU command line +^^^^^^^^^^^^^^^^^ + +Host passthrough: + +.. parsed-literal:: + + |qemu_system| -cpu host + +Host passthrough with feature customization: + +.. parsed-literal:: + + |qemu_system| -cpu host,vmx=off,... + +Named CPU models: + +.. parsed-literal:: + + |qemu_system| -cpu Westmere + +Named CPU models with feature customization: + +.. parsed-literal:: + + |qemu_system| -cpu Westmere,pcid=on,... + +Libvirt guest XML +^^^^^^^^^^^^^^^^^ + +Host passthrough:: + + <cpu mode='host-passthrough'/> + +Host passthrough with feature customization:: + + <cpu mode='host-passthrough'> + <feature name="vmx" policy="disable"/> + ... + </cpu> + +Host model:: + + <cpu mode='host-model'/> + +Host model with feature customization:: + + <cpu mode='host-model'> + <feature name="vmx" policy="disable"/> + ... + </cpu> + +Named model:: + + <cpu mode='custom'> + <model name="Westmere"/> + </cpu> + +Named model with feature customization:: + + <cpu mode='custom'> + <model name="Westmere"/> + <feature name="pcid" policy="require"/> + ... + </cpu> diff --git a/docs/system/device-emulation.rst b/docs/system/device-emulation.rst new file mode 100644 index 000000000..19944f526 --- /dev/null +++ b/docs/system/device-emulation.rst @@ -0,0 +1,91 @@ +.. _device-emulation: + +Device Emulation +---------------- + +QEMU supports the emulation of a large number of devices from +peripherals such network cards and USB devices to integrated systems +on a chip (SoCs). Configuration of these is often a source of +confusion so it helps to have an understanding of some of the terms +used to describes devices within QEMU. + +Common Terms +~~~~~~~~~~~~ + +Device Front End +================ + +A device front end is how a device is presented to the guest. The type +of device presented should match the hardware that the guest operating +system is expecting to see. All devices can be specified with the +``--device`` command line option. Running QEMU with the command line +options ``--device help`` will list all devices it is aware of. Using +the command line ``--device foo,help`` will list the additional +configuration options available for that device. + +A front end is often paired with a back end, which describes how the +host's resources are used in the emulation. + +Device Buses +============ + +Most devices will exist on a BUS of some sort. Depending on the +machine model you choose (``-M foo``) a number of buses will have been +automatically created. In most cases the BUS a device is attached to +can be inferred, for example PCI devices are generally automatically +allocated to the next free address of first PCI bus found. However in +complicated configurations you can explicitly specify what bus +(``bus=ID``) a device is attached to along with its address +(``addr=N``). + +Some devices, for example a PCI SCSI host controller, will add an +additional buses to the system that other devices can be attached to. +A hypothetical chain of devices might look like: + + --device foo,bus=pci.0,addr=0,id=foo + --device bar,bus=foo.0,addr=1,id=baz + +which would be a bar device (with the ID of baz) which is attached to +the first foo bus (foo.0) at address 1. The foo device which provides +that bus is itself is attached to the first PCI bus (pci.0). + + +Device Back End +=============== + +The back end describes how the data from the emulated device will be +processed by QEMU. The configuration of the back end is usually +specific to the class of device being emulated. For example serial +devices will be backed by a ``--chardev`` which can redirect the data +to a file or socket or some other system. Storage devices are handled +by ``--blockdev`` which will specify how blocks are handled, for +example being stored in a qcow2 file or accessing a raw host disk +partition. Back ends can sometimes be stacked to implement features +like snapshots. + +While the choice of back end is generally transparent to the guest, +there are cases where features will not be reported to the guest if +the back end is unable to support it. + +Device Pass Through +=================== + +Device pass through is where the device is actually given access to +the underlying hardware. This can be as simple as exposing a single +USB device on the host system to the guest or dedicating a video card +in a PCI slot to the exclusive use of the guest. + + +Emulated Devices +~~~~~~~~~~~~~~~~ + +.. toctree:: + :maxdepth: 1 + + devices/ivshmem.rst + devices/net.rst + devices/nvme.rst + devices/usb.rst + devices/vhost-user.rst + devices/virtio-pmem.rst + devices/vhost-user-rng.rst diff --git a/docs/system/device-url-syntax.rst.inc b/docs/system/device-url-syntax.rst.inc new file mode 100644 index 000000000..7dbc525fa --- /dev/null +++ b/docs/system/device-url-syntax.rst.inc @@ -0,0 +1,210 @@ + +In addition to using normal file images for the emulated storage +devices, QEMU can also use networked resources such as iSCSI devices. +These are specified using a special URL syntax. + +``iSCSI`` + iSCSI support allows QEMU to access iSCSI resources directly and use + as images for the guest storage. Both disk and cdrom images are + supported. + + Syntax for specifying iSCSI LUNs is + "iscsi://<target-ip>[:<port>]/<target-iqn>/<lun>" + + By default qemu will use the iSCSI initiator-name + 'iqn.2008-11.org.linux-kvm[:<name>]' but this can also be set from + the command line or a configuration file. + + Since version QEMU 2.4 it is possible to specify a iSCSI request + timeout to detect stalled requests and force a reestablishment of the + session. The timeout is specified in seconds. The default is 0 which + means no timeout. Libiscsi 1.15.0 or greater is required for this + feature. + + Example (without authentication): + + .. parsed-literal:: + + |qemu_system| -iscsi initiator-name=iqn.2001-04.com.example:my-initiator \\ + -cdrom iscsi://192.0.2.1/iqn.2001-04.com.example/2 \\ + -drive file=iscsi://192.0.2.1/iqn.2001-04.com.example/1 + + Example (CHAP username/password via URL): + + .. parsed-literal:: + + |qemu_system| -drive file=iscsi://user%password@192.0.2.1/iqn.2001-04.com.example/1 + + Example (CHAP username/password via environment variables): + + .. parsed-literal:: + + LIBISCSI_CHAP_USERNAME="user" \\ + LIBISCSI_CHAP_PASSWORD="password" \\ + |qemu_system| -drive file=iscsi://192.0.2.1/iqn.2001-04.com.example/1 + +``NBD`` + QEMU supports NBD (Network Block Devices) both using TCP protocol as + well as Unix Domain Sockets. With TCP, the default port is 10809. + + Syntax for specifying a NBD device using TCP, in preferred URI form: + "nbd://<server-ip>[:<port>]/[<export>]" + + Syntax for specifying a NBD device using Unix Domain Sockets; + remember that '?' is a shell glob character and may need quoting: + "nbd+unix:///[<export>]?socket=<domain-socket>" + + Older syntax that is also recognized: + "nbd:<server-ip>:<port>[:exportname=<export>]" + + Syntax for specifying a NBD device using Unix Domain Sockets + "nbd:unix:<domain-socket>[:exportname=<export>]" + + Example for TCP + + .. parsed-literal:: + + |qemu_system| --drive file=nbd:192.0.2.1:30000 + + Example for Unix Domain Sockets + + .. parsed-literal:: + + |qemu_system| --drive file=nbd:unix:/tmp/nbd-socket + +``SSH`` + QEMU supports SSH (Secure Shell) access to remote disks. + + Examples: + + .. parsed-literal:: + + |qemu_system| -drive file=ssh://user@host/path/to/disk.img + |qemu_system| -drive file.driver=ssh,file.user=user,file.host=host,file.port=22,file.path=/path/to/disk.img + + Currently authentication must be done using ssh-agent. Other + authentication methods may be supported in future. + +``GlusterFS`` + GlusterFS is a user space distributed file system. QEMU supports the + use of GlusterFS volumes for hosting VM disk images using TCP, Unix + Domain Sockets and RDMA transport protocols. + + Syntax for specifying a VM disk image on GlusterFS volume is + + .. parsed-literal:: + + URI: + gluster[+type]://[host[:port]]/volume/path[?socket=...][,debug=N][,logfile=...] + + JSON: + 'json:{"driver":"qcow2","file":{"driver":"gluster","volume":"testvol","path":"a.img","debug":N,"logfile":"...", + "server":[{"type":"tcp","host":"...","port":"..."}, + {"type":"unix","socket":"..."}]}}' + + Example + + .. parsed-literal:: + + URI: + |qemu_system| --drive file=gluster://192.0.2.1/testvol/a.img, + file.debug=9,file.logfile=/var/log/qemu-gluster.log + + JSON: + |qemu_system| 'json:{"driver":"qcow2", + "file":{"driver":"gluster", + "volume":"testvol","path":"a.img", + "debug":9,"logfile":"/var/log/qemu-gluster.log", + "server":[{"type":"tcp","host":"1.2.3.4","port":24007}, + {"type":"unix","socket":"/var/run/glusterd.socket"}]}}' + |qemu_system| -drive driver=qcow2,file.driver=gluster,file.volume=testvol,file.path=/path/a.img, + file.debug=9,file.logfile=/var/log/qemu-gluster.log, + file.server.0.type=tcp,file.server.0.host=1.2.3.4,file.server.0.port=24007, + file.server.1.type=unix,file.server.1.socket=/var/run/glusterd.socket + + See also http://www.gluster.org. + +``HTTP/HTTPS/FTP/FTPS`` + QEMU supports read-only access to files accessed over http(s) and + ftp(s). + + Syntax using a single filename: + + :: + + <protocol>://[<username>[:<password>]@]<host>/<path> + + where: + + ``protocol`` + 'http', 'https', 'ftp', or 'ftps'. + + ``username`` + Optional username for authentication to the remote server. + + ``password`` + Optional password for authentication to the remote server. + + ``host`` + Address of the remote server. + + ``path`` + Path on the remote server, including any query string. + + The following options are also supported: + + ``url`` + The full URL when passing options to the driver explicitly. + + ``readahead`` + The amount of data to read ahead with each range request to the + remote server. This value may optionally have the suffix 'T', 'G', + 'M', 'K', 'k' or 'b'. If it does not have a suffix, it will be + assumed to be in bytes. The value must be a multiple of 512 bytes. + It defaults to 256k. + + ``sslverify`` + Whether to verify the remote server's certificate when connecting + over SSL. It can have the value 'on' or 'off'. It defaults to + 'on'. + + ``cookie`` + Send this cookie (it can also be a list of cookies separated by + ';') with each outgoing request. Only supported when using + protocols such as HTTP which support cookies, otherwise ignored. + + ``timeout`` + Set the timeout in seconds of the CURL connection. This timeout is + the time that CURL waits for a response from the remote server to + get the size of the image to be downloaded. If not set, the + default timeout of 5 seconds is used. + + Note that when passing options to qemu explicitly, ``driver`` is the + value of <protocol>. + + Example: boot from a remote Fedora 20 live ISO image + + .. parsed-literal:: + + |qemu_system_x86| --drive media=cdrom,file=https://archives.fedoraproject.org/pub/archive/fedora/linux/releases/20/Live/x86_64/Fedora-Live-Desktop-x86_64-20-1.iso,readonly + + |qemu_system_x86| --drive media=cdrom,file.driver=http,file.url=http://archives.fedoraproject.org/pub/fedora/linux/releases/20/Live/x86_64/Fedora-Live-Desktop-x86_64-20-1.iso,readonly + + Example: boot from a remote Fedora 20 cloud image using a local + overlay for writes, copy-on-read, and a readahead of 64k + + .. parsed-literal:: + + qemu-img create -f qcow2 -o backing_file='json:{"file.driver":"http",, "file.url":"http://archives.fedoraproject.org/pub/archive/fedora/linux/releases/20/Images/x86_64/Fedora-x86_64-20-20131211.1-sda.qcow2",, "file.readahead":"64k"}' /tmp/Fedora-x86_64-20-20131211.1-sda.qcow2 + + |qemu_system_x86| -drive file=/tmp/Fedora-x86_64-20-20131211.1-sda.qcow2,copy-on-read=on + + Example: boot from an image stored on a VMware vSphere server with a + self-signed certificate using a local overlay for writes, a readahead + of 64k and a timeout of 10 seconds. + + .. parsed-literal:: + + qemu-img create -f qcow2 -o backing_file='json:{"file.driver":"https",, "file.url":"https://user:password@vsphere.example.com/folder/test/test-flat.vmdk?dcPath=Datacenter&dsName=datastore1",, "file.sslverify":"off",, "file.readahead":"64k",, "file.timeout":10}' /tmp/test.qcow2 + + |qemu_system_x86| -drive file=/tmp/test.qcow2 diff --git a/docs/system/devices/ivshmem.rst b/docs/system/devices/ivshmem.rst new file mode 100644 index 000000000..b03a48afa --- /dev/null +++ b/docs/system/devices/ivshmem.rst @@ -0,0 +1,64 @@ +.. _pcsys_005fivshmem: + +Inter-VM Shared Memory device +----------------------------- + +On Linux hosts, a shared memory device is available. The basic syntax +is: + +.. parsed-literal:: + + |qemu_system_x86| -device ivshmem-plain,memdev=hostmem + +where hostmem names a host memory backend. For a POSIX shared memory +backend, use something like + +:: + + -object memory-backend-file,size=1M,share,mem-path=/dev/shm/ivshmem,id=hostmem + +If desired, interrupts can be sent between guest VMs accessing the same +shared memory region. Interrupt support requires using a shared memory +server and using a chardev socket to connect to it. The code for the +shared memory server is qemu.git/contrib/ivshmem-server. An example +syntax when using the shared memory server is: + +.. parsed-literal:: + + # First start the ivshmem server once and for all + ivshmem-server -p pidfile -S path -m shm-name -l shm-size -n vectors + + # Then start your qemu instances with matching arguments + |qemu_system_x86| -device ivshmem-doorbell,vectors=vectors,chardev=id + -chardev socket,path=path,id=id + +When using the server, the guest will be assigned a VM ID (>=0) that +allows guests using the same server to communicate via interrupts. +Guests can read their VM ID from a device register (see +ivshmem-spec.txt). + +Migration with ivshmem +~~~~~~~~~~~~~~~~~~~~~~ + +With device property ``master=on``, the guest will copy the shared +memory on migration to the destination host. With ``master=off``, the +guest will not be able to migrate with the device attached. In the +latter case, the device should be detached and then reattached after +migration using the PCI hotplug support. + +At most one of the devices sharing the same memory can be master. The +master must complete migration before you plug back the other devices. + +ivshmem and hugepages +~~~~~~~~~~~~~~~~~~~~~ + +Instead of specifying the <shm size> using POSIX shm, you may specify a +memory backend that has hugepage support: + +.. parsed-literal:: + + |qemu_system_x86| -object memory-backend-file,size=1G,mem-path=/dev/hugepages/my-shmem-file,share,id=mb1 + -device ivshmem-plain,memdev=mb1 + +ivshmem-server also supports hugepages mount points with the ``-m`` +memory path argument. diff --git a/docs/system/devices/net.rst b/docs/system/devices/net.rst new file mode 100644 index 000000000..4b2640c44 --- /dev/null +++ b/docs/system/devices/net.rst @@ -0,0 +1,100 @@ +.. _pcsys_005fnetwork: + +Network emulation +----------------- + +QEMU can simulate several network cards (e.g. PCI or ISA cards on the PC +target) and can connect them to a network backend on the host or an +emulated hub. The various host network backends can either be used to +connect the NIC of the guest to a real network (e.g. by using a TAP +devices or the non-privileged user mode network stack), or to other +guest instances running in another QEMU process (e.g. by using the +socket host network backend). + +Using TAP network interfaces +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is the standard way to connect QEMU to a real network. QEMU adds a +virtual network device on your host (called ``tapN``), and you can then +configure it as if it was a real ethernet card. + +Linux host +^^^^^^^^^^ + +As an example, you can download the ``linux-test-xxx.tar.gz`` archive +and copy the script ``qemu-ifup`` in ``/etc`` and configure properly +``sudo`` so that the command ``ifconfig`` contained in ``qemu-ifup`` can +be executed as root. You must verify that your host kernel supports the +TAP network interfaces: the device ``/dev/net/tun`` must be present. + +See :ref:`sec_005finvocation` to have examples of command +lines using the TAP network interfaces. + +Windows host +^^^^^^^^^^^^ + +There is a virtual ethernet driver for Windows 2000/XP systems, called +TAP-Win32. But it is not included in standard QEMU for Windows, so you +will need to get it separately. It is part of OpenVPN package, so +download OpenVPN from : https://openvpn.net/. + +Using the user mode network stack +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By using the option ``-net user`` (default configuration if no ``-net`` +option is specified), QEMU uses a completely user mode network stack +(you don't need root privilege to use the virtual network). The virtual +network configuration is the following:: + + guest (10.0.2.15) <------> Firewall/DHCP server <-----> Internet + | (10.0.2.2) + | + ----> DNS server (10.0.2.3) + | + ----> SMB server (10.0.2.4) + +The QEMU VM behaves as if it was behind a firewall which blocks all +incoming connections. You can use a DHCP client to automatically +configure the network in the QEMU VM. The DHCP server assign addresses +to the hosts starting from 10.0.2.15. + +In order to check that the user mode network is working, you can ping +the address 10.0.2.2 and verify that you got an address in the range +10.0.2.x from the QEMU virtual DHCP server. + +Note that ICMP traffic in general does not work with user mode +networking. ``ping``, aka. ICMP echo, to the local router (10.0.2.2) +shall work, however. If you're using QEMU on Linux >= 3.0, it can use +unprivileged ICMP ping sockets to allow ``ping`` to the Internet. The +host admin has to set the ping_group_range in order to grant access to +those sockets. To allow ping for GID 100 (usually users group):: + + echo 100 100 > /proc/sys/net/ipv4/ping_group_range + +When using the built-in TFTP server, the router is also the TFTP server. + +When using the ``'-netdev user,hostfwd=...'`` option, TCP or UDP +connections can be redirected from the host to the guest. It allows for +example to redirect X11, telnet or SSH connections. + +Hubs +~~~~ + +QEMU can simulate several hubs. A hub can be thought of as a virtual +connection between several network devices. These devices can be for +example QEMU virtual ethernet cards or virtual Host ethernet devices +(TAP devices). You can connect guest NICs or host network backends to +such a hub using the ``-netdev +hubport`` or ``-nic hubport`` options. The legacy ``-net`` option also +connects the given device to the emulated hub with ID 0 (i.e. the +default hub) unless you specify a netdev with ``-net nic,netdev=xxx`` +here. + +Connecting emulated networks between QEMU instances +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using the ``-netdev socket`` (or ``-nic socket`` or ``-net socket``) +option, it is possible to create emulated networks that span several +QEMU instances. See the description of the ``-netdev socket`` option in +:ref:`sec_005finvocation` to have a basic +example. diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst new file mode 100644 index 000000000..b5acb2a9c --- /dev/null +++ b/docs/system/devices/nvme.rst @@ -0,0 +1,241 @@ +============== +NVMe Emulation +============== + +QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and +``nvme-subsys`` devices. + +See the following sections for specific information on + + * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_. + * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_, + `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data + Protection`_, + +Adding NVMe Devices +=================== + +Controller Emulation +-------------------- + +The QEMU emulated NVMe controller implements version 1.4 of the NVM Express +specification. All mandatory features are implement with a couple of exceptions +and limitations: + + * Accounting numbers in the SMART/Health log page are reset when the device + is power cycled. + * Interrupt Coalescing is not supported and is disabled by default. + +The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the +following parameters: + +.. code-block:: console + + -drive file=nvm.img,if=none,id=nvm + -device nvme,serial=deadbeef,drive=nvm + +There are a number of optional general parameters for the ``nvme`` device. Some +are mentioned here, but see ``-device nvme,help`` to list all possible +parameters. + +``max_ioqpairs=UINT32`` (default: ``64``) + Set the maximum number of allowed I/O queue pairs. This replaces the + deprecated ``num_queues`` parameter. + +``msix_qsize=UINT16`` (default: ``65``) + The number of MSI-X vectors that the device should support. + +``mdts=UINT8`` (default: ``7``) + Set the Maximum Data Transfer Size of the device. + +``use-intel-id`` (default: ``off``) + Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and + Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID + previously used. + +Additional Namespaces +--------------------- + +In the simplest possible invocation sketched above, the device only support a +single namespace with the namespace identifier ``1``. To support multiple +namespaces and additional features, the ``nvme-ns`` device must be used. + +.. code-block:: console + + -device nvme,id=nvme-ctrl-0,serial=deadbeef + -drive file=nvm-1.img,if=none,id=nvm-1 + -device nvme-ns,drive=nvm-1 + -drive file=nvm-2.img,if=none,id=nvm-2 + -device nvme-ns,drive=nvm-2 + +The namespaces defined by the ``nvme-ns`` device will attach to the most +recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace +identifiers are allocated automatically, starting from ``1``. + +There are a number of parameters available: + +``nsid`` (default: ``0``) + Explicitly set the namespace identifier. + +``uuid`` (default: *autogenerated*) + Set the UUID of the namespace. This will be reported as a "Namespace UUID" + descriptor in the Namespace Identification Descriptor List. + +``eui64`` + Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended + Unique Identifier" descriptor in the Namespace Identification Descriptor List. + Since machine type 6.1 a non-zero default value is used if the parameter + is not provided. For earlier machine types the field defaults to 0. + +``bus`` + If there are more ``nvme`` devices defined, this parameter may be used to + attach the namespace to a specific ``nvme`` device (identified by an ``id`` + parameter on the controller device). + +NVM Subsystems +-------------- + +Additional features becomes available if the controller device (``nvme``) is +linked to an NVM Subsystem device (``nvme-subsys``). + +The NVM Subsystem emulation allows features such as shared namespaces and +multipath I/O. + +.. code-block:: console + + -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0 + -device nvme,serial=a,subsys=nvme-subsys-0 + -device nvme,serial=b,subsys=nvme-subsys-0 + +This will create an NVM subsystem with two controllers. Having controllers +linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters: + +``shared`` (default: ``on`` since 6.2) + Specifies that the namespace will be attached to all controllers in the + subsystem. If set to ``off``, the namespace will remain a private namespace + and may only be attached to a single controller at a time. Shared namespaces + are always automatically attached to all controllers (also when controllers + are hotplugged). + +``detached`` (default: ``off``) + If set to ``on``, the namespace will be be available in the subsystem, but + not attached to any controllers initially. A shared namespace with this set + to ``on`` will never be automatically attached to controllers. + +Thus, adding + +.. code-block:: console + + -drive file=nvm-1.img,if=none,id=nvm-1 + -device nvme-ns,drive=nvm-1,nsid=1 + -drive file=nvm-2.img,if=none,id=nvm-2 + -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on + +will cause NSID 1 will be a shared namespace that is initially attached to both +controllers. NSID 3 will be a private namespace due to ``shared=off`` and only +attachable to a single controller at a time. Additionally it will not be +attached to any controller initially (due to ``detached=on``) or to hotplugged +controllers. + +Optional Features +================= + +Controller Memory Buffer +------------------------ + +``nvme`` device parameters related to the Controller Memory Buffer support: + +``cmb_size_mb=UINT32`` (default: ``0``) + This adds a Controller Memory Buffer of the given size at offset zero in BAR + 2. + +``legacy-cmb`` (default: ``off``) + By default, the device uses the "v1.4 scheme" for the Controller Memory + Buffer support (i.e, the CMB is initially disabled and must be explicitly + enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the + CMB. + +Simple Copy +----------- + +The device includes support for TP 4065 ("Simple Copy Command"). A number of +additional ``nvme-ns`` device parameters may be used to control the Copy +command limits: + +``mssrl=UINT16`` (default: ``128``) + Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum + number of logical blocks that may be specified in each source range. + +``mcl=UINT32`` (default: ``128``) + Set the Maximum Copy Length (``MCL``). This is the maximum number of logical + blocks that may be specified in a Copy command (the total for all source + ranges). + +``msrc=UINT8`` (default: ``127``) + Set the Maximum Source Range Count (``MSRC``). This is the maximum number of + source ranges that may be used in a Copy command. This is a 0's based value. + +Zoned Namespaces +---------------- + +A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set +``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace. + +The namespace may be configured with additional parameters + +``zoned.zone_size=SIZE`` (default: ``128MiB``) + Define the zone size (``ZSZE``). + +``zoned.zone_capacity=SIZE`` (default: ``0``) + Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone + capacity will equal the zone size. + +``zoned.descr_ext_size=UINT32`` (default: ``0``) + Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64 + bytes. + +``zoned.cross_read=BOOL`` (default: ``off``) + Set to ``on`` to allow reads to cross zone boundaries. + +``zoned.max_active=UINT32`` (default: ``0``) + Set the maximum number of active resources (``MAR``). The default (``0``) + allows all zones to be active. + +``zoned.max_open=UINT32`` (default: ``0``) + Set the maximum number of open resources (``MOR``). The default (``0``) + allows all zones to be open. If ``zoned.max_active`` is specified, this value + must be less than or equal to that. + +``zoned.zasl=UINT8`` (default: ``0``) + Set the maximum data transfer size for the Zone Append command. Like + ``mdts``, the value is specified as a power of two (2^n) and is in units of + the minimum memory page size (CAP.MPSMIN). The default value (``0``) + has this property inherit the ``mdts`` value. + +Metadata +-------- + +The virtual namespace device supports LBA metadata in the form separate +metadata (``MPTR``-based) and extended LBAs. + +``ms=UINT16`` (default: ``0``) + Defines the number of metadata bytes per LBA. + +``mset=UINT8`` (default: ``0``) + Set to ``1`` to enable extended LBAs. + +End-to-End Data Protection +-------------------------- + +The virtual namespace device supports DIF- and DIX-based protection information +(depending on ``mset``). + +``pi=UINT8`` (default: ``0``) + Enable protection information of the specified type (type ``1``, ``2`` or + ``3``). + +``pil=UINT8`` (default: ``0``) + Controls the location of the protection information within the metadata. Set + to ``1`` to transfer protection information as the first eight bytes of + metadata. Otherwise, the protection information is transferred as the last + eight bytes. diff --git a/docs/system/devices/usb.rst b/docs/system/devices/usb.rst new file mode 100644 index 000000000..afb7d6c22 --- /dev/null +++ b/docs/system/devices/usb.rst @@ -0,0 +1,351 @@ +.. _pcsys_005fusb: + +USB emulation +------------- + +QEMU can emulate a PCI UHCI, OHCI, EHCI or XHCI USB controller. You can +plug virtual USB devices or real host USB devices (only works with +certain host operating systems). QEMU will automatically create and +connect virtual USB hubs as necessary to connect multiple USB devices. + +USB controllers +~~~~~~~~~~~~~~~ + +XHCI controller support +^^^^^^^^^^^^^^^^^^^^^^^ + +QEMU has XHCI host adapter support. The XHCI hardware design is much +more virtualization-friendly when compared to EHCI and UHCI, thus XHCI +emulation uses less resources (especially CPU). So if your guest +supports XHCI (which should be the case for any operating system +released around 2010 or later) we recommend using it: + + qemu -device qemu-xhci + +XHCI supports USB 1.1, USB 2.0 and USB 3.0 devices, so this is the +only controller you need. With only a single USB controller (and +therefore only a single USB bus) present in the system there is no +need to use the bus= parameter when adding USB devices. + + +EHCI controller support +^^^^^^^^^^^^^^^^^^^^^^^ + +The QEMU EHCI Adapter supports USB 2.0 devices. It can be used either +standalone or with companion controllers (UHCI, OHCI) for USB 1.1 +devices. The companion controller setup is more convenient to use +because it provides a single USB bus supporting both USB 2.0 and USB +1.1 devices. See next section for details. + +When running EHCI in standalone mode you can add UHCI or OHCI +controllers for USB 1.1 devices too. Each controller creates its own +bus though, so there are two completely separate USB buses: One USB +1.1 bus driven by the UHCI controller and one USB 2.0 bus driven by +the EHCI controller. Devices must be attached to the correct +controller manually. + +The easiest way to add a UHCI controller to a ``pc`` machine is the +``-usb`` switch. QEMU will create the UHCI controller as function of +the PIIX3 chipset. The USB 1.1 bus will carry the name ``usb-bus.0``. + +You can use the standard ``-device`` switch to add a EHCI controller to +your virtual machine. It is strongly recommended to specify an ID for +the controller so the USB 2.0 bus gets an individual name, for example +``-device usb-ehci,id=ehci``. This will give you a USB 2.0 bus named +``ehci.0``. + +When adding USB devices using the ``-device`` switch you can specify the +bus they should be attached to. Here is a complete example: + +.. parsed-literal:: + + |qemu_system| -M pc ${otheroptions} \\ + -drive if=none,id=usbstick,format=raw,file=/path/to/image \\ + -usb \\ + -device usb-ehci,id=ehci \\ + -device usb-tablet,bus=usb-bus.0 \\ + -device usb-storage,bus=ehci.0,drive=usbstick + +This attaches a USB tablet to the UHCI adapter and a USB mass storage +device to the EHCI adapter. + + +Companion controller support +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The UHCI and OHCI controllers can attach to a USB bus created by EHCI +as companion controllers. This is done by specifying the ``masterbus`` +and ``firstport`` properties. ``masterbus`` specifies the bus name the +controller should attach to. ``firstport`` specifies the first port the +controller should attach to, which is needed as usually one EHCI +controller with six ports has three UHCI companion controllers with +two ports each. + +There is a config file in docs which will do all this for +you, which you can use like this: + +.. parsed-literal:: + + |qemu_system| -readconfig docs/config/ich9-ehci-uhci.cfg + +Then use ``bus=ehci.0`` to assign your USB devices to that bus. + +Using the ``-usb`` switch for ``q35`` machines will create a similar +USB controller configuration. + + +.. _Connecting USB devices: + +Connecting USB devices +~~~~~~~~~~~~~~~~~~~~~~ + +USB devices can be connected with the ``-device usb-...`` command line +option or the ``device_add`` monitor command. Available devices are: + +``usb-mouse`` + Virtual Mouse. This will override the PS/2 mouse emulation when + activated. + +``usb-tablet`` + Pointer device that uses absolute coordinates (like a touchscreen). + This means QEMU is able to report the mouse position without having + to grab the mouse. Also overrides the PS/2 mouse emulation when + activated. + +``usb-storage,drive=drive_id`` + Mass storage device backed by drive_id (see the :ref:`disk images` + chapter in the System Emulation Users Guide). This is the classic + bulk-only transport protocol used by 99% of USB sticks. This + example shows it connected to an XHCI USB controller and with + a drive backed by a raw format disk image: + + .. parsed-literal:: + + |qemu_system| [...] \\ + -drive if=none,id=stick,format=raw,file=/path/to/file.img \\ + -device nec-usb-xhci,id=xhci \\ + -device usb-storage,bus=xhci.0,drive=stick + +``usb-uas`` + USB attached SCSI device. This does not create a SCSI disk, so + you need to explicitly create a ``scsi-hd`` or ``scsi-cd`` device + on the command line, as well as using the ``-drive`` option to + specify what those disks are backed by. One ``usb-uas`` device can + handle multiple logical units (disks). This example creates three + logical units: two disks and one cdrom drive: + + .. parsed-literal:: + + |qemu_system| [...] \\ + -drive if=none,id=uas-disk1,format=raw,file=/path/to/file1.img \\ + -drive if=none,id=uas-disk2,format=raw,file=/path/to/file2.img \\ + -drive if=none,id=uas-cdrom,media=cdrom,format=raw,file=/path/to/image.iso \\ + -device nec-usb-xhci,id=xhci \\ + -device usb-uas,id=uas,bus=xhci.0 \\ + -device scsi-hd,bus=uas.0,scsi-id=0,lun=0,drive=uas-disk1 \\ + -device scsi-hd,bus=uas.0,scsi-id=0,lun=1,drive=uas-disk2 \\ + -device scsi-cd,bus=uas.0,scsi-id=0,lun=5,drive=uas-cdrom + +``usb-bot`` + Bulk-only transport storage device. This presents the guest with the + same USB bulk-only transport protocol interface as ``usb-storage``, but + the QEMU command line option works like ``usb-uas`` and does not + automatically create SCSI disks for you. ``usb-bot`` supports up to + 16 LUNs. Unlike ``usb-uas``, the LUN numbers must be continuous, + i.e. for three devices you must use 0+1+2. The 0+1+5 numbering from the + ``usb-uas`` example above won't work with ``usb-bot``. + +``usb-mtp,rootdir=dir`` + Media transfer protocol device, using dir as root of the file tree + that is presented to the guest. + +``usb-host,hostbus=bus,hostaddr=addr`` + Pass through the host device identified by bus and addr + +``usb-host,vendorid=vendor,productid=product`` + Pass through the host device identified by vendor and product ID + +``usb-wacom-tablet`` + Virtual Wacom PenPartner tablet. This device is similar to the + ``tablet`` above but it can be used with the tslib library because in + addition to touch coordinates it reports touch pressure. + +``usb-kbd`` + Standard USB keyboard. Will override the PS/2 keyboard (if present). + +``usb-serial,chardev=id`` + Serial converter. This emulates an FTDI FT232BM chip connected to + host character device id. + +``usb-braille,chardev=id`` + Braille device. This will use BrlAPI to display the braille output on + a real or fake device referenced by id. + +``usb-net[,netdev=id]`` + Network adapter that supports CDC ethernet and RNDIS protocols. id + specifies a netdev defined with ``-netdev …,id=id``. For instance, + user-mode networking can be used with + + .. parsed-literal:: + + |qemu_system| [...] -netdev user,id=net0 -device usb-net,netdev=net0 + +``usb-ccid`` + Smartcard reader device + +``usb-audio`` + USB audio device + +``u2f-{emulated,passthru}`` + Universal Second Factor device + +Physical port addressing +^^^^^^^^^^^^^^^^^^^^^^^^ + +For all the above USB devices, by default QEMU will plug the device +into the next available port on the specified USB bus, or onto +some available USB bus if you didn't specify one explicitly. +If you need to, you can also specify the physical port where +the device will show up in the guest. This can be done using the +``port`` property. UHCI has two root ports (1,2). EHCI has six root +ports (1-6), and the emulated (1.1) USB hub has eight ports. + +Plugging a tablet into UHCI port 1 works like this:: + + -device usb-tablet,bus=usb-bus.0,port=1 + +Plugging a hub into UHCI port 2 works like this:: + + -device usb-hub,bus=usb-bus.0,port=2 + +Plugging a virtual USB stick into port 4 of the hub just plugged works +this way:: + + -device usb-storage,bus=usb-bus.0,port=2.4,drive=... + +In the monitor, the ``device_add` command also accepts a ``port`` +property specification. If you want to unplug devices too you should +specify some unique id which you can use to refer to the device. +You can then use ``device_del`` to unplug the device later. +For example:: + + (qemu) device_add usb-tablet,bus=usb-bus.0,port=1,id=my-tablet + (qemu) device_del my-tablet + +Hotplugging USB storage +~~~~~~~~~~~~~~~~~~~~~~~ + +The ``usb-bot`` and ``usb-uas`` devices can be hotplugged. In the hotplug +case they are added with ``attached = false`` so the guest will not see +the device until the ``attached`` property is explicitly set to true. +That allows you to attach one or more scsi devices before making the +device visible to the guest. The workflow looks like this: + +#. ``device-add usb-bot,id=foo`` +#. ``device-add scsi-{hd,cd},bus=foo.0,lun=0`` +#. optionally add more devices (luns 1 ... 15) +#. ``scripts/qmp/qom-set foo.attached = true`` + +.. _host_005fusb_005fdevices: + +Using host USB devices on a Linux host +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +WARNING: this is an experimental feature. QEMU will slow down when using +it. USB devices requiring real time streaming (i.e. USB Video Cameras) +are not supported yet. + +1. If you use an early Linux 2.4 kernel, verify that no Linux driver is + actually using the USB device. A simple way to do that is simply to + disable the corresponding kernel module by renaming it from + ``mydriver.o`` to ``mydriver.o.disabled``. + +2. Verify that ``/proc/bus/usb`` is working (most Linux distributions + should enable it by default). You should see something like that: + + :: + + ls /proc/bus/usb + 001 devices drivers + +3. Since only root can access to the USB devices directly, you can + either launch QEMU as root or change the permissions of the USB + devices you want to use. For testing, the following suffices: + + :: + + chown -R myuid /proc/bus/usb + +4. Launch QEMU and do in the monitor: + + :: + + info usbhost + Device 1.2, speed 480 Mb/s + Class 00: USB device 1234:5678, USB DISK + + You should see the list of the devices you can use (Never try to use + hubs, it won't work). + +5. Add the device in QEMU by using: + + :: + + device_add usb-host,vendorid=0x1234,productid=0x5678 + + Normally the guest OS should report that a new USB device is plugged. + You can use the option ``-device usb-host,...`` to do the same. + +6. Now you can try to use the host USB device in QEMU. + +When relaunching QEMU, you may have to unplug and plug again the USB +device to make it work again (this is a bug). + +``usb-host`` properties for specifying the host device +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The example above uses the ``vendorid`` and ``productid`` to +specify which host device to pass through, but this is not +the only way to specify the host device. ``usb-host`` supports +the following properties: + +``hostbus=<nr>`` + Specifies the bus number the device must be attached to +``hostaddr=<nr>`` + Specifies the device address the device got assigned by the guest os +``hostport=<str>`` + Specifies the physical port the device is attached to +``vendorid=<hexnr>`` + Specifies the vendor ID of the device +``productid=<hexnr>`` + Specifies the product ID of the device. + +In theory you can combine all these properties as you like. In +practice only a few combinations are useful: + +- ``vendorid`` and ``productid`` -- match for a specific device, pass it to + the guest when it shows up somewhere in the host. + +- ``hostbus`` and ``hostport`` -- match for a specific physical port in the + host, any device which is plugged in there gets passed to the + guest. + +- ``hostbus`` and ``hostaddr`` -- most useful for ad-hoc pass through as the + hostaddr isn't stable. The next time you plug the device into the host it + will get a new hostaddr. + +Note that on the host USB 1.1 devices are handled by UHCI/OHCI and USB +2.0 by EHCI. That means different USB devices plugged into the very +same physical port on the host may show up on different host buses +depending on the speed. Supposing that devices plugged into a given +physical port appear as bus 1 + port 1 for 2.0 devices and bus 3 + port 1 +for 1.1 devices, you can pass through any device plugged into that port +and also assign it to the correct USB bus in QEMU like this: + +.. parsed-literal:: + + |qemu_system| -M pc [...] \\ + -usb \\ + -device usb-ehci,id=ehci \\ + -device usb-host,bus=usb-bus.0,hostbus=3,hostport=1 \\ + -device usb-host,bus=ehci.0,hostbus=1,hostport=1 diff --git a/docs/system/devices/vhost-user-rng.rst b/docs/system/devices/vhost-user-rng.rst new file mode 100644 index 000000000..a145d4105 --- /dev/null +++ b/docs/system/devices/vhost-user-rng.rst @@ -0,0 +1,39 @@ +QEMU vhost-user-rng - RNG emulation +=================================== + +Background +---------- + +What follows builds on the material presented in vhost-user.rst - it should +be reviewed before moving forward with the content in this file. + +Description +----------- + +The vhost-user-rng device implementation was designed to work with a random +number generator daemon such as the one found in the vhost-device crate of +the rust-vmm project available on github [1]. + +[1]. https://github.com/rust-vmm/vhost-device + +Examples +-------- + +The daemon should be started first: + +:: + + host# vhost-device-rng --socket-path=rng.sock -c 1 -m 512 -p 1000 + +The QEMU invocation needs to create a chardev socket the device can +use to communicate as well as share the guests memory over a memfd. + +:: + + host# qemu-system \ + -chardev socket,path=$(PATH)/rng.sock,id=rng0 \ + -device vhost-user-rng-pci,chardev=rng0 \ + -m 4096 \ + -object memory-backend-file,id=mem,size=4G,mem-path=/dev/shm,share=on \ + -numa node,memdev=mem \ + ... diff --git a/docs/system/devices/vhost-user.rst b/docs/system/devices/vhost-user.rst new file mode 100644 index 000000000..86128114f --- /dev/null +++ b/docs/system/devices/vhost-user.rst @@ -0,0 +1,59 @@ +.. _vhost_user: + +vhost-user back ends +-------------------- + +vhost-user back ends are way to service the request of VirtIO devices +outside of QEMU itself. To do this there are a number of things +required. + +vhost-user device +=================== + +These are simple stub devices that ensure the VirtIO device is visible +to the guest. The code is mostly boilerplate although each device has +a ``chardev`` option which specifies the ID of the ``--chardev`` +device that connects via a socket to the vhost-user *daemon*. + +vhost-user daemon +================= + +This is a separate process that is connected to by QEMU via a socket +following the :ref:`vhost_user_proto`. There are a number of daemons +that can be built when enabled by the project although any daemon that +meets the specification for a given device can be used. + +Shared memory object +==================== + +In order for the daemon to access the VirtIO queues to process the +requests it needs access to the guest's address space. This is +achieved via the ``memory-backend-file`` or ``memory-backend-memfd`` +objects. A reference to a file-descriptor which can access this object +will be passed via the socket as part of the protocol negotiation. + +Currently the shared memory object needs to match the size of the main +system memory as defined by the ``-m`` argument. + +Example +======= + +First start you daemon. + +.. parsed-literal:: + + $ virtio-foo --socket-path=/var/run/foo.sock $OTHER_ARGS + +The you start your QEMU instance specifying the device, chardev and +memory objects. + +.. parsed-literal:: + + $ |qemu_system| \\ + -m 4096 \\ + -chardev socket,id=ba1,path=/var/run/foo.sock \\ + -device vhost-user-foo,chardev=ba1,$OTHER_ARGS \\ + -object memory-backend-memfd,id=mem,size=4G,share=on \\ + -numa node,memdev=mem \\ + ... + diff --git a/docs/system/devices/virtio-pmem.rst b/docs/system/devices/virtio-pmem.rst new file mode 100644 index 000000000..c82ac0673 --- /dev/null +++ b/docs/system/devices/virtio-pmem.rst @@ -0,0 +1,76 @@ + +=========== +virtio pmem +=========== + +This document explains the setup and usage of the virtio pmem device. +The virtio pmem device is a paravirtualized persistent memory device +on regular (i.e non-NVDIMM) storage. + +Usecase +------- + +Virtio pmem allows to bypass the guest page cache and directly use +host page cache. This reduces guest memory footprint as the host can +make efficient memory reclaim decisions under memory pressure. + +How does virtio-pmem compare to the nvdimm emulation? +----------------------------------------------------- + +NVDIMM emulation on regular (i.e. non-NVDIMM) host storage does not +persist the guest writes as there are no defined semantics in the device +specification. The virtio pmem device provides guest write persistence +on non-NVDIMM host storage. + +virtio pmem usage +----------------- + +A virtio pmem device backed by a memory-backend-file can be created on +the QEMU command line as in the following example:: + + -object memory-backend-file,id=mem1,share,mem-path=./virtio_pmem.img,size=4G + -device virtio-pmem-pci,memdev=mem1,id=nv1 + +where: + + - "object memory-backend-file,id=mem1,share,mem-path=<image>, size=<image size>" + creates a backend file with the specified size. + + - "device virtio-pmem-pci,id=nvdimm1,memdev=mem1" creates a virtio pmem + pci device whose storage is provided by above memory backend device. + +Multiple virtio pmem devices can be created if multiple pairs of "-object" +and "-device" are provided. + +Hotplug +------- + +Virtio pmem devices can be hotplugged via the QEMU monitor. First, the +memory backing has to be added via 'object_add'; afterwards, the virtio +pmem device can be added via 'device_add'. + +For example, the following commands add another 4GB virtio pmem device to +the guest:: + + (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=virtio_pmem2.img,size=4G + (qemu) device_add virtio-pmem-pci,id=virtio_pmem2,memdev=mem2 + +Guest Data Persistence +---------------------- + +Guest data persistence on non-NVDIMM requires guest userspace applications +to perform fsync/msync. This is different from a real nvdimm backend where +no additional fsync/msync is required. This is to persist guest writes in +host backing file which otherwise remains in host page cache and there is +risk of losing the data in case of power failure. + +With virtio pmem device, MAP_SYNC mmap flag is not supported. This provides +a hint to application to perform fsync for write persistence. + +Limitations +----------- + +- Real nvdimm device backend is not supported. +- virtio pmem hotunplug is not supported. +- ACPI NVDIMM features like regions/namespaces are not supported. +- ndctl command is not supported. diff --git a/docs/system/gdb.rst b/docs/system/gdb.rst new file mode 100644 index 000000000..453eb73f6 --- /dev/null +++ b/docs/system/gdb.rst @@ -0,0 +1,194 @@ +.. _GDB usage: + +GDB usage +--------- + +QEMU supports working with gdb via gdb's remote-connection facility +(the "gdbstub"). This allows you to debug guest code in the same +way that you might with a low-level debug facility like JTAG +on real hardware. You can stop and start the virtual machine, +examine state like registers and memory, and set breakpoints and +watchpoints. + +In order to use gdb, launch QEMU with the ``-s`` and ``-S`` options. +The ``-s`` option will make QEMU listen for an incoming connection +from gdb on TCP port 1234, and ``-S`` will make QEMU not start the +guest until you tell it to from gdb. (If you want to specify which +TCP port to use or to use something other than TCP for the gdbstub +connection, use the ``-gdb dev`` option instead of ``-s``. See +`Using unix sockets`_ for an example.) + +.. parsed-literal:: + + |qemu_system| -s -S -kernel bzImage -hda rootdisk.img -append "root=/dev/hda" + +QEMU will launch but will silently wait for gdb to connect. + +Then launch gdb on the 'vmlinux' executable:: + + > gdb vmlinux + +In gdb, connect to QEMU:: + + (gdb) target remote localhost:1234 + +Then you can use gdb normally. For example, type 'c' to launch the +kernel:: + + (gdb) c + +Here are some useful tips in order to use gdb on system code: + +1. Use ``info reg`` to display all the CPU registers. + +2. Use ``x/10i $eip`` to display the code at the PC position. + +3. Use ``set architecture i8086`` to dump 16 bit code. Then use + ``x/10i $cs*16+$eip`` to dump the code at the PC position. + +Debugging multicore machines +============================ + +GDB's abstraction for debugging targets with multiple possible +parallel flows of execution is a two layer one: it supports multiple +"inferiors", each of which can have multiple "threads". When the QEMU +machine has more than one CPU, QEMU exposes each CPU cluster as a +separate "inferior", where each CPU within the cluster is a separate +"thread". Most QEMU machine types have identical CPUs, so there is a +single cluster which has all the CPUs in it. A few machine types are +heterogeneous and have multiple clusters: for example the ``sifive_u`` +machine has a cluster with one E51 core and a second cluster with four +U54 cores. Here the E51 is the only thread in the first inferior, and +the U54 cores are all threads in the second inferior. + +When you connect gdb to the gdbstub, it will automatically +connect to the first inferior; you can display the CPUs in this +cluster using the gdb ``info thread`` command, and switch between +them using gdb's usual thread-management commands. + +For multi-cluster machines, unfortunately gdb does not by default +handle multiple inferiors, and so you have to explicitly connect +to them. First, you must connect with the ``extended-remote`` +protocol, not ``remote``:: + + (gdb) target extended-remote localhost:1234 + +Once connected, gdb will have a single inferior, for the +first cluster. You need to create inferiors for the other +clusters and attach to them, like this:: + + (gdb) add-inferior + Added inferior 2 + (gdb) inferior 2 + [Switching to inferior 2 [<null>] (<noexec>)] + (gdb) attach 2 + Attaching to process 2 + warning: No executable has been specified and target does not support + determining executable automatically. Try using the "file" command. + 0x00000000 in ?? () + +Once you've done this, ``info threads`` will show CPUs in +all the clusters you have attached to:: + + (gdb) info threads + Id Target Id Frame + 1.1 Thread 1.1 (cortex-m33-arm-cpu cpu [running]) 0x00000000 in ?? () + * 2.1 Thread 2.2 (cortex-m33-arm-cpu cpu [halted ]) 0x00000000 in ?? () + +You probably also want to set gdb to ``schedule-multiple`` mode, +so that when you tell gdb to ``continue`` it resumes all CPUs, +not just those in the cluster you are currently working on:: + + (gdb) set schedule-multiple on + +Using unix sockets +================== + +An alternate method for connecting gdb to the QEMU gdbstub is to use +a unix socket (if supported by your operating system). This is useful when +running several tests in parallel, or if you do not have a known free TCP +port (e.g. when running automated tests). + +First create a chardev with the appropriate options, then +instruct the gdbserver to use that device: + +.. parsed-literal:: + + |qemu_system| -chardev socket,path=/tmp/gdb-socket,server=on,wait=off,id=gdb0 -gdb chardev:gdb0 -S ... + +Start gdb as before, but this time connect using the path to +the socket:: + + (gdb) target remote /tmp/gdb-socket + +Note that to use a unix socket for the connection you will need +gdb version 9.0 or newer. + +Advanced debugging options +========================== + +Changing single-stepping behaviour +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The default single stepping behavior is step with the IRQs and timer +service routines off. It is set this way because when gdb executes a +single step it expects to advance beyond the current instruction. With +the IRQs and timer service routines on, a single step might jump into +the one of the interrupt or exception vectors instead of executing the +current instruction. This means you may hit the same breakpoint a number +of times before executing the instruction gdb wants to have executed. +Because there are rare circumstances where you want to single step into +an interrupt vector the behavior can be controlled from GDB. There are +three commands you can query and set the single step behavior: + +``maintenance packet qqemu.sstepbits`` + This will display the MASK bits used to control the single stepping + IE: + + :: + + (gdb) maintenance packet qqemu.sstepbits + sending: "qqemu.sstepbits" + received: "ENABLE=1,NOIRQ=2,NOTIMER=4" + +``maintenance packet qqemu.sstep`` + This will display the current value of the mask used when single + stepping IE: + + :: + + (gdb) maintenance packet qqemu.sstep + sending: "qqemu.sstep" + received: "0x7" + +``maintenance packet Qqemu.sstep=HEX_VALUE`` + This will change the single step mask, so if wanted to enable IRQs on + the single step, but not timers, you would use: + + :: + + (gdb) maintenance packet Qqemu.sstep=0x5 + sending: "qemu.sstep=0x5" + received: "OK" + +Examining physical memory +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Another feature that QEMU gdbstub provides is to toggle the memory GDB +works with, by default GDB will show the current process memory respecting +the virtual address translation. + +If you want to examine/change the physical memory you can set the gdbstub +to work with the physical memory rather with the virtual one. + +The memory mode can be checked by sending the following command: + +``maintenance packet qqemu.PhyMemMode`` + This will return either 0 or 1, 1 indicates you are currently in the + physical memory mode. + +``maintenance packet Qqemu.PhyMemMode:1`` + This will change the memory mode to physical memory. + +``maintenance packet Qqemu.PhyMemMode:0`` + This will change it back to normal memory mode. diff --git a/docs/system/generic-loader.rst b/docs/system/generic-loader.rst new file mode 100644 index 000000000..4f9fb005f --- /dev/null +++ b/docs/system/generic-loader.rst @@ -0,0 +1,120 @@ +.. + Copyright (c) 2016, Xilinx Inc. + + This work is licensed under the terms of the GNU GPL, version 2 or later. See + the COPYING file in the top-level directory. + +Generic Loader +-------------- + +The 'loader' device allows the user to load multiple images or values into +QEMU at startup. + +Loading Data into Memory Values +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The loader device allows memory values to be set from the command line. This +can be done by following the syntax below:: + + -device loader,addr=<addr>,data=<data>,data-len=<data-len> \ + [,data-be=<data-be>][,cpu-num=<cpu-num>] + +``<addr>`` + The address to store the data in. + +``<data>`` + The value to be written to the address. The maximum size of the data + is 8 bytes. + +``<data-len>`` + The length of the data in bytes. This argument must be included if + the data argument is. + +``<data-be>`` + Set to true if the data to be stored on the guest should be written + as big endian data. The default is to write little endian data. + +``<cpu-num>`` + The number of the CPU's address space where the data should be + loaded. If not specified the address space of the first CPU is used. + +All values are parsed using the standard QemuOps parsing. This allows the user +to specify any values in any format supported. By default the values +will be parsed as decimal. To use hex values the user should prefix the number +with a '0x'. + +An example of loading value 0x8000000e to address 0xfd1a0104 is:: + + -device loader,addr=0xfd1a0104,data=0x8000000e,data-len=4 + +Setting a CPU's Program Counter +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The loader device allows the CPU's PC to be set from the command line. This +can be done by following the syntax below:: + + -device loader,addr=<addr>,cpu-num=<cpu-num> + +``<addr>`` + The value to use as the CPU's PC. + +``<cpu-num>`` + The number of the CPU whose PC should be set to the specified value. + +All values are parsed using the standard QemuOpts parsing. This allows the user +to specify any values in any format supported. By default the values +will be parsed as decimal. To use hex values the user should prefix the number +with a '0x'. + +An example of setting CPU 0's PC to 0x8000 is:: + + -device loader,addr=0x8000,cpu-num=0 + +Loading Files +^^^^^^^^^^^^^ + +The loader device also allows files to be loaded into memory. It can load ELF, +U-Boot, and Intel HEX executable formats as well as raw images. The syntax is +shown below: + + -device loader,file=<file>[,addr=<addr>][,cpu-num=<cpu-num>][,force-raw=<raw>] + +``<file>`` + A file to be loaded into memory + +``<addr>`` + The memory address where the file should be loaded. This is required + for raw images and ignored for non-raw files. + +``<cpu-num>`` + This specifies the CPU that should be used. This is an + optional argument and will cause the CPU's PC to be set to the + memory address where the raw file is loaded or the entry point + specified in the executable format header. This option should only + be used for the boot image. This will also cause the image to be + written to the specified CPU's address space. If not specified, the + default is CPU 0. + +``<force-raw>`` + Setting 'force-raw=on' forces the file to be treated as a raw image. + This can be used to load supported executable formats as if they + were raw. + +All values are parsed using the standard QemuOpts parsing. This allows the user +to specify any values in any format supported. By default the values +will be parsed as decimal. To use hex values the user should prefix the number +with a '0x'. + +An example of loading an ELF file which CPU0 will boot is shown below:: + + -device loader,file=./images/boot.elf,cpu-num=0 + +Restrictions and ToDos +^^^^^^^^^^^^^^^^^^^^^^ + +At the moment it is just assumed that if you specify a cpu-num then +you want to set the PC as well. This might not always be the case. In +future the internal state 'set_pc' (which exists in the generic loader +now) should be exposed to the user so that they can choose if the PC +is set or not. + + diff --git a/docs/system/guest-loader.rst b/docs/system/guest-loader.rst new file mode 100644 index 000000000..9ef9776bf --- /dev/null +++ b/docs/system/guest-loader.rst @@ -0,0 +1,54 @@ +.. + Copyright (c) 2020, Linaro + +Guest Loader +------------ + +The guest loader is similar to the ``generic-loader`` although it is +aimed at a particular use case of loading hypervisor guests. This is +useful for debugging hypervisors without having to jump through the +hoops of firmware and boot-loaders. + +The guest loader does two things: + + - load blobs (kernels and initial ram disks) into memory + - sets platform FDT data so hypervisors can find and boot them + +This is what is typically done by a boot-loader like grub using it's +multi-boot capability. A typical example would look like: + +.. parsed-literal:: + + |qemu_system| -kernel ~/xen.git/xen/xen \ + -append "dom0_mem=1G,max:1G loglvl=all guest_loglvl=all" \ + -device guest-loader,addr=0x42000000,kernel=Image,bootargs="root=/dev/sda2 ro console=hvc0 earlyprintk=xen" \ + -device guest-loader,addr=0x47000000,initrd=rootfs.cpio + +In the above example the Xen hypervisor is loaded by the -kernel +parameter and passed it's boot arguments via -append. The Dom0 guest +is loaded into the areas of memory. Each blob will get +``/chosen/module@<addr>`` entry in the FDT to indicate it's location and +size. Additional information can be passed with by using additional +arguments. + +Currently the only supported machines which use FDT data to boot are +the ARM and RiscV ``virt`` machines. + +Arguments +^^^^^^^^^ + +The full syntax of the guest-loader is:: + + -device guest-loader,addr=<addr>[,kernel=<file>,[bootargs=<args>]][,initrd=<file>] + +``addr=<addr>`` + This is mandatory and indicates the start address of the blob. + +``kernel|initrd=<file>`` + Indicates the filename of the kernel or initrd blob. Both blobs will + have the "multiboot,module" compatibility string as well as + "multiboot,kernel" or "multiboot,ramdisk" as appropriate. + +``bootargs=<args>`` + This is an optional field for kernel blobs which will pass command + like via the ``/chosen/module@<addr>/bootargs`` node. diff --git a/docs/system/i386/cpu.rst b/docs/system/i386/cpu.rst new file mode 100644 index 000000000..738719da9 --- /dev/null +++ b/docs/system/i386/cpu.rst @@ -0,0 +1 @@ +.. include:: ../cpu-models-x86.rst.inc diff --git a/docs/system/i386/kvm-pv.rst b/docs/system/i386/kvm-pv.rst new file mode 100644 index 000000000..1e5a9923e --- /dev/null +++ b/docs/system/i386/kvm-pv.rst @@ -0,0 +1,100 @@ +Paravirtualized KVM features +============================ + +Description +----------- + +In some cases when implementing hardware interfaces in software is slow, ``KVM`` +implements its own paravirtualized interfaces. + +Setup +----- + +Paravirtualized ``KVM`` features are represented as CPU flags. The following +features are enabled by default for any CPU model when ``KVM`` acceleration is +enabled: + +- ``kvmclock`` +- ``kvm-nopiodelay`` +- ``kvm-asyncpf`` +- ``kvm-steal-time`` +- ``kvm-pv-eoi`` +- ``kvmclock-stable-bit`` + +``kvm-msi-ext-dest-id`` feature is enabled by default in x2apic mode with split +irqchip (e.g. "-machine ...,kernel-irqchip=split -cpu ...,x2apic"). + +Note: when CPU model ``host`` is used, QEMU passes through all supported +paravirtualized ``KVM`` features to the guest. + +Existing features +----------------- + +``kvmclock`` + Expose a ``KVM`` specific paravirtualized clocksource to the guest. Supported + since Linux v2.6.26. + +``kvm-nopiodelay`` + The guest doesn't need to perform delays on PIO operations. Supported since + Linux v2.6.26. + +``kvm-mmu`` + This feature is deprecated. + +``kvm-asyncpf`` + Enable asynchronous page fault mechanism. Supported since Linux v2.6.38. + Note: since Linux v5.10 the feature is deprecated and not enabled by ``KVM``. + Use ``kvm-asyncpf-int`` instead. + +``kvm-steal-time`` + Enable stolen (when guest vCPU is not running) time accounting. Supported + since Linux v3.1. + +``kvm-pv-eoi`` + Enable paravirtualized end-of-interrupt signaling. Supported since Linux + v3.10. + +``kvm-pv-unhalt`` + Enable paravirtualized spinlocks support. Supported since Linux v3.12. + +``kvm-pv-tlb-flush`` + Enable paravirtualized TLB flush mechanism. Supported since Linux v4.16. + +``kvm-pv-ipi`` + Enable paravirtualized IPI mechanism. Supported since Linux v4.19. + +``kvm-poll-control`` + Enable host-side polling on HLT control from the guest. Supported since Linux + v5.10. + +``kvm-pv-sched-yield`` + Enable paravirtualized sched yield feature. Supported since Linux v5.10. + +``kvm-asyncpf-int`` + Enable interrupt based asynchronous page fault mechanism. Supported since Linux + v5.10. + +``kvm-msi-ext-dest-id`` + Support 'Extended Destination ID' for external interrupts. The feature allows + to use up to 32768 CPUs without IRQ remapping (but other limits may apply making + the number of supported vCPUs for a given configuration lower). Supported since + Linux v5.10. + +``kvmclock-stable-bit`` + Tell the guest that guest visible TSC value can be fully trusted for kvmclock + computations and no warps are expected. Supported since Linux v2.6.35. + +Supplementary features +---------------------- + +``kvm-pv-enforce-cpuid`` + Limit the supported paravirtualized feature set to the exposed features only. + Note, by default, ``KVM`` allows the guest to use all currently supported + paravirtualized features even when they were not announced in guest visible + CPUIDs. Supported since Linux v5.10. + + +Useful links +------------ + +Please refer to Documentation/virt/kvm in Linux for additional details. diff --git a/docs/system/i386/microvm.rst b/docs/system/i386/microvm.rst new file mode 100644 index 000000000..1675e37d3 --- /dev/null +++ b/docs/system/i386/microvm.rst @@ -0,0 +1,128 @@ +'microvm' virtual platform (``microvm``) +======================================== + +``microvm`` is a machine type inspired by ``Firecracker`` and +constructed after its machine model. + +It's a minimalist machine type without ``PCI`` nor ``ACPI`` support, +designed for short-lived guests. microvm also establishes a baseline +for benchmarking and optimizing both QEMU and guest operating systems, +since it is optimized for both boot time and footprint. + + +Supported devices +----------------- + +The microvm machine type supports the following devices: + +- ISA bus +- i8259 PIC (optional) +- i8254 PIT (optional) +- MC146818 RTC (optional) +- One ISA serial port (optional) +- LAPIC +- IOAPIC (with kernel-irqchip=split by default) +- kvmclock (if using KVM) +- fw_cfg +- Up to eight virtio-mmio devices (configured by the user) + + +Limitations +----------- + +Currently, microvm does *not* support the following features: + +- PCI-only devices. +- Hotplug of any kind. +- Live migration across QEMU versions. + + +Using the microvm machine type +------------------------------ + +Machine-specific options +~~~~~~~~~~~~~~~~~~~~~~~~ + +It supports the following machine-specific options: + +- microvm.x-option-roms=bool (Set off to disable loading option ROMs) +- microvm.pit=OnOffAuto (Enable i8254 PIT) +- microvm.isa-serial=bool (Set off to disable the instantiation an ISA serial port) +- microvm.pic=OnOffAuto (Enable i8259 PIC) +- microvm.rtc=OnOffAuto (Enable MC146818 RTC) +- microvm.auto-kernel-cmdline=bool (Set off to disable adding virtio-mmio devices to the kernel cmdline) + + +Boot options +~~~~~~~~~~~~ + +By default, microvm uses ``qboot`` as its BIOS, to obtain better boot +times, but it's also compatible with ``SeaBIOS``. + +As no current FW is able to boot from a block device using +``virtio-mmio`` as its transport, a microvm-based VM needs to be run +using a host-side kernel and, optionally, an initrd image. + + +Running a microvm-based VM +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, microvm aims for maximum compatibility, enabling both +legacy and non-legacy devices. In this example, a VM is created +without passing any additional machine-specific option, using the +legacy ``ISA serial`` device as console:: + + $ qemu-system-x86_64 -M microvm \ + -enable-kvm -cpu host -m 512m -smp 2 \ + -kernel vmlinux -append "earlyprintk=ttyS0 console=ttyS0 root=/dev/vda" \ + -nodefaults -no-user-config -nographic \ + -serial stdio \ + -drive id=test,file=test.img,format=raw,if=none \ + -device virtio-blk-device,drive=test \ + -netdev tap,id=tap0,script=no,downscript=no \ + -device virtio-net-device,netdev=tap0 + +While the example above works, you might be interested in reducing the +footprint further by disabling some legacy devices. If you're using +``KVM``, you can disable the ``RTC``, making the Guest rely on +``kvmclock`` exclusively. Additionally, if your host's CPUs have the +``TSC_DEADLINE`` feature, you can also disable both the i8259 PIC and +the i8254 PIT (make sure you're also emulating a CPU with such feature +in the guest). + +This is an example of a VM with all optional legacy features +disabled:: + + $ qemu-system-x86_64 \ + -M microvm,x-option-roms=off,pit=off,pic=off,isa-serial=off,rtc=off \ + -enable-kvm -cpu host -m 512m -smp 2 \ + -kernel vmlinux -append "console=hvc0 root=/dev/vda" \ + -nodefaults -no-user-config -nographic \ + -chardev stdio,id=virtiocon0 \ + -device virtio-serial-device \ + -device virtconsole,chardev=virtiocon0 \ + -drive id=test,file=test.img,format=raw,if=none \ + -device virtio-blk-device,drive=test \ + -netdev tap,id=tap0,script=no,downscript=no \ + -device virtio-net-device,netdev=tap0 + + +Triggering a guest-initiated shut down +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As the microvm machine type includes just a small set of system +devices, some x86 mechanisms for rebooting or shutting down the +system, like sending a key sequence to the keyboard or writing to an +ACPI register, doesn't have any effect in the VM. + +The recommended way to trigger a guest-initiated shut down is by +generating a ``triple-fault``, which will cause the VM to initiate a +reboot. Additionally, if the ``-no-reboot`` argument is present in the +command line, QEMU will detect this event and terminate its own +execution gracefully. + +Linux does support this mechanism, but by default will only be used +after other options have been tried and failed, causing the reboot to +be delayed by a small number of seconds. It's possible to instruct it +to try the triple-fault mechanism first, by adding ``reboot=t`` to the +kernel's command line. diff --git a/docs/system/i386/pc.rst b/docs/system/i386/pc.rst new file mode 100644 index 000000000..d543c11a5 --- /dev/null +++ b/docs/system/i386/pc.rst @@ -0,0 +1,7 @@ +i440fx PC (``pc-i440fx``, ``pc``) +================================= + +Peripherals +~~~~~~~~~~~ + +.. include:: ../target-i386-desc.rst.inc diff --git a/docs/system/i386/sgx.rst b/docs/system/i386/sgx.rst new file mode 100644 index 000000000..f8fade5ac --- /dev/null +++ b/docs/system/i386/sgx.rst @@ -0,0 +1,165 @@ +Software Guard eXtensions (SGX) +=============================== + +Overview +-------- + +Intel Software Guard eXtensions (SGX) is a set of instructions and mechanisms +for memory accesses in order to provide security accesses for sensitive +applications and data. SGX allows an application to use it's pariticular +address space as an *enclave*, which is a protected area provides confidentiality +and integrity even in the presence of privileged malware. Accesses to the +enclave memory area from any software not resident in the enclave are prevented, +including those from privileged software. + +Virtual SGX +----------- + +SGX feature is exposed to guest via SGX CPUID. Looking at SGX CPUID, we can +report the same CPUID info to guest as on host for most of SGX CPUID. With +reporting the same CPUID guest is able to use full capacity of SGX, and KVM +doesn't need to emulate those info. + +The guest's EPC base and size are determined by QEMU, and KVM needs QEMU to +notify such info to it before it can initialize SGX for guest. + +Virtual EPC +~~~~~~~~~~~ + +By default, QEMU does not assign EPC to a VM, i.e. fully enabling SGX in a VM +requires explicit allocation of EPC to the VM. Similar to other specialized +memory types, e.g. hugetlbfs, EPC is exposed as a memory backend. + +SGX EPC is enumerated through CPUID, i.e. EPC "devices" need to be realized +prior to realizing the vCPUs themselves, which occurs long before generic +devices are parsed and realized. This limitation means that EPC does not +require -maxmem as EPC is not treated as {cold,hot}plugged memory. + +QEMU does not artificially restrict the number of EPC sections exposed to a +guest, e.g. QEMU will happily allow you to create 64 1M EPC sections. Be aware +that some kernels may not recognize all EPC sections, e.g. the Linux SGX driver +is hardwired to support only 8 EPC sections. + +The following QEMU snippet creates two EPC sections, with 64M pre-allocated +to the VM and an additional 28M mapped but not allocated:: + + -object memory-backend-epc,id=mem1,size=64M,prealloc=on \ + -object memory-backend-epc,id=mem2,size=28M \ + -M sgx-epc.0.memdev=mem1,sgx-epc.1.memdev=mem2 + +Note: + +The size and location of the virtual EPC are far less restricted compared +to physical EPC. Because physical EPC is protected via range registers, +the size of the physical EPC must be a power of two (though software sees +a subset of the full EPC, e.g. 92M or 128M) and the EPC must be naturally +aligned. KVM SGX's virtual EPC is purely a software construct and only +requires the size and location to be page aligned. QEMU enforces the EPC +size is a multiple of 4k and will ensure the base of the EPC is 4k aligned. +To simplify the implementation, EPC is always located above 4g in the guest +physical address space. + +Migration +~~~~~~~~~ + +QEMU/KVM doesn't prevent live migrating SGX VMs, although from hardware's +perspective, SGX doesn't support live migration, since both EPC and the SGX +key hierarchy are bound to the physical platform. However live migration +can be supported in the sense if guest software stack can support recreating +enclaves when it suffers sudden lose of EPC; and if guest enclaves can detect +SGX keys being changed, and handle gracefully. For instance, when ERESUME fails +with #PF.SGX, guest software can gracefully detect it and recreate enclaves; +and when enclave fails to unseal sensitive information from outside, it can +detect such error and sensitive information can be provisioned to it again. + +CPUID +~~~~~ + +Due to its myriad dependencies, SGX is currently not listed as supported +in any of QEMU's built-in CPU configuration. To expose SGX (and SGX Launch +Control) to a guest, you must either use ``-cpu host`` to pass-through the +host CPU model, or explicitly enable SGX when using a built-in CPU model, +e.g. via ``-cpu <model>,+sgx`` or ``-cpu <model>,+sgx,+sgxlc``. + +All SGX sub-features enumerated through CPUID, e.g. SGX2, MISCSELECT, +ATTRIBUTES, etc... can be restricted via CPUID flags. Be aware that enforcing +restriction of MISCSELECT, ATTRIBUTES and XFRM requires intercepting ECREATE, +i.e. may marginally reduce SGX performance in the guest. All SGX sub-features +controlled via -cpu are prefixed with "sgx", e.g.:: + + $ qemu-system-x86_64 -cpu help | xargs printf "%s\n" | grep sgx + sgx + sgx-debug + sgx-encls-c + sgx-enclv + sgx-exinfo + sgx-kss + sgx-mode64 + sgx-provisionkey + sgx-tokenkey + sgx1 + sgx2 + sgxlc + +The following QEMU snippet passes through the host CPU but restricts access to +the provision and EINIT token keys:: + + -cpu host,-sgx-provisionkey,-sgx-tokenkey + +SGX sub-features cannot be emulated, i.e. sub-features that are not present +in hardware cannot be forced on via '-cpu'. + +Virtualize SGX Launch Control +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU SGX support for Launch Control (LC) is passive, in the sense that it +does not actively change the LC configuration. QEMU SGX provides the user +the ability to set/clear the CPUID flag (and by extension the associated +IA32_FEATURE_CONTROL MSR bit in fw_cfg) and saves/restores the LE Hash MSRs +when getting/putting guest state, but QEMU does not add new controls to +directly modify the LC configuration. Similar to hardware behavior, locking +the LC configuration to a non-Intel value is left to guest firmware. Unlike +host bios setting for SGX launch control(LC), there is no special bios setting +for SGX guest by our design. If host is in locked mode, we can still allow +creating VM with SGX. + +Feature Control +~~~~~~~~~~~~~~~ + +QEMU SGX updates the ``etc/msr_feature_control`` fw_cfg entry to set the SGX +(bit 18) and SGX LC (bit 17) flags based on their respective CPUID support, +i.e. existing guest firmware will automatically set SGX and SGX LC accordingly, +assuming said firmware supports fw_cfg.msr_feature_control. + +Launching a guest +----------------- + +To launch a SGX guest: + +.. parsed-literal:: + + |qemu_system_x86| \\ + -cpu host,+sgx-provisionkey \\ + -object memory-backend-epc,id=mem1,size=64M,prealloc=on \\ + -object memory-backend-epc,id=mem2,size=28M \\ + -M sgx-epc.0.memdev=mem1,sgx-epc.1.memdev=mem2 + +Utilizing SGX in the guest requires a kernel/OS with SGX support. +The support can be determined in guest by:: + + $ grep sgx /proc/cpuinfo + +and SGX epc info by:: + + $ dmesg | grep sgx + [ 1.242142] sgx: EPC section 0x180000000-0x181bfffff + [ 1.242319] sgx: EPC section 0x181c00000-0x1837fffff + +References +---------- + +- `SGX Homepage <https://software.intel.com/sgx>`__ + +- `SGX SDK <https://github.com/intel/linux-sgx.git>`__ + +- SGX specification: Intel SDM Volume 3 diff --git a/docs/system/images.rst b/docs/system/images.rst new file mode 100644 index 000000000..d000bd6b6 --- /dev/null +++ b/docs/system/images.rst @@ -0,0 +1,85 @@ +.. _disk images: + +Disk Images +----------- + +QEMU supports many disk image formats, including growable disk images +(their size increase as non empty sectors are written), compressed and +encrypted disk images. + +.. _disk_005fimages_005fquickstart: + +Quick start for disk image creation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can create a disk image with the command:: + + qemu-img create myimage.img mysize + +where myimage.img is the disk image filename and mysize is its size in +kilobytes. You can add an ``M`` suffix to give the size in megabytes and +a ``G`` suffix for gigabytes. + +See the ``qemu-img`` invocation documentation for more information. + +.. _disk_005fimages_005fsnapshot_005fmode: + +Snapshot mode +~~~~~~~~~~~~~ + +If you use the option ``-snapshot``, all disk images are considered as +read only. When sectors in written, they are written in a temporary file +created in ``/tmp``. You can however force the write back to the raw +disk images by using the ``commit`` monitor command (or C-a s in the +serial console). + +.. _vm_005fsnapshots: + +VM snapshots +~~~~~~~~~~~~ + +VM snapshots are snapshots of the complete virtual machine including CPU +state, RAM, device state and the content of all the writable disks. In +order to use VM snapshots, you must have at least one non removable and +writable block device using the ``qcow2`` disk image format. Normally +this device is the first virtual hard drive. + +Use the monitor command ``savevm`` to create a new VM snapshot or +replace an existing one. A human readable name can be assigned to each +snapshot in addition to its numerical ID. + +Use ``loadvm`` to restore a VM snapshot and ``delvm`` to remove a VM +snapshot. ``info snapshots`` lists the available snapshots with their +associated information:: + + (qemu) info snapshots + Snapshot devices: hda + Snapshot list (from hda): + ID TAG VM SIZE DATE VM CLOCK + 1 start 41M 2006-08-06 12:38:02 00:00:14.954 + 2 40M 2006-08-06 12:43:29 00:00:18.633 + 3 msys 40M 2006-08-06 12:44:04 00:00:23.514 + +A VM snapshot is made of a VM state info (its size is shown in +``info snapshots``) and a snapshot of every writable disk image. The VM +state info is stored in the first ``qcow2`` non removable and writable +block device. The disk image snapshots are stored in every disk image. +The size of a snapshot in a disk image is difficult to evaluate and is +not shown by ``info snapshots`` because the associated disk sectors are +shared among all the snapshots to save disk space (otherwise each +snapshot would need a full copy of all the disk images). + +When using the (unrelated) ``-snapshot`` option +(:ref:`disk_005fimages_005fsnapshot_005fmode`), +you can always make VM snapshots, but they are deleted as soon as you +exit QEMU. + +VM snapshots currently have the following known limitations: + +- They cannot cope with removable devices if they are removed or + inserted after a snapshot is done. + +- A few device drivers still have incomplete snapshot support so their + state is not saved or restored properly (in particular USB). + +.. include:: qemu-block-drivers.rst.inc diff --git a/docs/system/index.rst b/docs/system/index.rst new file mode 100644 index 000000000..73bbedbc2 --- /dev/null +++ b/docs/system/index.rst @@ -0,0 +1,36 @@ +---------------- +System Emulation +---------------- + +This section of the manual is the overall guide for users using QEMU +for full system emulation (as opposed to user-mode emulation). +This includes working with hypervisors such as KVM, Xen, Hax +or Hypervisor.Framework. + +.. toctree:: + :maxdepth: 3 + + quickstart + invocation + device-emulation + keys + mux-chardev + monitor + images + virtio-net-failover + linuxboot + generic-loader + guest-loader + barrier + vnc-security + tls + secrets + authz + gdb + managed-startup + bootindex + cpu-hotplug + pr-manager + targets + security + multi-process diff --git a/docs/system/invocation.rst b/docs/system/invocation.rst new file mode 100644 index 000000000..4ba38fc23 --- /dev/null +++ b/docs/system/invocation.rst @@ -0,0 +1,18 @@ +.. _sec_005finvocation: + +Invocation +---------- + +.. parsed-literal:: + + |qemu_system| [options] [disk_image] + +disk_image is a raw hard disk image for IDE hard disk 0. Some targets do +not need a disk image. + +.. hxtool-doc:: qemu-options.hx + +Device URL Syntax +~~~~~~~~~~~~~~~~~ + +.. include:: device-url-syntax.rst.inc diff --git a/docs/system/keys.rst b/docs/system/keys.rst new file mode 100644 index 000000000..e596ae6c4 --- /dev/null +++ b/docs/system/keys.rst @@ -0,0 +1,6 @@ +.. _pcsys_005fkeys: + +Keys in the graphical frontends +------------------------------- + +.. include:: keys.rst.inc diff --git a/docs/system/keys.rst.inc b/docs/system/keys.rst.inc new file mode 100644 index 000000000..bd9b8e5f6 --- /dev/null +++ b/docs/system/keys.rst.inc @@ -0,0 +1,35 @@ +During the graphical emulation, you can use special key combinations to +change modes. The default key mappings are shown below, but if you use +``-alt-grab`` then the modifier is Ctrl-Alt-Shift (instead of Ctrl-Alt) +and if you use ``-ctrl-grab`` then the modifier is the right Ctrl key +(instead of Ctrl-Alt): + +Ctrl-Alt-f + Toggle full screen + +Ctrl-Alt-+ + Enlarge the screen + +Ctrl-Alt\-- + Shrink the screen + +Ctrl-Alt-u + Restore the screen's un-scaled dimensions + +Ctrl-Alt-n + Switch to virtual console 'n'. Standard console mappings are: + + *1* + Target system display + + *2* + Monitor + + *3* + Serial port + +Ctrl-Alt + Toggle mouse and keyboard grab. + +In the virtual consoles, you can use Ctrl-Up, Ctrl-Down, Ctrl-PageUp and +Ctrl-PageDown to move in the back log. diff --git a/docs/system/linuxboot.rst b/docs/system/linuxboot.rst new file mode 100644 index 000000000..228650abc --- /dev/null +++ b/docs/system/linuxboot.rst @@ -0,0 +1,30 @@ +.. _direct_005flinux_005fboot: + +Direct Linux Boot +----------------- + +This section explains how to launch a Linux kernel inside QEMU without +having to make a full bootable image. It is very useful for fast Linux +kernel testing. + +The syntax is: + +.. parsed-literal:: + + |qemu_system| -kernel bzImage -hda rootdisk.img -append "root=/dev/hda" + +Use ``-kernel`` to provide the Linux kernel image and ``-append`` to +give the kernel command line arguments. The ``-initrd`` option can be +used to provide an INITRD image. + +If you do not need graphical output, you can disable it and redirect the +virtual serial port and the QEMU monitor to the console with the +``-nographic`` option. The typical command line is: + +.. parsed-literal:: + + |qemu_system| -kernel bzImage -hda rootdisk.img \ + -append "root=/dev/hda console=ttyS0" -nographic + +Use Ctrl-a c to switch between the serial console and the monitor (see +:ref:`pcsys_005fkeys`). diff --git a/docs/system/managed-startup.rst b/docs/system/managed-startup.rst new file mode 100644 index 000000000..9bcf98ea7 --- /dev/null +++ b/docs/system/managed-startup.rst @@ -0,0 +1,35 @@ +Managed start up options +======================== + +In system mode emulation, it's possible to create a VM in a paused +state using the ``-S`` command line option. In this state the machine +is completely initialized according to command line options and ready +to execute VM code but VCPU threads are not executing any code. The VM +state in this paused state depends on the way QEMU was started. It +could be in: + +- initial state (after reset/power on state) +- with direct kernel loading, the initial state could be amended to execute + code loaded by QEMU in the VM's RAM and with incoming migration +- with incoming migration, initial state will be amended with the migrated + machine state after migration completes + +This paused state is typically used by users to query machine state and/or +additionally configure the machine (by hotplugging devices) in runtime before +allowing VM code to run. + +However, at the ``-S`` pause point, it's impossible to configure options +that affect initial VM creation (like: ``-smp``/``-m``/``-numa`` ...) or +cold plug devices. The experimental ``--preconfig`` command line option +allows pausing QEMU before the initial VM creation, in a "preconfig" state, +where additional queries and configuration can be performed via QMP +before moving on to the resulting configuration startup. In the +preconfig state, QEMU only allows a limited set of commands over the +QMP monitor, where the commands do not depend on an initialized +machine, including but not limited to: + +- ``qmp_capabilities`` +- ``query-qmp-schema`` +- ``query-commands`` +- ``query-status`` +- ``x-exit-preconfig`` diff --git a/docs/system/monitor.rst b/docs/system/monitor.rst new file mode 100644 index 000000000..ff5c43461 --- /dev/null +++ b/docs/system/monitor.rst @@ -0,0 +1,31 @@ +.. _QEMU monitor: + +QEMU Monitor +------------ + +The QEMU monitor is used to give complex commands to the QEMU emulator. +You can use it to: + +- Remove or insert removable media images (such as CD-ROM or + floppies). + +- Freeze/unfreeze the Virtual Machine (VM) and save or restore its + state from a disk file. + +- Inspect the VM state without an external debugger. + +Commands +~~~~~~~~ + +The following commands are available: + +.. hxtool-doc:: hmp-commands.hx + +.. hxtool-doc:: hmp-commands-info.hx + +Integer expressions +~~~~~~~~~~~~~~~~~~~ + +The monitor understands integers expressions for every integer argument. +You can use register names to get the value of specifics CPU registers +by prefixing them with *$*. diff --git a/docs/system/multi-process.rst b/docs/system/multi-process.rst new file mode 100644 index 000000000..210531ee1 --- /dev/null +++ b/docs/system/multi-process.rst @@ -0,0 +1,64 @@ +Multi-process QEMU +================== + +This document describes how to configure and use multi-process qemu. +For the design document refer to docs/devel/qemu-multiprocess. + +1) Configuration +---------------- + +multi-process is enabled by default for targets that enable KVM + + +2) Usage +-------- + +Multi-process QEMU requires an orchestrator to launch. + +Following is a description of command-line used to launch mpqemu. + +* Orchestrator: + + - The Orchestrator creates a unix socketpair + + - It launches the remote process and passes one of the + sockets to it via command-line. + + - It then launches QEMU and specifies the other socket as an option + to the Proxy device object + +* Remote Process: + + - QEMU can enter remote process mode by using the "remote" machine + option. + + - The orchestrator creates a "remote-object" with details about + the device and the file descriptor for the device + + - The remaining options are no different from how one launches QEMU with + devices. + + - Example command-line for the remote process is as follows: + + /usr/bin/qemu-system-x86_64 \ + -machine x-remote \ + -device lsi53c895a,id=lsi0 \ + -drive id=drive_image2,file=/build/ol7-nvme-test-1.qcow2 \ + -device scsi-hd,id=drive2,drive=drive_image2,bus=lsi0.0,scsi-id=0 \ + -object x-remote-object,id=robj1,devid=lsi0,fd=4, + +* QEMU: + + - Since parts of the RAM are shared between QEMU & remote process, a + memory-backend-memfd is required to facilitate this, as follows: + + -object memory-backend-memfd,id=mem,size=2G + + - A "x-pci-proxy-dev" device is created for each of the PCI devices emulated + in the remote process. A "socket" sub-option specifies the other end of + unix channel created by orchestrator. The "id" sub-option must be specified + and should be the same as the "id" specified for the remote PCI device + + - Example commandline for QEMU is as follows: + + -device x-pci-proxy-dev,id=lsi0,socket=3 diff --git a/docs/system/mux-chardev.rst b/docs/system/mux-chardev.rst new file mode 100644 index 000000000..05064068a --- /dev/null +++ b/docs/system/mux-chardev.rst @@ -0,0 +1,6 @@ +.. _keys in the character backend multiplexer: + +Keys in the character backend multiplexer +----------------------------------------- + +.. include:: mux-chardev.rst.inc diff --git a/docs/system/mux-chardev.rst.inc b/docs/system/mux-chardev.rst.inc new file mode 100644 index 000000000..84ea12cbf --- /dev/null +++ b/docs/system/mux-chardev.rst.inc @@ -0,0 +1,27 @@ +During emulation, if you are using a character backend multiplexer +(which is the default if you are using ``-nographic``) then several +commands are available via an escape sequence. These key sequences all +start with an escape character, which is Ctrl-a by default, but can be +changed with ``-echr``. The list below assumes you're using the default. + +Ctrl-a h + Print this help + +Ctrl-a x + Exit emulator + +Ctrl-a s + Save disk data back to file (if -snapshot) + +Ctrl-a t + Toggle console timestamps + +Ctrl-a b + Send break (magic sysrq in Linux) + +Ctrl-a c + Rotate between the frontends connected to the multiplexer (usually + this switches between the monitor and the console) + +Ctrl-a Ctrl-a + Send the escape character to the frontend diff --git a/docs/system/ppc/embedded.rst b/docs/system/ppc/embedded.rst new file mode 100644 index 000000000..cfffbda24 --- /dev/null +++ b/docs/system/ppc/embedded.rst @@ -0,0 +1,10 @@ +Embedded family boards +====================== + +- ``bamboo`` bamboo +- ``mpc8544ds`` mpc8544ds +- ``ppce500`` generic paravirt e500 platform +- ``ref405ep`` ref405ep +- ``sam460ex`` aCube Sam460ex +- ``taihu`` taihu +- ``virtex-ml507`` Xilinx Virtex ML507 reference design diff --git a/docs/system/ppc/powermac.rst b/docs/system/ppc/powermac.rst new file mode 100644 index 000000000..04334ba21 --- /dev/null +++ b/docs/system/ppc/powermac.rst @@ -0,0 +1,34 @@ +PowerMac family boards (``g3beige``, ``mac99``) +================================================================== + +Use the executable ``qemu-system-ppc`` to simulate a complete PowerMac +PowerPC system. + +- ``g3beige`` Heathrow based PowerMAC +- ``mac99`` Mac99 based PowerMAC + +Supported devices +----------------- + +QEMU emulates the following PowerMac peripherals: + + * UniNorth or Grackle PCI Bridge + * PCI VGA compatible card with VESA Bochs Extensions + * 2 PMAC IDE interfaces with hard disk and CD-ROM support + * NE2000 PCI adapters + * Non Volatile RAM + * VIA-CUDA with ADB keyboard and mouse. + + +Missing devices +--------------- + + * To be identified + +Firmware +-------- + +Since version 0.9.1, QEMU uses OpenBIOS https://www.openbios.org/ for +the g3beige and mac99 PowerMac and the 40p machines. OpenBIOS is a free +(GPL v2) portable firmware implementation. The goal is to implement a +100% IEEE 1275-1994 (referred to as Open Firmware) compliant firmware. diff --git a/docs/system/ppc/powernv.rst b/docs/system/ppc/powernv.rst new file mode 100644 index 000000000..86186b7d2 --- /dev/null +++ b/docs/system/ppc/powernv.rst @@ -0,0 +1,192 @@ +PowerNV family boards (``powernv8``, ``powernv9``) +================================================================== + +PowerNV (as Non-Virtualized) is the "baremetal" platform using the +OPAL firmware. It runs Linux on IBM and OpenPOWER systems and it can +be used as an hypervisor OS, running KVM guests, or simply as a host +OS. + +The PowerNV QEMU machine tries to emulate a PowerNV system at the +level of the skiboot firmware, which loads the OS and provides some +runtime services. Power Systems have a lower firmware (HostBoot) that +does low level system initialization, like DRAM training. This is +beyond the scope of what QEMU addresses today. + +Supported devices +----------------- + + * Multi processor support for POWER8, POWER8NVL and POWER9. + * XSCOM, serial communication sideband bus to configure chiplets + * Simple LPC Controller + * Processor Service Interface (PSI) Controller + * Interrupt Controller, XICS (POWER8) and XIVE (POWER9) + * POWER8 PHB3 PCIe Host bridge and POWER9 PHB4 PCIe Host bridge + * Simple OCC is an on-chip microcontroller used for power management + tasks + * iBT device to handle BMC communication, with the internal BMC + simulator provided by QEMU or an external BMC such as an Aspeed + QEMU machine. + * PNOR containing the different firmware partitions. + +Missing devices +--------------- + +A lot is missing, among which : + + * POWER10 processor + * XIVE2 (POWER10) interrupt controller + * I2C controllers (yet to be merged) + * NPU/NPU2/NPU3 controllers + * EEH support for PCIe Host bridge controllers + * NX controller + * VAS controller + * chipTOD (Time Of Day) + * Self Boot Engine (SBE). + * FSI bus + +Firmware +-------- + +The OPAL firmware (OpenPower Abstraction Layer) for OpenPower systems +includes the runtime services ``skiboot`` and the bootloader kernel and +initramfs ``skiroot``. Source code can be found on GitHub: + + https://github.com/open-power. + +Prebuilt images of ``skiboot`` and ``skiroot`` are made available on the `OpenPOWER <https://github.com/open-power/op-build/releases/>`__ site. + +QEMU includes a prebuilt image of ``skiboot`` which is updated when a +more recent version is required by the models. + +Boot options +------------ + +Here is a simple setup with one e1000e NIC : + +.. code-block:: bash + + $ qemu-system-ppc64 -m 2G -machine powernv9 -smp 2,cores=2,threads=1 \ + -accel tcg,thread=single \ + -device e1000e,netdev=net0,mac=C0:FF:EE:00:00:02,bus=pcie.0,addr=0x0 \ + -netdev user,id=net0,hostfwd=::20022-:22,hostname=pnv \ + -kernel ./zImage.epapr \ + -initrd ./rootfs.cpio.xz \ + -nographic + +and a SATA disk : + +.. code-block:: bash + + -device ich9-ahci,id=sata0,bus=pcie.1,addr=0x0 \ + -drive file=./ubuntu-ppc64le.qcow2,if=none,id=drive0,format=qcow2,cache=none \ + -device ide-hd,bus=sata0.0,unit=0,drive=drive0,id=ide,bootindex=1 \ + +Complex PCIe configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~ +Six PHBs are defined per chip (POWER9) but no default PCI layout is +provided (to be compatible with libvirt). One PCI device can be added +on any of the available PCIe slots using command line options such as: + +.. code-block:: bash + + -device e1000e,netdev=net0,mac=C0:FF:EE:00:00:02,bus=pcie.0,addr=0x0 + -netdev bridge,id=net0,helper=/usr/libexec/qemu-bridge-helper,br=virbr0,id=hostnet0 + + -device megasas,id=scsi0,bus=pcie.0,addr=0x0 + -drive file=./ubuntu-ppc64le.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none + -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2 + +Here is a full example with two different storage controllers on +different PHBs, each with a disk, the second PHB is empty : + +.. code-block:: bash + + $ qemu-system-ppc64 -m 2G -machine powernv9 -smp 2,cores=2,threads=1 -accel tcg,thread=single \ + -kernel ./zImage.epapr -initrd ./rootfs.cpio.xz -bios ./skiboot.lid \ + \ + -device megasas,id=scsi0,bus=pcie.0,addr=0x0 \ + -drive file=./rhel7-ppc64le.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none \ + -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2 \ + \ + -device pcie-pci-bridge,id=bridge1,bus=pcie.1,addr=0x0 \ + \ + -device ich9-ahci,id=sata0,bus=bridge1,addr=0x1 \ + -drive file=./ubuntu-ppc64le.qcow2,if=none,id=drive0,format=qcow2,cache=none \ + -device ide-hd,bus=sata0.0,unit=0,drive=drive0,id=ide,bootindex=1 \ + -device e1000e,netdev=net0,mac=C0:FF:EE:00:00:02,bus=bridge1,addr=0x2 \ + -netdev bridge,helper=/usr/libexec/qemu-bridge-helper,br=virbr0,id=net0 \ + -device nec-usb-xhci,bus=bridge1,addr=0x7 \ + \ + -serial mon:stdio -nographic + +You can also use VIRTIO devices : + +.. code-block:: bash + + -drive file=./fedora-ppc64le.qcow2,if=none,snapshot=on,id=drive0 \ + -device virtio-blk-pci,drive=drive0,id=blk0,bus=pcie.0 \ + \ + -netdev tap,helper=/usr/lib/qemu/qemu-bridge-helper,br=virbr0,id=netdev0 \ + -device virtio-net-pci,netdev=netdev0,id=net0,bus=pcie.1 \ + \ + -fsdev local,id=fsdev0,path=$HOME,security_model=passthrough \ + -device virtio-9p-pci,fsdev=fsdev0,mount_tag=host,bus=pcie.2 + +Multi sockets +~~~~~~~~~~~~~ + +The number of sockets is deduced from the number of CPUs and the +number of cores. ``-smp 2,cores=1`` will define a machine with 2 +sockets of 1 core, whereas ``-smp 2,cores=2`` will define a machine +with 1 socket of 2 cores. ``-smp 8,cores=2``, 4 sockets of 2 cores. + +BMC configuration +~~~~~~~~~~~~~~~~~ + +OpenPOWER systems negotiate the shutdown and reboot with their +BMC. The QEMU PowerNV machine embeds an IPMI BMC simulator using the +iBT interface and should offer the same power features. + +If you want to define your own BMC, use ``-nodefaults`` and specify +one on the command line : + +.. code-block:: bash + + -device ipmi-bmc-sim,id=bmc0 -device isa-ipmi-bt,bmc=bmc0,irq=10 + +The files `palmetto-SDR.bin <http://www.kaod.org/qemu/powernv/palmetto-SDR.bin>`__ +and `palmetto-FRU.bin <http://www.kaod.org/qemu/powernv/palmetto-FRU.bin>`__ +define a Sensor Data Record repository and a Field Replaceable Unit +inventory for a palmetto BMC. They can be used to extend the QEMU BMC +simulator. + +.. code-block:: bash + + -device ipmi-bmc-sim,sdrfile=./palmetto-SDR.bin,fruareasize=256,frudatafile=./palmetto-FRU.bin,id=bmc0 \ + -device isa-ipmi-bt,bmc=bmc0,irq=10 + +The PowerNV machine can also be run with an external IPMI BMC device +connected to a remote QEMU machine acting as BMC, using these options +: + +.. code-block:: bash + + -chardev socket,id=ipmi0,host=localhost,port=9002,reconnect=10 \ + -device ipmi-bmc-extern,id=bmc0,chardev=ipmi0 \ + -device isa-ipmi-bt,bmc=bmc0,irq=10 \ + -nodefaults + +NVRAM +~~~~~ + +Use a MTD drive to add a PNOR to the machine, and get a NVRAM : + +.. code-block:: bash + + -drive file=./witherspoon.pnor,format=raw,if=mtd + +CAVEATS +------- + + * No support for multiple HW threads (SMT=1). Same as pseries. + * CPU can hang when doing intensive I/Os. Use ``-append powersave=off`` in that case. diff --git a/docs/system/ppc/ppce500.rst b/docs/system/ppc/ppce500.rst new file mode 100644 index 000000000..9beef3917 --- /dev/null +++ b/docs/system/ppc/ppce500.rst @@ -0,0 +1,164 @@ +ppce500 generic platform (``ppce500``) +====================================== + +QEMU for PPC supports a special ``ppce500`` machine designed for emulation and +virtualization purposes. + +Supported devices +----------------- + +The ``ppce500`` machine supports the following devices: + +* PowerPC e500 series core (e500v2/e500mc/e5500/e6500) +* Configuration, Control, and Status Register (CCSR) +* Multicore Programmable Interrupt Controller (MPIC) with MSI support +* 1 16550A UART device +* 1 Freescale MPC8xxx I2C controller +* 1 Pericom pt7c4338 RTC via I2C +* 1 Freescale MPC8xxx GPIO controller +* Power-off functionality via one GPIO pin +* 1 Freescale MPC8xxx PCI host controller +* VirtIO devices via PCI bus +* 1 Freescale Enhanced Triple Speed Ethernet controller (eTSEC) + +Hardware configuration information +---------------------------------- + +The ``ppce500`` machine automatically generates a device tree blob ("dtb") +which it passes to the guest, if there is no ``-dtb`` option. This provides +information about the addresses, interrupt lines and other configuration of +the various devices in the system. + +If users want to provide their own DTB, they can use the ``-dtb`` option. +These DTBs should have the following requirements: + +* The number of subnodes under /cpus node should match QEMU's ``-smp`` option +* The /memory reg size should match QEMU’s selected ram_size via ``-m`` + +Both ``qemu-system-ppc`` and ``qemu-system-ppc64`` provide emulation for the +following 32-bit PowerPC CPUs: + +* e500v2 +* e500mc + +Additionally ``qemu-system-ppc64`` provides support for the following 64-bit +PowerPC CPUs: + +* e5500 +* e6500 + +The CPU type can be specified via the ``-cpu`` command line. If not specified, +it creates a machine with e500v2 core. The following example shows an e6500 +based machine creation: + +.. code-block:: bash + + $ qemu-system-ppc64 -nographic -M ppce500 -cpu e6500 + +Boot options +------------ + +The ``ppce500`` machine can start using the standard -kernel functionality +for loading a payload like an OS kernel (e.g.: Linux), or U-Boot firmware. + +When -bios is omitted, the default pc-bios/u-boot.e500 firmware image is used +as the BIOS. QEMU follows below truth table to select which payload to execute: + +===== ========== ======= +-bios -kernel payload +===== ========== ======= + N N u-boot + N Y kernel + Y don't care u-boot +===== ========== ======= + +When both -bios and -kernel are present, QEMU loads U-Boot and U-Boot in turns +automatically loads the kernel image specified by the -kernel parameter via +U-Boot's built-in "bootm" command, hence a legacy uImage format is required in +such scenario. + +Running Linux kernel +-------------------- + +Linux mainline v5.11 release is tested at the time of writing. To build a +Linux mainline kernel that can be booted by the ``ppce500`` machine in +64-bit mode, simply configure the kernel using the defconfig configuration: + +.. code-block:: bash + + $ export ARCH=powerpc + $ export CROSS_COMPILE=powerpc-linux- + $ make corenet64_smp_defconfig + $ make menuconfig + +then manually select the following configuration: + + Platform support > Freescale Book-E Machine Type > QEMU generic e500 platform + +To boot the newly built Linux kernel in QEMU with the ``ppce500`` machine: + +.. code-block:: bash + + $ qemu-system-ppc64 -M ppce500 -cpu e5500 -smp 4 -m 2G \ + -display none -serial stdio \ + -kernel vmlinux \ + -initrd /path/to/rootfs.cpio \ + -append "root=/dev/ram" + +To build a Linux mainline kernel that can be booted by the ``ppce500`` machine +in 32-bit mode, use the same 64-bit configuration steps except the defconfig +file should use corenet32_smp_defconfig. + +To boot the 32-bit Linux kernel: + +.. code-block:: bash + + $ qemu-system-ppc{64|32} -M ppce500 -cpu e500mc -smp 4 -m 2G \ + -display none -serial stdio \ + -kernel vmlinux \ + -initrd /path/to/rootfs.cpio \ + -append "root=/dev/ram" + +Running U-Boot +-------------- + +U-Boot mainline v2021.07 release is tested at the time of writing. To build a +U-Boot mainline bootloader that can be booted by the ``ppce500`` machine, use +the qemu-ppce500_defconfig with similar commands as described above for Linux: + +.. code-block:: bash + + $ export CROSS_COMPILE=powerpc-linux- + $ make qemu-ppce500_defconfig + +You will get u-boot file in the build tree. + +When U-Boot boots, you will notice the following if using with ``-cpu e6500``: + +.. code-block:: none + + CPU: Unknown, Version: 0.0, (0x00000000) + Core: e6500, Version: 2.0, (0x80400020) + +This is because we only specified a core name to QEMU and it does not have a +meaningful SVR value which represents an actual SoC that integrates such core. +You can specify a real world SoC device that QEMU has built-in support but all +these SoCs are e500v2 based MPC85xx series, hence you cannot test anything +built for P4080 (e500mc), P5020 (e5500) and T2080 (e6500). + +By default a VirtIO standard PCI networking device is connected as an ethernet +interface at PCI address 0.1.0, but we can switch that to an e1000 NIC by: + +.. code-block:: bash + + $ qemu-system-ppc -M ppce500 -smp 4 -m 2G \ + -display none -serial stdio \ + -bios u-boot \ + -nic tap,ifname=tap0,script=no,downscript=no,model=e1000 + +The QEMU ``ppce500`` machine can also dynamically instantiate an eTSEC device +if “-device eTSEC” is given to QEMU: + +.. code-block:: bash + + -netdev tap,ifname=tap0,script=no,downscript=no,id=net0 -device eTSEC,netdev=net0 diff --git a/docs/system/ppc/prep.rst b/docs/system/ppc/prep.rst new file mode 100644 index 000000000..bd9eb8eab --- /dev/null +++ b/docs/system/ppc/prep.rst @@ -0,0 +1,18 @@ +Prep machine (``40p``) +================================================================== + +Use the executable ``qemu-system-ppc`` to simulate a complete 40P (PREP) + +Supported devices +----------------- + +QEMU emulates the following 40P (PREP) peripherals: + + * PCI Bridge + * PCI VGA compatible card with VESA Bochs Extensions + * 2 IDE interfaces with hard disk and CD-ROM support + * Floppy disk + * PCnet network adapters + * Serial port + * PREP Non Volatile RAM + * PC compatible keyboard and mouse. diff --git a/docs/system/ppc/pseries.rst b/docs/system/ppc/pseries.rst new file mode 100644 index 000000000..932d4dd17 --- /dev/null +++ b/docs/system/ppc/pseries.rst @@ -0,0 +1,12 @@ +pSeries family boards (``pseries``) +=================================== + +Supported devices +----------------- + +Missing devices +--------------- + + +Firmware +-------- diff --git a/docs/system/pr-manager.rst b/docs/system/pr-manager.rst new file mode 100644 index 000000000..b19a0c15e --- /dev/null +++ b/docs/system/pr-manager.rst @@ -0,0 +1,83 @@ +=============================== +Persistent reservation managers +=============================== + +SCSI persistent reservations allow restricting access to block devices +to specific initiators in a shared storage setup. When implementing +clustering of virtual machines, it is a common requirement for virtual +machines to send persistent reservation SCSI commands. However, +the operating system restricts sending these commands to unprivileged +programs because incorrect usage can disrupt regular operation of the +storage fabric. + +For this reason, QEMU's SCSI passthrough devices, ``scsi-block`` +and ``scsi-generic`` (both are only available on Linux) can delegate +implementation of persistent reservations to a separate object, +the "persistent reservation manager". Only PERSISTENT RESERVE OUT and +PERSISTENT RESERVE IN commands are passed to the persistent reservation +manager object; other commands are processed by QEMU as usual. + +----------------------------------------- +Defining a persistent reservation manager +----------------------------------------- + +A persistent reservation manager is an instance of a subclass of the +"pr-manager" QOM class. + +Right now only one subclass is defined, ``pr-manager-helper``, which +forwards the commands to an external privileged helper program +over Unix sockets. The helper program only allows sending persistent +reservation commands to devices for which QEMU has a file descriptor, +so that QEMU will not be able to effect persistent reservations +unless it has access to both the socket and the device. + +``pr-manager-helper`` has a single string property, ``path``, which +accepts the path to the helper program's Unix socket. For example, +the following command line defines a ``pr-manager-helper`` object and +attaches it to a SCSI passthrough device:: + + $ qemu-system-x86_64 + -device virtio-scsi \ + -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock + -drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0 + -device scsi-block,drive=hd + +Alternatively, using ``-blockdev``:: + + $ qemu-system-x86_64 + -device virtio-scsi \ + -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock + -blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0 + -device scsi-block,drive=hd + +You will also need to ensure that the helper program +:command:`qemu-pr-helper` is running, and that it has been +set up to use the same socket filename as your QEMU commandline +specifies. See the qemu-pr-helper documentation or manpage for +further details. + +--------------------------------------------- +Multipath devices and persistent reservations +--------------------------------------------- + +Proper support of persistent reservation for multipath devices requires +communication with the multipath daemon, so that the reservation is +registered and applied when a path is newly discovered or becomes online +again. :command:`qemu-pr-helper` can do this if the ``libmpathpersist`` +library was available on the system at build time. + +As of August 2017, a reservation key must be specified in ``multipath.conf`` +for ``multipathd`` to check for persistent reservation for newly +discovered paths or reinstated paths. The attribute can be added +to the ``defaults`` section or the ``multipaths`` section; for example:: + + multipaths { + multipath { + wwid XXXXXXXXXXXXXXXX + alias yellow + reservation_key 0x123abc + } + } + +Linking :program:`qemu-pr-helper` to ``libmpathpersist`` does not impede +its usage on regular SCSI devices. diff --git a/docs/system/qemu-block-drivers.rst b/docs/system/qemu-block-drivers.rst new file mode 100644 index 000000000..c2c0114ce --- /dev/null +++ b/docs/system/qemu-block-drivers.rst @@ -0,0 +1,24 @@ +:orphan: + +============================ +QEMU block drivers reference +============================ + +-------- +Synopsis +-------- + +QEMU block driver reference manual + +----------- +Description +----------- + +.. include:: qemu-block-drivers.rst.inc + +-------- +See also +-------- + +The HTML documentation of QEMU for more precise information and Linux +user mode emulator invocation. diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc new file mode 100644 index 000000000..e31378442 --- /dev/null +++ b/docs/system/qemu-block-drivers.rst.inc @@ -0,0 +1,911 @@ +Disk image file formats +~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU supports many image file formats that can be used with VMs as well as with +any of the tools (like ``qemu-img``). This includes the preferred formats +raw and qcow2 as well as formats that are supported for compatibility with +older QEMU versions or other hypervisors. + +Depending on the image format, different options can be passed to +``qemu-img create`` and ``qemu-img convert`` using the ``-o`` option. +This section describes each format and the options that are supported for it. + +.. program:: image-formats +.. option:: raw + + Raw disk image format. This format has the advantage of + being simple and easily exportable to all other emulators. If your + file system supports *holes* (for example in ext2 or ext3 on + Linux or NTFS on Windows), then only the written sectors will reserve + space. Use ``qemu-img info`` to know the real size used by the + image or ``ls -ls`` on Unix/Linux. + + Supported options: + + .. program:: raw + .. option:: preallocation + + Preallocation mode (allowed values: ``off``, ``falloc``, + ``full``). ``falloc`` mode preallocates space for image by + calling ``posix_fallocate()``. ``full`` mode preallocates space + for image by writing data to underlying storage. This data may or + may not be zero, depending on the storage location. + +.. program:: image-formats +.. option:: qcow2 + + QEMU image format, the most versatile format. Use it to have smaller + images (useful if your filesystem does not supports holes, for example + on Windows), zlib based compression and support of multiple VM + snapshots. + + Supported options: + + .. program:: qcow2 + .. option:: compat + + Determines the qcow2 version to use. ``compat=0.10`` uses the + traditional image format that can be read by any QEMU since 0.10. + ``compat=1.1`` enables image format extensions that only QEMU 1.1 and + newer understand (this is the default). Amongst others, this includes + zero clusters, which allow efficient copy-on-read for sparse images. + + .. option:: backing_file + + File name of a base image (see ``create`` subcommand) + + .. option:: backing_fmt + + Image format of the base image + + .. option:: encryption + + This option is deprecated and equivalent to ``encrypt.format=aes`` + + .. option:: encrypt.format + + If this is set to ``luks``, it requests that the qcow2 payload (not + qcow2 header) be encrypted using the LUKS format. The passphrase to + use to unlock the LUKS key slot is given by the ``encrypt.key-secret`` + parameter. LUKS encryption parameters can be tuned with the other + ``encrypt.*`` parameters. + + If this is set to ``aes``, the image is encrypted with 128-bit AES-CBC. + The encryption key is given by the ``encrypt.key-secret`` parameter. + This encryption format is considered to be flawed by modern cryptography + standards, suffering from a number of design problems: + + - The AES-CBC cipher is used with predictable initialization vectors based + on the sector number. This makes it vulnerable to chosen plaintext attacks + which can reveal the existence of encrypted data. + - The user passphrase is directly used as the encryption key. A poorly + chosen or short passphrase will compromise the security of the encryption. + - In the event of the passphrase being compromised there is no way to + change the passphrase to protect data in any qcow images. The files must + be cloned, using a different encryption passphrase in the new file. The + original file must then be securely erased using a program like shred, + though even this is ineffective with many modern storage technologies. + + The use of this is no longer supported in system emulators. Support only + remains in the command line utilities, for the purposes of data liberation + and interoperability with old versions of QEMU. The ``luks`` format + should be used instead. + + .. option:: encrypt.key-secret + + Provides the ID of a ``secret`` object that contains the passphrase + (``encrypt.format=luks``) or encryption key (``encrypt.format=aes``). + + .. option:: encrypt.cipher-alg + + Name of the cipher algorithm and key length. Currently defaults + to ``aes-256``. Only used when ``encrypt.format=luks``. + + .. option:: encrypt.cipher-mode + + Name of the encryption mode to use. Currently defaults to ``xts``. + Only used when ``encrypt.format=luks``. + + .. option:: encrypt.ivgen-alg + + Name of the initialization vector generator algorithm. Currently defaults + to ``plain64``. Only used when ``encrypt.format=luks``. + + .. option:: encrypt.ivgen-hash-alg + + Name of the hash algorithm to use with the initialization vector generator + (if required). Defaults to ``sha256``. Only used when ``encrypt.format=luks``. + + .. option:: encrypt.hash-alg + + Name of the hash algorithm to use for PBKDF algorithm + Defaults to ``sha256``. Only used when ``encrypt.format=luks``. + + .. option:: encrypt.iter-time + + Amount of time, in milliseconds, to use for PBKDF algorithm per key slot. + Defaults to ``2000``. Only used when ``encrypt.format=luks``. + + .. option:: cluster_size + + Changes the qcow2 cluster size (must be between 512 and 2M). Smaller cluster + sizes can improve the image file size whereas larger cluster sizes generally + provide better performance. + + .. option:: preallocation + + Preallocation mode (allowed values: ``off``, ``metadata``, ``falloc``, + ``full``). An image with preallocated metadata is initially larger but can + improve performance when the image needs to grow. ``falloc`` and ``full`` + preallocations are like the same options of ``raw`` format, but sets up + metadata also. + + .. option:: lazy_refcounts + + If this option is set to ``on``, reference count updates are postponed with + the goal of avoiding metadata I/O and improving performance. This is + particularly interesting with :option:`cache=writethrough` which doesn't batch + metadata updates. The tradeoff is that after a host crash, the reference count + tables must be rebuilt, i.e. on the next open an (automatic) ``qemu-img + check -r all`` is required, which may take some time. + + This option can only be enabled if ``compat=1.1`` is specified. + + .. option:: nocow + + If this option is set to ``on``, it will turn off COW of the file. It's only + valid on btrfs, no effect on other file systems. + + Btrfs has low performance when hosting a VM image file, even more + when the guest on the VM also using btrfs as file system. Turning off + COW is a way to mitigate this bad performance. Generally there are two + ways to turn off COW on btrfs: + + - Disable it by mounting with nodatacow, then all newly created files + will be NOCOW. + - For an empty file, add the NOCOW file attribute. That's what this + option does. + + Note: this option is only valid to new or empty files. If there is + an existing file which is COW and has data blocks already, it couldn't + be changed to NOCOW by setting ``nocow=on``. One can issue ``lsattr + filename`` to check if the NOCOW flag is set or not (Capital 'C' is + NOCOW flag). + +.. program:: image-formats +.. option:: qed + + Old QEMU image format with support for backing files and compact image files + (when your filesystem or transport medium does not support holes). + + When converting QED images to qcow2, you might want to consider using the + ``lazy_refcounts=on`` option to get a more QED-like behaviour. + + Supported options: + + .. program:: qed + .. option:: backing_file + + File name of a base image (see ``create`` subcommand). + + .. option:: backing_fmt + + Image file format of backing file (optional). Useful if the format cannot be + autodetected because it has no header, like some vhd/vpc files. + + .. option:: cluster_size + + Changes the cluster size (must be power-of-2 between 4K and 64K). Smaller + cluster sizes can improve the image file size whereas larger cluster sizes + generally provide better performance. + + .. option:: table_size + + Changes the number of clusters per L1/L2 table (must be + power-of-2 between 1 and 16). There is normally no need to + change this value but this option can between used for + performance benchmarking. + +.. program:: image-formats +.. option:: qcow + + Old QEMU image format with support for backing files, compact image files, + encryption and compression. + + Supported options: + + .. program:: qcow + .. option:: backing_file + + File name of a base image (see ``create`` subcommand) + + .. option:: encryption + + This option is deprecated and equivalent to ``encrypt.format=aes`` + + .. option:: encrypt.format + + If this is set to ``aes``, the image is encrypted with 128-bit AES-CBC. + The encryption key is given by the ``encrypt.key-secret`` parameter. + This encryption format is considered to be flawed by modern cryptography + standards, suffering from a number of design problems enumerated previously + against the ``qcow2`` image format. + + The use of this is no longer supported in system emulators. Support only + remains in the command line utilities, for the purposes of data liberation + and interoperability with old versions of QEMU. + + Users requiring native encryption should use the ``qcow2`` format + instead with ``encrypt.format=luks``. + + .. option:: encrypt.key-secret + + Provides the ID of a ``secret`` object that contains the encryption + key (``encrypt.format=aes``). + +.. program:: image-formats +.. option:: luks + + LUKS v1 encryption format, compatible with Linux dm-crypt/cryptsetup + + Supported options: + + .. program:: luks + .. option:: key-secret + + Provides the ID of a ``secret`` object that contains the passphrase. + + .. option:: cipher-alg + + Name of the cipher algorithm and key length. Currently defaults + to ``aes-256``. + + .. option:: cipher-mode + + Name of the encryption mode to use. Currently defaults to ``xts``. + + .. option:: ivgen-alg + + Name of the initialization vector generator algorithm. Currently defaults + to ``plain64``. + + .. option:: ivgen-hash-alg + + Name of the hash algorithm to use with the initialization vector generator + (if required). Defaults to ``sha256``. + + .. option:: hash-alg + + Name of the hash algorithm to use for PBKDF algorithm + Defaults to ``sha256``. + + .. option:: iter-time + + Amount of time, in milliseconds, to use for PBKDF algorithm per key slot. + Defaults to ``2000``. + +.. program:: image-formats +.. option:: vdi + + VirtualBox 1.1 compatible image format. + + Supported options: + + .. program:: vdi + .. option:: static + + If this option is set to ``on``, the image is created with metadata + preallocation. + +.. program:: image-formats +.. option:: vmdk + + VMware 3 and 4 compatible image format. + + Supported options: + + .. program: vmdk + .. option:: backing_file + + File name of a base image (see ``create`` subcommand). + + .. option:: compat6 + + Create a VMDK version 6 image (instead of version 4) + + .. option:: hwversion + + Specify vmdk virtual hardware version. Compat6 flag cannot be enabled + if hwversion is specified. + + .. option:: subformat + + Specifies which VMDK subformat to use. Valid options are + ``monolithicSparse`` (default), + ``monolithicFlat``, + ``twoGbMaxExtentSparse``, + ``twoGbMaxExtentFlat`` and + ``streamOptimized``. + +.. program:: image-formats +.. option:: vpc + + VirtualPC compatible image format (VHD). + + Supported options: + + .. program:: vpc + .. option:: subformat + + Specifies which VHD subformat to use. Valid options are + ``dynamic`` (default) and ``fixed``. + +.. program:: image-formats +.. option:: VHDX + + Hyper-V compatible image format (VHDX). + + Supported options: + + .. program:: VHDX + .. option:: subformat + + Specifies which VHDX subformat to use. Valid options are + ``dynamic`` (default) and ``fixed``. + + .. option:: block_state_zero + + Force use of payload blocks of type 'ZERO'. Can be set to ``on`` (default) + or ``off``. When set to ``off``, new blocks will be created as + ``PAYLOAD_BLOCK_NOT_PRESENT``, which means parsers are free to return + arbitrary data for those blocks. Do not set to ``off`` when using + ``qemu-img convert`` with ``subformat=dynamic``. + + .. option:: block_size + + Block size; min 1 MB, max 256 MB. 0 means auto-calculate based on + image size. + + .. option:: log_size + + Log size; min 1 MB. + +Read-only formats +~~~~~~~~~~~~~~~~~ + +More disk image file formats are supported in a read-only mode. + +.. program:: image-formats +.. option:: bochs + + Bochs images of ``growing`` type. + +.. program:: image-formats +.. option:: cloop + + Linux Compressed Loop image, useful only to reuse directly compressed + CD-ROM images present for example in the Knoppix CD-ROMs. + +.. program:: image-formats +.. option:: dmg + + Apple disk image. + +.. program:: image-formats +.. option:: parallels + + Parallels disk image format. + +Using host drives +~~~~~~~~~~~~~~~~~ + +In addition to disk image files, QEMU can directly access host +devices. We describe here the usage for QEMU version >= 0.8.3. + +Linux +^^^^^ + +On Linux, you can directly use the host device filename instead of a +disk image filename provided you have enough privileges to access +it. For example, use ``/dev/cdrom`` to access to the CDROM. + +CD + You can specify a CDROM device even if no CDROM is loaded. QEMU has + specific code to detect CDROM insertion or removal. CDROM ejection by + the guest OS is supported. Currently only data CDs are supported. + +Floppy + You can specify a floppy device even if no floppy is loaded. Floppy + removal is currently not detected accurately (if you change floppy + without doing floppy access while the floppy is not loaded, the guest + OS will think that the same floppy is loaded). + Use of the host's floppy device is deprecated, and support for it will + be removed in a future release. + +Hard disks + Hard disks can be used. Normally you must specify the whole disk + (``/dev/hdb`` instead of ``/dev/hdb1``) so that the guest OS can + see it as a partitioned disk. WARNING: unless you know what you do, it + is better to only make READ-ONLY accesses to the hard disk otherwise + you may corrupt your host data (use the ``-snapshot`` command + line option or modify the device permissions accordingly). + +Windows +^^^^^^^ + +CD + The preferred syntax is the drive letter (e.g. ``d:``). The + alternate syntax ``\\.\d:`` is supported. ``/dev/cdrom`` is + supported as an alias to the first CDROM drive. + + Currently there is no specific code to handle removable media, so it + is better to use the ``change`` or ``eject`` monitor commands to + change or eject media. + +Hard disks + Hard disks can be used with the syntax: ``\\.\PhysicalDriveN`` + where *N* is the drive number (0 is the first hard disk). + + WARNING: unless you know what you do, it is better to only make + READ-ONLY accesses to the hard disk otherwise you may corrupt your + host data (use the ``-snapshot`` command line so that the + modifications are written in a temporary file). + +Mac OS X +^^^^^^^^ + +``/dev/cdrom`` is an alias to the first CDROM. + +Currently there is no specific code to handle removable media, so it +is better to use the ``change`` or ``eject`` monitor commands to +change or eject media. + +Virtual FAT disk images +~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU can automatically create a virtual FAT disk image from a +directory tree. In order to use it, just type: + +.. parsed-literal:: + + |qemu_system| linux.img -hdb fat:/my_directory + +Then you access access to all the files in the ``/my_directory`` +directory without having to copy them in a disk image or to export +them via SAMBA or NFS. The default access is *read-only*. + +Floppies can be emulated with the ``:floppy:`` option: + +.. parsed-literal:: + + |qemu_system| linux.img -fda fat:floppy:/my_directory + +A read/write support is available for testing (beta stage) with the +``:rw:`` option: + +.. parsed-literal:: + + |qemu_system| linux.img -fda fat:floppy:rw:/my_directory + +What you should *never* do: + +- use non-ASCII filenames +- use "-snapshot" together with ":rw:" +- expect it to work when loadvm'ing +- write to the FAT directory on the host system while accessing it with the guest system + +NBD access +~~~~~~~~~~ + +QEMU can access directly to block device exported using the Network Block Device +protocol. + +.. parsed-literal:: + + |qemu_system| linux.img -hdb nbd://my_nbd_server.mydomain.org:1024/ + +If the NBD server is located on the same host, you can use an unix socket instead +of an inet socket: + +.. parsed-literal:: + + |qemu_system| linux.img -hdb nbd+unix://?socket=/tmp/my_socket + +In this case, the block device must be exported using ``qemu-nbd``: + +.. parsed-literal:: + + qemu-nbd --socket=/tmp/my_socket my_disk.qcow2 + +The use of ``qemu-nbd`` allows sharing of a disk between several guests: + +.. parsed-literal:: + + qemu-nbd --socket=/tmp/my_socket --share=2 my_disk.qcow2 + +and then you can use it with two guests: + +.. parsed-literal:: + + |qemu_system| linux1.img -hdb nbd+unix://?socket=/tmp/my_socket + |qemu_system| linux2.img -hdb nbd+unix://?socket=/tmp/my_socket + +If the ``nbd-server`` uses named exports (supported since NBD 2.9.18, or with QEMU's +own embedded NBD server), you must specify an export name in the URI: + +.. parsed-literal:: + + |qemu_system| -cdrom nbd://localhost/debian-500-ppc-netinst + |qemu_system| -cdrom nbd://localhost/openSUSE-11.1-ppc-netinst + +The URI syntax for NBD is supported since QEMU 1.3. An alternative syntax is +also available. Here are some example of the older syntax: + +.. parsed-literal:: + + |qemu_system| linux.img -hdb nbd:my_nbd_server.mydomain.org:1024 + |qemu_system| linux2.img -hdb nbd:unix:/tmp/my_socket + |qemu_system| -cdrom nbd:localhost:10809:exportname=debian-500-ppc-netinst + +iSCSI LUNs +~~~~~~~~~~ + +iSCSI is a popular protocol used to access SCSI devices across a computer +network. + +There are two different ways iSCSI devices can be used by QEMU. + +The first method is to mount the iSCSI LUN on the host, and make it appear as +any other ordinary SCSI device on the host and then to access this device as a +/dev/sd device from QEMU. How to do this differs between host OSes. + +The second method involves using the iSCSI initiator that is built into +QEMU. This provides a mechanism that works the same way regardless of which +host OS you are running QEMU on. This section will describe this second method +of using iSCSI together with QEMU. + +In QEMU, iSCSI devices are described using special iSCSI URLs. URL syntax: + +:: + + iscsi://[<username>[%<password>]@]<host>[:<port>]/<target-iqn-name>/<lun> + +Username and password are optional and only used if your target is set up +using CHAP authentication for access control. +Alternatively the username and password can also be set via environment +variables to have these not show up in the process list: + +:: + + export LIBISCSI_CHAP_USERNAME=<username> + export LIBISCSI_CHAP_PASSWORD=<password> + iscsi://<host>/<target-iqn-name>/<lun> + +Various session related parameters can be set via special options, either +in a configuration file provided via '-readconfig' or directly on the +command line. + +If the initiator-name is not specified qemu will use a default name +of 'iqn.2008-11.org.linux-kvm[:<uuid>'] where <uuid> is the UUID of the +virtual machine. If the UUID is not specified qemu will use +'iqn.2008-11.org.linux-kvm[:<name>'] where <name> is the name of the +virtual machine. + +Setting a specific initiator name to use when logging in to the target: + +:: + + -iscsi initiator-name=iqn.qemu.test:my-initiator + +Controlling which type of header digest to negotiate with the target: + +:: + + -iscsi header-digest=CRC32C|CRC32C-NONE|NONE-CRC32C|NONE + +These can also be set via a configuration file: + +:: + + [iscsi] + user = "CHAP username" + password = "CHAP password" + initiator-name = "iqn.qemu.test:my-initiator" + # header digest is one of CRC32C|CRC32C-NONE|NONE-CRC32C|NONE + header-digest = "CRC32C" + +Setting the target name allows different options for different targets: + +:: + + [iscsi "iqn.target.name"] + user = "CHAP username" + password = "CHAP password" + initiator-name = "iqn.qemu.test:my-initiator" + # header digest is one of CRC32C|CRC32C-NONE|NONE-CRC32C|NONE + header-digest = "CRC32C" + +How to use a configuration file to set iSCSI configuration options: + +.. parsed-literal:: + + cat >iscsi.conf <<EOF + [iscsi] + user = "me" + password = "my password" + initiator-name = "iqn.qemu.test:my-initiator" + header-digest = "CRC32C" + EOF + + |qemu_system| -drive file=iscsi://127.0.0.1/iqn.qemu.test/1 \\ + -readconfig iscsi.conf + +How to set up a simple iSCSI target on loopback and access it via QEMU: +this example shows how to set up an iSCSI target with one CDROM and one DISK +using the Linux STGT software target. This target is available on Red Hat based +systems as the package 'scsi-target-utils'. + +.. parsed-literal:: + + tgtd --iscsi portal=127.0.0.1:3260 + tgtadm --lld iscsi --op new --mode target --tid 1 -T iqn.qemu.test + tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 \\ + -b /IMAGES/disk.img --device-type=disk + tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 2 \\ + -b /IMAGES/cd.iso --device-type=cd + tgtadm --lld iscsi --op bind --mode target --tid 1 -I ALL + + |qemu_system| -iscsi initiator-name=iqn.qemu.test:my-initiator \\ + -boot d -drive file=iscsi://127.0.0.1/iqn.qemu.test/1 \\ + -cdrom iscsi://127.0.0.1/iqn.qemu.test/2 + +GlusterFS disk images +~~~~~~~~~~~~~~~~~~~~~ + +GlusterFS is a user space distributed file system. + +You can boot from the GlusterFS disk image with the command: + +URI: + +.. parsed-literal:: + + |qemu_system| -drive file=gluster[+TYPE]://[HOST}[:PORT]]/VOLUME/PATH + [?socket=...][,file.debug=9][,file.logfile=...] + +JSON: + +.. parsed-literal:: + + |qemu_system| 'json:{"driver":"qcow2", + "file":{"driver":"gluster", + "volume":"testvol","path":"a.img","debug":9,"logfile":"...", + "server":[{"type":"tcp","host":"...","port":"..."}, + {"type":"unix","socket":"..."}]}}' + +*gluster* is the protocol. + +*TYPE* specifies the transport type used to connect to gluster +management daemon (glusterd). Valid transport types are +tcp and unix. In the URI form, if a transport type isn't specified, +then tcp type is assumed. + +*HOST* specifies the server where the volume file specification for +the given volume resides. This can be either a hostname or an ipv4 address. +If transport type is unix, then *HOST* field should not be specified. +Instead *socket* field needs to be populated with the path to unix domain +socket. + +*PORT* is the port number on which glusterd is listening. This is optional +and if not specified, it defaults to port 24007. If the transport type is unix, +then *PORT* should not be specified. + +*VOLUME* is the name of the gluster volume which contains the disk image. + +*PATH* is the path to the actual disk image that resides on gluster volume. + +*debug* is the logging level of the gluster protocol driver. Debug levels +are 0-9, with 9 being the most verbose, and 0 representing no debugging output. +The default level is 4. The current logging levels defined in the gluster source +are 0 - None, 1 - Emergency, 2 - Alert, 3 - Critical, 4 - Error, 5 - Warning, +6 - Notice, 7 - Info, 8 - Debug, 9 - Trace + +*logfile* is a commandline option to mention log file path which helps in +logging to the specified file and also help in persisting the gfapi logs. The +default is stderr. + +You can create a GlusterFS disk image with the command: + +.. parsed-literal:: + + qemu-img create gluster://HOST/VOLUME/PATH SIZE + +Examples + +.. parsed-literal:: + + |qemu_system| -drive file=gluster://1.2.3.4/testvol/a.img + |qemu_system| -drive file=gluster+tcp://1.2.3.4/testvol/a.img + |qemu_system| -drive file=gluster+tcp://1.2.3.4:24007/testvol/dir/a.img + |qemu_system| -drive file=gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img + |qemu_system| -drive file=gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img + |qemu_system| -drive file=gluster+tcp://server.domain.com:24007/testvol/dir/a.img + |qemu_system| -drive file=gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket + |qemu_system| -drive file=gluster+rdma://1.2.3.4:24007/testvol/a.img + |qemu_system| -drive file=gluster://1.2.3.4/testvol/a.img,file.debug=9,file.logfile=/var/log/qemu-gluster.log + |qemu_system| 'json:{"driver":"qcow2", + "file":{"driver":"gluster", + "volume":"testvol","path":"a.img", + "debug":9,"logfile":"/var/log/qemu-gluster.log", + "server":[{"type":"tcp","host":"1.2.3.4","port":24007}, + {"type":"unix","socket":"/var/run/glusterd.socket"}]}}' + |qemu_system| -drive driver=qcow2,file.driver=gluster,file.volume=testvol,file.path=/path/a.img, + file.debug=9,file.logfile=/var/log/qemu-gluster.log, + file.server.0.type=tcp,file.server.0.host=1.2.3.4,file.server.0.port=24007, + file.server.1.type=unix,file.server.1.socket=/var/run/glusterd.socket + +Secure Shell (ssh) disk images +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can access disk images located on a remote ssh server +by using the ssh protocol: + +.. parsed-literal:: + + |qemu_system| -drive file=ssh://[USER@]SERVER[:PORT]/PATH[?host_key_check=HOST_KEY_CHECK] + +Alternative syntax using properties: + +.. parsed-literal:: + + |qemu_system| -drive file.driver=ssh[,file.user=USER],file.host=SERVER[,file.port=PORT],file.path=PATH[,file.host_key_check=HOST_KEY_CHECK] + +*ssh* is the protocol. + +*USER* is the remote user. If not specified, then the local +username is tried. + +*SERVER* specifies the remote ssh server. Any ssh server can be +used, but it must implement the sftp-server protocol. Most Unix/Linux +systems should work without requiring any extra configuration. + +*PORT* is the port number on which sshd is listening. By default +the standard ssh port (22) is used. + +*PATH* is the path to the disk image. + +The optional *HOST_KEY_CHECK* parameter controls how the remote +host's key is checked. The default is ``yes`` which means to use +the local ``.ssh/known_hosts`` file. Setting this to ``no`` +turns off known-hosts checking. Or you can check that the host key +matches a specific fingerprint: +``host_key_check=md5:78:45:8e:14:57:4f:d5:45:83:0a:0e:f3:49:82:c9:c8`` +(``sha1:`` can also be used as a prefix, but note that OpenSSH +tools only use MD5 to print fingerprints). + +Currently authentication must be done using ssh-agent. Other +authentication methods may be supported in future. + +Note: Many ssh servers do not support an ``fsync``-style operation. +The ssh driver cannot guarantee that disk flush requests are +obeyed, and this causes a risk of disk corruption if the remote +server or network goes down during writes. The driver will +print a warning when ``fsync`` is not supported: + +:: + + warning: ssh server ssh.example.com:22 does not support fsync + +With sufficiently new versions of libssh and OpenSSH, ``fsync`` is +supported. + +NVMe disk images +~~~~~~~~~~~~~~~~ + +NVM Express (NVMe) storage controllers can be accessed directly by a userspace +driver in QEMU. This bypasses the host kernel file system and block layers +while retaining QEMU block layer functionalities, such as block jobs, I/O +throttling, image formats, etc. Disk I/O performance is typically higher than +with ``-drive file=/dev/sda`` using either thread pool or linux-aio. + +The controller will be exclusively used by the QEMU process once started. To be +able to share storage between multiple VMs and other applications on the host, +please use the file based protocols. + +Before starting QEMU, bind the host NVMe controller to the host vfio-pci +driver. For example: + +.. parsed-literal:: + + # modprobe vfio-pci + # lspci -n -s 0000:06:0d.0 + 06:0d.0 0401: 1102:0002 (rev 08) + # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind + # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id + + # |qemu_system| -drive file=nvme://HOST:BUS:SLOT.FUNC/NAMESPACE + +Alternative syntax using properties: + +.. parsed-literal:: + + |qemu_system| -drive file.driver=nvme,file.device=HOST:BUS:SLOT.FUNC,file.namespace=NAMESPACE + +*HOST*:*BUS*:*SLOT*.\ *FUNC* is the NVMe controller's PCI device +address on the host. + +*NAMESPACE* is the NVMe namespace number, starting from 1. + +Disk image file locking +~~~~~~~~~~~~~~~~~~~~~~~ + +By default, QEMU tries to protect image files from unexpected concurrent +access, as long as it's supported by the block protocol driver and host +operating system. If multiple QEMU processes (including QEMU emulators and +utilities) try to open the same image with conflicting accessing modes, all but +the first one will get an error. + +This feature is currently supported by the file protocol on Linux with the Open +File Descriptor (OFD) locking API, and can be configured to fall back to POSIX +locking if the POSIX host doesn't support Linux OFD locking. + +To explicitly enable image locking, specify "locking=on" in the file protocol +driver options. If OFD locking is not possible, a warning will be printed and +the POSIX locking API will be used. In this case there is a risk that the lock +will get silently lost when doing hot plugging and block jobs, due to the +shortcomings of the POSIX locking API. + +QEMU transparently handles lock handover during shared storage migration. For +shared virtual disk images between multiple VMs, the "share-rw" device option +should be used. + +By default, the guest has exclusive write access to its disk image. If the +guest can safely share the disk image with other writers the +``-device ...,share-rw=on`` parameter can be used. This is only safe if +the guest is running software, such as a cluster file system, that +coordinates disk accesses to avoid corruption. + +Note that share-rw=on only declares the guest's ability to share the disk. +Some QEMU features, such as image file formats, require exclusive write access +to the disk image and this is unaffected by the share-rw=on option. + +Alternatively, locking can be fully disabled by "locking=off" block device +option. In the command line, the option is usually in the form of +"file.locking=off" as the protocol driver is normally placed as a "file" child +under a format driver. For example: + +:: + + -blockdev driver=qcow2,file.filename=/path/to/image,file.locking=off,file.driver=file + +To check if image locking is active, check the output of the "lslocks" command +on host and see if there are locks held by the QEMU process on the image file. +More than one byte could be locked by the QEMU instance, each byte of which +reflects a particular permission that is acquired or protected by the running +block driver. + +Filter drivers +~~~~~~~~~~~~~~ + +QEMU supports several filter drivers, which don't store any data, but perform +some additional tasks, hooking io requests. + +.. program:: filter-drivers +.. option:: preallocate + + The preallocate filter driver is intended to be inserted between format + and protocol nodes and preallocates some additional space + (expanding the protocol file) when writing past the file’s end. This can be + useful for file-systems with slow allocation. + + Supported options: + + .. program:: preallocate + .. option:: prealloc-align + + On preallocation, align the file length to this value (in bytes), default 1M. + + .. program:: preallocate + .. option:: prealloc-size + + How much to preallocate (in bytes), default 128M. diff --git a/docs/system/qemu-cpu-models.rst b/docs/system/qemu-cpu-models.rst new file mode 100644 index 000000000..5cf6e46f8 --- /dev/null +++ b/docs/system/qemu-cpu-models.rst @@ -0,0 +1,24 @@ +:orphan: + +================================== +QEMU / KVM CPU model configuration +================================== + +-------- +Synopsis +-------- + +QEMU CPU Modelling Infrastructure manual + +----------- +Description +----------- + +.. include:: cpu-models-x86.rst.inc +.. include:: cpu-models-mips.rst.inc + +-------- +See also +-------- + +The HTML documentation of QEMU for more precise information and Linux user mode emulator invocation. diff --git a/docs/system/qemu-manpage.rst b/docs/system/qemu-manpage.rst new file mode 100644 index 000000000..c47a41275 --- /dev/null +++ b/docs/system/qemu-manpage.rst @@ -0,0 +1,51 @@ +:orphan: + +.. + This file is the skeleton for the qemu.1 manpage. It mostly + should simply include the .rst.inc files corresponding to the + parts of the documentation that go in the manpage as well as the + HTML manual. + +======================= +QEMU User Documentation +======================= + +-------- +Synopsis +-------- + +.. parsed-literal:: + + |qemu_system| [options] [disk_image] + +----------- +Description +----------- + +.. include:: target-i386-desc.rst.inc + +------- +Options +------- + +disk_image is a raw hard disk image for IDE hard disk 0. Some targets do +not need a disk image. + +.. hxtool-doc:: qemu-options.hx + +.. include:: keys.rst.inc + +.. include:: mux-chardev.rst.inc + +----- +Notes +----- + +.. include:: device-url-syntax.rst.inc + +-------- +See also +-------- + +The HTML documentation of QEMU for more precise information and Linux +user mode emulator invocation. diff --git a/docs/system/quickstart.rst b/docs/system/quickstart.rst new file mode 100644 index 000000000..681678c86 --- /dev/null +++ b/docs/system/quickstart.rst @@ -0,0 +1,21 @@ +.. _pcsys_005fquickstart: + +Quick Start +----------- + +Download and uncompress a PC hard disk image with Linux installed (e.g. +``linux.img``) and type: + +.. parsed-literal:: + + |qemu_system| linux.img + +Linux should boot and give you a prompt. + +Users should be aware the above example elides a lot of the complexity +of setting up a VM with x86_64 specific defaults and assumes the +first non switch argument is a PC compatible disk image with a boot +sector. For a non-x86 system where we emulate a broad range of machine +types, the command lines are generally more explicit in defining the +machine and boot behaviour. You will find more example command lines +in the :ref:`system-targets-ref` section of the manual. diff --git a/docs/system/riscv/microchip-icicle-kit.rst b/docs/system/riscv/microchip-icicle-kit.rst new file mode 100644 index 000000000..40798b1aa --- /dev/null +++ b/docs/system/riscv/microchip-icicle-kit.rst @@ -0,0 +1,149 @@ +Microchip PolarFire SoC Icicle Kit (``microchip-icicle-kit``) +============================================================= + +Microchip PolarFire SoC Icicle Kit integrates a PolarFire SoC, with one +SiFive's E51 plus four U54 cores and many on-chip peripherals and an FPGA. + +For more details about Microchip PolarFire SoC, please see: +https://www.microsemi.com/product-directory/soc-fpgas/5498-polarfire-soc-fpga + +The Icicle Kit board information can be found here: +https://www.microsemi.com/existing-parts/parts/152514 + +Supported devices +----------------- + +The ``microchip-icicle-kit`` machine supports the following devices: + +* 1 E51 core +* 4 U54 cores +* Core Level Interruptor (CLINT) +* Platform-Level Interrupt Controller (PLIC) +* L2 Loosely Integrated Memory (L2-LIM) +* DDR memory controller +* 5 MMUARTs +* 1 DMA controller +* 2 GEM Ethernet controllers +* 1 SDHC storage controller + +Boot options +------------ + +The ``microchip-icicle-kit`` machine can start using the standard -bios +functionality for loading its BIOS image, aka Hart Software Services (HSS_). +HSS loads the second stage bootloader U-Boot from an SD card. Then a kernel +can be loaded from U-Boot. It also supports direct kernel booting via the +-kernel option along with the device tree blob via -dtb. When direct kernel +boot is used, the OpenSBI fw_dynamic BIOS image is used to boot a payload +like U-Boot or OS kernel directly. + +The user provided DTB should have the following requirements: + +* The /cpus node should contain at least one subnode for E51 and the number + of subnodes should match QEMU's ``-smp`` option +* The /memory reg size should match QEMU’s selected ram_size via ``-m`` +* Should contain a node for the CLINT device with a compatible string + "riscv,clint0" + +QEMU follows below truth table to select which payload to execute: + +===== ========== ========== ======= +-bios -kernel -dtb payload +===== ========== ========== ======= + N N don't care HSS + Y don't care don't care HSS + N Y Y kernel +===== ========== ========== ======= + +The memory is set to 1537 MiB by default which is the minimum required high +memory size by HSS. A sanity check on ram size is performed in the machine +init routine to prompt user to increase the RAM size to > 1537 MiB when less +than 1537 MiB ram is detected. + +Running HSS +----------- + +HSS 2020.12 release is tested at the time of writing. To build an HSS image +that can be booted by the ``microchip-icicle-kit`` machine, type the following +in the HSS source tree: + +.. code-block:: bash + + $ export CROSS_COMPILE=riscv64-linux- + $ cp boards/mpfs-icicle-kit-es/def_config .config + $ make BOARD=mpfs-icicle-kit-es + +Download the official SD card image released by Microchip and prepare it for +QEMU usage: + +.. code-block:: bash + + $ wget ftp://ftpsoc.microsemi.com/outgoing/core-image-minimal-dev-icicle-kit-es-sd-20201009141623.rootfs.wic.gz + $ gunzip core-image-minimal-dev-icicle-kit-es-sd-20201009141623.rootfs.wic.gz + $ qemu-img resize core-image-minimal-dev-icicle-kit-es-sd-20201009141623.rootfs.wic 4G + +Then we can boot the machine by: + +.. code-block:: bash + + $ qemu-system-riscv64 -M microchip-icicle-kit -smp 5 \ + -bios path/to/hss.bin -sd path/to/sdcard.img \ + -nic user,model=cadence_gem \ + -nic tap,ifname=tap,model=cadence_gem,script=no \ + -display none -serial stdio \ + -chardev socket,id=serial1,path=serial1.sock,server=on,wait=on \ + -serial chardev:serial1 + +With above command line, current terminal session will be used for the first +serial port. Open another terminal window, and use ``minicom`` to connect the +second serial port. + +.. code-block:: bash + + $ minicom -D unix\#serial1.sock + +HSS output is on the first serial port (stdio) and U-Boot outputs on the +second serial port. U-Boot will automatically load the Linux kernel from +the SD card image. + +Direct Kernel Boot +------------------ + +Sometimes we just want to test booting a new kernel, and transforming the +kernel image to the format required by the HSS bootflow is tedious. We can +use '-kernel' for direct kernel booting just like other RISC-V machines do. + +In this mode, the OpenSBI fw_dynamic BIOS image for 'generic' platform is +used to boot an S-mode payload like U-Boot or OS kernel directly. + +For example, the following commands show building a U-Boot image from U-Boot +mainline v2021.07 for the Microchip Icicle Kit board: + +.. code-block:: bash + + $ export CROSS_COMPILE=riscv64-linux- + $ make microchip_mpfs_icicle_defconfig + +Then we can boot the machine by: + +.. code-block:: bash + + $ qemu-system-riscv64 -M microchip-icicle-kit -smp 5 -m 2G \ + -sd path/to/sdcard.img \ + -nic user,model=cadence_gem \ + -nic tap,ifname=tap,model=cadence_gem,script=no \ + -display none -serial stdio \ + -kernel path/to/u-boot/build/dir/u-boot.bin \ + -dtb path/to/u-boot/build/dir/u-boot.dtb + +CAVEATS: + +* Check the "stdout-path" property in the /chosen node in the DTB to determine + which serial port is used for the serial console, e.g.: if the console is set + to the second serial port, change to use "-serial null -serial stdio". +* The default U-Boot configuration uses CONFIG_OF_SEPARATE hence the ELF image + ``u-boot`` cannot be passed to "-kernel" as it does not contain the DTB hence + ``u-boot.bin`` has to be used which does contain one. To use the ELF image, + we need to change to CONFIG_OF_EMBED or CONFIG_OF_PRIOR_STAGE. + +.. _HSS: https://github.com/polarfire-soc/hart-software-services diff --git a/docs/system/riscv/shakti-c.rst b/docs/system/riscv/shakti-c.rst new file mode 100644 index 000000000..fea57f7b6 --- /dev/null +++ b/docs/system/riscv/shakti-c.rst @@ -0,0 +1,82 @@ +Shakti C Reference Platform (``shakti_c``) +========================================== + +Shakti C Reference Platform is a reference platform based on arty a7 100t +for the Shakti SoC. + +Shakti SoC is a SoC based on the Shakti C-class processor core. Shakti C +is a 64bit RV64GCSUN processor core. + +For more details on Shakti SoC, please see: +https://gitlab.com/shaktiproject/cores/shakti-soc/-/blob/master/fpga/boards/artya7-100t/c-class/README.rst + +For more info on the Shakti C-class core, please see: +https://c-class.readthedocs.io/en/latest/ + +Supported devices +----------------- + +The ``shakti_c`` machine supports the following devices: + + * 1 C-class core + * Core Level Interruptor (CLINT) + * Platform-Level Interrupt Controller (PLIC) + * 1 UART + +Boot options +------------ + +The ``shakti_c`` machine can start using the standard -bios +functionality for loading the baremetal application or opensbi. + +Boot the machine +---------------- + +Shakti SDK +~~~~~~~~~~ +Shakti SDK can be used to generate the baremetal example UART applications. + +.. code-block:: bash + + $ git clone https://gitlab.com/behindbytes/shakti-sdk.git + $ cd shakti-sdk + $ make software PROGRAM=loopback TARGET=artix7_100t + +Binary would be generated in: + software/examples/uart_applns/loopback/output/loopback.shakti + +You could also download the precompiled example applications using below +commands. + +.. code-block:: bash + + $ wget -c https://gitlab.com/behindbytes/shakti-binaries/-/raw/master/sdk/shakti_sdk_qemu.zip + $ unzip shakti_sdk_qemu.zip + +Then we can run the UART example using: + +.. code-block:: bash + + $ qemu-system-riscv64 -M shakti_c -nographic \ + -bios path/to/shakti_sdk_qemu/loopback.shakti + +OpenSBI +~~~~~~~ +We can also run OpenSBI with Test Payload. + +.. code-block:: bash + + $ git clone https://github.com/riscv/opensbi.git -b v0.9 + $ cd opensbi + $ wget -c https://gitlab.com/behindbytes/shakti-binaries/-/raw/master/dts/shakti.dtb + $ export CROSS_COMPILE=riscv64-unknown-elf- + $ export FW_FDT_PATH=./shakti.dtb + $ make PLATFORM=generic + +fw_payload.elf would be generated in build/platform/generic/firmware/fw_payload.elf. +Boot it using the below qemu command. + +.. code-block:: bash + + $ qemu-system-riscv64 -M shakti_c -nographic \ + -bios path/to/fw_payload.elf diff --git a/docs/system/riscv/sifive_u.rst b/docs/system/riscv/sifive_u.rst new file mode 100644 index 000000000..7b166567f --- /dev/null +++ b/docs/system/riscv/sifive_u.rst @@ -0,0 +1,375 @@ +SiFive HiFive Unleashed (``sifive_u``) +====================================== + +SiFive HiFive Unleashed Development Board is the ultimate RISC-V development +board featuring the Freedom U540 multi-core RISC-V processor. + +Supported devices +----------------- + +The ``sifive_u`` machine supports the following devices: + +* 1 E51 / E31 core +* Up to 4 U54 / U34 cores +* Core Local Interruptor (CLINT) +* Platform-Level Interrupt Controller (PLIC) +* Power, Reset, Clock, Interrupt (PRCI) +* L2 Loosely Integrated Memory (L2-LIM) +* DDR memory controller +* 2 UARTs +* 1 GEM Ethernet controller +* 1 GPIO controller +* 1 One-Time Programmable (OTP) memory with stored serial number +* 1 DMA controller +* 2 QSPI controllers +* 1 ISSI 25WP256 flash +* 1 SD card in SPI mode +* PWM0 and PWM1 + +Please note the real world HiFive Unleashed board has a fixed configuration of +1 E51 core and 4 U54 core combination and the RISC-V core boots in 64-bit mode. +With QEMU, one can create a machine with 1 E51 core and up to 4 U54 cores. It +is also possible to create a 32-bit variant with the same peripherals except +that the RISC-V cores are replaced by the 32-bit ones (E31 and U34), to help +testing of 32-bit guest software. + +Hardware configuration information +---------------------------------- + +The ``sifive_u`` machine automatically generates a device tree blob ("dtb") +which it passes to the guest, if there is no ``-dtb`` option. This provides +information about the addresses, interrupt lines and other configuration of +the various devices in the system. Guest software should discover the devices +that are present in the generated DTB instead of using a DTB for the real +hardware, as some of the devices are not modeled by QEMU and trying to access +these devices may cause unexpected behavior. + +If users want to provide their own DTB, they can use the ``-dtb`` option. +These DTBs should have the following requirements: + +* The /cpus node should contain at least one subnode for E51 and the number + of subnodes should match QEMU's ``-smp`` option +* The /memory reg size should match QEMU’s selected ram_size via ``-m`` +* Should contain a node for the CLINT device with a compatible string + "riscv,clint0" if using with OpenSBI BIOS images + +Boot options +------------ + +The ``sifive_u`` machine can start using the standard -kernel functionality +for loading a Linux kernel, a VxWorks kernel, a modified U-Boot bootloader +(S-mode) or ELF executable with the default OpenSBI firmware image as the +-bios. It also supports booting the unmodified U-Boot bootloader using the +standard -bios functionality. + +Machine-specific options +------------------------ + +The following machine-specific options are supported: + +- serial=nnn + + The board serial number. When not given, the default serial number 1 is used. + + SiFive reserves the first 1 KiB of the 16 KiB OTP memory for internal use. + The current usage is only used to store the serial number of the board at + offset 0xfc. U-Boot reads the serial number from the OTP memory, and uses + it to generate a unique MAC address to be programmed to the on-chip GEM + Ethernet controller. When multiple QEMU ``sifive_u`` machines are created + and connected to the same subnet, they all have the same MAC address hence + it creates an unusable network. In such scenario, user should give different + values to serial= when creating different ``sifive_u`` machines. + +- start-in-flash + + When given, QEMU's ROM codes jump to QSPI memory-mapped flash directly. + Otherwise QEMU will jump to DRAM or L2LIM depending on the msel= value. + When not given, it defaults to direct DRAM booting. + +- msel=[6|11] + + Mode Select (MSEL[3:0]) pins value, used to control where to boot from. + + The FU540 SoC supports booting from several sources, which are controlled + using the Mode Select pins on the chip. Typically, the boot process runs + through several stages before it begins execution of user-provided programs. + These stages typically include the following: + + 1. Zeroth Stage Boot Loader (ZSBL), which is contained in an on-chip mask + ROM and provided by QEMU. Note QEMU implemented ROM codes are not the + same as what is programmed in the hardware. The QEMU one is a simplified + version, but it provides the same functionality as the hardware. + 2. First Stage Boot Loader (FSBL), which brings up PLLs and DDR memory. + This is U-Boot SPL. + 3. Second Stage Boot Loader (SSBL), which further initializes additional + peripherals as needed. This is U-Boot proper combined with an OpenSBI + fw_dynamic firmware image. + + msel=6 means FSBL and SSBL are both on the QSPI flash. msel=11 means FSBL + and SSBL are both on the SD card. + +Running Linux kernel +-------------------- + +Linux mainline v5.10 release is tested at the time of writing. To build a +Linux mainline kernel that can be booted by the ``sifive_u`` machine in +64-bit mode, simply configure the kernel using the defconfig configuration: + +.. code-block:: bash + + $ export ARCH=riscv + $ export CROSS_COMPILE=riscv64-linux- + $ make defconfig + $ make + +To boot the newly built Linux kernel in QEMU with the ``sifive_u`` machine: + +.. code-block:: bash + + $ qemu-system-riscv64 -M sifive_u -smp 5 -m 2G \ + -display none -serial stdio \ + -kernel arch/riscv/boot/Image \ + -initrd /path/to/rootfs.ext4 \ + -append "root=/dev/ram" + +Alternatively, we can use a custom DTB to boot the machine by inserting a CLINT +node in fu540-c000.dtsi in the Linux kernel, + +.. code-block:: none + + clint: clint@2000000 { + compatible = "riscv,clint0"; + interrupts-extended = <&cpu0_intc 3 &cpu0_intc 7 + &cpu1_intc 3 &cpu1_intc 7 + &cpu2_intc 3 &cpu2_intc 7 + &cpu3_intc 3 &cpu3_intc 7 + &cpu4_intc 3 &cpu4_intc 7>; + reg = <0x00 0x2000000 0x00 0x10000>; + }; + +with the following command line options: + +.. code-block:: bash + + $ qemu-system-riscv64 -M sifive_u -smp 5 -m 8G \ + -display none -serial stdio \ + -kernel arch/riscv/boot/Image \ + -dtb arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dtb \ + -initrd /path/to/rootfs.ext4 \ + -append "root=/dev/ram" + +To build a Linux mainline kernel that can be booted by the ``sifive_u`` machine +in 32-bit mode, use the rv32_defconfig configuration. A patch is required to +fix the 32-bit boot issue for Linux kernel v5.10. + +.. code-block:: bash + + $ export ARCH=riscv + $ export CROSS_COMPILE=riscv64-linux- + $ curl https://patchwork.kernel.org/project/linux-riscv/patch/20201219001356.2887782-1-atish.patra@wdc.com/mbox/ > riscv.patch + $ git am riscv.patch + $ make rv32_defconfig + $ make + +Replace ``qemu-system-riscv64`` with ``qemu-system-riscv32`` in the command +line above to boot the 32-bit Linux kernel. A rootfs image containing 32-bit +applications shall be used in order for kernel to boot to user space. + +Running VxWorks kernel +---------------------- + +VxWorks 7 SR0650 release is tested at the time of writing. To build a 64-bit +VxWorks mainline kernel that can be booted by the ``sifive_u`` machine, simply +create a VxWorks source build project based on the sifive_generic BSP, and a +VxWorks image project to generate the bootable VxWorks image, by following the +BSP documentation instructions. + +A pre-built 64-bit VxWorks 7 image for HiFive Unleashed board is available as +part of the VxWorks SDK for testing as well. Instructions to download the SDK: + +.. code-block:: bash + + $ wget https://labs.windriver.com/downloads/wrsdk-vxworks7-sifive-hifive-1.01.tar.bz2 + $ tar xvf wrsdk-vxworks7-sifive-hifive-1.01.tar.bz2 + $ ls bsps/sifive_generic_1_0_0_0/uboot/uVxWorks + +To boot the VxWorks kernel in QEMU with the ``sifive_u`` machine, use: + +.. code-block:: bash + + $ qemu-system-riscv64 -M sifive_u -smp 5 -m 2G \ + -display none -serial stdio \ + -nic tap,ifname=tap0,script=no,downscript=no \ + -kernel /path/to/vxWorks \ + -append "gem(0,0)host:vxWorks h=192.168.200.1 e=192.168.200.2:ffffff00 u=target pw=vxTarget f=0x01" + +It is also possible to test 32-bit VxWorks on the ``sifive_u`` machine. Create +a 32-bit project to build the 32-bit VxWorks image, and use exact the same +command line options with ``qemu-system-riscv32``. + +Running U-Boot +-------------- + +U-Boot mainline v2021.07 release is tested at the time of writing. To build a +U-Boot mainline bootloader that can be booted by the ``sifive_u`` machine, use +the sifive_unleashed_defconfig with similar commands as described above for +Linux: + +.. code-block:: bash + + $ export CROSS_COMPILE=riscv64-linux- + $ export OPENSBI=/path/to/opensbi-riscv64-generic-fw_dynamic.bin + $ make sifive_unleashed_defconfig + +You will get spl/u-boot-spl.bin and u-boot.itb file in the build tree. + +To start U-Boot using the ``sifive_u`` machine, prepare an SPI flash image, or +SD card image that is properly partitioned and populated with correct contents. +genimage_ can be used to generate these images. + +A sample configuration file for a 128 MiB SD card image is: + +.. code-block:: bash + + $ cat genimage_sdcard.cfg + image sdcard.img { + size = 128M + + hdimage { + gpt = true + } + + partition u-boot-spl { + image = "u-boot-spl.bin" + offset = 17K + partition-type-uuid = 5B193300-FC78-40CD-8002-E86C45580B47 + } + + partition u-boot { + image = "u-boot.itb" + offset = 1041K + partition-type-uuid = 2E54B353-1271-4842-806F-E436D6AF6985 + } + } + +SPI flash image has slightly different partition offsets, and the size has to +be 32 MiB to match the ISSI 25WP256 flash on the real board: + +.. code-block:: bash + + $ cat genimage_spi-nor.cfg + image spi-nor.img { + size = 32M + + hdimage { + gpt = true + } + + partition u-boot-spl { + image = "u-boot-spl.bin" + offset = 20K + partition-type-uuid = 5B193300-FC78-40CD-8002-E86C45580B47 + } + + partition u-boot { + image = "u-boot.itb" + offset = 1044K + partition-type-uuid = 2E54B353-1271-4842-806F-E436D6AF6985 + } + } + +Assume U-Boot binaries are put in the same directory as the config file, +we can generate the image by: + +.. code-block:: bash + + $ genimage --config genimage_<boot_src>.cfg --inputpath . + +Boot U-Boot from SD card, by specifying msel=11 and pass the SD card image +to QEMU ``sifive_u`` machine: + +.. code-block:: bash + + $ qemu-system-riscv64 -M sifive_u,msel=11 -smp 5 -m 8G \ + -display none -serial stdio \ + -bios /path/to/u-boot-spl.bin \ + -drive file=/path/to/sdcard.img,if=sd + +Changing msel= value to 6, allows booting U-Boot from the SPI flash: + +.. code-block:: bash + + $ qemu-system-riscv64 -M sifive_u,msel=6 -smp 5 -m 8G \ + -display none -serial stdio \ + -bios /path/to/u-boot-spl.bin \ + -drive file=/path/to/spi-nor.img,if=mtd + +Note when testing U-Boot, QEMU automatically generated device tree blob is +not used because U-Boot itself embeds device tree blobs for U-Boot SPL and +U-Boot proper. Hence the number of cores and size of memory have to match +the real hardware, ie: 5 cores (-smp 5) and 8 GiB memory (-m 8G). + +Above use case is to run upstream U-Boot for the SiFive HiFive Unleashed +board on QEMU ``sifive_u`` machine out of the box. This allows users to +develop and test the recommended RISC-V boot flow with a real world use +case: ZSBL (in QEMU) loads U-Boot SPL from SD card or SPI flash to L2LIM, +then U-Boot SPL loads the combined payload image of OpenSBI fw_dynamic +firmware and U-Boot proper. + +However sometimes we want to have a quick test of booting U-Boot on QEMU +without the needs of preparing the SPI flash or SD card images, an alternate +way can be used, which is to create a U-Boot S-mode image by modifying the +configuration of U-Boot: + +.. code-block:: bash + + $ export CROSS_COMPILE=riscv64-linux- + $ make sifive_unleashed_defconfig + $ make menuconfig + +then manually select the following configuration: + + * Device Tree Control ---> Provider of DTB for DT Control ---> Prior Stage bootloader DTB + +and unselect the following configuration: + + * Library routines ---> Allow access to binman information in the device tree + +This changes U-Boot to use the QEMU generated device tree blob, and bypass +running the U-Boot SPL stage. + +Boot the 64-bit U-Boot S-mode image directly: + +.. code-block:: bash + + $ qemu-system-riscv64 -M sifive_u -smp 5 -m 2G \ + -display none -serial stdio \ + -kernel /path/to/u-boot.bin + +It's possible to create a 32-bit U-Boot S-mode image as well. + +.. code-block:: bash + + $ export CROSS_COMPILE=riscv64-linux- + $ make sifive_unleashed_defconfig + $ make menuconfig + +then manually update the following configuration in U-Boot: + + * Device Tree Control ---> Provider of DTB for DT Control ---> Prior Stage bootloader DTB + * RISC-V architecture ---> Base ISA ---> RV32I + * Boot options ---> Boot images ---> Text Base ---> 0x80400000 + +and unselect the following configuration: + + * Library routines ---> Allow access to binman information in the device tree + +Use the same command line options to boot the 32-bit U-Boot S-mode image: + +.. code-block:: bash + + $ qemu-system-riscv32 -M sifive_u -smp 5 -m 2G \ + -display none -serial stdio \ + -kernel /path/to/u-boot.bin + +.. _genimage: https://github.com/pengutronix/genimage diff --git a/docs/system/riscv/virt.rst b/docs/system/riscv/virt.rst new file mode 100644 index 000000000..fa016584b --- /dev/null +++ b/docs/system/riscv/virt.rst @@ -0,0 +1,148 @@ +'virt' Generic Virtual Platform (``virt``) +========================================== + +The ``virt`` board is a platform which does not correspond to any real hardware; +it is designed for use in virtual machines. It is the recommended board type +if you simply want to run a guest such as Linux and do not care about +reproducing the idiosyncrasies and limitations of a particular bit of +real-world hardware. + +Supported devices +----------------- + +The ``virt`` machine supports the following devices: + +* Up to 8 generic RV32GC/RV64GC cores, with optional extensions +* Core Local Interruptor (CLINT) +* Platform-Level Interrupt Controller (PLIC) +* CFI parallel NOR flash memory +* 1 NS16550 compatible UART +* 1 Google Goldfish RTC +* 1 SiFive Test device +* 8 virtio-mmio transport devices +* 1 generic PCIe host bridge +* The fw_cfg device that allows a guest to obtain data from QEMU + +Note that the default CPU is a generic RV32GC/RV64GC. Optional extensions +can be enabled via command line parameters, e.g.: ``-cpu rv64,x-h=true`` +enables the hypervisor extension for RV64. + +Hardware configuration information +---------------------------------- + +The ``virt`` machine automatically generates a device tree blob ("dtb") +which it passes to the guest, if there is no ``-dtb`` option. This provides +information about the addresses, interrupt lines and other configuration of +the various devices in the system. Guest software should discover the devices +that are present in the generated DTB. + +If users want to provide their own DTB, they can use the ``-dtb`` option. +These DTBs should have the following requirements: + +* The number of subnodes of the /cpus node should match QEMU's ``-smp`` option +* The /memory reg size should match QEMU’s selected ram_size via ``-m`` +* Should contain a node for the CLINT device with a compatible string + "riscv,clint0" if using with OpenSBI BIOS images + +Boot options +------------ + +The ``virt`` machine can start using the standard -kernel functionality +for loading a Linux kernel, a VxWorks kernel, an S-mode U-Boot bootloader +with the default OpenSBI firmware image as the -bios. It also supports +the recommended RISC-V bootflow: U-Boot SPL (M-mode) loads OpenSBI fw_dynamic +firmware and U-Boot proper (S-mode), using the standard -bios functionality. + +Machine-specific options +------------------------ + +The following machine-specific options are supported: + +- aclint=[on|off] + + When this option is "on", ACLINT devices will be emulated instead of + SiFive CLINT. When not specified, this option is assumed to be "off". + +Running Linux kernel +-------------------- + +Linux mainline v5.12 release is tested at the time of writing. To build a +Linux mainline kernel that can be booted by the ``virt`` machine in +64-bit mode, simply configure the kernel using the defconfig configuration: + +.. code-block:: bash + + $ export ARCH=riscv + $ export CROSS_COMPILE=riscv64-linux- + $ make defconfig + $ make + +To boot the newly built Linux kernel in QEMU with the ``virt`` machine: + +.. code-block:: bash + + $ qemu-system-riscv64 -M virt -smp 4 -m 2G \ + -display none -serial stdio \ + -kernel arch/riscv/boot/Image \ + -initrd /path/to/rootfs.cpio \ + -append "root=/dev/ram" + +To build a Linux mainline kernel that can be booted by the ``virt`` machine +in 32-bit mode, use the rv32_defconfig configuration. A patch is required to +fix the 32-bit boot issue for Linux kernel v5.12. + +.. code-block:: bash + + $ export ARCH=riscv + $ export CROSS_COMPILE=riscv64-linux- + $ curl https://patchwork.kernel.org/project/linux-riscv/patch/20210627135117.28641-1-bmeng.cn@gmail.com/mbox/ > riscv.patch + $ git am riscv.patch + $ make rv32_defconfig + $ make + +Replace ``qemu-system-riscv64`` with ``qemu-system-riscv32`` in the command +line above to boot the 32-bit Linux kernel. A rootfs image containing 32-bit +applications shall be used in order for kernel to boot to user space. + +Running U-Boot +-------------- + +U-Boot mainline v2021.04 release is tested at the time of writing. To build an +S-mode U-Boot bootloader that can be booted by the ``virt`` machine, use +the qemu-riscv64_smode_defconfig with similar commands as described above for Linux: + +.. code-block:: bash + + $ export CROSS_COMPILE=riscv64-linux- + $ make qemu-riscv64_smode_defconfig + +Boot the 64-bit U-Boot S-mode image directly: + +.. code-block:: bash + + $ qemu-system-riscv64 -M virt -smp 4 -m 2G \ + -display none -serial stdio \ + -kernel /path/to/u-boot.bin + +To test booting U-Boot SPL which in M-mode, which in turn loads a FIT image +that bundles OpenSBI fw_dynamic firmware and U-Boot proper (S-mode) together, +build the U-Boot images using riscv64_spl_defconfig: + +.. code-block:: bash + + $ export CROSS_COMPILE=riscv64-linux- + $ export OPENSBI=/path/to/opensbi-riscv64-generic-fw_dynamic.bin + $ make qemu-riscv64_spl_defconfig + +The minimal QEMU commands to run U-Boot SPL are: + +.. code-block:: bash + + $ qemu-system-riscv64 -M virt -smp 4 -m 2G \ + -display none -serial stdio \ + -bios /path/to/u-boot-spl \ + -device loader,file=/path/to/u-boot.itb,addr=0x80200000 + +To test 32-bit U-Boot images, switch to use qemu-riscv32_smode_defconfig and +riscv32_spl_defconfig builds, and replace ``qemu-system-riscv64`` with +``qemu-system-riscv32`` in the command lines above to boot the 32-bit U-Boot. diff --git a/docs/system/s390x/3270.rst b/docs/system/s390x/3270.rst new file mode 100644 index 000000000..0e173b323 --- /dev/null +++ b/docs/system/s390x/3270.rst @@ -0,0 +1,63 @@ +3270 devices +============ + +The 3270 is the classic 'green-screen' console of the mainframes (see the +`IBM 3270 Wikipedia article <https://en.wikipedia.org/wiki/IBM_3270>`__). + +The 3270 data stream is not implemented within QEMU; the device only provides +TN3270 (a telnet extension; see `RFC 854 <https://tools.ietf.org/html/rfc854>`__ +and `RFC 1576 <https://tools.ietf.org/html/rfc1576>`__) and leaves the heavy +lifting to an external 3270 terminal emulator (such as ``x3270``) to make a +single 3270 device available to a guest. Note that this supports basic +features only. + +To provide a 3270 device to a guest, create a ``x-terminal3270`` linked to +a ``tn3270`` chardev. The guest will see a 3270 channel device. In order +to actually be able to use it, attach the ``x3270`` emulator to the chardev. + +Example configuration +--------------------- + +* Make sure that 3270 support is enabled in the guest's Linux kernel. You need + ``CONFIG_TN3270`` and at least one of ``CONFIG_TN3270_TTY`` (for additional + ttys) or ``CONFIG_TN3270_CONSOLE`` (for a 3270 console). + +* Add a ``tn3270`` chardev and a ``x-terminal3270`` to the QEMU command line:: + + -chardev socket,id=ch0,host=0.0.0.0,port=2300,wait=off,server=on,tn3270=on + -device x-terminal3270,chardev=ch0,devno=fe.0.000a,id=terminal0 + +* Start the guest. In the guest, use ``chccwdev -e 0.0.000a`` to enable + the device. + +* On the host, start the ``x3270`` emulator:: + + x3270 <host>:2300 + +* In the guest, locate the 3270 device node under ``/dev/3270/`` (say, + ``tty1``) and start a getty on it:: + + systemctl start serial-getty@3270-tty1.service + + This should get you an additional tty for logging into the guest. + +* If you want to use the 3270 device as the Linux kernel console instead of + an additional tty, you can also append ``conmode=3270 condev=000a`` to + the guest's kernel command line. The kernel then should use the 3270 as + console after the next boot. + +Restrictions +------------ + +3270 support is very basic. In particular: + +* Only one 3270 device is supported. + +* It has only been tested with Linux guests and the x3270 emulator. + +* TLS/SSL is not supported. + +* Resizing on reattach is not supported. + +* Multiple commands in one inbound buffer (for example, when the reset key + is pressed while the network is slow) are not supported. diff --git a/docs/system/s390x/bootdevices.rst b/docs/system/s390x/bootdevices.rst new file mode 100644 index 000000000..9e591cb9d --- /dev/null +++ b/docs/system/s390x/bootdevices.rst @@ -0,0 +1,82 @@ +Boot devices on s390x +===================== + +Booting with bootindex parameter +-------------------------------- + +For classical mainframe guests (i.e. LPAR or z/VM installations), you always +have to explicitly specify the disk where you want to boot from (or "IPL" from, +in s390x-speak -- IPL means "Initial Program Load"). In particular, there can +also be only one boot device according to the architecture specification, thus +specifying multiple boot devices is not possible (yet). + +So for booting an s390x guest in QEMU, you should always mark the +device where you want to boot from with the ``bootindex`` property, for +example:: + + qemu-system-s390x -drive if=none,id=dr1,file=guest.qcow2 \ + -device virtio-blk,drive=dr1,bootindex=1 + +For booting from a CD-ROM ISO image (which needs to include El-Torito boot +information in order to be bootable), it is recommended to specify a ``scsi-cd`` +device, for example like this:: + + qemu-system-s390x -blockdev file,node-name=c1,filename=... \ + -device virtio-scsi \ + -device scsi-cd,drive=c1,bootindex=1 + +Note that you really have to use the ``bootindex`` property to select the +boot device. The old-fashioned ``-boot order=...`` command of QEMU (and +also ``-boot once=...``) is not supported on s390x. + + +Booting without bootindex parameter +----------------------------------- + +The QEMU guest firmware (the so-called s390-ccw bios) has also some rudimentary +support for scanning through the available block devices. So in case you did +not specify a boot device with the ``bootindex`` property, there is still a +chance that it finds a bootable device on its own and starts a guest operating +system from it. However, this scanning algorithm is still very rough and may +be incomplete, so that it might fail to detect a bootable device in many cases. +It is really recommended to always specify the boot device with the +``bootindex`` property instead. + +This also means that you should avoid the classical short-cut commands like +``-hda``, ``-cdrom`` or ``-drive if=virtio``, since it is not possible to +specify the ``bootindex`` with these commands. Note that the convenience +``-cdrom`` option even does not give you a real (virtio-scsi) CD-ROM device on +s390x. Due to technical limitations in the QEMU code base, you will get a +virtio-blk device with this parameter instead, which might not be the right +device type for installing a Linux distribution via ISO image. It is +recommended to specify a CD-ROM device via ``-device scsi-cd`` (as mentioned +above) instead. + + +Booting from a network device +----------------------------- + +Beside the normal guest firmware (which is loaded from the file ``s390-ccw.img`` +in the data directory of QEMU, or via the ``-bios`` option), QEMU ships with +a small TFTP network bootloader firmware for virtio-net-ccw devices, too. This +firmware is loaded from a file called ``s390-netboot.img`` in the QEMU data +directory. In case you want to load it from a different filename instead, +you can specify it via the ``-global s390-ipl.netboot_fw=filename`` +command line option. + +The ``bootindex`` property is especially important for booting via the network. +If you don't specify the the ``bootindex`` property here, the network bootloader +firmware code won't get loaded into the guest memory so that the network boot +will fail. For a successful network boot, try something like this:: + + qemu-system-s390x -netdev user,id=n1,tftp=...,bootfile=... \ + -device virtio-net-ccw,netdev=n1,bootindex=1 + +The network bootloader firmware also has basic support for pxelinux.cfg-style +configuration files. See the `PXELINUX Configuration page +<https://wiki.syslinux.org/wiki/index.php?title=PXELINUX#Configuration>`__ +for details how to set up the configuration file on your TFTP server. +The supported configuration file entries are ``DEFAULT``, ``LABEL``, +``KERNEL``, ``INITRD`` and ``APPEND`` (see the `Syslinux Config file syntax +<https://wiki.syslinux.org/wiki/index.php?title=Config>`__ for more +information). diff --git a/docs/system/s390x/css.rst b/docs/system/s390x/css.rst new file mode 100644 index 000000000..3b4016118 --- /dev/null +++ b/docs/system/s390x/css.rst @@ -0,0 +1,86 @@ +The virtual channel subsystem +============================= + +QEMU implements a virtual channel subsystem with subchannels, (mostly +functionless) channel paths, and channel devices (virtio-ccw, 3270, and +devices passed via vfio-ccw). It supports multiple subchannel sets (MSS) and +multiple channel subsystems extended (MCSS-E). + +All channel devices support the ``devno`` property, which takes a parameter +in the form ``<cssid>.<ssid>.<device number>``. + +The default channel subsystem image id (``<cssid>``) is ``0xfe``. Devices in +there will show up in channel subsystem image ``0`` to guests that do not +enable MCSS-E. Note that devices with a different cssid will not be visible +if the guest OS does not enable MCSS-E (which is true for all supported guest +operating systems today). + +Supported values for the subchannel set id (``<ssid>``) range from ``0-3``. +Devices with a ssid that is not ``0`` will not be visible if the guest OS +does not enable MSS (any Linux version that supports virtio also enables MSS). +Any device may be put into any subchannel set, there is no restriction by +device type. + +The device number can range from ``0-0xffff``. + +If the ``devno`` property is not specified for a device, QEMU will choose the +next free device number in subchannel set 0, skipping to the next subchannel +set if no more device numbers are free. + +QEMU places a device at the first free subchannel in the specified subchannel +set. If a device is hotunplugged and later replugged, it may appear at a +different subchannel. (This is similar to how z/VM works.) + + +Examples +-------- + +* a virtio-net device, cssid/ssid/devno automatically assigned:: + + -device virtio-net-ccw + + In a Linux guest (without default devices and no other devices specified + prior to this one), this will show up as ``0.0.0000`` under subchannel + ``0.0.0000``. + + The auto-assigned-properties in QEMU (as seen via e.g. ``info qtree``) + would be ``dev_id = "fe.0.0000"`` and ``subch_id = "fe.0.0000"``. + +* a virtio-rng device in subchannel set ``0``:: + + -device virtio-rng-ccw,devno=fe.0.0042 + + If added to the same Linux guest as above, it would show up as ``0.0.0042`` + under subchannel ``0.0.0001``. + + The properties for the device would be ``dev_id = "fe.0.0042"`` and + ``subch_id = "fe.0.0001"``. + +* a virtio-gpu device in subchannel set ``2``:: + + -device virtio-gpu-ccw,devno=fe.2.1111 + + If added to the same Linux guest as above, it would show up as ``0.2.1111`` + under subchannel ``0.2.0000``. + + The properties for the device would be ``dev_id = "fe.2.1111"`` and + ``subch_id = "fe.2.0000"``. + +* a virtio-mouse device in a non-standard channel subsystem image:: + + -device virtio-mouse-ccw,devno=2.0.2222 + + This would not show up in a standard Linux guest. + + The properties for the device would be ``dev_id = "2.0.2222"`` and + ``subch_id = "2.0.0000"``. + +* a virtio-keyboard device in another non-standard channel subsystem image:: + + -device virtio-keyboard-ccw,devno=0.0.1234 + + This would not show up in a standard Linux guest, either, as ``0`` is not + the standard channel subsystem image id. + + The properties for the device would be ``dev_id = "0.0.1234"`` and + ``subch_id = "0.0.0000"``. diff --git a/docs/system/s390x/protvirt.rst b/docs/system/s390x/protvirt.rst new file mode 100644 index 000000000..aee63ed7e --- /dev/null +++ b/docs/system/s390x/protvirt.rst @@ -0,0 +1,67 @@ +Protected Virtualization on s390x +================================= + +The memory and most of the registers of Protected Virtual Machines +(PVMs) are encrypted or inaccessible to the hypervisor, effectively +prohibiting VM introspection when the VM is running. At rest, PVMs are +encrypted and can only be decrypted by the firmware, represented by an +entity called Ultravisor, of specific IBM Z machines. + + +Prerequisites +------------- + +To run PVMs, a machine with the Protected Virtualization feature, as +indicated by the Ultravisor Call facility (stfle bit 158), is +required. The Ultravisor needs to be initialized at boot by setting +``prot_virt=1`` on the host's kernel command line. + +Running PVMs requires using the KVM hypervisor. + +If those requirements are met, the capability ``KVM_CAP_S390_PROTECTED`` +will indicate that KVM can support PVMs on that LPAR. + + +Running a Protected Virtual Machine +----------------------------------- + +To run a PVM you will need to select a CPU model which includes the +``Unpack facility`` (stfle bit 161 represented by the feature +``unpack``/``S390_FEAT_UNPACK``), and add these options to the command line:: + + -object s390-pv-guest,id=pv0 \ + -machine confidential-guest-support=pv0 + +Adding these options will: + +* Ensure the ``unpack`` facility is available +* Enable the IOMMU by default for all I/O devices +* Initialize the PV mechanism + +Passthrough (vfio) devices are currently not supported. + +Host huge page backings are not supported. However guests can use huge +pages as indicated by its facilities. + + +Boot Process +------------ + +A secure guest image can either be loaded from disk or supplied on the +QEMU command line. Booting from disk is done by the unmodified +s390-ccw BIOS. I.e., the bootmap is interpreted, multiple components +are read into memory and control is transferred to one of the +components (zipl stage3). Stage3 does some fixups and then transfers +control to some program residing in guest memory, which is normally +the OS kernel. The secure image has another component prepended +(stage3a) that uses the new diag308 subcodes 8 and 10 to trigger the +transition into secure mode. + +Booting from the image supplied on the QEMU command line requires that +the file passed via -kernel has the same memory layout as would result +from the disk boot. This memory layout includes the encrypted +components (kernel, initrd, cmdline), the stage3a loader and +metadata. In case this boot method is used, the command line +options -initrd and -cmdline are ineffective. The preparation of a PVM +image is done via the ``genprotimg`` tool from the s390-tools +collection. diff --git a/docs/system/s390x/vfio-ap.rst b/docs/system/s390x/vfio-ap.rst new file mode 100644 index 000000000..084ba9c4e --- /dev/null +++ b/docs/system/s390x/vfio-ap.rst @@ -0,0 +1,916 @@ +Adjunct Processor (AP) Device +============================= + +.. contents:: + +Introduction +------------ + +The IBM Adjunct Processor (AP) Cryptographic Facility is comprised +of three AP instructions and from 1 to 256 PCIe cryptographic adapter cards. +These AP devices provide cryptographic functions to all CPUs assigned to a +linux system running in an IBM Z system LPAR. + +On s390x, AP adapter cards are exposed via the AP bus. This document +describes how those cards may be made available to KVM guests using the +VFIO mediated device framework. + +AP Architectural Overview +------------------------- + +In order understand the terminology used in the rest of this document, let's +start with some definitions: + +* AP adapter + + An AP adapter is an IBM Z adapter card that can perform cryptographic + functions. There can be from 0 to 256 adapters assigned to an LPAR depending + on the machine model. Adapters assigned to the LPAR in which a linux host is + running will be available to the linux host. Each adapter is identified by a + number from 0 to 255; however, the maximum adapter number allowed is + determined by machine model. When installed, an AP adapter is accessed by + AP instructions executed by any CPU. + +* AP domain + + An adapter is partitioned into domains. Each domain can be thought of as + a set of hardware registers for processing AP instructions. An adapter can + hold up to 256 domains; however, the maximum domain number allowed is + determined by machine model. Each domain is identified by a number from 0 to + 255. Domains can be further classified into two types: + + * Usage domains are domains that can be accessed directly to process AP + commands + + * Control domains are domains that are accessed indirectly by AP + commands sent to a usage domain to control or change the domain; for + example, to set a secure private key for the domain. + +* AP Queue + + An AP queue is the means by which an AP command-request message is sent to an + AP usage domain inside a specific AP. An AP queue is identified by a tuple + comprised of an AP adapter ID (APID) and an AP queue index (APQI). The + APQI corresponds to a given usage domain number within the adapter. This tuple + forms an AP Queue Number (APQN) uniquely identifying an AP queue. AP + instructions include a field containing the APQN to identify the AP queue to + which the AP command-request message is to be sent for processing. + +* AP Instructions: + + There are three AP instructions: + + * NQAP: to enqueue an AP command-request message to a queue + * DQAP: to dequeue an AP command-reply message from a queue + * PQAP: to administer the queues + + AP instructions identify the domain that is targeted to process the AP + command; this must be one of the usage domains. An AP command may modify a + domain that is not one of the usage domains, but the modified domain + must be one of the control domains. + +Start Interpretive Execution (SIE) Instruction +---------------------------------------------- + +A KVM guest is started by executing the Start Interpretive Execution (SIE) +instruction. The SIE state description is a control block that contains the +state information for a KVM guest and is supplied as input to the SIE +instruction. The SIE state description contains a satellite control block called +the Crypto Control Block (CRYCB). The CRYCB contains three fields to identify +the adapters, usage domains and control domains assigned to the KVM guest: + +* The AP Mask (APM) field is a bit mask that identifies the AP adapters assigned + to the KVM guest. Each bit in the mask, from left to right, corresponds to + an APID from 0-255. If a bit is set, the corresponding adapter is valid for + use by the KVM guest. + +* The AP Queue Mask (AQM) field is a bit mask identifying the AP usage domains + assigned to the KVM guest. Each bit in the mask, from left to right, + corresponds to an AP queue index (APQI) from 0-255. If a bit is set, the + corresponding queue is valid for use by the KVM guest. + +* The AP Domain Mask field is a bit mask that identifies the AP control domains + assigned to the KVM guest. The ADM bit mask controls which domains can be + changed by an AP command-request message sent to a usage domain from the + guest. Each bit in the mask, from left to right, corresponds to a domain from + 0-255. If a bit is set, the corresponding domain can be modified by an AP + command-request message sent to a usage domain. + +If you recall from the description of an AP Queue, AP instructions include +an APQN to identify the AP adapter and AP queue to which an AP command-request +message is to be sent (NQAP and PQAP instructions), or from which a +command-reply message is to be received (DQAP instruction). The validity of an +APQN is defined by the matrix calculated from the APM and AQM; it is the +cross product of all assigned adapter numbers (APM) with all assigned queue +indexes (AQM). For example, if adapters 1 and 2 and usage domains 5 and 6 are +assigned to a guest, the APQNs (1,5), (1,6), (2,5) and (2,6) will be valid for +the guest. + +The APQNs can provide secure key functionality - i.e., a private key is stored +on the adapter card for each of its domains - so each APQN must be assigned to +at most one guest or the linux host. + +Example 1: Valid configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++----------+--------+--------+ +| | Guest1 | Guest2 | ++==========+========+========+ +| adapters | 1, 2 | 1, 2 | ++----------+--------+--------+ +| domains | 5, 6 | 7 | ++----------+--------+--------+ + +This is valid because both guests have a unique set of APQNs: + +* Guest1 has APQNs (1,5), (1,6), (2,5) and (2,6); +* Guest2 has APQNs (1,7) and (2,7). + +Example 2: Valid configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++----------+--------+--------+ +| | Guest1 | Guest2 | ++==========+========+========+ +| adapters | 1, 2 | 3, 4 | ++----------+--------+--------+ +| domains | 5, 6 | 5, 6 | ++----------+--------+--------+ + +This is also valid because both guests have a unique set of APQNs: + +* Guest1 has APQNs (1,5), (1,6), (2,5), (2,6); +* Guest2 has APQNs (3,5), (3,6), (4,5), (4,6) + +Example 3: Invalid configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++----------+--------+--------+ +| | Guest1 | Guest2 | ++==========+========+========+ +| adapters | 1, 2 | 1 | ++----------+--------+--------+ +| domains | 5, 6 | 6, 7 | ++----------+--------+--------+ + +This is an invalid configuration because both guests have access to +APQN (1,6). + +AP Matrix Configuration on Linux Host +------------------------------------- + +A linux system is a guest of the LPAR in which it is running and has access to +the AP resources configured for the LPAR. The LPAR's AP matrix is +configured via its Activation Profile which can be edited on the HMC. When the +linux system is started, the AP bus will detect the AP devices assigned to the +LPAR and create the following in sysfs:: + + /sys/bus/ap + ... [devices] + ...... xx.yyyy + ...... ... + ...... cardxx + ...... ... + +Where: + +``cardxx`` + is AP adapter number xx (in hex) + +``xx.yyyy`` + is an APQN with xx specifying the APID and yyyy specifying the APQI + +For example, if AP adapters 5 and 6 and domains 4, 71 (0x47), 171 (0xab) and +255 (0xff) are configured for the LPAR, the sysfs representation on the linux +host system would look like this:: + + /sys/bus/ap + ... [devices] + ...... 05.0004 + ...... 05.0047 + ...... 05.00ab + ...... 05.00ff + ...... 06.0004 + ...... 06.0047 + ...... 06.00ab + ...... 06.00ff + ...... card05 + ...... card06 + +A set of default device drivers are also created to control each type of AP +device that can be assigned to the LPAR on which a linux host is running:: + + /sys/bus/ap + ... [drivers] + ...... [cex2acard] for Crypto Express 2/3 accelerator cards + ...... [cex2aqueue] for AP queues served by Crypto Express 2/3 + accelerator cards + ...... [cex4card] for Crypto Express 4/5/6 accelerator and coprocessor + cards + ...... [cex4queue] for AP queues served by Crypto Express 4/5/6 + accelerator and coprocessor cards + ...... [pcixcccard] for Crypto Express 2/3 coprocessor cards + ...... [pcixccqueue] for AP queues served by Crypto Express 2/3 + coprocessor cards + +Binding AP devices to device drivers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are two sysfs files that specify bitmasks marking a subset of the APQN +range as 'usable by the default AP queue device drivers' or 'not usable by the +default device drivers' and thus available for use by the alternate device +driver(s). The sysfs locations of the masks are:: + + /sys/bus/ap/apmask + /sys/bus/ap/aqmask + +The ``apmask`` is a 256-bit mask that identifies a set of AP adapter IDs +(APID). Each bit in the mask, from left to right (i.e., from most significant +to least significant bit in big endian order), corresponds to an APID from +0-255. If a bit is set, the APID is marked as usable only by the default AP +queue device drivers; otherwise, the APID is usable by the vfio_ap +device driver. + +The ``aqmask`` is a 256-bit mask that identifies a set of AP queue indexes +(APQI). Each bit in the mask, from left to right (i.e., from most significant +to least significant bit in big endian order), corresponds to an APQI from +0-255. If a bit is set, the APQI is marked as usable only by the default AP +queue device drivers; otherwise, the APQI is usable by the vfio_ap device +driver. + +Take, for example, the following mask:: + + 0x7dffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + +It indicates: + + 1, 2, 3, 4, 5, and 7-255 belong to the default drivers' pool, and 0 and 6 + belong to the vfio_ap device driver's pool. + +The APQN of each AP queue device assigned to the linux host is checked by the +AP bus against the set of APQNs derived from the cross product of APIDs +and APQIs marked as usable only by the default AP queue device drivers. If a +match is detected, only the default AP queue device drivers will be probed; +otherwise, the vfio_ap device driver will be probed. + +By default, the two masks are set to reserve all APQNs for use by the default +AP queue device drivers. There are two ways the default masks can be changed: + + 1. The sysfs mask files can be edited by echoing a string into the + respective sysfs mask file in one of two formats: + + * An absolute hex string starting with 0x - like "0x12345678" - sets + the mask. If the given string is shorter than the mask, it is padded + with 0s on the right; for example, specifying a mask value of 0x41 is + the same as specifying:: + + 0x4100000000000000000000000000000000000000000000000000000000000000 + + Keep in mind that the mask reads from left to right (i.e., most + significant to least significant bit in big endian order), so the mask + above identifies device numbers 1 and 7 (``01000001``). + + If the string is longer than the mask, the operation is terminated with + an error (EINVAL). + + * Individual bits in the mask can be switched on and off by specifying + each bit number to be switched in a comma separated list. Each bit + number string must be prepended with a (``+``) or minus (``-``) to indicate + the corresponding bit is to be switched on (``+``) or off (``-``). Some + valid values are:: + + "+0" switches bit 0 on + "-13" switches bit 13 off + "+0x41" switches bit 65 on + "-0xff" switches bit 255 off + + The following example:: + + +0,-6,+0x47,-0xf0 + + Switches bits 0 and 71 (0x47) on + Switches bits 6 and 240 (0xf0) off + + Note that the bits not specified in the list remain as they were before + the operation. + + 2. The masks can also be changed at boot time via parameters on the kernel + command line like this:: + + ap.apmask=0xffff ap.aqmask=0x40 + + This would create the following masks: + + apmask:: + + 0xffff000000000000000000000000000000000000000000000000000000000000 + + aqmask:: + + 0x4000000000000000000000000000000000000000000000000000000000000000 + + Resulting in these two pools:: + + default drivers pool: adapter 0-15, domain 1 + alternate drivers pool: adapter 16-255, domains 0, 2-255 + +Configuring an AP matrix for a linux guest +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sysfs interfaces for configuring an AP matrix for a guest are built on the +VFIO mediated device framework. To configure an AP matrix for a guest, a +mediated matrix device must first be created for the ``/sys/devices/vfio_ap/matrix`` +device. When the vfio_ap device driver is loaded, it registers with the VFIO +mediated device framework. When the driver registers, the sysfs interfaces for +creating mediated matrix devices is created:: + + /sys/devices + ... [vfio_ap] + ......[matrix] + ......... [mdev_supported_types] + ............ [vfio_ap-passthrough] + ............... create + ............... [devices] + +A mediated AP matrix device is created by writing a UUID to the attribute file +named ``create``, for example:: + + uuidgen > create + +or + +:: + + echo $uuid > create + +When a mediated AP matrix device is created, a sysfs directory named after +the UUID is created in the ``devices`` subdirectory:: + + /sys/devices + ... [vfio_ap] + ......[matrix] + ......... [mdev_supported_types] + ............ [vfio_ap-passthrough] + ............... create + ............... [devices] + .................. [$uuid] + +There will also be three sets of attribute files created in the mediated +matrix device's sysfs directory to configure an AP matrix for the +KVM guest:: + + /sys/devices + ... [vfio_ap] + ......[matrix] + ......... [mdev_supported_types] + ............ [vfio_ap-passthrough] + ............... create + ............... [devices] + .................. [$uuid] + ..................... assign_adapter + ..................... assign_control_domain + ..................... assign_domain + ..................... matrix + ..................... unassign_adapter + ..................... unassign_control_domain + ..................... unassign_domain + +``assign_adapter`` + To assign an AP adapter to the mediated matrix device, its APID is written + to the ``assign_adapter`` file. This may be done multiple times to assign more + than one adapter. The APID may be specified using conventional semantics + as a decimal, hexadecimal, or octal number. For example, to assign adapters + 4, 5 and 16 to a mediated matrix device in decimal, hexadecimal and octal + respectively:: + + echo 4 > assign_adapter + echo 0x5 > assign_adapter + echo 020 > assign_adapter + + In order to successfully assign an adapter: + + * The adapter number specified must represent a value from 0 up to the + maximum adapter number allowed by the machine model. If an adapter number + higher than the maximum is specified, the operation will terminate with + an error (ENODEV). + + * All APQNs that can be derived from the adapter ID being assigned and the + IDs of the previously assigned domains must be bound to the vfio_ap device + driver. If no domains have yet been assigned, then there must be at least + one APQN with the specified APID bound to the vfio_ap driver. If no such + APQNs are bound to the driver, the operation will terminate with an + error (EADDRNOTAVAIL). + + * No APQN that can be derived from the adapter ID and the IDs of the + previously assigned domains can be assigned to another mediated matrix + device. If an APQN is assigned to another mediated matrix device, the + operation will terminate with an error (EADDRINUSE). + +``unassign_adapter`` + To unassign an AP adapter, its APID is written to the ``unassign_adapter`` + file. This may also be done multiple times to unassign more than one adapter. + +``assign_domain`` + To assign a usage domain, the domain number is written into the + ``assign_domain`` file. This may be done multiple times to assign more than one + usage domain. The domain number is specified using conventional semantics as + a decimal, hexadecimal, or octal number. For example, to assign usage domains + 4, 8, and 71 to a mediated matrix device in decimal, hexadecimal and octal + respectively:: + + echo 4 > assign_domain + echo 0x8 > assign_domain + echo 0107 > assign_domain + + In order to successfully assign a domain: + + * The domain number specified must represent a value from 0 up to the + maximum domain number allowed by the machine model. If a domain number + higher than the maximum is specified, the operation will terminate with + an error (ENODEV). + + * All APQNs that can be derived from the domain ID being assigned and the IDs + of the previously assigned adapters must be bound to the vfio_ap device + driver. If no domains have yet been assigned, then there must be at least + one APQN with the specified APQI bound to the vfio_ap driver. If no such + APQNs are bound to the driver, the operation will terminate with an + error (EADDRNOTAVAIL). + + * No APQN that can be derived from the domain ID being assigned and the IDs + of the previously assigned adapters can be assigned to another mediated + matrix device. If an APQN is assigned to another mediated matrix device, + the operation will terminate with an error (EADDRINUSE). + +``unassign_domain`` + To unassign a usage domain, the domain number is written into the + ``unassign_domain`` file. This may be done multiple times to unassign more than + one usage domain. + +``assign_control_domain`` + To assign a control domain, the domain number is written into the + ``assign_control_domain`` file. This may be done multiple times to + assign more than one control domain. The domain number may be specified using + conventional semantics as a decimal, hexadecimal, or octal number. For + example, to assign control domains 4, 8, and 71 to a mediated matrix device + in decimal, hexadecimal and octal respectively:: + + echo 4 > assign_domain + echo 0x8 > assign_domain + echo 0107 > assign_domain + + In order to successfully assign a control domain, the domain number + specified must represent a value from 0 up to the maximum domain number + allowed by the machine model. If a control domain number higher than the + maximum is specified, the operation will terminate with an error (ENODEV). + +``unassign_control_domain`` + To unassign a control domain, the domain number is written into the + ``unassign_domain`` file. This may be done multiple times to unassign more than + one control domain. + +Notes: No changes to the AP matrix will be allowed while a guest using +the mediated matrix device is running. Attempts to assign an adapter, +domain or control domain will be rejected and an error (EBUSY) returned. + +Starting a Linux Guest Configured with an AP Matrix +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To provide a mediated matrix device for use by a guest, the following option +must be specified on the QEMU command line:: + + -device vfio_ap,sysfsdev=$path-to-mdev + +The sysfsdev parameter specifies the path to the mediated matrix device. +There are a number of ways to specify this path:: + + /sys/devices/vfio_ap/matrix/$uuid + /sys/bus/mdev/devices/$uuid + /sys/bus/mdev/drivers/vfio_mdev/$uuid + /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/devices/$uuid + +When the linux guest is started, the guest will open the mediated +matrix device's file descriptor to get information about the mediated matrix +device. The ``vfio_ap`` device driver will update the APM, AQM, and ADM fields in +the guest's CRYCB with the adapter, usage domain and control domains assigned +via the mediated matrix device's sysfs attribute files. Programs running on the +linux guest will then: + +1. Have direct access to the APQNs derived from the cross product of the AP + adapter numbers (APID) and queue indexes (APQI) specified in the APM and AQM + fields of the guests's CRYCB respectively. These APQNs identify the AP queues + that are valid for use by the guest; meaning, AP commands can be sent by the + guest to any of these queues for processing. + +2. Have authorization to process AP commands to change a control domain + identified in the ADM field of the guest's CRYCB. The AP command must be sent + to a valid APQN (see 1 above). + +CPU model features: + +Three CPU model features are available for controlling guest access to AP +facilities: + +1. AP facilities feature + + The AP facilities feature indicates that AP facilities are installed on the + guest. This feature will be exposed for use only if the AP facilities + are installed on the host system. The feature is s390-specific and is + represented as a parameter of the -cpu option on the QEMU command line:: + + qemu-system-s390x -cpu $model,ap=on|off + + Where: + + ``$model`` + is the CPU model defined for the guest (defaults to the model of + the host system if not specified). + + ``ap=on|off`` + indicates whether AP facilities are installed (on) or not + (off). The default for CPU models zEC12 or newer + is ``ap=on``. AP facilities must be installed on the guest if a + vfio-ap device (``-device vfio-ap,sysfsdev=$path``) is configured + for the guest, or the guest will fail to start. + +2. Query Configuration Information (QCI) facility + + The QCI facility is used by the AP bus running on the guest to query the + configuration of the AP facilities. This facility will be available + only if the QCI facility is installed on the host system. The feature is + s390-specific and is represented as a parameter of the -cpu option on the + QEMU command line:: + + qemu-system-s390x -cpu $model,apqci=on|off + + Where: + + ``$model`` + is the CPU model defined for the guest + + ``apqci=on|off`` + indicates whether the QCI facility is installed (on) or + not (off). The default for CPU models zEC12 or newer + is ``apqci=on``; for older models, QCI will not be installed. + + If QCI is installed (``apqci=on``) but AP facilities are not + (``ap=off``), an error message will be logged, but the guest + will be allowed to start. It makes no sense to have QCI + installed if the AP facilities are not; this is considered + an invalid configuration. + + If the QCI facility is not installed, APQNs with an APQI + greater than 15 will not be detected by the AP bus + running on the guest. + +3. Adjunct Process Facility Test (APFT) facility + + The APFT facility is used by the AP bus running on the guest to test the + AP facilities available for a given AP queue. This facility will be available + only if the APFT facility is installed on the host system. The feature is + s390-specific and is represented as a parameter of the -cpu option on the + QEMU command line:: + + qemu-system-s390x -cpu $model,apft=on|off + + Where: + + ``$model`` + is the CPU model defined for the guest (defaults to the model of + the host system if not specified). + + ``apft=on|off`` + indicates whether the APFT facility is installed (on) or + not (off). The default for CPU models zEC12 and + newer is ``apft=on`` for older models, APFT will not be + installed. + + If APFT is installed (``apft=on``) but AP facilities are not + (``ap=off``), an error message will be logged, but the guest + will be allowed to start. It makes no sense to have APFT + installed if the AP facilities are not; this is considered + an invalid configuration. + + It also makes no sense to turn APFT off because the AP bus + running on the guest will not detect CEX4 and newer devices + without it. Since only CEX4 and newer devices are supported + for guest usage, no AP devices can be made accessible to a + guest started without APFT installed. + +Hot plug a vfio-ap device into a running guest +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Only one vfio-ap device can be attached to the virtual machine's ap-bus, so a +vfio-ap device can be hot plugged if and only if no vfio-ap device is attached +to the bus already, whether via the QEMU command line or a prior hot plug +action. + +To hot plug a vfio-ap device, use the QEMU ``device_add`` command:: + + (qemu) device_add vfio-ap,sysfsdev="$path-to-mdev",id="$id" + +Where the ``$path-to-mdev`` value specifies the absolute path to a mediated +device to which AP resources to be used by the guest have been assigned. +``$id`` is the name value for the optional id parameter. + +Note that on Linux guests, the AP devices will be created in the +``/sys/bus/ap/devices`` directory when the AP bus subsequently performs its periodic +scan, so there may be a short delay before the AP devices are accessible on the +guest. + +The command will fail if: + +* A vfio-ap device has already been attached to the virtual machine's ap-bus. + +* The CPU model features for controlling guest access to AP facilities are not + enabled (see 'CPU model features' subsection in the previous section). + +Hot unplug a vfio-ap device from a running guest +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A vfio-ap device can be unplugged from a running KVM guest if a vfio-ap device +has been attached to the virtual machine's ap-bus via the QEMU command line +or a prior hot plug action. + +To hot unplug a vfio-ap device, use the QEMU ``device_del`` command:: + + (qemu) device_del "$id" + +Where ``$id`` is the same id that was specified at device creation. + +On a Linux guest, the AP devices will be removed from the ``/sys/bus/ap/devices`` +directory on the guest when the AP bus subsequently performs its periodic scan, +so there may be a short delay before the AP devices are no longer accessible by +the guest. + +The command will fail if the ``$path-to-mdev`` specified on the ``device_del`` command +does not match the value specified when the vfio-ap device was attached to +the virtual machine's ap-bus. + +Example: Configure AP Matrices for Three Linux Guests +----------------------------------------------------- + +Let's now provide an example to illustrate how KVM guests may be given +access to AP facilities. For this example, we will show how to configure +three guests such that executing the lszcrypt command on the guests would +look like this: + +Guest1:: + + CARD.DOMAIN TYPE MODE + ------------------------------ + 05 CEX5C CCA-Coproc + 05.0004 CEX5C CCA-Coproc + 05.00ab CEX5C CCA-Coproc + 06 CEX5A Accelerator + 06.0004 CEX5A Accelerator + 06.00ab CEX5C CCA-Coproc + +Guest2:: + + CARD.DOMAIN TYPE MODE + ------------------------------ + 05 CEX5A Accelerator + 05.0047 CEX5A Accelerator + 05.00ff CEX5A Accelerator + +Guest3:: + + CARD.DOMAIN TYPE MODE + ------------------------------ + 06 CEX5A Accelerator + 06.0047 CEX5A Accelerator + 06.00ff CEX5A Accelerator + +These are the steps: + +1. Install the vfio_ap module on the linux host. The dependency chain for the + vfio_ap module is: + + * iommu + * s390 + * zcrypt + * vfio + * vfio_mdev + * vfio_mdev_device + * KVM + + To build the vfio_ap module, the kernel build must be configured with the + following Kconfig elements selected: + + * IOMMU_SUPPORT + * S390 + * ZCRYPT + * S390_AP_IOMMU + * VFIO + * VFIO_MDEV + * VFIO_MDEV_DEVICE + * KVM + + If using make menuconfig select the following to build the vfio_ap module:: + -> Device Drivers + -> IOMMU Hardware Support + select S390 AP IOMMU Support + -> VFIO Non-Privileged userspace driver framework + -> Mediated device driver framework + -> VFIO driver for Mediated devices + -> I/O subsystem + -> VFIO support for AP devices + +2. Secure the AP queues to be used by the three guests so that the host can not + access them. To secure the AP queues 05.0004, 05.0047, 05.00ab, 05.00ff, + 06.0004, 06.0047, 06.00ab, and 06.00ff for use by the vfio_ap device driver, + the corresponding APQNs must be removed from the default queue drivers pool + as follows:: + + echo -5,-6 > /sys/bus/ap/apmask + + echo -4,-0x47,-0xab,-0xff > /sys/bus/ap/aqmask + + This will result in AP queues 05.0004, 05.0047, 05.00ab, 05.00ff, 06.0004, + 06.0047, 06.00ab, and 06.00ff getting bound to the vfio_ap device driver. The + sysfs directory for the vfio_ap device driver will now contain symbolic links + to the AP queue devices bound to it:: + + /sys/bus/ap + ... [drivers] + ...... [vfio_ap] + ......... [05.0004] + ......... [05.0047] + ......... [05.00ab] + ......... [05.00ff] + ......... [06.0004] + ......... [06.0047] + ......... [06.00ab] + ......... [06.00ff] + + Keep in mind that only type 10 and newer adapters (i.e., CEX4 and later) + can be bound to the vfio_ap device driver. The reason for this is to + simplify the implementation by not needlessly complicating the design by + supporting older devices that will go out of service in the relatively near + future, and for which there are few older systems on which to test. + + The administrator, therefore, must take care to secure only AP queues that + can be bound to the vfio_ap device driver. The device type for a given AP + queue device can be read from the parent card's sysfs directory. For example, + to see the hardware type of the queue 05.0004:: + + cat /sys/bus/ap/devices/card05/hwtype + + The hwtype must be 10 or higher (CEX4 or newer) in order to be bound to the + vfio_ap device driver. + +3. Create the mediated devices needed to configure the AP matrixes for the + three guests and to provide an interface to the vfio_ap driver for + use by the guests:: + + /sys/devices/vfio_ap/matrix/ + ... [mdev_supported_types] + ...... [vfio_ap-passthrough] (passthrough mediated matrix device type) + ......... create + ......... [devices] + + To create the mediated devices for the three guests:: + + uuidgen > create + uuidgen > create + uuidgen > create + + or + + :: + + echo $uuid1 > create + echo $uuid2 > create + echo $uuid3 > create + + This will create three mediated devices in the [devices] subdirectory named + after the UUID used to create the mediated device. We'll call them $uuid1, + $uuid2 and $uuid3 and this is the sysfs directory structure after creation:: + + /sys/devices/vfio_ap/matrix/ + ... [mdev_supported_types] + ...... [vfio_ap-passthrough] + ......... [devices] + ............ [$uuid1] + ............... assign_adapter + ............... assign_control_domain + ............... assign_domain + ............... matrix + ............... unassign_adapter + ............... unassign_control_domain + ............... unassign_domain + + ............ [$uuid2] + ............... assign_adapter + ............... assign_control_domain + ............... assign_domain + ............... matrix + ............... unassign_adapter + ............... unassign_control_domain + ............... unassign_domain + + ............ [$uuid3] + ............... assign_adapter + ............... assign_control_domain + ............... assign_domain + ............... matrix + ............... unassign_adapter + ............... unassign_control_domain + ............... unassign_domain + +4. The administrator now needs to configure the matrixes for the mediated + devices $uuid1 (for Guest1), $uuid2 (for Guest2) and $uuid3 (for Guest3). + + This is how the matrix is configured for Guest1:: + + echo 5 > assign_adapter + echo 6 > assign_adapter + echo 4 > assign_domain + echo 0xab > assign_domain + + Control domains can similarly be assigned using the assign_control_domain + sysfs file. + + If a mistake is made configuring an adapter, domain or control domain, + you can use the ``unassign_xxx`` interfaces to unassign the adapter, domain or + control domain. + + To display the matrix configuration for Guest1:: + + cat matrix + + The output will display the APQNs in the format ``xx.yyyy``, where xx is + the adapter number and yyyy is the domain number. The output for Guest1 + will look like this:: + + 05.0004 + 05.00ab + 06.0004 + 06.00ab + + This is how the matrix is configured for Guest2:: + + echo 5 > assign_adapter + echo 0x47 > assign_domain + echo 0xff > assign_domain + + This is how the matrix is configured for Guest3:: + + echo 6 > assign_adapter + echo 0x47 > assign_domain + echo 0xff > assign_domain + +5. Start Guest1:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid1 ... + +7. Start Guest2:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid2 ... + +7. Start Guest3:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid3 ... + +When the guest is shut down, the mediated matrix devices may be removed. + +Using our example again, to remove the mediated matrix device $uuid1:: + + /sys/devices/vfio_ap/matrix/ + ... [mdev_supported_types] + ...... [vfio_ap-passthrough] + ......... [devices] + ............ [$uuid1] + ............... remove + + + echo 1 > remove + +This will remove all of the mdev matrix device's sysfs structures including +the mdev device itself. To recreate and reconfigure the mdev matrix device, +all of the steps starting with step 3 will have to be performed again. Note +that the remove will fail if a guest using the mdev is still running. + +It is not necessary to remove an mdev matrix device, but one may want to +remove it if no guest will use it during the remaining lifetime of the linux +host. If the mdev matrix device is removed, one may want to also reconfigure +the pool of adapters and queues reserved for use by the default drivers. + +Limitations +----------- + +* The KVM/kernel interfaces do not provide a way to prevent restoring an APQN + to the default drivers pool of a queue that is still assigned to a mediated + device in use by a guest. It is incumbent upon the administrator to + ensure there is no mediated device in use by a guest to which the APQN is + assigned lest the host be given access to the private data of the AP queue + device, such as a private key configured specifically for the guest. + +* Dynamically assigning AP resources to or unassigning AP resources from a + mediated matrix device - see `Configuring an AP matrix for a linux guest`_ + section above - while a running guest is using it is currently not supported. + +* Live guest migration is not supported for guests using AP devices. If a guest + is using AP devices, the vfio-ap device configured for the guest must be + unplugged before migrating the guest (see `Hot unplug a vfio-ap device from a + running guest`_ section above.) diff --git a/docs/system/s390x/vfio-ccw.rst b/docs/system/s390x/vfio-ccw.rst new file mode 100644 index 000000000..41e0bad5b --- /dev/null +++ b/docs/system/s390x/vfio-ccw.rst @@ -0,0 +1,77 @@ +Subchannel passthrough via vfio-ccw +=================================== + +vfio-ccw (based upon the mediated vfio device infrastructure) allows to +make certain I/O subchannels and their devices available to a guest. The +host will not interact with those subchannels/devices any more. + +Note that while vfio-ccw should work with most non-QDIO devices, only ECKD +DASDs have really been tested. + +Example configuration +--------------------- + +Step 1: configure the host device +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As every mdev is identified by a uuid, the first step is to obtain one:: + + [root@host ~]# uuidgen + 7e270a25-e163-4922-af60-757fc8ed48c6 + +Note: it is recommended to use the ``mdevctl`` tool for actually configuring +the host device. + +To define the same device as configured below to be started +automatically, use + +:: + + [root@host ~]# driverctl -b css set-override 0.0.0313 vfio_ccw + [root@host ~]# mdevctl define -u 7e270a25-e163-4922-af60-757fc8ed48c6 \ + -p 0.0.0313 -t vfio_ccw-io -a + +If using ``mdevctl`` is not possible or wanted, follow the manual procedure +below. + +* Locate the subchannel for the device (in this example, ``0.0.2b09``):: + + [root@host ~]# lscss | grep 0.0.2b09 | awk '{print $2}' + 0.0.0313 + +* Unbind the subchannel (in this example, ``0.0.0313``) from the standard + I/O subchannel driver and bind it to the vfio-ccw driver:: + + [root@host ~]# echo 0.0.0313 > /sys/bus/css/devices/0.0.0313/driver/unbind + [root@host ~]# echo 0.0.0313 > /sys/bus/css/drivers/vfio_ccw/bind + +* Create the mediated device (identified by the uuid):: + + [root@host ~]# echo "7e270a25-e163-4922-af60-757fc8ed48c6" > \ + /sys/bus/css/devices/0.0.0313/mdev_supported_types/vfio_ccw-io/create + +Step 2: configure QEMU +~~~~~~~~~~~~~~~~~~~~~~ + +* Reference the created mediated device and (optionally) pick a device id to + be presented in the guest (here, ``fe.0.1234``, which will end up visible + in the guest as ``0.0.1234``:: + + -device vfio-ccw,devno=fe.0.1234,sysfsdev=\ + /sys/bus/mdev/devices/7e270a25-e163-4922-af60-757fc8ed48c6 + +* Start the guest. The device (here, ``0.0.1234``) should now be usable:: + + [root@guest ~]# lscss -d 0.0.1234 + Device Subchan. DevType CU Type Use PIM PAM POM CHPID + ---------------------------------------------------------------------- + 0.0.1234 0.0.0007 3390/0e 3990/e9 f0 f0 ff 1a2a3a0a 00000000 + [root@guest ~]# chccwdev -e 0.0.1234 + Setting device 0.0.1234 online + Done + [root@guest ~]# dmesg -t + (...) + dasd-eckd 0.0.1234: A channel path to the device has become operational + dasd-eckd 0.0.1234: New DASD 3390/0E (CU 3990/01) with 10017 cylinders, 15 heads, 224 sectors + dasd-eckd 0.0.1234: DASD with 4 KB/block, 7212240 KB total size, 48 KB/track, compatible disk layout + dasda:VOL1/ 0X2B09: dasda1 diff --git a/docs/system/secrets.rst b/docs/system/secrets.rst new file mode 100644 index 000000000..4a177369b --- /dev/null +++ b/docs/system/secrets.rst @@ -0,0 +1,162 @@ +.. _secret data: + +Providing secret data to QEMU +----------------------------- + +There are a variety of objects in QEMU which require secret data to be provided +by the administrator or management application. For example, network block +devices often require a password, LUKS block devices require a passphrase to +unlock key material, remote desktop services require an access password. +QEMU has a general purpose mechanism for providing secret data to QEMU in a +secure manner, using the ``secret`` object type. + +At startup this can be done using the ``-object secret,...`` command line +argument. At runtime this can be done using the ``object_add`` QMP / HMP +monitor commands. The examples that follow will illustrate use of ``-object`` +command lines, but they all apply equivalentely in QMP / HMP. When creating +a ``secret`` object it must be given a unique ID string. This ID is then +used to identify the object when configuring the thing which need the data. + + +INSECURE: Passing secrets as clear text inline +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**The following should never be done in a production environment or on a +multi-user host. Command line arguments are usually visible in the process +listings and are often collected in log files by system monitoring agents +or bug reporting tools. QMP/HMP commands and their arguments are also often +logged and attached to bug reports. This all risks compromising secrets that +are passed inline.** + +For the convenience of people debugging / developing with QEMU, it is possible +to pass secret data inline on the command line. + +:: + + -object secret,id=secvnc0,data=87539319 + + +Again it is possible to provide the data in base64 encoded format, which is +particularly useful if the data contains binary characters that would clash +with argument parsing. + +:: + + -object secret,id=secvnc0,data=ODc1MzkzMTk=,format=base64 + + +**Note: base64 encoding does not provide any security benefit.** + +Passing secrets as clear text via a file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The simplest approach to providing data securely is to use a file to store +the secret: + +:: + + -object secret,id=secvnc0,file=vnc-password.txt + + +In this example the file ``vnc-password.txt`` contains the plain text secret +data. It is important to note that the contents of the file are treated as an +opaque blob. The entire raw file contents is used as the value, thus it is +important not to mistakenly add any trailing newline character in the file if +this newline is not intended to be part of the secret data. + +In some cases it might be more convenient to pass the secret data in base64 +format and have QEMU decode to get the raw bytes before use: + +:: + + -object secret,id=sec0,file=vnc-password.txt,format=base64 + + +The file should generally be given mode ``0600`` or ``0400`` permissions, and +have its user/group ownership set to the same account that the QEMU process +will be launched under. If using mandatory access control such as SELinux, then +the file should be labelled to only grant access to the specific QEMU process +that needs access. This will prevent other processes/users from compromising the +secret data. + + +Passing secrets as cipher text inline +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To address the insecurity of passing secrets inline as clear text, it is +possible to configure a second secret as an AES key to use for decrypting +the data. + +The secret used as the AES key must always be configured using the file based +storage mechanism: + +:: + + -object secret,id=secmaster,file=masterkey.data,format=base64 + + +In this case the ``masterkey.data`` file would be initialized with 32 +cryptographically secure random bytes, which are then base64 encoded. +The contents of this file will by used as an AES-256 key to encrypt the +real secret that can now be safely passed to QEMU inline as cipher text + +:: + + -object secret,id=secvnc0,keyid=secmaster,data=BASE64-CIPHERTEXT,iv=BASE64-IV,format=base64 + + +In this example ``BASE64-CIPHERTEXT`` is the result of AES-256-CBC encrypting +the secret with ``masterkey.data`` and then base64 encoding the ciphertext. +The ``BASE64-IV`` data is 16 random bytes which have been base64 encrypted. +These bytes are used as the initialization vector for the AES-256-CBC value. + +A single master key can be used to encrypt all subsequent secrets, **but it is +critical that a different initialization vector is used for every secret**. + +Passing secrets via the Linux keyring +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The earlier mechanisms described are platform agnostic. If using QEMU on a Linux +host, it is further possible to pass secrets to QEMU using the Linux keyring: + +:: + + -object secret_keyring,id=secvnc0,serial=1729 + + +This instructs QEMU to load data from the Linux keyring secret identified by +the serial number ``1729``. It is possible to combine use of the keyring with +other features mentioned earlier such as base64 encoding: + +:: + + -object secret_keyring,id=secvnc0,serial=1729,format=base64 + + +and also encryption with a master key: + +:: + + -object secret_keyring,id=secvnc0,keyid=secmaster,serial=1729,iv=BASE64-IV + + +Best practice +~~~~~~~~~~~~~ + +It is recommended for production deployments to use a master key secret, and +then pass all subsequent inline secrets encrypted with the master key. + +Each QEMU instance must have a distinct master key, and that must be generated +from a cryptographically secure random data source. The master key should be +deleted immediately upon QEMU shutdown. If passing the master key as a file, +the key file must have access control rules applied that restrict access to +just the one QEMU process that is intended to use it. Alternatively the Linux +keyring can be used to pass the master key to QEMU. + +The secrets for individual QEMU device backends must all then be encrypted +with this master key. + +This procedure helps ensure that the individual secrets for QEMU backends will +not be compromised, even if ``-object`` CLI args or ``object_add`` monitor +commands are collected in log files and attached to public bug support tickets. +The only item that needs strongly protecting is the master key file. diff --git a/docs/system/security.rst b/docs/system/security.rst new file mode 100644 index 000000000..f2092c876 --- /dev/null +++ b/docs/system/security.rst @@ -0,0 +1,173 @@ +Security +======== + +Overview +-------- + +This chapter explains the security requirements that QEMU is designed to meet +and principles for securely deploying QEMU. + +Security Requirements +--------------------- + +QEMU supports many different use cases, some of which have stricter security +requirements than others. The community has agreed on the overall security +requirements that users may depend on. These requirements define what is +considered supported from a security perspective. + +Virtualization Use Case +''''''''''''''''''''''' + +The virtualization use case covers cloud and virtual private server (VPS) +hosting, as well as traditional data center and desktop virtualization. These +use cases rely on hardware virtualization extensions to execute guest code +safely on the physical CPU at close-to-native speed. + +The following entities are untrusted, meaning that they may be buggy or +malicious: + +- Guest +- User-facing interfaces (e.g. VNC, SPICE, WebSocket) +- Network protocols (e.g. NBD, live migration) +- User-supplied files (e.g. disk images, kernels, device trees) +- Passthrough devices (e.g. PCI, USB) + +Bugs affecting these entities are evaluated on whether they can cause damage in +real-world use cases and treated as security bugs if this is the case. + +Non-virtualization Use Case +''''''''''''''''''''''''''' + +The non-virtualization use case covers emulation using the Tiny Code Generator +(TCG). In principle the TCG and device emulation code used in conjunction with +the non-virtualization use case should meet the same security requirements as +the virtualization use case. However, for historical reasons much of the +non-virtualization use case code was not written with these security +requirements in mind. + +Bugs affecting the non-virtualization use case are not considered security +bugs at this time. Users with non-virtualization use cases must not rely on +QEMU to provide guest isolation or any security guarantees. + +Architecture +------------ + +This section describes the design principles that ensure the security +requirements are met. + +Guest Isolation +''''''''''''''' + +Guest isolation is the confinement of guest code to the virtual machine. When +guest code gains control of execution on the host this is called escaping the +virtual machine. Isolation also includes resource limits such as throttling of +CPU, memory, disk, or network. Guests must be unable to exceed their resource +limits. + +QEMU presents an attack surface to the guest in the form of emulated devices. +The guest must not be able to gain control of QEMU. Bugs in emulated devices +could allow malicious guests to gain code execution in QEMU. At this point the +guest has escaped the virtual machine and is able to act in the context of the +QEMU process on the host. + +Guests often interact with other guests and share resources with them. A +malicious guest must not gain control of other guests or access their data. +Disk image files and network traffic must be protected from other guests unless +explicitly shared between them by the user. + +Principle of Least Privilege +'''''''''''''''''''''''''''' + +The principle of least privilege states that each component only has access to +the privileges necessary for its function. In the case of QEMU this means that +each process only has access to resources belonging to the guest. + +The QEMU process should not have access to any resources that are inaccessible +to the guest. This way the guest does not gain anything by escaping into the +QEMU process since it already has access to those same resources from within +the guest. + +Following the principle of least privilege immediately fulfills guest isolation +requirements. For example, guest A only has access to its own disk image file +``a.img`` and not guest B's disk image file ``b.img``. + +In reality certain resources are inaccessible to the guest but must be +available to QEMU to perform its function. For example, host system calls are +necessary for QEMU but are not exposed to guests. A guest that escapes into +the QEMU process can then begin invoking host system calls. + +New features must be designed to follow the principle of least privilege. +Should this not be possible for technical reasons, the security risk must be +clearly documented so users are aware of the trade-off of enabling the feature. + +Isolation mechanisms +'''''''''''''''''''' + +Several isolation mechanisms are available to realize this architecture of +guest isolation and the principle of least privilege. With the exception of +Linux seccomp, these mechanisms are all deployed by management tools that +launch QEMU, such as libvirt. They are also platform-specific so they are only +described briefly for Linux here. + +The fundamental isolation mechanism is that QEMU processes must run as +unprivileged users. Sometimes it seems more convenient to launch QEMU as +root to give it access to host devices (e.g. ``/dev/net/tun``) but this poses a +huge security risk. File descriptor passing can be used to give an otherwise +unprivileged QEMU process access to host devices without running QEMU as root. +It is also possible to launch QEMU as a non-root user and configure UNIX groups +for access to ``/dev/kvm``, ``/dev/net/tun``, and other device nodes. +Some Linux distros already ship with UNIX groups for these devices by default. + +- SELinux and AppArmor make it possible to confine processes beyond the + traditional UNIX process and file permissions model. They restrict the QEMU + process from accessing processes and files on the host system that are not + needed by QEMU. + +- Resource limits and cgroup controllers provide throughput and utilization + limits on key resources such as CPU time, memory, and I/O bandwidth. + +- Linux namespaces can be used to make process, file system, and other system + resources unavailable to QEMU. A namespaced QEMU process is restricted to only + those resources that were granted to it. + +- Linux seccomp is available via the QEMU ``--sandbox`` option. It disables + system calls that are not needed by QEMU, thereby reducing the host kernel + attack surface. + +Sensitive configurations +------------------------ + +There are aspects of QEMU that can have security implications which users & +management applications must be aware of. + +Monitor console (QMP and HMP) +''''''''''''''''''''''''''''' + +The monitor console (whether used with QMP or HMP) provides an interface +to dynamically control many aspects of QEMU's runtime operation. Many of the +commands exposed will instruct QEMU to access content on the host file system +and/or trigger spawning of external processes. + +For example, the ``migrate`` command allows for the spawning of arbitrary +processes for the purpose of tunnelling the migration data stream. The +``blockdev-add`` command instructs QEMU to open arbitrary files, exposing +their content to the guest as a virtual disk. + +Unless QEMU is otherwise confined using technologies such as SELinux, AppArmor, +or Linux namespaces, the monitor console should be considered to have privileges +equivalent to those of the user account QEMU is running under. + +It is further important to consider the security of the character device backend +over which the monitor console is exposed. It needs to have protection against +malicious third parties which might try to make unauthorized connections, or +perform man-in-the-middle attacks. Many of the character device backends do not +satisfy this requirement and so must not be used for the monitor console. + +The general recommendation is that the monitor console should be exposed over +a UNIX domain socket backend to the local host only. Use of the TCP based +character device backend is inappropriate unless configured to use both TLS +encryption and authorization control policy on client connections. + +In summary, the monitor console is considered a privileged control interface to +QEMU and as such should only be made accessible to a trusted management +application or user. diff --git a/docs/system/target-arm.rst b/docs/system/target-arm.rst new file mode 100644 index 000000000..91ebc26c6 --- /dev/null +++ b/docs/system/target-arm.rst @@ -0,0 +1,120 @@ +.. _ARM-System-emulator: + +Arm System emulator +------------------- + +QEMU can emulate both 32-bit and 64-bit Arm CPUs. Use the +``qemu-system-aarch64`` executable to simulate a 64-bit Arm machine. +You can use either ``qemu-system-arm`` or ``qemu-system-aarch64`` +to simulate a 32-bit Arm machine: in general, command lines that +work for ``qemu-system-arm`` will behave the same when used with +``qemu-system-aarch64``. + +QEMU has generally good support for Arm guests. It has support for +nearly fifty different machines. The reason we support so many is that +Arm hardware is much more widely varying than x86 hardware. Arm CPUs +are generally built into "system-on-chip" (SoC) designs created by +many different companies with different devices, and these SoCs are +then built into machines which can vary still further even if they use +the same SoC. Even with fifty boards QEMU does not cover more than a +small fraction of the Arm hardware ecosystem. + +The situation for 64-bit Arm is fairly similar, except that we don't +implement so many different machines. + +As well as the more common "A-profile" CPUs (which have MMUs and will +run Linux) QEMU also supports "M-profile" CPUs such as the Cortex-M0, +Cortex-M4 and Cortex-M33 (which are microcontrollers used in very +embedded boards). For most boards the CPU type is fixed (matching what +the hardware has), so typically you don't need to specify the CPU type +by hand, except for special cases like the ``virt`` board. + +Choosing a board model +====================== + +For QEMU's Arm system emulation, you must specify which board +model you want to use with the ``-M`` or ``--machine`` option; +there is no default. + +Because Arm systems differ so much and in fundamental ways, typically +operating system or firmware images intended to run on one machine +will not run at all on any other. This is often surprising for new +users who are used to the x86 world where every system looks like a +standard PC. (Once the kernel has booted, most userspace software +cares much less about the detail of the hardware.) + +If you already have a system image or a kernel that works on hardware +and you want to boot with QEMU, check whether QEMU lists that machine +in its ``-machine help`` output. If it is listed, then you can probably +use that board model. If it is not listed, then unfortunately your image +will almost certainly not boot on QEMU. (You might be able to +extract the filesystem and use that with a different kernel which +boots on a system that QEMU does emulate.) + +If you don't care about reproducing the idiosyncrasies of a particular +bit of hardware, such as small amount of RAM, no PCI or other hard +disk, etc., and just want to run Linux, the best option is to use the +``virt`` board. This is a platform which doesn't correspond to any +real hardware and is designed for use in virtual machines. You'll +need to compile Linux with a suitable configuration for running on +the ``virt`` board. ``virt`` supports PCI, virtio, recent CPUs and +large amounts of RAM. It also supports 64-bit CPUs. + +Board-specific documentation +============================ + +Unfortunately many of the Arm boards QEMU supports are currently +undocumented; you can get a complete list by running +``qemu-system-aarch64 --machine help``. + +.. + This table of contents should be kept sorted alphabetically + by the title text of each file, which isn't the same ordering + as an alphabetical sort by filename. + +.. toctree:: + :maxdepth: 1 + + arm/integratorcp + arm/mps2 + arm/musca + arm/realview + arm/sbsa + arm/versatile + arm/vexpress + arm/aspeed + arm/sabrelite + arm/digic + arm/cubieboard + arm/emcraft-sf2 + arm/highbank + arm/musicpal + arm/gumstix + arm/mainstone + arm/kzm + arm/nrf + arm/nseries + arm/nuvoton + arm/imx25-pdk + arm/orangepi + arm/palm + arm/raspi + arm/xscale + arm/collie + arm/sx1 + arm/stellaris + arm/stm32 + arm/virt + arm/xlnx-versal-virt + +Emulated CPU architecture support +================================= + +.. toctree:: + arm/emulation + +Arm CPU features +================ + +.. toctree:: + arm/cpu-features diff --git a/docs/system/target-avr.rst b/docs/system/target-avr.rst new file mode 100644 index 000000000..03d5ab51c --- /dev/null +++ b/docs/system/target-avr.rst @@ -0,0 +1,48 @@ +.. _AVR-System-emulator: + +AVR System emulator +------------------- + +Use the executable ``qemu-system-avr`` to emulate a AVR 8 bit based machine. +These can have one of the following cores: avr1, avr2, avr25, avr3, avr31, +avr35, avr4, avr5, avr51, avr6, avrtiny, xmega2, xmega3, xmega4, xmega5, +xmega6 and xmega7. + +As for now it supports few Arduino boards for educational and testing purposes. +These boards use a ATmega controller, which model is limited to USART & 16-bit +timer devices, enough to run FreeRTOS based applications (like +https://github.com/seharris/qemu-avr-tests/blob/master/free-rtos/Demo/AVR_ATMega2560_GCC/demo.elf +). + +Following are examples of possible usages, assuming demo.elf is compiled for +AVR cpu + +- Continuous non interrupted execution:: + + qemu-system-avr -machine mega2560 -bios demo.elf + +- Continuous non interrupted execution with serial output into telnet window:: + + qemu-system-avr -M mega2560 -bios demo.elf -nographic \ + -serial tcp::5678,server=on,wait=off + + and then in another shell:: + + telnet localhost 5678 + +- Debugging with GDB debugger:: + + qemu-system-avr -machine mega2560 -bios demo.elf -s -S + + and then in another shell:: + + avr-gdb demo.elf + + and then within GDB shell:: + + target remote :1234 + +- Print out executed instructions (that have not been translated by the JIT + compiler yet):: + + qemu-system-avr -machine mega2560 -bios demo.elf -d in_asm diff --git a/docs/system/target-i386-desc.rst.inc b/docs/system/target-i386-desc.rst.inc new file mode 100644 index 000000000..7d1fffacb --- /dev/null +++ b/docs/system/target-i386-desc.rst.inc @@ -0,0 +1,73 @@ +The QEMU PC System emulator simulates the following peripherals: + +- i440FX host PCI bridge and PIIX3 PCI to ISA bridge + +- Cirrus CLGD 5446 PCI VGA card or dummy VGA card with Bochs VESA + extensions (hardware level, including all non standard modes). + +- PS/2 mouse and keyboard + +- 2 PCI IDE interfaces with hard disk and CD-ROM support + +- Floppy disk + +- PCI and ISA network adapters + +- Serial ports + +- IPMI BMC, either and internal or external one + +- Creative SoundBlaster 16 sound card + +- ENSONIQ AudioPCI ES1370 sound card + +- Intel 82801AA AC97 Audio compatible sound card + +- Intel HD Audio Controller and HDA codec + +- Adlib (OPL2) - Yamaha YM3812 compatible chip + +- Gravis Ultrasound GF1 sound card + +- CS4231A compatible sound card + +- PC speaker + +- PCI UHCI, OHCI, EHCI or XHCI USB controller and a virtual USB-1.1 + hub. + +SMP is supported with up to 255 CPUs. + +QEMU uses the PC BIOS from the Seabios project and the Plex86/Bochs LGPL +VGA BIOS. + +QEMU uses YM3812 emulation by Tatsuyuki Satoh. + +QEMU uses GUS emulation (GUSEMU32 http://www.deinmeister.de/gusemu/) by +Tibor \"TS\" Schütz. + +Note that, by default, GUS shares IRQ(7) with parallel ports and so QEMU +must be told to not have parallel ports to have working GUS. + +.. parsed-literal:: + + |qemu_system_x86| dos.img -device gus -parallel none + +Alternatively: + +.. parsed-literal:: + + |qemu_system_x86| dos.img -device gus,irq=5 + +Or some other unclaimed IRQ. + +CS4231A is the chip used in Windows Sound System and GUSMAX products + +The PC speaker audio device can be configured using the pcspk-audiodev +machine property, i.e. + +.. parsed-literal:: + + |qemu_system_x86| some.img \ + -audiodev <backend>,id=<name> \ + -machine pcspk-audiodev=<name> diff --git a/docs/system/target-i386.rst b/docs/system/target-i386.rst new file mode 100644 index 000000000..4daa53c35 --- /dev/null +++ b/docs/system/target-i386.rst @@ -0,0 +1,40 @@ +.. _QEMU-PC-System-emulator: + +x86 System emulator +------------------- + +.. _pcsys_005fdevices: + +Board-specific documentation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. + This table of contents should be kept sorted alphabetically + by the title text of each file, which isn't the same ordering + as an alphabetical sort by filename. + +.. toctree:: + :maxdepth: 1 + + i386/microvm + i386/pc + +Architectural features +~~~~~~~~~~~~~~~~~~~~~~ + +.. toctree:: + :maxdepth: 1 + + i386/cpu + i386/kvm-pv + i386/sgx + +.. _pcsys_005freq: + +OS requirements +~~~~~~~~~~~~~~~ + +On x86_64 hosts, the default set of CPU features enabled by the KVM +accelerator require the host to be running Linux v4.5 or newer. Red Hat +Enterprise Linux 7 is also supported, since the required +functionality was backported. diff --git a/docs/system/target-m68k.rst b/docs/system/target-m68k.rst new file mode 100644 index 000000000..d28d3b92e --- /dev/null +++ b/docs/system/target-m68k.rst @@ -0,0 +1,21 @@ +.. _ColdFire-System-emulator: + +ColdFire System emulator +------------------------ + +Use the executable ``qemu-system-m68k`` to simulate a ColdFire machine. +The emulator is able to boot a uClinux kernel. + +The M5208EVB emulation includes the following devices: + +- MCF5208 ColdFire V2 Microprocessor (ISA A+ with EMAC). + +- Three Two on-chip UARTs. + +- Fast Ethernet Controller (FEC) + +The AN5206 emulation includes the following devices: + +- MCF5206 ColdFire V2 Microprocessor. + +- Two on-chip UARTs. diff --git a/docs/system/target-mips.rst b/docs/system/target-mips.rst new file mode 100644 index 000000000..138441bde --- /dev/null +++ b/docs/system/target-mips.rst @@ -0,0 +1,130 @@ +.. _MIPS-System-emulator: + +MIPS System emulator +-------------------- + +Four executables cover simulation of 32 and 64-bit MIPS systems in both +endian options, ``qemu-system-mips``, ``qemu-system-mipsel`` +``qemu-system-mips64`` and ``qemu-system-mips64el``. Five different +machine types are emulated: + +- A generic ISA PC-like machine \"mips\" + +- The MIPS Malta prototype board \"malta\" + +- An ACER Pica \"pica61\". This machine needs the 64-bit emulator. + +- MIPS emulator pseudo board \"mipssim\" + +- A MIPS Magnum R4000 machine \"magnum\". This machine needs the + 64-bit emulator. + +The generic emulation is supported by Debian 'Etch' and is able to +install Debian into a virtual disk image. The following devices are +emulated: + +- A range of MIPS CPUs, default is the 24Kf + +- PC style serial port + +- PC style IDE disk + +- NE2000 network card + +The Malta emulation supports the following devices: + +- Core board with MIPS 24Kf CPU and Galileo system controller + +- PIIX4 PCI/USB/SMbus controller + +- The Multi-I/O chip's serial device + +- PCI network cards (PCnet32 and others) + +- Malta FPGA serial device + +- Cirrus (default) or any other PCI VGA graphics card + +The Boston board emulation supports the following devices: + +- Xilinx FPGA, which includes a PCIe root port and an UART + +- Intel EG20T PCH connects the I/O peripherals, but only the SATA bus + is emulated + +The ACER Pica emulation supports: + +- MIPS R4000 CPU + +- PC-style IRQ and DMA controllers + +- PC Keyboard + +- IDE controller + +The MIPS Magnum R4000 emulation supports: + +- MIPS R4000 CPU + +- PC-style IRQ controller + +- PC Keyboard + +- SCSI controller + +- G364 framebuffer + +The Fuloong 2E emulation supports: + +- Loongson 2E CPU + +- Bonito64 system controller as North Bridge + +- VT82C686 chipset as South Bridge + +- RTL8139D as a network card chipset + +The Loongson-3 virtual platform emulation supports: + +- Loongson 3A CPU + +- LIOINTC as interrupt controller + +- GPEX and virtio as peripheral devices + +- Both KVM and TCG supported + +The mipssim pseudo board emulation provides an environment similar to +what the proprietary MIPS emulator uses for running Linux. It supports: + +- A range of MIPS CPUs, default is the 24Kf + +- PC style serial port + +- MIPSnet network emulation + +.. include:: cpu-models-mips.rst.inc + +.. _nanoMIPS-System-emulator: + +nanoMIPS System emulator +~~~~~~~~~~~~~~~~~~~~~~~~ + +Executable ``qemu-system-mipsel`` also covers simulation of 32-bit +nanoMIPS system in little endian mode: + +- nanoMIPS I7200 CPU + +Example of ``qemu-system-mipsel`` usage for nanoMIPS is shown below: + +Download ``<disk_image_file>`` from +https://mipsdistros.mips.com/LinuxDistro/nanomips/buildroot/index.html. + +Download ``<kernel_image_file>`` from +https://mipsdistros.mips.com/LinuxDistro/nanomips/kernels/v4.15.18-432-gb2eb9a8b07a1-20180627102142/index.html. + +Start system emulation of Malta board with nanoMIPS I7200 CPU:: + + qemu-system-mipsel -cpu I7200 -kernel <kernel_image_file> \ + -M malta -serial stdio -m <memory_size> -hda <disk_image_file> \ + -append "mem=256m@0x0 rw console=ttyS0 vga=cirrus vesa=0x111 root=/dev/sda" diff --git a/docs/system/target-ppc.rst b/docs/system/target-ppc.rst new file mode 100644 index 000000000..4f6eb93b1 --- /dev/null +++ b/docs/system/target-ppc.rst @@ -0,0 +1,25 @@ +.. _PowerPC-System-emulator: + +PowerPC System emulator +----------------------- + +Board-specific documentation +============================ + +You can get a complete list by running ``qemu-system-ppc64 --machine +help``. + +.. + This table of contents should be kept sorted alphabetically + by the title text of each file, which isn't the same ordering + as an alphabetical sort by filename. + +.. toctree:: + :maxdepth: 1 + + ppc/embedded + ppc/powermac + ppc/powernv + ppc/ppce500 + ppc/prep + ppc/pseries diff --git a/docs/system/target-riscv.rst b/docs/system/target-riscv.rst new file mode 100644 index 000000000..89a866e4f --- /dev/null +++ b/docs/system/target-riscv.rst @@ -0,0 +1,86 @@ +.. _RISC-V-System-emulator: + +RISC-V System emulator +====================== + +QEMU can emulate both 32-bit and 64-bit RISC-V CPUs. Use the +``qemu-system-riscv64`` executable to simulate a 64-bit RISC-V machine, +``qemu-system-riscv32`` executable to simulate a 32-bit RISC-V machine. + +QEMU has generally good support for RISC-V guests. It has support for +several different machines. The reason we support so many is that +RISC-V hardware is much more widely varying than x86 hardware. RISC-V +CPUs are generally built into "system-on-chip" (SoC) designs created by +many different companies with different devices, and these SoCs are +then built into machines which can vary still further even if they use +the same SoC. + +For most boards the CPU type is fixed (matching what the hardware has), +so typically you don't need to specify the CPU type by hand, except for +special cases like the ``virt`` board. + +Choosing a board model +---------------------- + +For QEMU's RISC-V system emulation, you must specify which board +model you want to use with the ``-M`` or ``--machine`` option; +there is no default. + +Because RISC-V systems differ so much and in fundamental ways, typically +operating system or firmware images intended to run on one machine +will not run at all on any other. This is often surprising for new +users who are used to the x86 world where every system looks like a +standard PC. (Once the kernel has booted, most user space software +cares much less about the detail of the hardware.) + +If you already have a system image or a kernel that works on hardware +and you want to boot with QEMU, check whether QEMU lists that machine +in its ``-machine help`` output. If it is listed, then you can probably +use that board model. If it is not listed, then unfortunately your image +will almost certainly not boot on QEMU. (You might be able to +extract the file system and use that with a different kernel which +boots on a system that QEMU does emulate.) + +If you don't care about reproducing the idiosyncrasies of a particular +bit of hardware, such as small amount of RAM, no PCI or other hard +disk, etc., and just want to run Linux, the best option is to use the +``virt`` board. This is a platform which doesn't correspond to any +real hardware and is designed for use in virtual machines. You'll +need to compile Linux with a suitable configuration for running on +the ``virt`` board. ``virt`` supports PCI, virtio, recent CPUs and +large amounts of RAM. It also supports 64-bit CPUs. + +Board-specific documentation +---------------------------- + +Unfortunately many of the RISC-V boards QEMU supports are currently +undocumented; you can get a complete list by running +``qemu-system-riscv64 --machine help``, or +``qemu-system-riscv32 --machine help``. + +.. + This table of contents should be kept sorted alphabetically + by the title text of each file, which isn't the same ordering + as an alphabetical sort by filename. + +.. toctree:: + :maxdepth: 1 + + riscv/microchip-icicle-kit + riscv/shakti-c + riscv/sifive_u + riscv/virt + +RISC-V CPU firmware +------------------- + +When using the ``sifive_u`` or ``virt`` machine there are three different +firmware boot options: +1. ``-bios default`` - This is the default behaviour if no -bios option +is included. This option will load the default OpenSBI firmware automatically. +The firmware is included with the QEMU release and no user interaction is +required. All a user needs to do is specify the kernel they want to boot +with the -kernel option +2. ``-bios none`` - QEMU will not automatically load any firmware. It is up +to the user to load all the images they need. +3. ``-bios <file>`` - Tells QEMU to load the specified file as the firmware. diff --git a/docs/system/target-rx.rst b/docs/system/target-rx.rst new file mode 100644 index 000000000..4a20a89a0 --- /dev/null +++ b/docs/system/target-rx.rst @@ -0,0 +1,36 @@ +.. _RX-System-emulator: + +RX System emulator +-------------------- + +Use the executable ``qemu-system-rx`` to simulate RX target (GDB simulator). +This target emulated following devices. + +- R5F562N8 MCU + + - On-chip memory (ROM 512KB, RAM 96KB) + - Interrupt Control Unit (ICUa) + - 8Bit Timer x 1CH (TMR0,1) + - Compare Match Timer x 2CH (CMT0,1) + - Serial Communication Interface x 1CH (SCI0) + +- External memory 16MByte + +Example of ``qemu-system-rx`` usage for RX is shown below: + +Download ``<u-boot_image_file>`` from +https://osdn.net/users/ysato/pf/qemu/dl/u-boot.bin.gz + +Start emulation of rx-virt:: + qemu-system-rx -M gdbsim-r5f562n8 -bios <u-boot_image_file> + +Download ``kernel_image_file`` from +https://osdn.net/users/ysato/pf/qemu/dl/zImage + +Download ``device_tree_blob`` from +https://osdn.net/users/ysato/pf/qemu/dl/rx-virt.dtb + +Start emulation of rx-virt:: + qemu-system-rx -M gdbsim-r5f562n8 \ + -kernel <kernel_image_file> -dtb <device_tree_blob> \ + -append "earlycon" diff --git a/docs/system/target-s390x.rst b/docs/system/target-s390x.rst new file mode 100644 index 000000000..c636f6411 --- /dev/null +++ b/docs/system/target-s390x.rst @@ -0,0 +1,35 @@ +.. _s390x-System-emulator: + +s390x System emulator +--------------------- + +QEMU can emulate z/Architecture (in particular, 64 bit) s390x systems +via the ``qemu-system-s390x`` binary. Only one machine type, +``s390-ccw-virtio``, is supported (with versioning for compatibility +handling). + +When using KVM as accelerator, QEMU can emulate CPUs up to the generation +of the host. When using the default cpu model with TCG as accelerator, +QEMU will emulate a subset of z13 cpu features that should be enough to run +distributions built for the z13. + +Device support +============== + +QEMU will not emulate most of the traditional devices found under LPAR or +z/VM; virtio devices (especially using virtio-ccw) make up the bulk of +the available devices. Passthrough of host devices via vfio-pci, vfio-ccw, +or vfio-ap is also available. + +.. toctree:: + s390x/vfio-ap + s390x/css + s390x/3270 + s390x/vfio-ccw + +Architectural features +====================== + +.. toctree:: + s390x/bootdevices + s390x/protvirt diff --git a/docs/system/target-sparc.rst b/docs/system/target-sparc.rst new file mode 100644 index 000000000..b55f8d09e --- /dev/null +++ b/docs/system/target-sparc.rst @@ -0,0 +1,62 @@ +.. _Sparc32-System-emulator: + +Sparc32 System emulator +----------------------- + +Use the executable ``qemu-system-sparc`` to simulate the following Sun4m +architecture machines: + +- SPARCstation 4 + +- SPARCstation 5 + +- SPARCstation 10 + +- SPARCstation 20 + +- SPARCserver 600MP + +- SPARCstation LX + +- SPARCstation Voyager + +- SPARCclassic + +- SPARCbook + +The emulation is somewhat complete. SMP up to 16 CPUs is supported, but +Linux limits the number of usable CPUs to 4. + +QEMU emulates the following sun4m peripherals: + +- IOMMU + +- TCX or cgthree Frame buffer + +- Lance (Am7990) Ethernet + +- Non Volatile RAM M48T02/M48T08 + +- Slave I/O: timers, interrupt controllers, Zilog serial ports, + keyboard and power/reset logic + +- ESP SCSI controller with hard disk and CD-ROM support + +- Floppy drive (not on SS-600MP) + +- CS4231 sound device (only on SS-5, not working yet) + +The number of peripherals is fixed in the architecture. Maximum memory +size depends on the machine type, for SS-5 it is 256MB and for others +2047MB. + +Since version 0.8.2, QEMU uses OpenBIOS https://www.openbios.org/. +OpenBIOS is a free (GPL v2) portable firmware implementation. The goal +is to implement a 100% IEEE 1275-1994 (referred to as Open Firmware) +compliant firmware. + +A sample Linux 2.6 series kernel and ram disk image are available on the +QEMU web site. There are still issues with NetBSD and OpenBSD, but most +kernel versions work. Please note that currently older Solaris kernels +don't work probably due to interface issues between OpenBIOS and +Solaris. diff --git a/docs/system/target-sparc64.rst b/docs/system/target-sparc64.rst new file mode 100644 index 000000000..97e334b93 --- /dev/null +++ b/docs/system/target-sparc64.rst @@ -0,0 +1,37 @@ +.. _Sparc64-System-emulator: + +Sparc64 System emulator +----------------------- + +Use the executable ``qemu-system-sparc64`` to simulate a Sun4u +(UltraSPARC PC-like machine), Sun4v (T1 PC-like machine), or generic +Niagara (T1) machine. The Sun4u emulator is mostly complete, being able +to run Linux, NetBSD and OpenBSD in headless (-nographic) mode. The +Sun4v emulator is still a work in progress. + +The Niagara T1 emulator makes use of firmware and OS binaries supplied +in the S10image/ directory of the OpenSPARC T1 project +http://download.oracle.com/technetwork/systems/opensparc/OpenSPARCT1_Arch.1.5.tar.bz2 +and is able to boot the disk.s10hw2 Solaris image. + +:: + + qemu-system-sparc64 -M niagara -L /path-to/S10image/ \ + -nographic -m 256 \ + -drive if=pflash,readonly=on,file=/S10image/disk.s10hw2 + +QEMU emulates the following peripherals: + +- UltraSparc IIi APB PCI Bridge + +- PCI VGA compatible card with VESA Bochs Extensions + +- PS/2 mouse and keyboard + +- Non Volatile RAM M48T59 + +- PC-compatible serial ports + +- 2 PCI IDE interfaces with hard disk and CD-ROM support + +- Floppy disk diff --git a/docs/system/target-xtensa.rst b/docs/system/target-xtensa.rst new file mode 100644 index 000000000..8d703ad76 --- /dev/null +++ b/docs/system/target-xtensa.rst @@ -0,0 +1,27 @@ +.. _Xtensa-System-emulator: + +Xtensa System emulator +---------------------- + +Two executables cover simulation of both Xtensa endian options, +``qemu-system-xtensa`` and ``qemu-system-xtensaeb``. Two different +machine types are emulated: + +- Xtensa emulator pseudo board \"sim\" + +- Avnet LX60/LX110/LX200 board + +The sim pseudo board emulation provides an environment similar to one +provided by the proprietary Tensilica ISS. It supports: + +- A range of Xtensa CPUs, default is the DC232B + +- Console and filesystem access via semihosting calls + +The Avnet LX60/LX110/LX200 emulation supports: + +- A range of Xtensa CPUs, default is the DC232B + +- 16550 UART + +- OpenCores 10/100 Mbps Ethernet MAC diff --git a/docs/system/targets.rst b/docs/system/targets.rst new file mode 100644 index 000000000..9dcd95dd8 --- /dev/null +++ b/docs/system/targets.rst @@ -0,0 +1,30 @@ +.. _system-targets-ref: + +QEMU System Emulator Targets +============================ + +QEMU is a generic emulator and it emulates many machines. Most of the +options are similar for all machines. Specific information about the +various targets are mentioned in the following sections. + +Contents: + +.. + This table of contents should be kept sorted alphabetically + by the title text of each file, which isn't the same ordering + as an alphabetical sort by filename. + +.. toctree:: + + target-arm + target-avr + target-m68k + target-mips + target-ppc + target-riscv + target-rx + target-s390x + target-sparc + target-sparc64 + target-i386 + target-xtensa diff --git a/docs/system/tls.rst b/docs/system/tls.rst new file mode 100644 index 000000000..1a0467436 --- /dev/null +++ b/docs/system/tls.rst @@ -0,0 +1,328 @@ +.. _network_005ftls: + +TLS setup for network services +------------------------------ + +Almost all network services in QEMU have the ability to use TLS for +session data encryption, along with x509 certificates for simple client +authentication. What follows is a description of how to generate +certificates suitable for usage with QEMU, and applies to the VNC +server, character devices with the TCP backend, NBD server and client, +and migration server and client. + +At a high level, QEMU requires certificates and private keys to be +provided in PEM format. Aside from the core fields, the certificates +should include various extension data sets, including v3 basic +constraints data, key purpose, key usage and subject alt name. + +The GnuTLS package includes a command called ``certtool`` which can be +used to easily generate certificates and keys in the required format +with expected data present. Alternatively a certificate management +service may be used. + +At a minimum it is necessary to setup a certificate authority, and issue +certificates to each server. If using x509 certificates for +authentication, then each client will also need to be issued a +certificate. + +Assuming that the QEMU network services will only ever be exposed to +clients on a private intranet, there is no need to use a commercial +certificate authority to create certificates. A self-signed CA is +sufficient, and in fact likely to be more secure since it removes the +ability of malicious 3rd parties to trick the CA into mis-issuing certs +for impersonating your services. The only likely exception where a +commercial CA might be desirable is if enabling the VNC websockets +server and exposing it directly to remote browser clients. In such a +case it might be useful to use a commercial CA to avoid needing to +install custom CA certs in the web browsers. + +The recommendation is for the server to keep its certificates in either +``/etc/pki/qemu`` or for unprivileged users in ``$HOME/.pki/qemu``. + +.. _tls_005fgenerate_005fca: + +Setup the Certificate Authority +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This step only needs to be performed once per organization / +organizational unit. First the CA needs a private key. This key must be +kept VERY secret and secure. If this key is compromised the entire trust +chain of the certificates issued with it is lost. + +:: + + # certtool --generate-privkey > ca-key.pem + +To generate a self-signed certificate requires one core piece of +information, the name of the organization. A template file ``ca.info`` +should be populated with the desired data to avoid having to deal with +interactive prompts from certtool:: + + # cat > ca.info <<EOF + cn = Name of your organization + ca + cert_signing_key + EOF + # certtool --generate-self-signed \ + --load-privkey ca-key.pem \ + --template ca.info \ + --outfile ca-cert.pem + +The ``ca`` keyword in the template sets the v3 basic constraints +extension to indicate this certificate is for a CA, while +``cert_signing_key`` sets the key usage extension to indicate this will +be used for signing other keys. The generated ``ca-cert.pem`` file +should be copied to all servers and clients wishing to utilize TLS +support in the VNC server. The ``ca-key.pem`` must not be +disclosed/copied anywhere except the host responsible for issuing +certificates. + +.. _tls_005fgenerate_005fserver: + +Issuing server certificates +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each server (or host) needs to be issued with a key and certificate. +When connecting the certificate is sent to the client which validates it +against the CA certificate. The core pieces of information for a server +certificate are the hostnames and/or IP addresses that will be used by +clients when connecting. The hostname / IP address that the client +specifies when connecting will be validated against the hostname(s) and +IP address(es) recorded in the server certificate, and if no match is +found the client will close the connection. + +Thus it is recommended that the server certificate include both the +fully qualified and unqualified hostnames. If the server will have +permanently assigned IP address(es), and clients are likely to use them +when connecting, they may also be included in the certificate. Both IPv4 +and IPv6 addresses are supported. Historically certificates only +included 1 hostname in the ``CN`` field, however, usage of this field +for validation is now deprecated. Instead modern TLS clients will +validate against the Subject Alt Name extension data, which allows for +multiple entries. In the future usage of the ``CN`` field may be +discontinued entirely, so providing SAN extension data is strongly +recommended. + +On the host holding the CA, create template files containing the +information for each server, and use it to issue server certificates. + +:: + + # cat > server-hostNNN.info <<EOF + organization = Name of your organization + cn = hostNNN.foo.example.com + dns_name = hostNNN + dns_name = hostNNN.foo.example.com + ip_address = 10.0.1.87 + ip_address = 192.8.0.92 + ip_address = 2620:0:cafe::87 + ip_address = 2001:24::92 + tls_www_server + encryption_key + signing_key + EOF + # certtool --generate-privkey > server-hostNNN-key.pem + # certtool --generate-certificate \ + --load-ca-certificate ca-cert.pem \ + --load-ca-privkey ca-key.pem \ + --load-privkey server-hostNNN-key.pem \ + --template server-hostNNN.info \ + --outfile server-hostNNN-cert.pem + +The ``dns_name`` and ``ip_address`` fields in the template are setting +the subject alt name extension data. The ``tls_www_server`` keyword is +the key purpose extension to indicate this certificate is intended for +usage in a web server. Although QEMU network services are not in fact +HTTP servers (except for VNC websockets), setting this key purpose is +still recommended. The ``encryption_key`` and ``signing_key`` keyword is +the key usage extension to indicate this certificate is intended for +usage in the data session. + +The ``server-hostNNN-key.pem`` and ``server-hostNNN-cert.pem`` files +should now be securely copied to the server for which they were +generated, and renamed to ``server-key.pem`` and ``server-cert.pem`` +when added to the ``/etc/pki/qemu`` directory on the target host. The +``server-key.pem`` file is security sensitive and should be kept +protected with file mode 0600 to prevent disclosure. + +.. _tls_005fgenerate_005fclient: + +Issuing client certificates +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The QEMU x509 TLS credential setup defaults to enabling client +verification using certificates, providing a simple authentication +mechanism. If this default is used, each client also needs to be issued +a certificate. The client certificate contains enough metadata to +uniquely identify the client with the scope of the certificate +authority. The client certificate would typically include fields for +organization, state, city, building, etc. + +Once again on the host holding the CA, create template files containing +the information for each client, and use it to issue client +certificates. + +:: + + # cat > client-hostNNN.info <<EOF + country = GB + state = London + locality = City Of London + organization = Name of your organization + cn = hostNNN.foo.example.com + tls_www_client + encryption_key + signing_key + EOF + # certtool --generate-privkey > client-hostNNN-key.pem + # certtool --generate-certificate \ + --load-ca-certificate ca-cert.pem \ + --load-ca-privkey ca-key.pem \ + --load-privkey client-hostNNN-key.pem \ + --template client-hostNNN.info \ + --outfile client-hostNNN-cert.pem + +The subject alt name extension data is not required for clients, so the +the ``dns_name`` and ``ip_address`` fields are not included. The +``tls_www_client`` keyword is the key purpose extension to indicate this +certificate is intended for usage in a web client. Although QEMU network +clients are not in fact HTTP clients, setting this key purpose is still +recommended. The ``encryption_key`` and ``signing_key`` keyword is the +key usage extension to indicate this certificate is intended for usage +in the data session. + +The ``client-hostNNN-key.pem`` and ``client-hostNNN-cert.pem`` files +should now be securely copied to the client for which they were +generated, and renamed to ``client-key.pem`` and ``client-cert.pem`` +when added to the ``/etc/pki/qemu`` directory on the target host. The +``client-key.pem`` file is security sensitive and should be kept +protected with file mode 0600 to prevent disclosure. + +If a single host is going to be using TLS in both a client and server +role, it is possible to create a single certificate to cover both roles. +This would be quite common for the migration and NBD services, where a +QEMU process will be started by accepting a TLS protected incoming +migration, and later itself be migrated out to another host. To generate +a single certificate, simply include the template data from both the +client and server instructions in one. + +:: + + # cat > both-hostNNN.info <<EOF + country = GB + state = London + locality = City Of London + organization = Name of your organization + cn = hostNNN.foo.example.com + dns_name = hostNNN + dns_name = hostNNN.foo.example.com + ip_address = 10.0.1.87 + ip_address = 192.8.0.92 + ip_address = 2620:0:cafe::87 + ip_address = 2001:24::92 + tls_www_server + tls_www_client + encryption_key + signing_key + EOF + # certtool --generate-privkey > both-hostNNN-key.pem + # certtool --generate-certificate \ + --load-ca-certificate ca-cert.pem \ + --load-ca-privkey ca-key.pem \ + --load-privkey both-hostNNN-key.pem \ + --template both-hostNNN.info \ + --outfile both-hostNNN-cert.pem + +When copying the PEM files to the target host, save them twice, once as +``server-cert.pem`` and ``server-key.pem``, and again as +``client-cert.pem`` and ``client-key.pem``. + +.. _tls_005fcreds_005fsetup: + +TLS x509 credential configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU has a standard mechanism for loading x509 credentials that will be +used for network services and clients. It requires specifying the +``tls-creds-x509`` class name to the ``--object`` command line argument +for the system emulators. Each set of credentials loaded should be given +a unique string identifier via the ``id`` parameter. A single set of TLS +credentials can be used for multiple network backends, so VNC, +migration, NBD, character devices can all share the same credentials. +Note, however, that credentials for use in a client endpoint must be +loaded separately from those used in a server endpoint. + +When specifying the object, the ``dir`` parameters specifies which +directory contains the credential files. This directory is expected to +contain files with the names mentioned previously, ``ca-cert.pem``, +``server-key.pem``, ``server-cert.pem``, ``client-key.pem`` and +``client-cert.pem`` as appropriate. It is also possible to include a set +of pre-generated Diffie-Hellman (DH) parameters in a file +``dh-params.pem``, which can be created using the +``certtool --generate-dh-params`` command. If omitted, QEMU will +dynamically generate DH parameters when loading the credentials. + +The ``endpoint`` parameter indicates whether the credentials will be +used for a network client or server, and determines which PEM files are +loaded. + +The ``verify`` parameter determines whether x509 certificate validation +should be performed. This defaults to enabled, meaning clients will +always validate the server hostname against the certificate subject alt +name fields and/or CN field. It also means that servers will request +that clients provide a certificate and validate them. Verification +should never be turned off for client endpoints, however, it may be +turned off for server endpoints if an alternative mechanism is used to +authenticate clients. For example, the VNC server can use SASL to +authenticate clients instead. + +To load server credentials with client certificate validation enabled + +.. parsed-literal:: + + |qemu_system| -object tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server + +while to load client credentials use + +.. parsed-literal:: + + |qemu_system| -object tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=client + +Network services which support TLS will all have a ``tls-creds`` +parameter which expects the ID of the TLS credentials object. For +example with VNC: + +.. parsed-literal:: + + |qemu_system| -vnc 0.0.0.0:0,tls-creds=tls0 + +.. _tls_005fpsk: + +TLS Pre-Shared Keys (PSK) +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Instead of using certificates, you may also use TLS Pre-Shared Keys +(TLS-PSK). This can be simpler to set up than certificates but is less +scalable. + +Use the GnuTLS ``psktool`` program to generate a ``keys.psk`` file +containing one or more usernames and random keys:: + + mkdir -m 0700 /tmp/keys + psktool -u rich -p /tmp/keys/keys.psk + +TLS-enabled servers such as ``qemu-nbd`` can use this directory like so:: + + qemu-nbd \ + -t -x / \ + --object tls-creds-psk,id=tls0,endpoint=server,dir=/tmp/keys \ + --tls-creds tls0 \ + image.qcow2 + +When connecting from a qemu-based client you must specify the directory +containing ``keys.psk`` and an optional username (defaults to "qemu"):: + + qemu-img info \ + --object tls-creds-psk,id=tls0,dir=/tmp/keys,username=rich,endpoint=client \ + --image-opts \ + file.driver=nbd,file.host=localhost,file.port=10809,file.tls-creds=tls0,file.export=/ diff --git a/docs/system/virtio-net-failover.rst b/docs/system/virtio-net-failover.rst new file mode 100644 index 000000000..6002dc5d9 --- /dev/null +++ b/docs/system/virtio-net-failover.rst @@ -0,0 +1,68 @@ +====================================== +QEMU virtio-net standby (net_failover) +====================================== + +This document explains the setup and usage of virtio-net standby feature which +is used to create a net_failover pair of devices. + +The general idea is that we have a pair of devices, a (vfio-)pci and a +virtio-net device. Before migration the vfio device is unplugged and data flows +through the virtio-net device, on the target side another vfio-pci device is +plugged in to take over the data-path. In the guest the net_failover kernel +module will pair net devices with the same MAC address. + +The two devices are called primary and standby device. The fast hardware based +networking device is called the primary device and the virtio-net device is the +standby device. + +Restrictions +------------ + +Currently only PCIe devices are allowed as primary devices, this restriction +can be lifted in the future with enhanced QEMU support. Also, only networking +devices are allowed as primary device. The user needs to ensure that primary +and standby devices are not plugged into the same PCIe slot. + +Usecase +------- + + Virtio-net standby allows easy migration while using a passed-through fast + networking device by falling back to a virtio-net device for the duration of + the migration. It is like a simple version of a bond, the difference is that it + requires no configuration in the guest. When a guest is live-migrated to + another host QEMU will unplug the primary device via the PCIe based hotplug + handler and traffic will go through the virtio-net device. On the target + system the primary device will be automatically plugged back and the + net_failover module registers it again as the primary device. + +Usage +----- + + The primary device can be hotplugged or be part of the startup configuration + + -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc, \ + bus=root2,failover=on + + With the parameter failover=on the VIRTIO_NET_F_STANDBY feature will be enabled. + + -device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,failover_pair_id=net1 + + failover_pair_id references the id of the virtio-net standby device. This + is only for pairing the devices within QEMU. The guest kernel module + net_failover will match devices with identical MAC addresses. + +Hotplug +------- + + Both primary and standby device can be hotplugged via the QEMU monitor. Note + that if the virtio-net device is plugged first a warning will be issued that it + couldn't find the primary device. + +Migration +--------- + + A new migration state wait-unplug was added for this feature. If failover primary + devices are present in the configuration, migration will go into this state. + It will wait until the device unplug is completed in the guest and then move into + active state. On the target system the primary devices will be automatically hotplugged + when the feature bit was negotiated for the virtio-net standby device. diff --git a/docs/system/vnc-security.rst b/docs/system/vnc-security.rst new file mode 100644 index 000000000..4c1769eeb --- /dev/null +++ b/docs/system/vnc-security.rst @@ -0,0 +1,203 @@ +.. _VNC security: + +VNC security +------------ + +The VNC server capability provides access to the graphical console of +the guest VM across the network. This has a number of security +considerations depending on the deployment scenarios. + +.. _vnc_005fsec_005fnone: + +Without passwords +~~~~~~~~~~~~~~~~~ + +The simplest VNC server setup does not include any form of +authentication. For this setup it is recommended to restrict it to +listen on a UNIX domain socket only. For example + +.. parsed-literal:: + + |qemu_system| [...OPTIONS...] -vnc unix:/home/joebloggs/.qemu-myvm-vnc + +This ensures that only users on local box with read/write access to that +path can access the VNC server. To securely access the VNC server from a +remote machine, a combination of netcat+ssh can be used to provide a +secure tunnel. + +.. _vnc_005fsec_005fpassword: + +With passwords +~~~~~~~~~~~~~~ + +The VNC protocol has limited support for password based authentication. +Since the protocol limits passwords to 8 characters it should not be +considered to provide high security. The password can be fairly easily +brute-forced by a client making repeat connections. For this reason, a +VNC server using password authentication should be restricted to only +listen on the loopback interface or UNIX domain sockets. Password +authentication is not supported when operating in FIPS 140-2 compliance +mode as it requires the use of the DES cipher. Password authentication +is requested with the ``password`` option, and then once QEMU is running +the password is set with the monitor. Until the monitor is used to set +the password all clients will be rejected. + +.. parsed-literal:: + + |qemu_system| [...OPTIONS...] -vnc :1,password=on -monitor stdio + (qemu) change vnc password + Password: ******** + (qemu) + +.. _vnc_005fsec_005fcertificate: + +With x509 certificates +~~~~~~~~~~~~~~~~~~~~~~ + +The QEMU VNC server also implements the VeNCrypt extension allowing use +of TLS for encryption of the session, and x509 certificates for +authentication. The use of x509 certificates is strongly recommended, +because TLS on its own is susceptible to man-in-the-middle attacks. +Basic x509 certificate support provides a secure session, but no +authentication. This allows any client to connect, and provides an +encrypted session. + +.. parsed-literal:: + + |qemu_system| [...OPTIONS...] \ + -object tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=off \ + -vnc :1,tls-creds=tls0 -monitor stdio + +In the above example ``/etc/pki/qemu`` should contain at least three +files, ``ca-cert.pem``, ``server-cert.pem`` and ``server-key.pem``. +Unprivileged users will want to use a private directory, for example +``$HOME/.pki/qemu``. NB the ``server-key.pem`` file should be protected +with file mode 0600 to only be readable by the user owning it. + +.. _vnc_005fsec_005fcertificate_005fverify: + +With x509 certificates and client verification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Certificates can also provide a means to authenticate the client +connecting. The server will request that the client provide a +certificate, which it will then validate against the CA certificate. +This is a good choice if deploying in an environment with a private +internal certificate authority. It uses the same syntax as previously, +but with ``verify-peer`` set to ``on`` instead. + +.. parsed-literal:: + + |qemu_system| [...OPTIONS...] \ + -object tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \ + -vnc :1,tls-creds=tls0 -monitor stdio + +.. _vnc_005fsec_005fcertificate_005fpw: + +With x509 certificates, client verification and passwords +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Finally, the previous method can be combined with VNC password +authentication to provide two layers of authentication for clients. + +.. parsed-literal:: + + |qemu_system| [...OPTIONS...] \ + -object tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \ + -vnc :1,tls-creds=tls0,password=on -monitor stdio + (qemu) change vnc password + Password: ******** + (qemu) + +.. _vnc_005fsec_005fsasl: + +With SASL authentication +~~~~~~~~~~~~~~~~~~~~~~~~ + +The SASL authentication method is a VNC extension, that provides an +easily extendable, pluggable authentication method. This allows for +integration with a wide range of authentication mechanisms, such as PAM, +GSSAPI/Kerberos, LDAP, SQL databases, one-time keys and more. The +strength of the authentication depends on the exact mechanism +configured. If the chosen mechanism also provides a SSF layer, then it +will encrypt the datastream as well. + +Refer to the later docs on how to choose the exact SASL mechanism used +for authentication, but assuming use of one supporting SSF, then QEMU +can be launched with: + +.. parsed-literal:: + + |qemu_system| [...OPTIONS...] -vnc :1,sasl=on -monitor stdio + +.. _vnc_005fsec_005fcertificate_005fsasl: + +With x509 certificates and SASL authentication +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If the desired SASL authentication mechanism does not supported SSF +layers, then it is strongly advised to run it in combination with TLS +and x509 certificates. This provides securely encrypted data stream, +avoiding risk of compromising of the security credentials. This can be +enabled, by combining the 'sasl' option with the aforementioned TLS + +x509 options: + +.. parsed-literal:: + + |qemu_system| [...OPTIONS...] \ + -object tls-creds-x509,id=tls0,dir=/etc/pki/qemu,endpoint=server,verify-peer=on \ + -vnc :1,tls-creds=tls0,sasl=on -monitor stdio + +.. _vnc_005fsetup_005fsasl: + +Configuring SASL mechanisms +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following documentation assumes use of the Cyrus SASL implementation +on a Linux host, but the principles should apply to any other SASL +implementation or host. When SASL is enabled, the mechanism +configuration will be loaded from system default SASL service config +/etc/sasl2/qemu.conf. If running QEMU as an unprivileged user, an +environment variable SASL_CONF_PATH can be used to make it search +alternate locations for the service config file. + +If the TLS option is enabled for VNC, then it will provide session +encryption, otherwise the SASL mechanism will have to provide +encryption. In the latter case the list of possible plugins that can be +used is drastically reduced. In fact only the GSSAPI SASL mechanism +provides an acceptable level of security by modern standards. Previous +versions of QEMU referred to the DIGEST-MD5 mechanism, however, it has +multiple serious flaws described in detail in RFC 6331 and thus should +never be used any more. The SCRAM-SHA-256 mechanism provides a simple +username/password auth facility similar to DIGEST-MD5, but does not +support session encryption, so can only be used in combination with TLS. + +When not using TLS the recommended configuration is + +:: + + mech_list: gssapi + keytab: /etc/qemu/krb5.tab + +This says to use the 'GSSAPI' mechanism with the Kerberos v5 protocol, +with the server principal stored in /etc/qemu/krb5.tab. For this to work +the administrator of your KDC must generate a Kerberos principal for the +server, with a name of 'qemu/somehost.example.com@EXAMPLE.COM' replacing +'somehost.example.com' with the fully qualified host name of the machine +running QEMU, and 'EXAMPLE.COM' with the Kerberos Realm. + +When using TLS, if username+password authentication is desired, then a +reasonable configuration is + +:: + + mech_list: scram-sha-256 + sasldb_path: /etc/qemu/passwd.db + +The ``saslpasswd2`` program can be used to populate the ``passwd.db`` +file with accounts. Note that the ``passwd.db`` file stores passwords +in clear text. + +Other SASL configurations will be left as an exercise for the reader. +Note that all mechanisms, except GSSAPI, should be combined with use of +TLS to ensure a secure data channel. diff --git a/docs/throttle.txt b/docs/throttle.txt new file mode 100644 index 000000000..0a0453a5e --- /dev/null +++ b/docs/throttle.txt @@ -0,0 +1,359 @@ +The QEMU throttling infrastructure +================================== +Copyright (C) 2016,2020 Igalia, S.L. +Author: Alberto Garcia <berto@igalia.com> + +This work is licensed under the terms of the GNU GPL, version 2 or +later. See the COPYING file in the top-level directory. + +Introduction +------------ +QEMU includes a throttling module that can be used to set limits to +I/O operations. The code itself is generic and independent of the I/O +units, but it is currently used to limit the number of bytes per second +and operations per second (IOPS) when performing disk I/O. + +This document explains how to use the throttling code in QEMU, and how +it works internally. The implementation is in throttle.c. + + +Using throttling to limit disk I/O +---------------------------------- +Two aspects of the disk I/O can be limited: the number of bytes per +second and the number of operations per second (IOPS). For each one of +them the user can set a global limit or separate limits for read and +write operations. This gives us a total of six different parameters. + +I/O limits can be set using the throttling.* parameters of -drive, or +using the QMP 'block_set_io_throttle' command. These are the names of +the parameters for both cases: + +|-----------------------+-----------------------| +| -drive | block_set_io_throttle | +|-----------------------+-----------------------| +| throttling.iops-total | iops | +| throttling.iops-read | iops_rd | +| throttling.iops-write | iops_wr | +| throttling.bps-total | bps | +| throttling.bps-read | bps_rd | +| throttling.bps-write | bps_wr | +|-----------------------+-----------------------| + +It is possible to set limits for both IOPS and bps at the same time, +and for each case we can decide whether to have separate read and +write limits or not, but note that if iops-total is set then neither +iops-read nor iops-write can be set. The same applies to bps-total and +bps-read/write. + +The default value of these parameters is 0, and it means 'unlimited'. + +In its most basic usage, the user can add a drive to QEMU with a limit +of 100 IOPS with the following -drive line: + + -drive file=hd0.qcow2,throttling.iops-total=100 + +We can do the same using QMP. In this case all these parameters are +mandatory, so we must set to 0 the ones that we don't want to limit: + + { "execute": "block_set_io_throttle", + "arguments": { + "device": "virtio0", + "iops": 100, + "iops_rd": 0, + "iops_wr": 0, + "bps": 0, + "bps_rd": 0, + "bps_wr": 0 + } + } + + +I/O bursts +---------- +In addition to the basic limits we have just seen, QEMU allows the +user to do bursts of I/O for a configurable amount of time. A burst is +an amount of I/O that can exceed the basic limit. Bursts are useful to +allow better performance when there are peaks of activity (the OS +boots, a service needs to be restarted) while keeping the average +limits lower the rest of the time. + +Two parameters control bursts: their length and the maximum amount of +I/O they allow. These two can be configured separately for each one of +the six basic parameters described in the previous section, but in +this section we'll use 'iops-total' as an example. + +The I/O limit during bursts is set using 'iops-total-max', and the +maximum length (in seconds) is set with 'iops-total-max-length'. So if +we want to configure a drive with a basic limit of 100 IOPS and allow +bursts of 2000 IOPS for 60 seconds, we would do it like this (the line +is split for clarity): + + -drive file=hd0.qcow2, + throttling.iops-total=100, + throttling.iops-total-max=2000, + throttling.iops-total-max-length=60 + +Or, with QMP: + + { "execute": "block_set_io_throttle", + "arguments": { + "device": "virtio0", + "iops": 100, + "iops_rd": 0, + "iops_wr": 0, + "bps": 0, + "bps_rd": 0, + "bps_wr": 0, + "iops_max": 2000, + "iops_max_length": 60, + } + } + +With this, the user can perform I/O on hd0.qcow2 at a rate of 2000 +IOPS for 1 minute before it's throttled down to 100 IOPS. + +The user will be able to do bursts again if there's a sufficiently +long period of time with unused I/O (see below for details). + +The default value for 'iops-total-max' is 0 and it means that bursts +are not allowed. 'iops-total-max-length' can only be set if +'iops-total-max' is set as well, and its default value is 1 second. + +Here's the complete list of parameters for configuring bursts: + +|----------------------------------+-----------------------| +| -drive | block_set_io_throttle | +|----------------------------------+-----------------------| +| throttling.iops-total-max | iops_max | +| throttling.iops-total-max-length | iops_max_length | +| throttling.iops-read-max | iops_rd_max | +| throttling.iops-read-max-length | iops_rd_max_length | +| throttling.iops-write-max | iops_wr_max | +| throttling.iops-write-max-length | iops_wr_max_length | +| throttling.bps-total-max | bps_max | +| throttling.bps-total-max-length | bps_max_length | +| throttling.bps-read-max | bps_rd_max | +| throttling.bps-read-max-length | bps_rd_max_length | +| throttling.bps-write-max | bps_wr_max | +| throttling.bps-write-max-length | bps_wr_max_length | +|----------------------------------+-----------------------| + + +Controlling the size of I/O operations +-------------------------------------- +When applying IOPS limits all I/O operations are treated equally +regardless of their size. This means that the user can take advantage +of this in order to circumvent the limits and submit one huge I/O +request instead of several smaller ones. + +QEMU provides a setting called throttling.iops-size to prevent this +from happening. This setting specifies the size (in bytes) of an I/O +request for accounting purposes. Larger requests will be counted +proportionally to this size. + +For example, if iops-size is set to 4096 then an 8KB request will be +counted as two, and a 6KB request will be counted as one and a +half. This only applies to requests larger than iops-size: smaller +requests will be always counted as one, no matter their size. + +The default value of iops-size is 0 and it means that the size of the +requests is never taken into account when applying IOPS limits. + + +Applying I/O limits to groups of disks +-------------------------------------- +In all the examples so far we have seen how to apply limits to the I/O +performed on individual drives, but QEMU allows grouping drives so +they all share the same limits. + +The way it works is that each drive with I/O limits is assigned to a +group named using the throttling.group parameter. If this parameter is +not specified, then the device name (i.e. 'virtio0', 'ide0-hd0') will +be used as the group name. + +Limits set using the throttling.* parameters discussed earlier in this +document apply to the combined I/O of all members of a group. + +Consider this example: + + -drive file=hd1.qcow2,throttling.iops-total=6000,throttling.group=foo + -drive file=hd2.qcow2,throttling.iops-total=6000,throttling.group=foo + -drive file=hd3.qcow2,throttling.iops-total=3000,throttling.group=bar + -drive file=hd4.qcow2,throttling.iops-total=6000,throttling.group=foo + -drive file=hd5.qcow2,throttling.iops-total=3000,throttling.group=bar + -drive file=hd6.qcow2,throttling.iops-total=5000 + +Here hd1, hd2 and hd4 are all members of a group named 'foo' with a +combined IOPS limit of 6000, and hd3 and hd5 are members of 'bar'. hd6 +is left alone (technically it is part of a 1-member group). + +Limits are applied in a round-robin fashion so if there are concurrent +I/O requests on several drives of the same group they will be +distributed evenly. + +When I/O limits are applied to an existing drive using the QMP command +'block_set_io_throttle', the following things need to be taken into +account: + + - I/O limits are shared within the same group, so new values will + affect all members and overwrite the previous settings. In other + words: if different limits are applied to members of the same + group, the last one wins. + + - If 'group' is unset it is assumed to be the current group of that + drive. If the drive is not in a group yet, it will be added to a + group named after the device name. + + - If 'group' is set then the drive will be moved to that group if + it was member of a different one. In this case the limits + specified in the parameters will be applied to the new group + only. + + - I/O limits can be disabled by setting all of them to 0. In this + case the device will be removed from its group and the rest of + its members will not be affected. The 'group' parameter is + ignored. + + +The Leaky Bucket algorithm +-------------------------- +I/O limits in QEMU are implemented using the leaky bucket algorithm +(specifically the "Leaky bucket as a meter" variant). + +This algorithm uses the analogy of a bucket that leaks water +constantly. The water that gets into the bucket represents the I/O +that has been performed, and no more I/O is allowed once the bucket is +full. + +To see the way this corresponds to the throttling parameters in QEMU, +consider the following values: + + iops-total=100 + iops-total-max=2000 + iops-total-max-length=60 + + - Water leaks from the bucket at a rate of 100 IOPS. + - Water can be added to the bucket at a rate of 2000 IOPS. + - The size of the bucket is 2000 x 60 = 120000 + - If 'iops-total-max-length' is unset then it defaults to 1 and the + size of the bucket is 2000. + - If 'iops-total-max' is unset then 'iops-total-max-length' must be + unset as well. In this case the bucket size is 100. + +The bucket is initially empty, therefore water can be added until it's +full at a rate of 2000 IOPS (the burst rate). Once the bucket is full +we can only add as much water as it leaks, therefore the I/O rate is +reduced to 100 IOPS. If we add less water than it leaks then the +bucket will start to empty, allowing for bursts again. + +Note that since water is leaking from the bucket even during bursts, +it will take a bit more than 60 seconds at 2000 IOPS to fill it +up. After those 60 seconds the bucket will have leaked 60 x 100 = +6000, allowing for 3 more seconds of I/O at 2000 IOPS. + +Also, due to the way the algorithm works, longer burst can be done at +a lower I/O rate, e.g. 1000 IOPS during 120 seconds. + + +The 'throttle' block filter +--------------------------- +Since QEMU 2.11 it is possible to configure the I/O limits using a +'throttle' block filter. This filter uses the exact same throttling +infrastructure described above but can be used anywhere in the node +graph, allowing for more flexibility. + +The user can create an arbitrary number of filters and each one of +them must be assigned to a group that contains the actual I/O limits. +Different filters can use the same group so the limits are shared as +described earlier in "Applying I/O limits to groups of disks". + +A group can be created using the object-add QMP function: + + { "execute": "object-add", + "arguments": { + "qom-type": "throttle-group", + "id": "group0", + "limits" : { + "iops-total": 1000, + "bps-write": 2097152 + } + } + } + +throttle-group has a 'limits' property (of type ThrottleLimits as +defined in qapi/block-core.json) which can be set on creation or later +with 'qom-set'. + +A throttle-group can also be created with the -object command line +option but at the moment there is no way to pass a 'limits' parameter +that contains a ThrottleLimits structure. The solution is to set the +individual values directly, like in this example: + + -object throttle-group,id=group0,x-iops-total=1000,x-bps-write=2097152 + +Note however that this is not a stable API (hence the 'x-' prefixes) and +will disappear when -object gains support for structured options and +enables use of 'limits'. + +Once we have a throttle-group we can use the throttle block filter, +where the 'file' property must be set to the block device that we want +to filter: + + { "execute": "blockdev-add", + "arguments": { + "options": { + "driver": "qcow2", + "node-name": "disk0", + "file": { + "driver": "file", + "filename": "/path/to/disk.qcow2" + } + } + } + } + + { "execute": "blockdev-add", + "arguments": { + "driver": "throttle", + "node-name": "throttle0", + "throttle-group": "group0", + "file": "disk0" + } + } + +A similar setup can also be done with the command line, for example: + + -drive driver=throttle,throttle-group=group0, + file.driver=qcow2,file.file.filename=/path/to/disk.qcow2 + +The scenario described so far is very simple but the throttle block +filter allows for more complex configurations. For example, let's say +that we have three different drives and we want to set I/O limits for +each one of them and an additional set of limits for the combined I/O +of all three drives. + +First we would define all throttle groups, one for each one of the +drives and one that would apply to all of them: + + -object throttle-group,id=limits0,x-iops-total=2000 + -object throttle-group,id=limits1,x-iops-total=2500 + -object throttle-group,id=limits2,x-iops-total=3000 + -object throttle-group,id=limits012,x-iops-total=4000 + +Now we can define the drives, and for each one of them we use two +chained throttle filters: the drive's own filter and the combined +filter. + + -drive driver=throttle,throttle-group=limits012, + file.driver=throttle,file.throttle-group=limits0 + file.file.driver=qcow2,file.file.file.filename=/path/to/disk0.qcow2 + -drive driver=throttle,throttle-group=limits012, + file.driver=throttle,file.throttle-group=limits1 + file.file.driver=qcow2,file.file.file.filename=/path/to/disk1.qcow2 + -drive driver=throttle,throttle-group=limits012, + file.driver=throttle,file.throttle-group=limits2 + file.file.driver=qcow2,file.file.file.filename=/path/to/disk2.qcow2 + +In this example the individual drives have IOPS limits of 2000, 2500 +and 3000 respectively but the total combined I/O can never exceed 4000 +IOPS. diff --git a/docs/tools/index.rst b/docs/tools/index.rst new file mode 100644 index 000000000..1edd5a805 --- /dev/null +++ b/docs/tools/index.rst @@ -0,0 +1,17 @@ +----- +Tools +----- + +This section of the manual documents QEMU's "tools": its +command line utilities and other standalone programs. + +.. toctree:: + :maxdepth: 2 + + qemu-img + qemu-storage-daemon + qemu-nbd + qemu-pr-helper + qemu-trace-stap + virtfs-proxy-helper + virtiofsd diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst new file mode 100644 index 000000000..d663dd92b --- /dev/null +++ b/docs/tools/qemu-img.rst @@ -0,0 +1,921 @@ +======================= +QEMU disk image utility +======================= + +Synopsis +-------- + +**qemu-img** [*standard options*] *command* [*command options*] + +Description +----------- + +qemu-img allows you to create, convert and modify images offline. It can handle +all image formats supported by QEMU. + +**Warning:** Never use qemu-img to modify images in use by a running virtual +machine or any other process; this may destroy the image. Also, be aware that +querying an image that is being modified by another process may encounter +inconsistent state. + +Options +------- + +.. program:: qemu-img + +Standard options: + +.. option:: -h, --help + + Display this help and exit + +.. option:: -V, --version + + Display version information and exit + +.. option:: -T, --trace [[enable=]PATTERN][,events=FILE][,file=FILE] + + .. include:: ../qemu-option-trace.rst.inc + +The following commands are supported: + +.. hxtool-doc:: qemu-img-cmds.hx + +Command parameters: + +*FILENAME* is a disk image filename. + +*FMT* is the disk image format. It is guessed automatically in most +cases. See below for a description of the supported disk formats. + +*SIZE* is the disk image size in bytes. Optional suffixes ``k`` or +``K`` (kilobyte, 1024) ``M`` (megabyte, 1024k) and ``G`` (gigabyte, +1024M) and T (terabyte, 1024G) are supported. ``b`` is ignored. + +*OUTPUT_FILENAME* is the destination disk image filename. + +*OUTPUT_FMT* is the destination format. + +*OPTIONS* is a comma separated list of format specific options in a +name=value format. Use ``-o ?`` for an overview of the options supported +by the used format or see the format descriptions below for details. + +*SNAPSHOT_PARAM* is param used for internal snapshot, format is +'snapshot.id=[ID],snapshot.name=[NAME]' or '[ID_OR_NAME]'. + +.. + Note the use of a new 'program'; otherwise Sphinx complains about + the -h option appearing both in the above option list and this one. + +.. program:: qemu-img-common-opts + +.. option:: --object OBJECTDEF + + is a QEMU user creatable object definition. See the :manpage:`qemu(1)` + manual page for a description of the object properties. The most common + object type is a ``secret``, which is used to supply passwords and/or + encryption keys. + +.. option:: --image-opts + + Indicates that the source *FILENAME* parameter is to be interpreted as a + full option string, not a plain filename. This parameter is mutually + exclusive with the *-f* parameter. + +.. option:: --target-image-opts + + Indicates that the OUTPUT_FILENAME parameter(s) are to be interpreted as + a full option string, not a plain filename. This parameter is mutually + exclusive with the *-O* parameters. It is currently required to also use + the *-n* parameter to skip image creation. This restriction may be relaxed + in a future release. + +.. option:: --force-share (-U) + + If specified, ``qemu-img`` will open the image in shared mode, allowing + other QEMU processes to open it in write mode. For example, this can be used to + get the image information (with 'info' subcommand) when the image is used by a + running guest. Note that this could produce inconsistent results because of + concurrent metadata changes, etc. This option is only allowed when opening + images in read-only mode. + +.. option:: --backing-chain + + Will enumerate information about backing files in a disk image chain. Refer + below for further description. + +.. option:: -c + + Indicates that target image must be compressed (qcow format only). + +.. option:: -h + + With or without a command, shows help and lists the supported formats. + +.. option:: -p + + Display progress bar (compare, convert and rebase commands only). + If the *-p* option is not used for a command that supports it, the + progress is reported when the process receives a ``SIGUSR1`` or + ``SIGINFO`` signal. + +.. option:: -q + + Quiet mode - do not print any output (except errors). There's no progress bar + in case both *-q* and *-p* options are used. + +.. option:: -S SIZE + + Indicates the consecutive number of bytes that must contain only zeros + for ``qemu-img`` to create a sparse image during conversion. This value is + rounded down to the nearest 512 bytes. You may use the common size suffixes + like ``k`` for kilobytes. + +.. option:: -t CACHE + + Specifies the cache mode that should be used with the (destination) file. See + the documentation of the emulator's ``-drive cache=...`` option for allowed + values. + +.. option:: -T SRC_CACHE + + Specifies the cache mode that should be used with the source file(s). See + the documentation of the emulator's ``-drive cache=...`` option for allowed + values. + +Parameters to compare subcommand: + +.. program:: qemu-img-compare + +.. option:: -f + + First image format + +.. option:: -F + + Second image format + +.. option:: -s + + Strict mode - fail on different image size or sector allocation + +Parameters to convert subcommand: + +.. program:: qemu-img-convert + +.. option:: --bitmaps + + Additionally copy all persistent bitmaps from the top layer of the source + +.. option:: -n + + Skip the creation of the target volume + +.. option:: -m + + Number of parallel coroutines for the convert process + +.. option:: -W + + Allow out-of-order writes to the destination. This option improves performance, + but is only recommended for preallocated devices like host devices or other + raw block devices. + +.. option:: -C + + Try to use copy offloading to move data from source image to target. This may + improve performance if the data is remote, such as with NFS or iSCSI backends, + but will not automatically sparsify zero sectors, and may result in a fully + allocated target image depending on the host support for getting allocation + information. + +.. option:: -r + + Rate limit for the convert process + +.. option:: --salvage + + Try to ignore I/O errors when reading. Unless in quiet mode (``-q``), errors + will still be printed. Areas that cannot be read from the source will be + treated as containing only zeroes. + +.. option:: --target-is-zero + + Assume that reading the destination image will always return + zeros. This parameter is mutually exclusive with a destination image + that has a backing file. It is required to also use the ``-n`` + parameter to skip image creation. + +Parameters to dd subcommand: + +.. program:: qemu-img-dd + +.. option:: bs=BLOCK_SIZE + + Defines the block size + +.. option:: count=BLOCKS + + Sets the number of input blocks to copy + +.. option:: if=INPUT + + Sets the input file + +.. option:: of=OUTPUT + + Sets the output file + +.. option:: skip=BLOCKS + + Sets the number of input blocks to skip + +Parameters to snapshot subcommand: + +.. program:: qemu-img-snapshot + +.. option:: snapshot + + Is the name of the snapshot to create, apply or delete + +.. option:: -a + + Applies a snapshot (revert disk to saved state) + +.. option:: -c + + Creates a snapshot + +.. option:: -d + + Deletes a snapshot + +.. option:: -l + + Lists all snapshots in the given image + +Command description: + +.. program:: qemu-img-commands + +.. option:: amend [--object OBJECTDEF] [--image-opts] [-p] [-q] [-f FMT] [-t CACHE] [--force] -o OPTIONS FILENAME + + Amends the image format specific *OPTIONS* for the image file + *FILENAME*. Not all file formats support this operation. + + The set of options that can be amended are dependent on the image + format, but note that amending the backing chain relationship should + instead be performed with ``qemu-img rebase``. + + --force allows some unsafe operations. Currently for -f luks, it allows to + erase the last encryption key, and to overwrite an active encryption key. + +.. option:: bench [-c COUNT] [-d DEPTH] [-f FMT] [--flush-interval=FLUSH_INTERVAL] [-i AIO] [-n] [--no-drain] [-o OFFSET] [--pattern=PATTERN] [-q] [-s BUFFER_SIZE] [-S STEP_SIZE] [-t CACHE] [-w] [-U] FILENAME + + Run a simple sequential I/O benchmark on the specified image. If ``-w`` is + specified, a write test is performed, otherwise a read test is performed. + + A total number of *COUNT* I/O requests is performed, each *BUFFER_SIZE* + bytes in size, and with *DEPTH* requests in parallel. The first request + starts at the position given by *OFFSET*, each following request increases + the current position by *STEP_SIZE*. If *STEP_SIZE* is not given, + *BUFFER_SIZE* is used for its value. + + If *FLUSH_INTERVAL* is specified for a write test, the request queue is + drained and a flush is issued before new writes are made whenever the number of + remaining requests is a multiple of *FLUSH_INTERVAL*. If additionally + ``--no-drain`` is specified, a flush is issued without draining the request + queue first. + + if ``-i`` is specified, *AIO* option can be used to specify different + AIO backends: ``threads``, ``native`` or ``io_uring``. + + If ``-n`` is specified, the native AIO backend is used if possible. On + Linux, this option only works if ``-t none`` or ``-t directsync`` is + specified as well. + + For write tests, by default a buffer filled with zeros is written. This can be + overridden with a pattern byte specified by *PATTERN*. + +.. option:: bitmap (--merge SOURCE | --add | --remove | --clear | --enable | --disable)... [-b SOURCE_FILE [-F SOURCE_FMT]] [-g GRANULARITY] [--object OBJECTDEF] [--image-opts | -f FMT] FILENAME BITMAP + + Perform one or more modifications of the persistent bitmap *BITMAP* + in the disk image *FILENAME*. The various modifications are: + + ``--add`` to create *BITMAP*, enabled to record future edits. + + ``--remove`` to remove *BITMAP*. + + ``--clear`` to clear *BITMAP*. + + ``--enable`` to change *BITMAP* to start recording future edits. + + ``--disable`` to change *BITMAP* to stop recording future edits. + + ``--merge`` to merge the contents of the *SOURCE* bitmap into *BITMAP*. + + Additional options include ``-g`` which sets a non-default + *GRANULARITY* for ``--add``, and ``-b`` and ``-F`` which select an + alternative source file for all *SOURCE* bitmaps used by + ``--merge``. + + To see what bitmaps are present in an image, use ``qemu-img info``. + +.. option:: check [--object OBJECTDEF] [--image-opts] [-q] [-f FMT] [--output=OFMT] [-r [leaks | all]] [-T SRC_CACHE] [-U] FILENAME + + Perform a consistency check on the disk image *FILENAME*. The command can + output in the format *OFMT* which is either ``human`` or ``json``. + The JSON output is an object of QAPI type ``ImageCheck``. + + If ``-r`` is specified, qemu-img tries to repair any inconsistencies found + during the check. ``-r leaks`` repairs only cluster leaks, whereas + ``-r all`` fixes all kinds of errors, with a higher risk of choosing the + wrong fix or hiding corruption that has already occurred. + + Only the formats ``qcow2``, ``qed`` and ``vdi`` support + consistency checks. + + In case the image does not have any inconsistencies, check exits with ``0``. + Other exit codes indicate the kind of inconsistency found or if another error + occurred. The following table summarizes all exit codes of the check subcommand: + + 0 + Check completed, the image is (now) consistent + 1 + Check not completed because of internal errors + 2 + Check completed, image is corrupted + 3 + Check completed, image has leaked clusters, but is not corrupted + 63 + Checks are not supported by the image format + + If ``-r`` is specified, exit codes representing the image state refer to the + state after (the attempt at) repairing it. That is, a successful ``-r all`` + will yield the exit code 0, independently of the image state before. + +.. option:: commit [--object OBJECTDEF] [--image-opts] [-q] [-f FMT] [-t CACHE] [-b BASE] [-r RATE_LIMIT] [-d] [-p] FILENAME + + Commit the changes recorded in *FILENAME* in its base image or backing file. + If the backing file is smaller than the snapshot, then the backing file will be + resized to be the same size as the snapshot. If the snapshot is smaller than + the backing file, the backing file will not be truncated. If you want the + backing file to match the size of the smaller snapshot, you can safely truncate + it yourself once the commit operation successfully completes. + + The image *FILENAME* is emptied after the operation has succeeded. If you do + not need *FILENAME* afterwards and intend to drop it, you may skip emptying + *FILENAME* by specifying the ``-d`` flag. + + If the backing chain of the given image file *FILENAME* has more than one + layer, the backing file into which the changes will be committed may be + specified as *BASE* (which has to be part of *FILENAME*'s backing + chain). If *BASE* is not specified, the immediate backing file of the top + image (which is *FILENAME*) will be used. Note that after a commit operation + all images between *BASE* and the top image will be invalid and may return + garbage data when read. For this reason, ``-b`` implies ``-d`` (so that + the top image stays valid). + + The rate limit for the commit process is specified by ``-r``. + +.. option:: compare [--object OBJECTDEF] [--image-opts] [-f FMT] [-F FMT] [-T SRC_CACHE] [-p] [-q] [-s] [-U] FILENAME1 FILENAME2 + + Check if two images have the same content. You can compare images with + different format or settings. + + The format is probed unless you specify it by ``-f`` (used for + *FILENAME1*) and/or ``-F`` (used for *FILENAME2*) option. + + By default, images with different size are considered identical if the larger + image contains only unallocated and/or zeroed sectors in the area after the end + of the other image. In addition, if any sector is not allocated in one image + and contains only zero bytes in the second one, it is evaluated as equal. You + can use Strict mode by specifying the ``-s`` option. When compare runs in + Strict mode, it fails in case image size differs or a sector is allocated in + one image and is not allocated in the second one. + + By default, compare prints out a result message. This message displays + information that both images are same or the position of the first different + byte. In addition, result message can report different image size in case + Strict mode is used. + + Compare exits with ``0`` in case the images are equal and with ``1`` + in case the images differ. Other exit codes mean an error occurred during + execution and standard error output should contain an error message. + The following table sumarizes all exit codes of the compare subcommand: + + 0 + Images are identical (or requested help was printed) + 1 + Images differ + 2 + Error on opening an image + 3 + Error on checking a sector allocation + 4 + Error on reading data + +.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] [--target-is-zero] [--bitmaps [--skip-broken-bitmaps]] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE [-F BACKING_FMT]] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME + + Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM* + to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can + be optionally compressed (``-c`` option) or use any format specific + options like encryption (``-o`` option). + + Only the formats ``qcow`` and ``qcow2`` support compression. The + compression is read-only. It means that if a compressed sector is + rewritten, then it is rewritten as uncompressed data. + + Image conversion is also useful to get smaller image when using a + growable format such as ``qcow``: the empty sectors are detected and + suppressed from the destination image. + + *SPARSE_SIZE* indicates the consecutive number of bytes (defaults to 4k) + that must contain only zeros for ``qemu-img`` to create a sparse image during + conversion. If *SPARSE_SIZE* is 0, the source will not be scanned for + unallocated or zero sectors, and the destination image will always be + fully allocated. + + You can use the *BACKING_FILE* option to force the output image to be + created as a copy on write image of the specified base image; the + *BACKING_FILE* should have the same content as the input's base image, + however the path, image format (as given by *BACKING_FMT*), etc may differ. + + If a relative path name is given, the backing file is looked up relative to + the directory containing *OUTPUT_FILENAME*. + + If the ``-n`` option is specified, the target volume creation will be + skipped. This is useful for formats such as ``rbd`` if the target + volume has already been created with site specific options that cannot + be supplied through ``qemu-img``. + + Out of order writes can be enabled with ``-W`` to improve performance. + This is only recommended for preallocated devices like host devices or other + raw block devices. Out of order write does not work in combination with + creating compressed images. + + *NUM_COROUTINES* specifies how many coroutines work in parallel during + the convert process (defaults to 8). + + Use of ``--bitmaps`` requests that any persistent bitmaps present in + the original are also copied to the destination. If any bitmap is + inconsistent in the source, the conversion will fail unless + ``--skip-broken-bitmaps`` is also specified to copy only the + consistent bitmaps. + +.. option:: create [--object OBJECTDEF] [-q] [-f FMT] [-b BACKING_FILE] [-F BACKING_FMT] [-u] [-o OPTIONS] FILENAME [SIZE] + + Create the new disk image *FILENAME* of size *SIZE* and format + *FMT*. Depending on the file format, you can add one or more *OPTIONS* + that enable additional features of this format. + + If the option *BACKING_FILE* is specified, then the image will record + only the differences from *BACKING_FILE*. No size needs to be specified in + this case. *BACKING_FILE* will never be modified unless you use the + ``commit`` monitor command (or ``qemu-img commit``). + + If a relative path name is given, the backing file is looked up relative to + the directory containing *FILENAME*. + + Note that a given backing file will be opened to check that it is valid. Use + the ``-u`` option to enable unsafe backing file mode, which means that the + image will be created even if the associated backing file cannot be opened. A + matching backing file must be created or additional options be used to make the + backing file specification valid when you want to use an image created this + way. + + The size can also be specified using the *SIZE* option with ``-o``, + it doesn't need to be specified separately in this case. + + +.. option:: dd [--image-opts] [-U] [-f FMT] [-O OUTPUT_FMT] [bs=BLOCK_SIZE] [count=BLOCKS] [skip=BLOCKS] if=INPUT of=OUTPUT + + dd copies from *INPUT* file to *OUTPUT* file converting it from + *FMT* format to *OUTPUT_FMT* format. + + The data is by default read and written using blocks of 512 bytes but can be + modified by specifying *BLOCK_SIZE*. If count=\ *BLOCKS* is specified + dd will stop reading input after reading *BLOCKS* input blocks. + + The size syntax is similar to :manpage:`dd(1)`'s size syntax. + +.. option:: info [--object OBJECTDEF] [--image-opts] [-f FMT] [--output=OFMT] [--backing-chain] [-U] FILENAME + + Give information about the disk image *FILENAME*. Use it in + particular to know the size reserved on disk which can be different + from the displayed size. If VM snapshots are stored in the disk image, + they are displayed too. + + If a disk image has a backing file chain, information about each disk image in + the chain can be recursively enumerated by using the option ``--backing-chain``. + + For instance, if you have an image chain like: + + :: + + base.qcow2 <- snap1.qcow2 <- snap2.qcow2 + + To enumerate information about each disk image in the above chain, starting from top to base, do: + + :: + + qemu-img info --backing-chain snap2.qcow2 + + The command can output in the format *OFMT* which is either ``human`` or + ``json``. The JSON output is an object of QAPI type ``ImageInfo``; with + ``--backing-chain``, it is an array of ``ImageInfo`` objects. + + ``--output=human`` reports the following information (for every image in the + chain): + + *image* + The image file name + + *file format* + The image format + + *virtual size* + The size of the guest disk + + *disk size* + How much space the image file occupies on the host file system (may be + shown as 0 if this information is unavailable, e.g. because there is no + file system) + + *cluster_size* + Cluster size of the image format, if applicable + + *encrypted* + Whether the image is encrypted (only present if so) + + *cleanly shut down* + This is shown as ``no`` if the image is dirty and will have to be + auto-repaired the next time it is opened in qemu. + + *backing file* + The backing file name, if present + + *backing file format* + The format of the backing file, if the image enforces it + + *Snapshot list* + A list of all internal snapshots + + *Format specific information* + Further information whose structure depends on the image format. This + section is a textual representation of the respective + ``ImageInfoSpecific*`` QAPI object (e.g. ``ImageInfoSpecificQCow2`` + for qcow2 images). + +.. option:: map [--object OBJECTDEF] [--image-opts] [-f FMT] [--start-offset=OFFSET] [--max-length=LEN] [--output=OFMT] [-U] FILENAME + + Dump the metadata of image *FILENAME* and its backing file chain. + In particular, this commands dumps the allocation state of every sector + of *FILENAME*, together with the topmost file that allocates it in + the backing file chain. + + Two option formats are possible. The default format (``human``) + only dumps known-nonzero areas of the file. Known-zero parts of the + file are omitted altogether, and likewise for parts that are not allocated + throughout the chain. ``qemu-img`` output will identify a file + from where the data can be read, and the offset in the file. Each line + will include four fields, the first three of which are hexadecimal + numbers. For example the first line of: + + :: + + Offset Length Mapped to File + 0 0x20000 0x50000 /tmp/overlay.qcow2 + 0x100000 0x10000 0x95380000 /tmp/backing.qcow2 + + means that 0x20000 (131072) bytes starting at offset 0 in the image are + available in /tmp/overlay.qcow2 (opened in ``raw`` format) starting + at offset 0x50000 (327680). Data that is compressed, encrypted, or + otherwise not available in raw format will cause an error if ``human`` + format is in use. Note that file names can include newlines, thus it is + not safe to parse this output format in scripts. + + The alternative format ``json`` will return an array of dictionaries + in JSON format. It will include similar information in + the ``start``, ``length``, ``offset`` fields; + it will also include other more specific information: + + - boolean field ``data``: true if the sectors contain actual data, + false if the sectors are either unallocated or stored as optimized + all-zero clusters + - boolean field ``zero``: true if the data is known to read as zero + - boolean field ``present``: true if the data belongs to the backing + chain, false if rebasing the backing chain onto a deeper file + would pick up data from the deeper file; + - integer field ``depth``: the depth within the backing chain at + which the data was resolved; for example, a depth of 2 refers to + the backing file of the backing file of *FILENAME*. + + In JSON format, the ``offset`` field is optional; it is absent in + cases where ``human`` format would omit the entry or exit with an error. + If ``data`` is false and the ``offset`` field is present, the + corresponding sectors in the file are not yet in use, but they are + preallocated. + + For more information, consult ``include/block/block.h`` in QEMU's + source code. + +.. option:: measure [--output=OFMT] [-O OUTPUT_FMT] [-o OPTIONS] [--size N | [--object OBJECTDEF] [--image-opts] [-f FMT] [-l SNAPSHOT_PARAM] FILENAME] + + Calculate the file size required for a new image. This information + can be used to size logical volumes or SAN LUNs appropriately for + the image that will be placed in them. The values reported are + guaranteed to be large enough to fit the image. The command can + output in the format *OFMT* which is either ``human`` or ``json``. + The JSON output is an object of QAPI type ``BlockMeasureInfo``. + + If the size *N* is given then act as if creating a new empty image file + using ``qemu-img create``. If *FILENAME* is given then act as if + converting an existing image file using ``qemu-img convert``. The format + of the new file is given by *OUTPUT_FMT* while the format of an existing + file is given by *FMT*. + + A snapshot in an existing image can be specified using *SNAPSHOT_PARAM*. + + The following fields are reported: + + :: + + required size: 524288 + fully allocated size: 1074069504 + bitmaps size: 0 + + The ``required size`` is the file size of the new image. It may be smaller + than the virtual disk size if the image format supports compact representation. + + The ``fully allocated size`` is the file size of the new image once data has + been written to all sectors. This is the maximum size that the image file can + occupy with the exception of internal snapshots, dirty bitmaps, vmstate data, + and other advanced image format features. + + The ``bitmaps size`` is the additional size required in order to + copy bitmaps from a source image in addition to the guest-visible + data; the line is omitted if either source or destination lacks + bitmap support, or 0 if bitmaps are supported but there is nothing + to copy. + +.. option:: snapshot [--object OBJECTDEF] [--image-opts] [-U] [-q] [-l | -a SNAPSHOT | -c SNAPSHOT | -d SNAPSHOT] FILENAME + + List, apply, create or delete snapshots in image *FILENAME*. + +.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME + + Changes the backing file of an image. Only the formats ``qcow2`` and + ``qed`` support changing the backing file. + + The backing file is changed to *BACKING_FILE* and (if the image format of + *FILENAME* supports this) the backing file format is changed to + *BACKING_FMT*. If *BACKING_FILE* is specified as "" (the empty + string), then the image is rebased onto no backing file (i.e. it will exist + independently of any backing file). + + If a relative path name is given, the backing file is looked up relative to + the directory containing *FILENAME*. + + *CACHE* specifies the cache mode to be used for *FILENAME*, whereas + *SRC_CACHE* specifies the cache mode for reading backing files. + + There are two different modes in which ``rebase`` can operate: + + Safe mode + This is the default mode and performs a real rebase operation. The + new backing file may differ from the old one and ``qemu-img rebase`` + will take care of keeping the guest-visible content of *FILENAME* + unchanged. + + In order to achieve this, any clusters that differ between + *BACKING_FILE* and the old backing file of *FILENAME* are merged + into *FILENAME* before actually changing the backing file. + + Note that the safe mode is an expensive operation, comparable to + converting an image. It only works if the old backing file still + exists. + + Unsafe mode + ``qemu-img`` uses the unsafe mode if ``-u`` is specified. In this + mode, only the backing file name and format of *FILENAME* is changed + without any checks on the file contents. The user must take care of + specifying the correct new backing file, or the guest-visible + content of the image will be corrupted. + + This mode is useful for renaming or moving the backing file to + somewhere else. It can be used without an accessible old backing + file, i.e. you can use it to fix an image whose backing file has + already been moved/renamed. + + You can use ``rebase`` to perform a "diff" operation on two + disk images. This can be useful when you have copied or cloned + a guest, and you want to get back to a thin image on top of a + template or base image. + + Say that ``base.img`` has been cloned as ``modified.img`` by + copying it, and that the ``modified.img`` guest has run so there + are now some changes compared to ``base.img``. To construct a thin + image called ``diff.qcow2`` that contains just the differences, do: + + :: + + qemu-img create -f qcow2 -b modified.img diff.qcow2 + qemu-img rebase -b base.img diff.qcow2 + + At this point, ``modified.img`` can be discarded, since + ``base.img + diff.qcow2`` contains the same information. + +.. option:: resize [--object OBJECTDEF] [--image-opts] [-f FMT] [--preallocation=PREALLOC] [-q] [--shrink] FILENAME [+ | -]SIZE + + Change the disk image as if it had been created with *SIZE*. + + Before using this command to shrink a disk image, you MUST use file system and + partitioning tools inside the VM to reduce allocated file systems and partition + sizes accordingly. Failure to do so will result in data loss! + + When shrinking images, the ``--shrink`` option must be given. This informs + ``qemu-img`` that the user acknowledges all loss of data beyond the truncated + image's end. + + After using this command to grow a disk image, you must use file system and + partitioning tools inside the VM to actually begin using the new space on the + device. + + When growing an image, the ``--preallocation`` option may be used to specify + how the additional image area should be allocated on the host. See the format + description in the :ref:`notes` section which values are allowed. Using this + option may result in slightly more data being allocated than necessary. + +.. _notes: + +Notes +----- + +Supported image file formats: + +``raw`` + + Raw disk image format (default). This format has the advantage of + being simple and easily exportable to all other emulators. If your + file system supports *holes* (for example in ext2 or ext3 on + Linux or NTFS on Windows), then only the written sectors will reserve + space. Use ``qemu-img info`` to know the real size used by the + image or ``ls -ls`` on Unix/Linux. + + Supported options: + + ``preallocation`` + Preallocation mode (allowed values: ``off``, ``falloc``, + ``full``). ``falloc`` mode preallocates space for image by + calling ``posix_fallocate()``. ``full`` mode preallocates space + for image by writing data to underlying storage. This data may or + may not be zero, depending on the storage location. + +``qcow2`` + + QEMU image format, the most versatile format. Use it to have smaller + images (useful if your filesystem does not supports holes, for example + on Windows), optional AES encryption, zlib based compression and + support of multiple VM snapshots. + + Supported options: + + ``compat`` + Determines the qcow2 version to use. ``compat=0.10`` uses the + traditional image format that can be read by any QEMU since 0.10. + ``compat=1.1`` enables image format extensions that only QEMU 1.1 and + newer understand (this is the default). Amongst others, this includes zero + clusters, which allow efficient copy-on-read for sparse images. + + ``backing_file`` + File name of a base image (see ``create`` subcommand) + + ``backing_fmt`` + Image format of the base image + + ``encryption`` + If this option is set to ``on``, the image is encrypted with + 128-bit AES-CBC. + + The use of encryption in qcow and qcow2 images is considered to be + flawed by modern cryptography standards, suffering from a number + of design problems: + + - The AES-CBC cipher is used with predictable initialization + vectors based on the sector number. This makes it vulnerable to + chosen plaintext attacks which can reveal the existence of + encrypted data. + + - The user passphrase is directly used as the encryption key. A + poorly chosen or short passphrase will compromise the security + of the encryption. + + - In the event of the passphrase being compromised there is no way + to change the passphrase to protect data in any qcow images. The + files must be cloned, using a different encryption passphrase in + the new file. The original file must then be securely erased + using a program like shred, though even this is ineffective with + many modern storage technologies. + + - Initialization vectors used to encrypt sectors are based on the + guest virtual sector number, instead of the host physical + sector. When a disk image has multiple internal snapshots this + means that data in multiple physical sectors is encrypted with + the same initialization vector. With the CBC mode, this opens + the possibility of watermarking attacks if the attack can + collect multiple sectors encrypted with the same IV and some + predictable data. Having multiple qcow2 images with the same + passphrase also exposes this weakness since the passphrase is + directly used as the key. + + Use of qcow / qcow2 encryption is thus strongly discouraged. Users are + recommended to use an alternative encryption technology such as the + Linux dm-crypt / LUKS system. + + ``cluster_size`` + Changes the qcow2 cluster size (must be between 512 and + 2M). Smaller cluster sizes can improve the image file size whereas + larger cluster sizes generally provide better performance. + + ``preallocation`` + Preallocation mode (allowed values: ``off``, ``metadata``, + ``falloc``, ``full``). An image with preallocated metadata is + initially larger but can improve performance when the image needs + to grow. ``falloc`` and ``full`` preallocations are like the same + options of ``raw`` format, but sets up metadata also. + + ``lazy_refcounts`` + If this option is set to ``on``, reference count updates are + postponed with the goal of avoiding metadata I/O and improving + performance. This is particularly interesting with + ``cache=writethrough`` which doesn't batch metadata + updates. The tradeoff is that after a host crash, the reference + count tables must be rebuilt, i.e. on the next open an (automatic) + ``qemu-img check -r all`` is required, which may take some time. + + This option can only be enabled if ``compat=1.1`` is specified. + + ``nocow`` + If this option is set to ``on``, it will turn off COW of the file. It's + only valid on btrfs, no effect on other file systems. + + Btrfs has low performance when hosting a VM image file, even more + when the guest on the VM also using btrfs as file system. Turning + off COW is a way to mitigate this bad performance. Generally there + are two ways to turn off COW on btrfs: + + - Disable it by mounting with nodatacow, then all newly created files + will be NOCOW + - For an empty file, add the NOCOW file attribute. That's what this + option does. + + Note: this option is only valid to new or empty files. If there is + an existing file which is COW and has data blocks already, it + couldn't be changed to NOCOW by setting ``nocow=on``. One can + issue ``lsattr filename`` to check if the NOCOW flag is set or not + (Capital 'C' is NOCOW flag). + + ``data_file`` + Filename where all guest data will be stored. If this option is used, + the qcow2 file will only contain the image's metadata. + + Note: Data loss will occur if the given filename already exists when + using this option with ``qemu-img create`` since ``qemu-img`` will create + the data file anew, overwriting the file's original contents. To simply + update the reference to point to the given pre-existing file, use + ``qemu-img amend``. + + ``data_file_raw`` + If this option is set to ``on``, QEMU will always keep the external data + file consistent as a standalone read-only raw image. + + It does this by forwarding all write accesses to the qcow2 file through to + the raw data file, including their offsets. Therefore, data that is visible + on the qcow2 node (i.e., to the guest) at some offset is visible at the same + offset in the raw data file. This results in a read-only raw image. Writes + that bypass the qcow2 metadata may corrupt the qcow2 metadata because the + out-of-band writes may result in the metadata falling out of sync with the + raw image. + + If this option is ``off``, QEMU will use the data file to store data in an + arbitrary manner. The file’s content will not make sense without the + accompanying qcow2 metadata. Where data is written will have no relation to + its offset as seen by the guest, and some writes (specifically zero writes) + may not be forwarded to the data file at all, but will only be handled by + modifying qcow2 metadata. + + This option can only be enabled if ``data_file`` is set. + +``Other`` + + QEMU also supports various other image file formats for + compatibility with older QEMU versions or other hypervisors, + including VMDK, VDI, VHD (vpc), VHDX, qcow1 and QED. For a full list + of supported formats see ``qemu-img --help``. For a more detailed + description of these formats, see the QEMU block drivers reference + documentation. + + The main purpose of the block drivers for these formats is image + conversion. For running VMs, it is recommended to convert the disk + images to either raw or qcow2 in order to achieve good performance. diff --git a/docs/tools/qemu-nbd.rst b/docs/tools/qemu-nbd.rst new file mode 100644 index 000000000..6031f9689 --- /dev/null +++ b/docs/tools/qemu-nbd.rst @@ -0,0 +1,265 @@ +===================================== +QEMU Disk Network Block Device Server +===================================== + +Synopsis +-------- + +**qemu-nbd** [*OPTION*]... *filename* + +**qemu-nbd** -L [*OPTION*]... + +**qemu-nbd** -d *dev* + +Description +----------- + +Export a QEMU disk image using the NBD protocol. + +Other uses: + +- Bind a /dev/nbdX block device to a QEMU server (on Linux). +- As a client to query exports of a remote NBD server. + +Options +------- + +.. program:: qemu-nbd + +*filename* is a disk image filename, or a set of block +driver options if ``--image-opts`` is specified. + +*dev* is an NBD device. + +.. option:: --object type,id=ID,... + + Define a new instance of the *type* object class identified by *ID*. + See the :manpage:`qemu(1)` manual page for full details of the properties + supported. The common object types that it makes sense to define are the + ``secret`` object, which is used to supply passwords and/or encryption + keys, and the ``tls-creds`` object, which is used to supply TLS + credentials for the ``qemu-nbd`` server or client. + +.. option:: -p, --port=PORT + + TCP port to listen on as a server, or connect to as a client + (default ``10809``). + +.. option:: -o, --offset=OFFSET + + The offset into the image. + +.. option:: -b, --bind=IFACE + + The interface to bind to as a server, or connect to as a client + (default ``0.0.0.0``). + +.. option:: -k, --socket=PATH + + Use a unix socket with path *PATH*. + +.. option:: --image-opts + + Treat *filename* as a set of image options, instead of a plain + filename. If this flag is specified, the ``-f`` flag should + not be used, instead the :option:`format=` option should be set. + +.. option:: -f, --format=FMT + + Force the use of the block driver for format *FMT* instead of + auto-detecting. + +.. option:: -r, --read-only + + Export the disk as read-only. + +.. option:: -A, --allocation-depth + + Expose allocation depth information via the + ``qemu:allocation-depth`` metadata context accessible through + NBD_OPT_SET_META_CONTEXT. + +.. option:: -B, --bitmap=NAME + + If *filename* has a qcow2 persistent bitmap *NAME*, expose + that bitmap via the ``qemu:dirty-bitmap:NAME`` metadata context + accessible through NBD_OPT_SET_META_CONTEXT. + +.. option:: -s, --snapshot + + Use *filename* as an external snapshot, create a temporary + file with ``backing_file=``\ *filename*, redirect the write to + the temporary one. + +.. option:: -l, --load-snapshot=SNAPSHOT_PARAM + + Load an internal snapshot inside *filename* and export it + as an read-only device, SNAPSHOT_PARAM format is + ``snapshot.id=[ID],snapshot.name=[NAME]`` or ``[ID_OR_NAME]`` + +.. option:: --cache=CACHE + + The cache mode to be used with the file. Valid values are: + ``none``, ``writeback`` (the default), ``writethrough``, + ``directsync`` and ``unsafe``. See the documentation of + the emulator's ``-drive cache=...`` option for more info. + +.. option:: -n, --nocache + + Equivalent to :option:`--cache=none`. + +.. option:: --aio=AIO + + Set the asynchronous I/O mode between ``threads`` (the default), + ``native`` (Linux only), and ``io_uring`` (Linux 5.1+). + +.. option:: --discard=DISCARD + + Control whether ``discard`` (also known as ``trim`` or ``unmap``) + requests are ignored or passed to the filesystem. *DISCARD* is one of + ``ignore`` (or ``off``), ``unmap`` (or ``on``). The default is + ``ignore``. + +.. option:: --detect-zeroes=DETECT_ZEROES + + Control the automatic conversion of plain zero writes by the OS to + driver-specific optimized zero write commands. *DETECT_ZEROES* is one of + ``off``, ``on``, or ``unmap``. ``unmap`` + converts a zero write to an unmap operation and can only be used if + *DISCARD* is set to ``unmap``. The default is ``off``. + +.. option:: -c, --connect=DEV + + Connect *filename* to NBD device *DEV* (Linux only). + +.. option:: -d, --disconnect + + Disconnect the device *DEV* (Linux only). + +.. option:: -e, --shared=NUM + + Allow up to *NUM* clients to share the device (default + ``1``), 0 for unlimited. Safe for readers, but for now, + consistency is not guaranteed between multiple writers. + +.. option:: -t, --persistent + + Don't exit on the last connection. + +.. option:: -x, --export-name=NAME + + Set the NBD volume export name (default of a zero-length string). + +.. option:: -D, --description=DESCRIPTION + + Set the NBD volume export description, as a human-readable + string. + +.. option:: -L, --list + + Connect as a client and list all details about the exports exposed by + a remote NBD server. This enables list mode, and is incompatible + with options that change behavior related to a specific export (such as + :option:`--export-name`, :option:`--offset`, ...). + +.. option:: --tls-creds=ID + + Enable mandatory TLS encryption for the server by setting the ID + of the TLS credentials object previously created with the --object + option; or provide the credentials needed for connecting as a client + in list mode. + +.. option:: --fork + + Fork off the server process and exit the parent once the server is running. + +.. option:: --pid-file=PATH + + Store the server's process ID in the given file. + +.. option:: --tls-authz=ID + + Specify the ID of a qauthz object previously created with the + :option:`--object` option. This will be used to authorize connecting users + against their x509 distinguished name. + +.. option:: -v, --verbose + + Display extra debugging information. + +.. option:: -h, --help + + Display this help and exit. + +.. option:: -V, --version + + Display version information and exit. + +.. option:: -T, --trace [[enable=]PATTERN][,events=FILE][,file=FILE] + + .. include:: ../qemu-option-trace.rst.inc + +Examples +-------- + +Start a server listening on port 10809 that exposes only the +guest-visible contents of a qcow2 file, with no TLS encryption, and +with the default export name (an empty string). The command is +one-shot, and will block until the first successful client +disconnects: + +:: + + qemu-nbd -f qcow2 file.qcow2 + +Start a long-running server listening with encryption on port 10810, +and whitelist clients with a specific X.509 certificate to connect to +a 1 megabyte subset of a raw file, using the export name 'subset': + +:: + + qemu-nbd \ + --object tls-creds-x509,id=tls0,endpoint=server,dir=/path/to/qemutls \ + --object 'authz-simple,id=auth0,identity=CN=laptop.example.com,,\ + O=Example Org,,L=London,,ST=London,,C=GB' \ + --tls-creds tls0 --tls-authz auth0 \ + -t -x subset -p 10810 \ + --image-opts driver=raw,offset=1M,size=1M,file.driver=file,file.filename=file.raw + +Serve a read-only copy of a guest image over a Unix socket with as +many as 5 simultaneous readers, with a persistent process forked as a +daemon: + +:: + + qemu-nbd --fork --persistent --shared=5 --socket=/path/to/sock \ + --read-only --format=qcow2 file.qcow2 + +Expose the guest-visible contents of a qcow2 file via a block device +/dev/nbd0 (and possibly creating /dev/nbd0p1 and friends for +partitions found within), then disconnect the device when done. +Access to bind ``qemu-nbd`` to a /dev/nbd device generally requires root +privileges, and may also require the execution of ``modprobe nbd`` +to enable the kernel NBD client module. *CAUTION*: Do not use +this method to mount filesystems from an untrusted guest image - a +malicious guest may have prepared the image to attempt to trigger +kernel bugs in partition probing or file system mounting. + +:: + + qemu-nbd -c /dev/nbd0 -f qcow2 file.qcow2 + qemu-nbd -d /dev/nbd0 + +Query a remote server to see details about what export(s) it is +serving on port 10809, and authenticating via PSK: + +:: + + qemu-nbd \ + --object tls-creds-psk,id=tls0,dir=/tmp/keys,username=eblake,endpoint=client \ + --tls-creds tls0 -L -b remote.example.com + +See also +-------- + +:manpage:`qemu(1)`, :manpage:`qemu-img(1)` diff --git a/docs/tools/qemu-pr-helper.rst b/docs/tools/qemu-pr-helper.rst new file mode 100644 index 000000000..eaebe40da --- /dev/null +++ b/docs/tools/qemu-pr-helper.rst @@ -0,0 +1,91 @@ +================================== +QEMU persistent reservation helper +================================== + +Synopsis +-------- + +**qemu-pr-helper** [*OPTION*] + +Description +----------- + +Implements the persistent reservation helper for QEMU. + +SCSI persistent reservations allow restricting access to block devices +to specific initiators in a shared storage setup. When implementing +clustering of virtual machines, it is a common requirement for virtual +machines to send persistent reservation SCSI commands. However, +the operating system restricts sending these commands to unprivileged +programs because incorrect usage can disrupt regular operation of the +storage fabric. QEMU's SCSI passthrough devices ``scsi-block`` +and ``scsi-generic`` support passing guest persistent reservation +requests to a privileged external helper program. :program:`qemu-pr-helper` +is that external helper; it creates a socket which QEMU can +connect to to communicate with it. + +If you want to run VMs in a setup like this, this helper should be +started as a system service, and you should read the QEMU manual +section on "persistent reservation managers" to find out how to +configure QEMU to connect to the socket created by +:program:`qemu-pr-helper`. + +After connecting to the socket, :program:`qemu-pr-helper` can +optionally drop root privileges, except for those capabilities that +are needed for its operation. + +:program:`qemu-pr-helper` can also use the systemd socket activation +protocol. In this case, the systemd socket unit should specify a +Unix stream socket, like this:: + + [Socket] + ListenStream=/var/run/qemu-pr-helper.sock + +Options +------- + +.. program:: qemu-pr-helper + +.. option:: -d, --daemon + + run in the background (and create a PID file) + +.. option:: -q, --quiet + + decrease verbosity + +.. option:: -v, --verbose + + increase verbosity + +.. option:: -f, --pidfile=PATH + + PID file when running as a daemon. By default the PID file + is created in the system runtime state directory, for example + :file:`/var/run/qemu-pr-helper.pid`. + +.. option:: -k, --socket=PATH + + path to the socket. By default the socket is created in + the system runtime state directory, for example + :file:`/var/run/qemu-pr-helper.sock`. + +.. option:: -T, --trace [[enable=]PATTERN][,events=FILE][,file=FILE] + + .. include:: ../qemu-option-trace.rst.inc + +.. option:: -u, --user=USER + + user to drop privileges to + +.. option:: -g, --group=GROUP + + group to drop privileges to + +.. option:: -h, --help + + Display a help message and exit. + +.. option:: -V, --version + + Display version information and exit. diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst new file mode 100644 index 000000000..3e5a9dc03 --- /dev/null +++ b/docs/tools/qemu-storage-daemon.rst @@ -0,0 +1,223 @@ +=================== +QEMU Storage Daemon +=================== + +Synopsis +-------- + +**qemu-storage-daemon** [options] + +Description +----------- + +``qemu-storage-daemon`` provides disk image functionality from QEMU, +``qemu-img``, and ``qemu-nbd`` in a long-running process controlled via QMP +commands without running a virtual machine. +It can export disk images, run block job operations, and +perform other disk-related operations. The daemon is controlled via a QMP +monitor and initial configuration from the command-line. + +The daemon offers the following subset of QEMU features: + +* Block nodes +* Block jobs +* Block exports +* Throttle groups +* Character devices +* Crypto and secrets +* QMP +* IOThreads + +Commands can be sent over a QEMU Monitor Protocol (QMP) connection. See the +:manpage:`qemu-storage-daemon-qmp-ref(7)` manual page for a description of the +commands. + +The daemon runs until it is stopped using the ``quit`` QMP command or +SIGINT/SIGHUP/SIGTERM. + +**Warning:** Never modify images in use by a running virtual machine or any +other process; this may destroy the image. Also, be aware that querying an +image that is being modified by another process may encounter inconsistent +state. + +Options +------- + +.. program:: qemu-storage-daemon + +Standard options: + +.. option:: -h, --help + + Display help and exit + +.. option:: -V, --version + + Display version information and exit + +.. option:: -T, --trace [[enable=]PATTERN][,events=FILE][,file=FILE] + + .. include:: ../qemu-option-trace.rst.inc + +.. option:: --blockdev BLOCKDEVDEF + + is a block node definition. See the :manpage:`qemu(1)` manual page for a + description of block node properties and the :manpage:`qemu-block-drivers(7)` + manual page for a description of driver-specific parameters. + +.. option:: --chardev CHARDEVDEF + + is a character device definition. See the :manpage:`qemu(1)` manual page for + a description of character device properties. A common character device + definition configures a UNIX domain socket:: + + --chardev socket,id=char1,path=/var/run/qsd-qmp.sock,server=on,wait=off + +.. option:: --export [type=]nbd,id=<id>,node-name=<node-name>[,name=<export-name>][,writable=on|off][,bitmap=<name>] + --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=unix,addr.path=<socket-path>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>] + --export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=fd,addr.str=<fd>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>] + --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off] + + is a block export definition. ``node-name`` is the block node that should be + exported. ``writable`` determines whether or not the export allows write + requests for modifying data (the default is off). + + The ``nbd`` export type requires ``--nbd-server`` (see below). ``name`` is + the NBD export name (if not specified, it defaults to the given + ``node-name``). ``bitmap`` is the name of a dirty bitmap reachable from the + block node, so the NBD client can use NBD_OPT_SET_META_CONTEXT with the + metadata context name "qemu:dirty-bitmap:BITMAP" to inspect the bitmap. + + The ``vhost-user-blk`` export type takes a vhost-user socket address on which + it accept incoming connections. Both + ``addr.type=unix,addr.path=<socket-path>`` for UNIX domain sockets and + ``addr.type=fd,addr.str=<fd>`` for file descriptor passing are supported. + ``logical-block-size`` sets the logical block size in bytes (the default is + 512). ``num-queues`` sets the number of virtqueues (the default is 1). + + The ``fuse`` export type takes a mount point, which must be a regular file, + on which to export the given block node. That file will not be changed, it + will just appear to have the block node's content while the export is active + (very much like mounting a filesystem on a directory does not change what the + directory contains, it only shows a different content while the filesystem is + mounted). Consequently, applications that have opened the given file before + the export became active will continue to see its original content. If + ``growable`` is set, writes after the end of the exported file will grow the + block node to fit. + +.. option:: --monitor MONITORDEF + + is a QMP monitor definition. See the :manpage:`qemu(1)` manual page for + a description of QMP monitor properties. A common QMP monitor definition + configures a monitor on character device ``char1``:: + + --monitor chardev=char1 + +.. option:: --nbd-server addr.type=inet,addr.host=<host>,addr.port=<port>[,tls-creds=<id>][,tls-authz=<id>][,max-connections=<n>] + --nbd-server addr.type=unix,addr.path=<path>[,tls-creds=<id>][,tls-authz=<id>][,max-connections=<n>] + --nbd-server addr.type=fd,addr.str=<fd>[,tls-creds=<id>][,tls-authz=<id>][,max-connections=<n>] + + is a server for NBD exports. Both TCP and UNIX domain sockets are supported. + A listen socket can be provided via file descriptor passing (see Examples + below). TLS encryption can be configured using ``--object`` tls-creds-* and + authz-* secrets (see below). + + To configure an NBD server on UNIX domain socket path + ``/var/run/qsd-nbd.sock``:: + + --nbd-server addr.type=unix,addr.path=/var/run/qsd-nbd.sock + +.. option:: --object help + --object <type>,help + --object <type>[,<property>=<value>...] + + is a QEMU user creatable object definition. List object types with ``help``. + List object properties with ``<type>,help``. See the :manpage:`qemu(1)` + manual page for a description of the object properties. + +.. option:: --pidfile PATH + + is the path to a file where the daemon writes its pid. This allows scripts to + stop the daemon by sending a signal:: + + $ kill -SIGTERM $(<path/to/qsd.pid) + + A file lock is applied to the file so only one instance of the daemon can run + with a given pid file path. The daemon unlinks its pid file when terminating. + + The pid file is written after chardevs, exports, and NBD servers have been + created but before accepting connections. The daemon has started successfully + when the pid file is written and clients may begin connecting. + +Examples +-------- +Launch the daemon with QMP monitor socket ``qmp.sock`` so clients can execute +QMP commands:: + + $ qemu-storage-daemon \ + --chardev socket,path=qmp.sock,server=on,wait=off,id=char1 \ + --monitor chardev=char1 + +Launch the daemon from Python with a QMP monitor socket using file descriptor +passing so there is no need to busy wait for the QMP monitor to become +available:: + + #!/usr/bin/env python3 + import subprocess + import socket + + sock_path = '/var/run/qmp.sock' + + with socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) as listen_sock: + listen_sock.bind(sock_path) + listen_sock.listen() + + fd = listen_sock.fileno() + + subprocess.Popen( + ['qemu-storage-daemon', + '--chardev', f'socket,fd={fd},server=on,id=char1', + '--monitor', 'chardev=char1'], + pass_fds=[fd], + ) + + # listen_sock was automatically closed when leaving the 'with' statement + # body. If the daemon process terminated early then the following connect() + # will fail with "Connection refused" because no process has the listen + # socket open anymore. Launch errors can be detected this way. + + qmp_sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) + qmp_sock.connect(sock_path) + ...QMP interaction... + +The same socket spawning approach also works with the ``--nbd-server +addr.type=fd,addr.str=<fd>`` and ``--export +type=vhost-user-blk,addr.type=fd,addr.str=<fd>`` options. + +Export raw image file ``disk.img`` over NBD UNIX domain socket ``nbd.sock``:: + + $ qemu-storage-daemon \ + --blockdev driver=file,node-name=disk,filename=disk.img \ + --nbd-server addr.type=unix,addr.path=nbd.sock \ + --export type=nbd,id=export,node-name=disk,writable=on + +Export a qcow2 image file ``disk.qcow2`` as a vhosts-user-blk device over UNIX +domain socket ``vhost-user-blk.sock``:: + + $ qemu-storage-daemon \ + --blockdev driver=file,node-name=file,filename=disk.qcow2 \ + --blockdev driver=qcow2,node-name=qcow2,file=file \ + --export type=vhost-user-blk,id=export,addr.type=unix,addr.path=vhost-user-blk.sock,node-name=qcow2 + +Export a qcow2 image file ``disk.qcow2`` via FUSE on itself, so the disk image +file will then appear as a raw image:: + + $ qemu-storage-daemon \ + --blockdev driver=file,node-name=file,filename=disk.qcow2 \ + --blockdev driver=qcow2,node-name=qcow2,file=file \ + --export type=fuse,id=export,node-name=qcow2,mountpoint=disk.qcow2,writable=on + +See also +-------- + +:manpage:`qemu(1)`, :manpage:`qemu-block-drivers(7)`, :manpage:`qemu-storage-daemon-qmp-ref(7)` diff --git a/docs/tools/qemu-trace-stap.rst b/docs/tools/qemu-trace-stap.rst new file mode 100644 index 000000000..d53073b52 --- /dev/null +++ b/docs/tools/qemu-trace-stap.rst @@ -0,0 +1,125 @@ +========================= +QEMU SystemTap trace tool +========================= + +Synopsis +-------- + +**qemu-trace-stap** [*GLOBAL-OPTIONS*] *COMMAND* [*COMMAND-OPTIONS*] *ARGS*... + +Description +----------- + +The ``qemu-trace-stap`` program facilitates tracing of the execution +of QEMU emulators using SystemTap. + +It is required to have the SystemTap runtime environment installed to use +this program, since it is a wrapper around execution of the ``stap`` +program. + +Options +------- + +.. program:: qemu-trace-stap + +The following global options may be used regardless of which command +is executed: + +.. option:: --verbose, -v + + Display verbose information about command execution. + +The following commands are valid: + +.. option:: list BINARY PATTERN... + + List all the probe names provided by *BINARY* that match + *PATTERN*. + + If *BINARY* is not an absolute path, it will be located by searching + the directories listed in the ``$PATH`` environment variable. + + *PATTERN* is a plain string that is used to filter the results of + this command. It may optionally contain a ``*`` wildcard to facilitate + matching multiple probes without listing each one explicitly. Multiple + *PATTERN* arguments may be given, causing listing of probes that match + any of the listed names. If no *PATTERN* is given, the all possible + probes will be listed. + + For example, to list all probes available in the ``qemu-system-x86_64`` + binary: + + :: + + $ qemu-trace-stap list qemu-system-x86_64 + + To filter the list to only cover probes related to QEMU's cryptographic + subsystem, in a binary outside ``$PATH`` + + :: + + $ qemu-trace-stap list /opt/qemu/4.0.0/bin/qemu-system-x86_64 'qcrypto*' + +.. option:: run OPTIONS BINARY PATTERN... + + Run a trace session, printing formatted output any time a process that is + executing *BINARY* triggers a probe matching *PATTERN*. + + If *BINARY* is not an absolute path, it will be located by searching + the directories listed in the ``$PATH`` environment variable. + + *PATTERN* is a plain string that matches a probe name shown by the + *LIST* command. It may optionally contain a ``*`` wildcard to + facilitate matching multiple probes without listing each one explicitly. + Multiple *PATTERN* arguments may be given, causing all matching probes + to be monitored. At least one *PATTERN* is required, since stap is not + capable of tracing all known QEMU probes concurrently without overflowing + its trace buffer. + + Invocation of this command does not need to be synchronized with + invocation of the QEMU process(es). It will match probes on all + existing running processes and all future launched processes, + unless told to only monitor a specific process. + + Valid command specific options are: + + .. program:: qemu-trace-stap-run + + .. option:: --pid=PID, -p PID + + Restrict the tracing session so that it only triggers for the process + identified by *PID*. + + For example, to monitor all processes executing ``qemu-system-x86_64`` + as found on ``$PATH``, displaying all I/O related probes: + + :: + + $ qemu-trace-stap run qemu-system-x86_64 'qio*' + + To monitor only the QEMU process with PID 1732 + + :: + + $ qemu-trace-stap run --pid=1732 qemu-system-x86_64 'qio*' + + To monitor QEMU processes running an alternative binary outside of + ``$PATH``, displaying verbose information about setup of the + tracing environment: + + :: + + $ qemu-trace-stap -v run /opt/qemu/4.0.0/qemu-system-x86_64 'qio*' + +See also +-------- + +:manpage:`qemu(1)`, :manpage:`stap(1)` + +.. + Copyright (C) 2019 Red Hat, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. diff --git a/docs/tools/virtfs-proxy-helper.rst b/docs/tools/virtfs-proxy-helper.rst new file mode 100644 index 000000000..6cdeedf8e --- /dev/null +++ b/docs/tools/virtfs-proxy-helper.rst @@ -0,0 +1,72 @@ +QEMU 9p virtfs proxy filesystem helper +====================================== + +Synopsis +-------- + +**virtfs-proxy-helper** [*OPTIONS*] + +Description +----------- + +Pass-through security model in QEMU 9p server needs root privilege to do +few file operations (like chown, chmod to any mode/uid:gid). There are two +issues in pass-through security model: + +- TOCTTOU vulnerability: Following symbolic links in the server could + provide access to files beyond 9p export path. + +- Running QEMU with root privilege could be a security issue. + +To overcome above issues, following approach is used: A new filesystem +type 'proxy' is introduced. Proxy FS uses chroot + socket combination +for securing the vulnerability known with following symbolic links. +Intention of adding a new filesystem type is to allow qemu to run +in non-root mode, but doing privileged operations using socket IO. + +Proxy helper (a stand alone binary part of qemu) is invoked with +root privileges. Proxy helper chroots into 9p export path and creates +a socket pair or a named socket based on the command line parameter. +QEMU and proxy helper communicate using this socket. QEMU proxy fs +driver sends filesystem request to proxy helper and receives the +response from it. + +The proxy helper is designed so that it can drop root privileges except +for the capabilities needed for doing filesystem operations. + +Options +------- + +The following options are supported: + +.. program:: virtfs-proxy-helper + +.. option:: -h + + Display help and exit + +.. option:: -p, --path PATH + + Path to export for proxy filesystem driver + +.. option:: -f, --fd SOCKET_ID + + Use given file descriptor as socket descriptor for communicating with + qemu proxy fs drier. Usually a helper like libvirt will create + socketpair and pass one of the fds as parameter to this option. + +.. option:: -s, --socket SOCKET_FILE + + Creates named socket file for communicating with qemu proxy fs driver + +.. option:: -u, --uid UID + + uid to give access to named socket file; used in combination with -g. + +.. option:: -g, --gid GID + + gid to give access to named socket file; used in combination with -u. + +.. option:: -n, --nodaemon + + Run as a normal program. By default program will run in daemon mode diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst new file mode 100644 index 000000000..07ac0be55 --- /dev/null +++ b/docs/tools/virtiofsd.rst @@ -0,0 +1,366 @@ +QEMU virtio-fs shared file system daemon +======================================== + +Synopsis +-------- + +**virtiofsd** [*OPTIONS*] + +Description +----------- + +Share a host directory tree with a guest through a virtio-fs device. This +program is a vhost-user backend that implements the virtio-fs device. Each +virtio-fs device instance requires its own virtiofsd process. + +This program is designed to work with QEMU's ``--device vhost-user-fs-pci`` +but should work with any virtual machine monitor (VMM) that supports +vhost-user. See the Examples section below. + +This program must be run as the root user. The program drops privileges where +possible during startup although it must be able to create and access files +with any uid/gid: + +* The ability to invoke syscalls is limited using seccomp(2). +* Linux capabilities(7) are dropped. + +In "namespace" sandbox mode the program switches into a new file system +namespace and invokes pivot_root(2) to make the shared directory tree its root. +A new pid and net namespace is also created to isolate the process. + +In "chroot" sandbox mode the program invokes chroot(2) to make the shared +directory tree its root. This mode is intended for container environments where +the container runtime has already set up the namespaces and the program does +not have permission to create namespaces itself. + +Both sandbox modes prevent "file system escapes" due to symlinks and other file +system objects that might lead to files outside the shared directory. + +Options +------- + +.. program:: virtiofsd + +.. option:: -h, --help + + Print help. + +.. option:: -V, --version + + Print version. + +.. option:: -d + + Enable debug output. + +.. option:: --syslog + + Print log messages to syslog instead of stderr. + +.. option:: -o OPTION + + * debug - + Enable debug output. + + * flock|no_flock - + Enable/disable flock. The default is ``no_flock``. + + * modcaps=CAPLIST + Modify the list of capabilities allowed; CAPLIST is a colon separated + list of capabilities, each preceded by either + or -, e.g. + ''+sys_admin:-chown''. + + * log_level=LEVEL - + Print only log messages matching LEVEL or more severe. LEVEL is one of + ``err``, ``warn``, ``info``, or ``debug``. The default is ``info``. + + * posix_lock|no_posix_lock - + Enable/disable remote POSIX locks. The default is ``no_posix_lock``. + + * readdirplus|no_readdirplus - + Enable/disable readdirplus. The default is ``readdirplus``. + + * sandbox=namespace|chroot - + Sandbox mode: + - namespace: Create mount, pid, and net namespaces and pivot_root(2) into + the shared directory. + - chroot: chroot(2) into shared directory (use in containers). + The default is "namespace". + + * source=PATH - + Share host directory tree located at PATH. This option is required. + + * timeout=TIMEOUT - + I/O timeout in seconds. The default depends on cache= option. + + * writeback|no_writeback - + Enable/disable writeback cache. The cache allows the FUSE client to buffer + and merge write requests. The default is ``no_writeback``. + + * xattr|no_xattr - + Enable/disable extended attributes (xattr) on files and directories. The + default is ``no_xattr``. + + * posix_acl|no_posix_acl - + Enable/disable posix acl support. Posix ACLs are disabled by default. + +.. option:: --socket-path=PATH + + Listen on vhost-user UNIX domain socket at PATH. + +.. option:: --socket-group=GROUP + + Set the vhost-user UNIX domain socket gid to GROUP. + +.. option:: --fd=FDNUM + + Accept connections from vhost-user UNIX domain socket file descriptor FDNUM. + The file descriptor must already be listening for connections. + +.. option:: --thread-pool-size=NUM + + Restrict the number of worker threads per request queue to NUM. The default + is 64. + +.. option:: --cache=none|auto|always + + Select the desired trade-off between coherency and performance. ``none`` + forbids the FUSE client from caching to achieve best coherency at the cost of + performance. ``auto`` acts similar to NFS with a 1 second metadata cache + timeout. ``always`` sets a long cache lifetime at the expense of coherency. + The default is ``auto``. + +Extended attribute (xattr) mapping +---------------------------------- + +By default the name of xattr's used by the client are passed through to the server +file system. This can be a problem where either those xattr names are used +by something on the server (e.g. selinux client/server confusion) or if the +``virtiofsd`` is running in a container with restricted privileges where it +cannot access some attributes. + +Mapping syntax +~~~~~~~~~~~~~~ + +A mapping of xattr names can be made using -o xattrmap=mapping where the ``mapping`` +string consists of a series of rules. + +The first matching rule terminates the mapping. +The set of rules must include a terminating rule to match any remaining attributes +at the end. + +Each rule consists of a number of fields separated with a separator that is the +first non-white space character in the rule. This separator must then be used +for the whole rule. +White space may be added before and after each rule. + +Using ':' as the separator a rule is of the form: + +``:type:scope:key:prepend:`` + +**scope** is: + +- 'client' - match 'key' against a xattr name from the client for + setxattr/getxattr/removexattr +- 'server' - match 'prepend' against a xattr name from the server + for listxattr +- 'all' - can be used to make a single rule where both the server + and client matches are triggered. + +**type** is one of: + +- 'prefix' - is designed to prepend and strip a prefix; the modified + attributes then being passed on to the client/server. + +- 'ok' - Causes the rule set to be terminated when a match is found + while allowing matching xattr's through unchanged. + It is intended both as a way of explicitly terminating + the list of rules, and to allow some xattr's to skip following rules. + +- 'bad' - If a client tries to use a name matching 'key' it's + denied using EPERM; when the server passes an attribute + name matching 'prepend' it's hidden. In many ways it's use is very like + 'ok' as either an explicit terminator or for special handling of certain + patterns. + +- 'unsupported' - If a client tries to use a name matching 'key' it's + denied using ENOTSUP; when the server passes an attribute + name matching 'prepend' it's hidden. In many ways it's use is very like + 'ok' as either an explicit terminator or for special handling of certain + patterns. + +**key** is a string tested as a prefix on an attribute name originating +on the client. It maybe empty in which case a 'client' rule +will always match on client names. + +**prepend** is a string tested as a prefix on an attribute name originating +on the server, and used as a new prefix. It may be empty +in which case a 'server' rule will always match on all names from +the server. + +e.g.: + + ``:prefix:client:trusted.:user.virtiofs.:`` + + will match 'trusted.' attributes in client calls and prefix them before + passing them to the server. + + ``:prefix:server::user.virtiofs.:`` + + will strip 'user.virtiofs.' from all server replies. + + ``:prefix:all:trusted.:user.virtiofs.:`` + + combines the previous two cases into a single rule. + + ``:ok:client:user.::`` + + will allow get/set xattr for 'user.' xattr's and ignore + following rules. + + ``:ok:server::security.:`` + + will pass 'securty.' xattr's in listxattr from the server + and ignore following rules. + + ``:ok:all:::`` + + will terminate the rule search passing any remaining attributes + in both directions. + + ``:bad:server::security.:`` + + would hide 'security.' xattr's in listxattr from the server. + +A simpler 'map' type provides a shorter syntax for the common case: + +``:map:key:prepend:`` + +The 'map' type adds a number of separate rules to add **prepend** as a prefix +to the matched **key** (or all attributes if **key** is empty). +There may be at most one 'map' rule and it must be the last rule in the set. + +Note: When the 'security.capability' xattr is remapped, the daemon has to do +extra work to remove it during many operations, which the host kernel normally +does itself. + +Security considerations +~~~~~~~~~~~~~~~~~~~~~~~ + +Operating systems typically partition the xattr namespace using +well defined name prefixes. Each partition may have different +access controls applied. For example, on Linux there are multiple +partitions + + * ``system.*`` - access varies depending on attribute & filesystem + * ``security.*`` - only processes with CAP_SYS_ADMIN + * ``trusted.*`` - only processes with CAP_SYS_ADMIN + * ``user.*`` - any process granted by file permissions / ownership + +While other OS such as FreeBSD have different name prefixes +and access control rules. + +When remapping attributes on the host, it is important to +ensure that the remapping does not allow a guest user to +evade the guest access control rules. + +Consider if ``trusted.*`` from the guest was remapped to +``user.virtiofs.trusted*`` in the host. An unprivileged +user in a Linux guest has the ability to write to xattrs +under ``user.*``. Thus the user can evade the access +control restriction on ``trusted.*`` by instead writing +to ``user.virtiofs.trusted.*``. + +As noted above, the partitions used and access controls +applied, will vary across guest OS, so it is not wise to +try to predict what the guest OS will use. + +The simplest way to avoid an insecure configuration is +to remap all xattrs at once, to a given fixed prefix. +This is shown in example (1) below. + +If selectively mapping only a subset of xattr prefixes, +then rules must be added to explicitly block direct +access to the target of the remapping. This is shown +in example (2) below. + +Mapping examples +~~~~~~~~~~~~~~~~ + +1) Prefix all attributes with 'user.virtiofs.' + +:: + + -o xattrmap=":prefix:all::user.virtiofs.::bad:all:::" + + +This uses two rules, using : as the field separator; +the first rule prefixes and strips 'user.virtiofs.', +the second rule hides any non-prefixed attributes that +the host set. + +This is equivalent to the 'map' rule: + +:: + + -o xattrmap=":map::user.virtiofs.:" + +2) Prefix 'trusted.' attributes, allow others through + +:: + + "/prefix/all/trusted./user.virtiofs./ + /bad/server//trusted./ + /bad/client/user.virtiofs.// + /ok/all///" + + +Here there are four rules, using / as the field +separator, and also demonstrating that new lines can +be included between rules. +The first rule is the prefixing of 'trusted.' and +stripping of 'user.virtiofs.'. +The second rule hides unprefixed 'trusted.' attributes +on the host. +The third rule stops a guest from explicitly setting +the 'user.virtiofs.' path directly to prevent access +control bypass on the target of the earlier prefix +remapping. +Finally, the fourth rule lets all remaining attributes +through. + +This is equivalent to the 'map' rule: + +:: + + -o xattrmap="/map/trusted./user.virtiofs./" + +3) Hide 'security.' attributes, and allow everything else + +:: + + "/bad/all/security./security./ + /ok/all///' + +The first rule combines what could be separate client and server +rules into a single 'all' rule, matching 'security.' in either +client arguments or lists returned from the host. This stops +the client seeing any 'security.' attributes on the server and +stops it setting any. + +Examples +-------- + +Export ``/var/lib/fs/vm001/`` on vhost-user UNIX domain socket +``/var/run/vm001-vhost-fs.sock``: + +.. parsed-literal:: + + host# virtiofsd --socket-path=/var/run/vm001-vhost-fs.sock -o source=/var/lib/fs/vm001 + host# |qemu_system| \\ + -chardev socket,id=char0,path=/var/run/vm001-vhost-fs.sock \\ + -device vhost-user-fs-pci,chardev=char0,tag=myfs \\ + -object memory-backend-memfd,id=mem,size=4G,share=on \\ + -numa node,memdev=mem \\ + ... + guest# mount -t virtiofs myfs /mnt diff --git a/docs/u2f.txt b/docs/u2f.txt new file mode 100644 index 000000000..7f5813a0b --- /dev/null +++ b/docs/u2f.txt @@ -0,0 +1,110 @@ +QEMU U2F Key Device Documentation. + +Contents +1. USB U2F key device +2. Building +3. Using u2f-emulated +4. Using u2f-passthru +5. Libu2f-emu + +1. USB U2F key device + +U2F is an open authentication standard that enables relying parties +exposed to the internet to offer a strong second factor option for end +user authentication. + +The standard brings many advantages to both parties, client and server, +allowing to reduce over-reliance on passwords, it increases authentication +security and simplifies passwords. + +The second factor is materialized by a device implementing the U2F +protocol. In case of a USB U2F security key, it is a USB HID device +that implements the U2F protocol. + +In QEMU, the USB U2F key device offers a dedicated support of U2F, allowing +guest USB FIDO/U2F security keys operating in two possible modes: +pass-through and emulated. + +The pass-through mode consists of passing all requests made from the guest +to the physical security key connected to the host machine and vice versa. +In addition, the dedicated pass-through allows to have a U2F security key +shared on several guests which is not possible with a simple host device +assignment pass-through. + +The emulated mode consists of completely emulating the behavior of an +U2F device through software part. Libu2f-emu is used for that. + + +2. Building + +To ensure the build of the u2f-emulated device variant which depends +on libu2f-emu: configuring and building: + + ./configure --enable-u2f && make + +The pass-through mode is built by default on Linux. To take advantage +of the autoscan option it provides, make sure you have a working libudev +installed on the host. + + +3. Using u2f-emulated + +To work, an emulated U2F device must have four elements: + * ec x509 certificate + * ec private key + * counter (four bytes value) + * 48 bytes of entropy (random bits) + +To use this type of device, this one has to be configured, and these +four elements must be passed one way or another. + +Assuming that you have a working libu2f-emu installed on the host. +There are three possible ways of configurations: + * ephemeral + * setup directory + * manual + +Ephemeral is the simplest way to configure, it lets the device generate +all the elements it needs for a single use of the lifetime of the device. + + qemu -usb -device u2f-emulated + +Setup directory allows to configure the device from a directory containing +four files: + * certificate.pem: ec x509 certificate + * private-key.pem: ec private key + * counter: counter value + * entropy: 48 bytes of entropy + + qemu -usb -device u2f-emulated,dir=$dir + +Manual allows to configure the device more finely by specifying each +of the elements necessary for the device: + * cert + * priv + * counter + * entropy + + qemu -usb -device u2f-emulated,cert=$DIR1/$FILE1,priv=$DIR2/$FILE2,counter=$DIR3/$FILE3,entropy=$DIR4/$FILE4 + + +4. Using u2f-passthru + +On the host specify the u2f-passthru device with a suitable hidraw: + + qemu -usb -device u2f-passthru,hidraw=/dev/hidraw0 + +Alternately, the u2f-passthru device can autoscan to take the first +U2F device it finds on the host (this requires a working libudev): + + qemu -usb -device u2f-passthru + + +5. Libu2f-emu + +The u2f-emulated device uses libu2f-emu for the U2F key emulation. Libu2f-emu +implements completely the U2F protocol device part for all specified +transport given by the FIDO Alliance. + +For more information about libu2f-emu see this page: +https://github.com/MattGorko/libu2f-emu. diff --git a/docs/user/index.rst b/docs/user/index.rst new file mode 100644 index 000000000..2c4e29f3d --- /dev/null +++ b/docs/user/index.rst @@ -0,0 +1,12 @@ +------------------- +User Mode Emulation +------------------- + +This section of the manual is the overall guide for users using QEMU +for user-mode emulation. In this mode, QEMU can launch +processes compiled for one CPU on another CPU. + +.. toctree:: + :maxdepth: 2 + + main diff --git a/docs/user/main.rst b/docs/user/main.rst new file mode 100644 index 000000000..e08d4be63 --- /dev/null +++ b/docs/user/main.rst @@ -0,0 +1,247 @@ +QEMU User space emulator +======================== + +Supported Operating Systems +--------------------------- + +The following OS are supported in user space emulation: + +- Linux (referred as qemu-linux-user) + +- BSD (referred as qemu-bsd-user) + +Features +-------- + +QEMU user space emulation has the following notable features: + +**System call translation:** + QEMU includes a generic system call translator. This means that the + parameters of the system calls can be converted to fix endianness and + 32/64-bit mismatches between hosts and targets. IOCTLs can be + converted too. + +**POSIX signal handling:** + QEMU can redirect to the running program all signals coming from the + host (such as ``SIGALRM``), as well as synthesize signals from + virtual CPU exceptions (for example ``SIGFPE`` when the program + executes a division by zero). + + QEMU relies on the host kernel to emulate most signal system calls, + for example to emulate the signal mask. On Linux, QEMU supports both + normal and real-time signals. + +**Threading:** + On Linux, QEMU can emulate the ``clone`` syscall and create a real + host thread (with a separate virtual CPU) for each emulated thread. + Note that not all targets currently emulate atomic operations + correctly. x86 and Arm use a global lock in order to preserve their + semantics. + +QEMU was conceived so that ultimately it can emulate itself. Although it +is not very useful, it is an important test to show the power of the +emulator. + +Linux User space emulator +------------------------- + +Command line options +~~~~~~~~~~~~~~~~~~~~ + +:: + + qemu-i386 [-h] [-d] [-L path] [-s size] [-cpu model] [-g port] [-B offset] [-R size] program [arguments...] + +``-h`` + Print the help + +``-L path`` + Set the x86 elf interpreter prefix (default=/usr/local/qemu-i386) + +``-s size`` + Set the x86 stack size in bytes (default=524288) + +``-cpu model`` + Select CPU model (-cpu help for list and additional feature + selection) + +``-E var=value`` + Set environment var to value. + +``-U var`` + Remove var from the environment. + +``-B offset`` + Offset guest address by the specified number of bytes. This is useful + when the address region required by guest applications is reserved on + the host. This option is currently only supported on some hosts. + +``-R size`` + Pre-allocate a guest virtual address space of the given size (in + bytes). \"G\", \"M\", and \"k\" suffixes may be used when specifying + the size. + +Debug options: + +``-d item1,...`` + Activate logging of the specified items (use '-d help' for a list of + log items) + +``-p pagesize`` + Act as if the host page size was 'pagesize' bytes + +``-g port`` + Wait gdb connection to port + +``-singlestep`` + Run the emulation in single step mode. + +Environment variables: + +QEMU_STRACE + Print system calls and arguments similar to the 'strace' program + (NOTE: the actual 'strace' program will not work because the user + space emulator hasn't implemented ptrace). At the moment this is + incomplete. All system calls that don't have a specific argument + format are printed with information for six arguments. Many + flag-style arguments don't have decoders and will show up as numbers. + +Other binaries +~~~~~~~~~~~~~~ + +- user mode (Alpha) + + * ``qemu-alpha`` TODO. + +- user mode (Arm) + + * ``qemu-armeb`` TODO. + + * ``qemu-arm`` is also capable of running Arm \"Angel\" semihosted ELF + binaries (as implemented by the arm-elf and arm-eabi Newlib/GDB + configurations), and arm-uclinux bFLT format binaries. + +- user mode (ColdFire) + +- user mode (M68K) + + * ``qemu-m68k`` is capable of running semihosted binaries using the BDM + (m5xxx-ram-hosted.ld) or m68k-sim (sim.ld) syscall interfaces, and + coldfire uClinux bFLT format binaries. + + The binary format is detected automatically. + +- user mode (Cris) + + * ``qemu-cris`` TODO. + +- user mode (i386) + + * ``qemu-i386`` TODO. + * ``qemu-x86_64`` TODO. + +- user mode (Microblaze) + + * ``qemu-microblaze`` TODO. + +- user mode (MIPS) + + * ``qemu-mips`` executes 32-bit big endian MIPS binaries (MIPS O32 ABI). + + * ``qemu-mipsel`` executes 32-bit little endian MIPS binaries (MIPS O32 ABI). + + * ``qemu-mips64`` executes 64-bit big endian MIPS binaries (MIPS N64 ABI). + + * ``qemu-mips64el`` executes 64-bit little endian MIPS binaries (MIPS N64 + ABI). + + * ``qemu-mipsn32`` executes 32-bit big endian MIPS binaries (MIPS N32 ABI). + + * ``qemu-mipsn32el`` executes 32-bit little endian MIPS binaries (MIPS N32 + ABI). + +- user mode (NiosII) + + * ``qemu-nios2`` TODO. + +- user mode (PowerPC) + + * ``qemu-ppc64abi32`` TODO. + * ``qemu-ppc64`` TODO. + * ``qemu-ppc`` TODO. + +- user mode (SH4) + + * ``qemu-sh4eb`` TODO. + * ``qemu-sh4`` TODO. + +- user mode (SPARC) + + * ``qemu-sparc`` can execute Sparc32 binaries (Sparc32 CPU, 32 bit ABI). + + * ``qemu-sparc32plus`` can execute Sparc32 and SPARC32PLUS binaries + (Sparc64 CPU, 32 bit ABI). + + * ``qemu-sparc64`` can execute some Sparc64 (Sparc64 CPU, 64 bit ABI) and + SPARC32PLUS binaries (Sparc64 CPU, 32 bit ABI). + +BSD User space emulator +----------------------- + +BSD Status +~~~~~~~~~~ + +- target Sparc64 on Sparc64: Some trivial programs work. + +Quick Start +~~~~~~~~~~~ + +In order to launch a BSD process, QEMU needs the process executable +itself and all the target dynamic libraries used by it. + +- On Sparc64, you can just try to launch any process by using the + native libraries:: + + qemu-sparc64 /bin/ls + +Command line options +~~~~~~~~~~~~~~~~~~~~ + +:: + + qemu-sparc64 [-h] [-d] [-L path] [-s size] [-bsd type] program [arguments...] + +``-h`` + Print the help + +``-L path`` + Set the library root path (default=/) + +``-s size`` + Set the stack size in bytes (default=524288) + +``-ignore-environment`` + Start with an empty environment. Without this option, the initial + environment is a copy of the caller's environment. + +``-E var=value`` + Set environment var to value. + +``-U var`` + Remove var from the environment. + +``-bsd type`` + Set the type of the emulated BSD Operating system. Valid values are + FreeBSD, NetBSD and OpenBSD (default). + +Debug options: + +``-d item1,...`` + Activate logging of the specified items (use '-d help' for a list of + log items) + +``-p pagesize`` + Act as if the host page size was 'pagesize' bytes + +``-singlestep`` + Run the emulation in single step mode. diff --git a/docs/virtio-balloon-stats.txt b/docs/virtio-balloon-stats.txt new file mode 100644 index 000000000..1732cc8c8 --- /dev/null +++ b/docs/virtio-balloon-stats.txt @@ -0,0 +1,109 @@ +virtio balloon memory statistics +================================ + +The virtio balloon driver supports guest memory statistics reporting. These +statistics are available to QEMU users as QOM (QEMU Object Model) device +properties via a polling mechanism. + +Before querying the available stats, clients first have to enable polling. +This is done by writing a time interval value (in seconds) to the +guest-stats-polling-interval property. This value can be: + + > 0 enables polling in the specified interval. If polling is already + enabled, the polling time interval is changed to the new value + + 0 disables polling. Previous polled statistics are still valid and + can be queried. + +Once polling is enabled, the virtio-balloon device in QEMU will start +polling the guest's balloon driver for new stats in the specified time +interval. + +To retrieve those stats, clients have to query the guest-stats property, +which will return a dictionary containing: + + o A key named 'stats', containing all available stats. If the guest + doesn't support a particular stat, or if it couldn't be retrieved, + its value will be -1. Currently, the following stats are supported: + + - stat-swap-in + - stat-swap-out + - stat-major-faults + - stat-minor-faults + - stat-free-memory + - stat-total-memory + - stat-available-memory + - stat-disk-caches + - stat-htlb-pgalloc + - stat-htlb-pgfail + + o A key named last-update, which contains the last stats update + timestamp in seconds. Since this timestamp is generated by the host, + a buggy guest can't influence its value. The value is 0 if the guest + has not updated the stats (yet). + +It's also important to note the following: + + - Previously polled statistics remain available even if the polling is + later disabled + + - As noted above, if a guest doesn't support a particular stat its value + will always be -1. However, it's also possible that a guest temporarily + couldn't update one or even all stats. If this happens, just wait for + the next update + + - Polling can be enabled even if the guest doesn't have stats support + or the balloon driver wasn't loaded in the guest. If this is the case + and stats are queried, last-update will be 0. + + - The polling timer is only re-armed when the guest responds to the + statistics request. This means that if a (buggy) guest doesn't ever + respond to the request the timer will never be re-armed, which has + the same effect as disabling polling + +Here are a few examples. QEMU is started with '-device virtio-balloon', +which generates '/machine/peripheral-anon/device[1]' as the QOM path for +the balloon device. + +Enable polling with 2 seconds interval: + +{ "execute": "qom-set", + "arguments": { "path": "/machine/peripheral-anon/device[1]", + "property": "guest-stats-polling-interval", "value": 2 } } + +{ "return": {} } + +Change polling to 10 seconds: + +{ "execute": "qom-set", + "arguments": { "path": "/machine/peripheral-anon/device[1]", + "property": "guest-stats-polling-interval", "value": 10 } } + +{ "return": {} } + +Get stats: + +{ "execute": "qom-get", + "arguments": { "path": "/machine/peripheral-anon/device[1]", + "property": "guest-stats" } } +{ + "return": { + "stats": { + "stat-swap-out": 0, + "stat-free-memory": 844943360, + "stat-minor-faults": 219028, + "stat-major-faults": 235, + "stat-total-memory": 1044406272, + "stat-swap-in": 0 + }, + "last-update": 1358529861 + } +} + +Disable polling: + +{ "execute": "qom-set", + "arguments": { "path": "/machine/peripheral-anon/device[1]", + "property": "stats-polling-interval", "value": 0 } } + +{ "return": {} } diff --git a/docs/xbzrle.txt b/docs/xbzrle.txt new file mode 100644 index 000000000..bcb3f0c90 --- /dev/null +++ b/docs/xbzrle.txt @@ -0,0 +1,138 @@ +XBZRLE (Xor Based Zero Run Length Encoding) +=========================================== + +Using XBZRLE (Xor Based Zero Run Length Encoding) allows for the reduction +of VM downtime and the total live-migration time of Virtual machines. +It is particularly useful for virtual machines running memory write intensive +workloads that are typical of large enterprise applications such as SAP ERP +Systems, and generally speaking for any application that uses a sparse memory +update pattern. + +Instead of sending the changed guest memory page this solution will send a +compressed version of the updates, thus reducing the amount of data sent during +live migration. +In order to be able to calculate the update, the previous memory pages need to +be stored on the source. Those pages are stored in a dedicated cache +(hash table) and are accessed by their address. +The larger the cache size the better the chances are that the page has already +been stored in the cache. +A small cache size will result in high cache miss rate. +Cache size can be changed before and during migration. + +Format +======= + +The compression format performs a XOR between the previous and current content +of the page, where zero represents an unchanged value. +The page data delta is represented by zero and non zero runs. +A zero run is represented by its length (in bytes). +A non zero run is represented by its length (in bytes) and the new data. +The run length is encoded using ULEB128 (http://en.wikipedia.org/wiki/LEB128) + +There can be more than one valid encoding, the sender may send a longer encoding +for the benefit of reducing computation cost. + +page = zrun nzrun + | zrun nzrun page + +zrun = length + +nzrun = length byte... + +length = uleb128 encoded integer + +On the sender side XBZRLE is used as a compact delta encoding of page updates, +retrieving the old page content from the cache (default size of 64MB). The +receiving side uses the existing page's content and XBZRLE to decode the new +page's content. + +This work was originally based on research results published +VEE 2011: Evaluation of Delta Compression Techniques for Efficient Live +Migration of Large Virtual Machines by Benoit, Svard, Tordsson and Elmroth. +Additionally the delta encoder XBRLE was improved further using the XBZRLE +instead. + +XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it +ideal for in-line, real-time encoding such as is needed for live-migration. + +Example +old buffer: +1001 zeros +05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 68 00 00 6b 00 6d +3074 zeros + +new buffer: +1001 zeros +01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 68 00 00 67 00 69 +3074 zeros + +encoded buffer: + +encoded length 24 +e9 07 0f 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 03 01 67 01 01 69 + +Cache update strategy +===================== +Keeping the hot pages in the cache is effective for decreasing cache +misses. XBZRLE uses a counter as the age of each page. The counter will +increase after each ram dirty bitmap sync. When a cache conflict is +detected, XBZRLE will only evict pages in the cache that are older than +a threshold. + +Usage +====================== +1. Verify the destination QEMU version is able to decode the new format. + {qemu} info migrate_capabilities + {qemu} xbzrle: off , ... + +2. Activate xbzrle on both source and destination: + {qemu} migrate_set_capability xbzrle on + +3. Set the XBZRLE cache size - the cache size is in MBytes and should be a +power of 2. The cache default value is 64MBytes. (on source only) + {qemu} migrate_set_parameter xbzrle-cache-size 256m + +4. Start outgoing migration + {qemu} migrate -d tcp:destination.host:4444 + {qemu} info migrate + capabilities: xbzrle: on + Migration status: active + transferred ram: A kbytes + remaining ram: B kbytes + total ram: C kbytes + total time: D milliseconds + duplicate: E pages + normal: F pages + normal bytes: G kbytes + cache size: H bytes + xbzrle transferred: I kbytes + xbzrle pages: J pages + xbzrle cache miss: K pages + xbzrle cache miss rate: L + xbzrle encoding rate: M + xbzrle overflow: N + +xbzrle cache miss: the number of cache misses to date - high cache-miss rate +indicates that the cache size is set too low. +xbzrle overflow: the number of overflows in the decoding which where the delta +could not be compressed. This can happen if the changes in the pages are too +large or there are many short changes; for example, changing every second byte +(half a page). + +Testing: Testing indicated that live migration with XBZRLE was completed in 110 +seconds, whereas without it would not be able to complete. + +A simple synthetic memory r/w load generator: +.. include <stdlib.h> +.. include <stdio.h> +.. int main() +.. { +.. char *buf = (char *) calloc(4096, 4096); +.. while (1) { +.. int i; +.. for (i = 0; i < 4096 * 4; i++) { +.. buf[i * 4096 / 4]++; +.. } +.. printf("."); +.. } +.. } diff --git a/docs/xen-save-devices-state.txt b/docs/xen-save-devices-state.txt new file mode 100644 index 000000000..1912ecad2 --- /dev/null +++ b/docs/xen-save-devices-state.txt @@ -0,0 +1,33 @@ += Save Devices = + +QEMU has code to load/save the state of the guest that it is running. +These are two complementary operations. Saving the state just does +that, saves the state for each device that the guest is running. + +These operations are normally used with migration (see migration.txt), +however it is also possible to save the state of all devices to file, +without saving the RAM or the block devices of the VM. + +The save operation is available as QMP command xen-save-devices-state. + + +The binary format used in the file is the following: + + +------------------------------------------- + +32 bit big endian: QEMU_VM_FILE_MAGIC +32 bit big endian: QEMU_VM_FILE_VERSION + +for_each_device +{ + 8 bit: QEMU_VM_SECTION_FULL + 32 bit big endian: section_id + 8 bit: idstr (ID string) length + string: idstr (ID string) + 32 bit big endian: instance_id + 32 bit big endian: version_id + buffer: device specific data +} + +8 bit: QEMU_VM_EOF |