diff options
Diffstat (limited to 'docs/system/s390x')
-rw-r--r-- | docs/system/s390x/3270.rst | 63 | ||||
-rw-r--r-- | docs/system/s390x/bootdevices.rst | 82 | ||||
-rw-r--r-- | docs/system/s390x/css.rst | 86 | ||||
-rw-r--r-- | docs/system/s390x/protvirt.rst | 67 | ||||
-rw-r--r-- | docs/system/s390x/vfio-ap.rst | 916 | ||||
-rw-r--r-- | docs/system/s390x/vfio-ccw.rst | 77 |
6 files changed, 1291 insertions, 0 deletions
diff --git a/docs/system/s390x/3270.rst b/docs/system/s390x/3270.rst new file mode 100644 index 000000000..0e173b323 --- /dev/null +++ b/docs/system/s390x/3270.rst @@ -0,0 +1,63 @@ +3270 devices +============ + +The 3270 is the classic 'green-screen' console of the mainframes (see the +`IBM 3270 Wikipedia article <https://en.wikipedia.org/wiki/IBM_3270>`__). + +The 3270 data stream is not implemented within QEMU; the device only provides +TN3270 (a telnet extension; see `RFC 854 <https://tools.ietf.org/html/rfc854>`__ +and `RFC 1576 <https://tools.ietf.org/html/rfc1576>`__) and leaves the heavy +lifting to an external 3270 terminal emulator (such as ``x3270``) to make a +single 3270 device available to a guest. Note that this supports basic +features only. + +To provide a 3270 device to a guest, create a ``x-terminal3270`` linked to +a ``tn3270`` chardev. The guest will see a 3270 channel device. In order +to actually be able to use it, attach the ``x3270`` emulator to the chardev. + +Example configuration +--------------------- + +* Make sure that 3270 support is enabled in the guest's Linux kernel. You need + ``CONFIG_TN3270`` and at least one of ``CONFIG_TN3270_TTY`` (for additional + ttys) or ``CONFIG_TN3270_CONSOLE`` (for a 3270 console). + +* Add a ``tn3270`` chardev and a ``x-terminal3270`` to the QEMU command line:: + + -chardev socket,id=ch0,host=0.0.0.0,port=2300,wait=off,server=on,tn3270=on + -device x-terminal3270,chardev=ch0,devno=fe.0.000a,id=terminal0 + +* Start the guest. In the guest, use ``chccwdev -e 0.0.000a`` to enable + the device. + +* On the host, start the ``x3270`` emulator:: + + x3270 <host>:2300 + +* In the guest, locate the 3270 device node under ``/dev/3270/`` (say, + ``tty1``) and start a getty on it:: + + systemctl start serial-getty@3270-tty1.service + + This should get you an additional tty for logging into the guest. + +* If you want to use the 3270 device as the Linux kernel console instead of + an additional tty, you can also append ``conmode=3270 condev=000a`` to + the guest's kernel command line. The kernel then should use the 3270 as + console after the next boot. + +Restrictions +------------ + +3270 support is very basic. In particular: + +* Only one 3270 device is supported. + +* It has only been tested with Linux guests and the x3270 emulator. + +* TLS/SSL is not supported. + +* Resizing on reattach is not supported. + +* Multiple commands in one inbound buffer (for example, when the reset key + is pressed while the network is slow) are not supported. diff --git a/docs/system/s390x/bootdevices.rst b/docs/system/s390x/bootdevices.rst new file mode 100644 index 000000000..9e591cb9d --- /dev/null +++ b/docs/system/s390x/bootdevices.rst @@ -0,0 +1,82 @@ +Boot devices on s390x +===================== + +Booting with bootindex parameter +-------------------------------- + +For classical mainframe guests (i.e. LPAR or z/VM installations), you always +have to explicitly specify the disk where you want to boot from (or "IPL" from, +in s390x-speak -- IPL means "Initial Program Load"). In particular, there can +also be only one boot device according to the architecture specification, thus +specifying multiple boot devices is not possible (yet). + +So for booting an s390x guest in QEMU, you should always mark the +device where you want to boot from with the ``bootindex`` property, for +example:: + + qemu-system-s390x -drive if=none,id=dr1,file=guest.qcow2 \ + -device virtio-blk,drive=dr1,bootindex=1 + +For booting from a CD-ROM ISO image (which needs to include El-Torito boot +information in order to be bootable), it is recommended to specify a ``scsi-cd`` +device, for example like this:: + + qemu-system-s390x -blockdev file,node-name=c1,filename=... \ + -device virtio-scsi \ + -device scsi-cd,drive=c1,bootindex=1 + +Note that you really have to use the ``bootindex`` property to select the +boot device. The old-fashioned ``-boot order=...`` command of QEMU (and +also ``-boot once=...``) is not supported on s390x. + + +Booting without bootindex parameter +----------------------------------- + +The QEMU guest firmware (the so-called s390-ccw bios) has also some rudimentary +support for scanning through the available block devices. So in case you did +not specify a boot device with the ``bootindex`` property, there is still a +chance that it finds a bootable device on its own and starts a guest operating +system from it. However, this scanning algorithm is still very rough and may +be incomplete, so that it might fail to detect a bootable device in many cases. +It is really recommended to always specify the boot device with the +``bootindex`` property instead. + +This also means that you should avoid the classical short-cut commands like +``-hda``, ``-cdrom`` or ``-drive if=virtio``, since it is not possible to +specify the ``bootindex`` with these commands. Note that the convenience +``-cdrom`` option even does not give you a real (virtio-scsi) CD-ROM device on +s390x. Due to technical limitations in the QEMU code base, you will get a +virtio-blk device with this parameter instead, which might not be the right +device type for installing a Linux distribution via ISO image. It is +recommended to specify a CD-ROM device via ``-device scsi-cd`` (as mentioned +above) instead. + + +Booting from a network device +----------------------------- + +Beside the normal guest firmware (which is loaded from the file ``s390-ccw.img`` +in the data directory of QEMU, or via the ``-bios`` option), QEMU ships with +a small TFTP network bootloader firmware for virtio-net-ccw devices, too. This +firmware is loaded from a file called ``s390-netboot.img`` in the QEMU data +directory. In case you want to load it from a different filename instead, +you can specify it via the ``-global s390-ipl.netboot_fw=filename`` +command line option. + +The ``bootindex`` property is especially important for booting via the network. +If you don't specify the the ``bootindex`` property here, the network bootloader +firmware code won't get loaded into the guest memory so that the network boot +will fail. For a successful network boot, try something like this:: + + qemu-system-s390x -netdev user,id=n1,tftp=...,bootfile=... \ + -device virtio-net-ccw,netdev=n1,bootindex=1 + +The network bootloader firmware also has basic support for pxelinux.cfg-style +configuration files. See the `PXELINUX Configuration page +<https://wiki.syslinux.org/wiki/index.php?title=PXELINUX#Configuration>`__ +for details how to set up the configuration file on your TFTP server. +The supported configuration file entries are ``DEFAULT``, ``LABEL``, +``KERNEL``, ``INITRD`` and ``APPEND`` (see the `Syslinux Config file syntax +<https://wiki.syslinux.org/wiki/index.php?title=Config>`__ for more +information). diff --git a/docs/system/s390x/css.rst b/docs/system/s390x/css.rst new file mode 100644 index 000000000..3b4016118 --- /dev/null +++ b/docs/system/s390x/css.rst @@ -0,0 +1,86 @@ +The virtual channel subsystem +============================= + +QEMU implements a virtual channel subsystem with subchannels, (mostly +functionless) channel paths, and channel devices (virtio-ccw, 3270, and +devices passed via vfio-ccw). It supports multiple subchannel sets (MSS) and +multiple channel subsystems extended (MCSS-E). + +All channel devices support the ``devno`` property, which takes a parameter +in the form ``<cssid>.<ssid>.<device number>``. + +The default channel subsystem image id (``<cssid>``) is ``0xfe``. Devices in +there will show up in channel subsystem image ``0`` to guests that do not +enable MCSS-E. Note that devices with a different cssid will not be visible +if the guest OS does not enable MCSS-E (which is true for all supported guest +operating systems today). + +Supported values for the subchannel set id (``<ssid>``) range from ``0-3``. +Devices with a ssid that is not ``0`` will not be visible if the guest OS +does not enable MSS (any Linux version that supports virtio also enables MSS). +Any device may be put into any subchannel set, there is no restriction by +device type. + +The device number can range from ``0-0xffff``. + +If the ``devno`` property is not specified for a device, QEMU will choose the +next free device number in subchannel set 0, skipping to the next subchannel +set if no more device numbers are free. + +QEMU places a device at the first free subchannel in the specified subchannel +set. If a device is hotunplugged and later replugged, it may appear at a +different subchannel. (This is similar to how z/VM works.) + + +Examples +-------- + +* a virtio-net device, cssid/ssid/devno automatically assigned:: + + -device virtio-net-ccw + + In a Linux guest (without default devices and no other devices specified + prior to this one), this will show up as ``0.0.0000`` under subchannel + ``0.0.0000``. + + The auto-assigned-properties in QEMU (as seen via e.g. ``info qtree``) + would be ``dev_id = "fe.0.0000"`` and ``subch_id = "fe.0.0000"``. + +* a virtio-rng device in subchannel set ``0``:: + + -device virtio-rng-ccw,devno=fe.0.0042 + + If added to the same Linux guest as above, it would show up as ``0.0.0042`` + under subchannel ``0.0.0001``. + + The properties for the device would be ``dev_id = "fe.0.0042"`` and + ``subch_id = "fe.0.0001"``. + +* a virtio-gpu device in subchannel set ``2``:: + + -device virtio-gpu-ccw,devno=fe.2.1111 + + If added to the same Linux guest as above, it would show up as ``0.2.1111`` + under subchannel ``0.2.0000``. + + The properties for the device would be ``dev_id = "fe.2.1111"`` and + ``subch_id = "fe.2.0000"``. + +* a virtio-mouse device in a non-standard channel subsystem image:: + + -device virtio-mouse-ccw,devno=2.0.2222 + + This would not show up in a standard Linux guest. + + The properties for the device would be ``dev_id = "2.0.2222"`` and + ``subch_id = "2.0.0000"``. + +* a virtio-keyboard device in another non-standard channel subsystem image:: + + -device virtio-keyboard-ccw,devno=0.0.1234 + + This would not show up in a standard Linux guest, either, as ``0`` is not + the standard channel subsystem image id. + + The properties for the device would be ``dev_id = "0.0.1234"`` and + ``subch_id = "0.0.0000"``. diff --git a/docs/system/s390x/protvirt.rst b/docs/system/s390x/protvirt.rst new file mode 100644 index 000000000..aee63ed7e --- /dev/null +++ b/docs/system/s390x/protvirt.rst @@ -0,0 +1,67 @@ +Protected Virtualization on s390x +================================= + +The memory and most of the registers of Protected Virtual Machines +(PVMs) are encrypted or inaccessible to the hypervisor, effectively +prohibiting VM introspection when the VM is running. At rest, PVMs are +encrypted and can only be decrypted by the firmware, represented by an +entity called Ultravisor, of specific IBM Z machines. + + +Prerequisites +------------- + +To run PVMs, a machine with the Protected Virtualization feature, as +indicated by the Ultravisor Call facility (stfle bit 158), is +required. The Ultravisor needs to be initialized at boot by setting +``prot_virt=1`` on the host's kernel command line. + +Running PVMs requires using the KVM hypervisor. + +If those requirements are met, the capability ``KVM_CAP_S390_PROTECTED`` +will indicate that KVM can support PVMs on that LPAR. + + +Running a Protected Virtual Machine +----------------------------------- + +To run a PVM you will need to select a CPU model which includes the +``Unpack facility`` (stfle bit 161 represented by the feature +``unpack``/``S390_FEAT_UNPACK``), and add these options to the command line:: + + -object s390-pv-guest,id=pv0 \ + -machine confidential-guest-support=pv0 + +Adding these options will: + +* Ensure the ``unpack`` facility is available +* Enable the IOMMU by default for all I/O devices +* Initialize the PV mechanism + +Passthrough (vfio) devices are currently not supported. + +Host huge page backings are not supported. However guests can use huge +pages as indicated by its facilities. + + +Boot Process +------------ + +A secure guest image can either be loaded from disk or supplied on the +QEMU command line. Booting from disk is done by the unmodified +s390-ccw BIOS. I.e., the bootmap is interpreted, multiple components +are read into memory and control is transferred to one of the +components (zipl stage3). Stage3 does some fixups and then transfers +control to some program residing in guest memory, which is normally +the OS kernel. The secure image has another component prepended +(stage3a) that uses the new diag308 subcodes 8 and 10 to trigger the +transition into secure mode. + +Booting from the image supplied on the QEMU command line requires that +the file passed via -kernel has the same memory layout as would result +from the disk boot. This memory layout includes the encrypted +components (kernel, initrd, cmdline), the stage3a loader and +metadata. In case this boot method is used, the command line +options -initrd and -cmdline are ineffective. The preparation of a PVM +image is done via the ``genprotimg`` tool from the s390-tools +collection. diff --git a/docs/system/s390x/vfio-ap.rst b/docs/system/s390x/vfio-ap.rst new file mode 100644 index 000000000..084ba9c4e --- /dev/null +++ b/docs/system/s390x/vfio-ap.rst @@ -0,0 +1,916 @@ +Adjunct Processor (AP) Device +============================= + +.. contents:: + +Introduction +------------ + +The IBM Adjunct Processor (AP) Cryptographic Facility is comprised +of three AP instructions and from 1 to 256 PCIe cryptographic adapter cards. +These AP devices provide cryptographic functions to all CPUs assigned to a +linux system running in an IBM Z system LPAR. + +On s390x, AP adapter cards are exposed via the AP bus. This document +describes how those cards may be made available to KVM guests using the +VFIO mediated device framework. + +AP Architectural Overview +------------------------- + +In order understand the terminology used in the rest of this document, let's +start with some definitions: + +* AP adapter + + An AP adapter is an IBM Z adapter card that can perform cryptographic + functions. There can be from 0 to 256 adapters assigned to an LPAR depending + on the machine model. Adapters assigned to the LPAR in which a linux host is + running will be available to the linux host. Each adapter is identified by a + number from 0 to 255; however, the maximum adapter number allowed is + determined by machine model. When installed, an AP adapter is accessed by + AP instructions executed by any CPU. + +* AP domain + + An adapter is partitioned into domains. Each domain can be thought of as + a set of hardware registers for processing AP instructions. An adapter can + hold up to 256 domains; however, the maximum domain number allowed is + determined by machine model. Each domain is identified by a number from 0 to + 255. Domains can be further classified into two types: + + * Usage domains are domains that can be accessed directly to process AP + commands + + * Control domains are domains that are accessed indirectly by AP + commands sent to a usage domain to control or change the domain; for + example, to set a secure private key for the domain. + +* AP Queue + + An AP queue is the means by which an AP command-request message is sent to an + AP usage domain inside a specific AP. An AP queue is identified by a tuple + comprised of an AP adapter ID (APID) and an AP queue index (APQI). The + APQI corresponds to a given usage domain number within the adapter. This tuple + forms an AP Queue Number (APQN) uniquely identifying an AP queue. AP + instructions include a field containing the APQN to identify the AP queue to + which the AP command-request message is to be sent for processing. + +* AP Instructions: + + There are three AP instructions: + + * NQAP: to enqueue an AP command-request message to a queue + * DQAP: to dequeue an AP command-reply message from a queue + * PQAP: to administer the queues + + AP instructions identify the domain that is targeted to process the AP + command; this must be one of the usage domains. An AP command may modify a + domain that is not one of the usage domains, but the modified domain + must be one of the control domains. + +Start Interpretive Execution (SIE) Instruction +---------------------------------------------- + +A KVM guest is started by executing the Start Interpretive Execution (SIE) +instruction. The SIE state description is a control block that contains the +state information for a KVM guest and is supplied as input to the SIE +instruction. The SIE state description contains a satellite control block called +the Crypto Control Block (CRYCB). The CRYCB contains three fields to identify +the adapters, usage domains and control domains assigned to the KVM guest: + +* The AP Mask (APM) field is a bit mask that identifies the AP adapters assigned + to the KVM guest. Each bit in the mask, from left to right, corresponds to + an APID from 0-255. If a bit is set, the corresponding adapter is valid for + use by the KVM guest. + +* The AP Queue Mask (AQM) field is a bit mask identifying the AP usage domains + assigned to the KVM guest. Each bit in the mask, from left to right, + corresponds to an AP queue index (APQI) from 0-255. If a bit is set, the + corresponding queue is valid for use by the KVM guest. + +* The AP Domain Mask field is a bit mask that identifies the AP control domains + assigned to the KVM guest. The ADM bit mask controls which domains can be + changed by an AP command-request message sent to a usage domain from the + guest. Each bit in the mask, from left to right, corresponds to a domain from + 0-255. If a bit is set, the corresponding domain can be modified by an AP + command-request message sent to a usage domain. + +If you recall from the description of an AP Queue, AP instructions include +an APQN to identify the AP adapter and AP queue to which an AP command-request +message is to be sent (NQAP and PQAP instructions), or from which a +command-reply message is to be received (DQAP instruction). The validity of an +APQN is defined by the matrix calculated from the APM and AQM; it is the +cross product of all assigned adapter numbers (APM) with all assigned queue +indexes (AQM). For example, if adapters 1 and 2 and usage domains 5 and 6 are +assigned to a guest, the APQNs (1,5), (1,6), (2,5) and (2,6) will be valid for +the guest. + +The APQNs can provide secure key functionality - i.e., a private key is stored +on the adapter card for each of its domains - so each APQN must be assigned to +at most one guest or the linux host. + +Example 1: Valid configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++----------+--------+--------+ +| | Guest1 | Guest2 | ++==========+========+========+ +| adapters | 1, 2 | 1, 2 | ++----------+--------+--------+ +| domains | 5, 6 | 7 | ++----------+--------+--------+ + +This is valid because both guests have a unique set of APQNs: + +* Guest1 has APQNs (1,5), (1,6), (2,5) and (2,6); +* Guest2 has APQNs (1,7) and (2,7). + +Example 2: Valid configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++----------+--------+--------+ +| | Guest1 | Guest2 | ++==========+========+========+ +| adapters | 1, 2 | 3, 4 | ++----------+--------+--------+ +| domains | 5, 6 | 5, 6 | ++----------+--------+--------+ + +This is also valid because both guests have a unique set of APQNs: + +* Guest1 has APQNs (1,5), (1,6), (2,5), (2,6); +* Guest2 has APQNs (3,5), (3,6), (4,5), (4,6) + +Example 3: Invalid configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++----------+--------+--------+ +| | Guest1 | Guest2 | ++==========+========+========+ +| adapters | 1, 2 | 1 | ++----------+--------+--------+ +| domains | 5, 6 | 6, 7 | ++----------+--------+--------+ + +This is an invalid configuration because both guests have access to +APQN (1,6). + +AP Matrix Configuration on Linux Host +------------------------------------- + +A linux system is a guest of the LPAR in which it is running and has access to +the AP resources configured for the LPAR. The LPAR's AP matrix is +configured via its Activation Profile which can be edited on the HMC. When the +linux system is started, the AP bus will detect the AP devices assigned to the +LPAR and create the following in sysfs:: + + /sys/bus/ap + ... [devices] + ...... xx.yyyy + ...... ... + ...... cardxx + ...... ... + +Where: + +``cardxx`` + is AP adapter number xx (in hex) + +``xx.yyyy`` + is an APQN with xx specifying the APID and yyyy specifying the APQI + +For example, if AP adapters 5 and 6 and domains 4, 71 (0x47), 171 (0xab) and +255 (0xff) are configured for the LPAR, the sysfs representation on the linux +host system would look like this:: + + /sys/bus/ap + ... [devices] + ...... 05.0004 + ...... 05.0047 + ...... 05.00ab + ...... 05.00ff + ...... 06.0004 + ...... 06.0047 + ...... 06.00ab + ...... 06.00ff + ...... card05 + ...... card06 + +A set of default device drivers are also created to control each type of AP +device that can be assigned to the LPAR on which a linux host is running:: + + /sys/bus/ap + ... [drivers] + ...... [cex2acard] for Crypto Express 2/3 accelerator cards + ...... [cex2aqueue] for AP queues served by Crypto Express 2/3 + accelerator cards + ...... [cex4card] for Crypto Express 4/5/6 accelerator and coprocessor + cards + ...... [cex4queue] for AP queues served by Crypto Express 4/5/6 + accelerator and coprocessor cards + ...... [pcixcccard] for Crypto Express 2/3 coprocessor cards + ...... [pcixccqueue] for AP queues served by Crypto Express 2/3 + coprocessor cards + +Binding AP devices to device drivers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are two sysfs files that specify bitmasks marking a subset of the APQN +range as 'usable by the default AP queue device drivers' or 'not usable by the +default device drivers' and thus available for use by the alternate device +driver(s). The sysfs locations of the masks are:: + + /sys/bus/ap/apmask + /sys/bus/ap/aqmask + +The ``apmask`` is a 256-bit mask that identifies a set of AP adapter IDs +(APID). Each bit in the mask, from left to right (i.e., from most significant +to least significant bit in big endian order), corresponds to an APID from +0-255. If a bit is set, the APID is marked as usable only by the default AP +queue device drivers; otherwise, the APID is usable by the vfio_ap +device driver. + +The ``aqmask`` is a 256-bit mask that identifies a set of AP queue indexes +(APQI). Each bit in the mask, from left to right (i.e., from most significant +to least significant bit in big endian order), corresponds to an APQI from +0-255. If a bit is set, the APQI is marked as usable only by the default AP +queue device drivers; otherwise, the APQI is usable by the vfio_ap device +driver. + +Take, for example, the following mask:: + + 0x7dffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + +It indicates: + + 1, 2, 3, 4, 5, and 7-255 belong to the default drivers' pool, and 0 and 6 + belong to the vfio_ap device driver's pool. + +The APQN of each AP queue device assigned to the linux host is checked by the +AP bus against the set of APQNs derived from the cross product of APIDs +and APQIs marked as usable only by the default AP queue device drivers. If a +match is detected, only the default AP queue device drivers will be probed; +otherwise, the vfio_ap device driver will be probed. + +By default, the two masks are set to reserve all APQNs for use by the default +AP queue device drivers. There are two ways the default masks can be changed: + + 1. The sysfs mask files can be edited by echoing a string into the + respective sysfs mask file in one of two formats: + + * An absolute hex string starting with 0x - like "0x12345678" - sets + the mask. If the given string is shorter than the mask, it is padded + with 0s on the right; for example, specifying a mask value of 0x41 is + the same as specifying:: + + 0x4100000000000000000000000000000000000000000000000000000000000000 + + Keep in mind that the mask reads from left to right (i.e., most + significant to least significant bit in big endian order), so the mask + above identifies device numbers 1 and 7 (``01000001``). + + If the string is longer than the mask, the operation is terminated with + an error (EINVAL). + + * Individual bits in the mask can be switched on and off by specifying + each bit number to be switched in a comma separated list. Each bit + number string must be prepended with a (``+``) or minus (``-``) to indicate + the corresponding bit is to be switched on (``+``) or off (``-``). Some + valid values are:: + + "+0" switches bit 0 on + "-13" switches bit 13 off + "+0x41" switches bit 65 on + "-0xff" switches bit 255 off + + The following example:: + + +0,-6,+0x47,-0xf0 + + Switches bits 0 and 71 (0x47) on + Switches bits 6 and 240 (0xf0) off + + Note that the bits not specified in the list remain as they were before + the operation. + + 2. The masks can also be changed at boot time via parameters on the kernel + command line like this:: + + ap.apmask=0xffff ap.aqmask=0x40 + + This would create the following masks: + + apmask:: + + 0xffff000000000000000000000000000000000000000000000000000000000000 + + aqmask:: + + 0x4000000000000000000000000000000000000000000000000000000000000000 + + Resulting in these two pools:: + + default drivers pool: adapter 0-15, domain 1 + alternate drivers pool: adapter 16-255, domains 0, 2-255 + +Configuring an AP matrix for a linux guest +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sysfs interfaces for configuring an AP matrix for a guest are built on the +VFIO mediated device framework. To configure an AP matrix for a guest, a +mediated matrix device must first be created for the ``/sys/devices/vfio_ap/matrix`` +device. When the vfio_ap device driver is loaded, it registers with the VFIO +mediated device framework. When the driver registers, the sysfs interfaces for +creating mediated matrix devices is created:: + + /sys/devices + ... [vfio_ap] + ......[matrix] + ......... [mdev_supported_types] + ............ [vfio_ap-passthrough] + ............... create + ............... [devices] + +A mediated AP matrix device is created by writing a UUID to the attribute file +named ``create``, for example:: + + uuidgen > create + +or + +:: + + echo $uuid > create + +When a mediated AP matrix device is created, a sysfs directory named after +the UUID is created in the ``devices`` subdirectory:: + + /sys/devices + ... [vfio_ap] + ......[matrix] + ......... [mdev_supported_types] + ............ [vfio_ap-passthrough] + ............... create + ............... [devices] + .................. [$uuid] + +There will also be three sets of attribute files created in the mediated +matrix device's sysfs directory to configure an AP matrix for the +KVM guest:: + + /sys/devices + ... [vfio_ap] + ......[matrix] + ......... [mdev_supported_types] + ............ [vfio_ap-passthrough] + ............... create + ............... [devices] + .................. [$uuid] + ..................... assign_adapter + ..................... assign_control_domain + ..................... assign_domain + ..................... matrix + ..................... unassign_adapter + ..................... unassign_control_domain + ..................... unassign_domain + +``assign_adapter`` + To assign an AP adapter to the mediated matrix device, its APID is written + to the ``assign_adapter`` file. This may be done multiple times to assign more + than one adapter. The APID may be specified using conventional semantics + as a decimal, hexadecimal, or octal number. For example, to assign adapters + 4, 5 and 16 to a mediated matrix device in decimal, hexadecimal and octal + respectively:: + + echo 4 > assign_adapter + echo 0x5 > assign_adapter + echo 020 > assign_adapter + + In order to successfully assign an adapter: + + * The adapter number specified must represent a value from 0 up to the + maximum adapter number allowed by the machine model. If an adapter number + higher than the maximum is specified, the operation will terminate with + an error (ENODEV). + + * All APQNs that can be derived from the adapter ID being assigned and the + IDs of the previously assigned domains must be bound to the vfio_ap device + driver. If no domains have yet been assigned, then there must be at least + one APQN with the specified APID bound to the vfio_ap driver. If no such + APQNs are bound to the driver, the operation will terminate with an + error (EADDRNOTAVAIL). + + * No APQN that can be derived from the adapter ID and the IDs of the + previously assigned domains can be assigned to another mediated matrix + device. If an APQN is assigned to another mediated matrix device, the + operation will terminate with an error (EADDRINUSE). + +``unassign_adapter`` + To unassign an AP adapter, its APID is written to the ``unassign_adapter`` + file. This may also be done multiple times to unassign more than one adapter. + +``assign_domain`` + To assign a usage domain, the domain number is written into the + ``assign_domain`` file. This may be done multiple times to assign more than one + usage domain. The domain number is specified using conventional semantics as + a decimal, hexadecimal, or octal number. For example, to assign usage domains + 4, 8, and 71 to a mediated matrix device in decimal, hexadecimal and octal + respectively:: + + echo 4 > assign_domain + echo 0x8 > assign_domain + echo 0107 > assign_domain + + In order to successfully assign a domain: + + * The domain number specified must represent a value from 0 up to the + maximum domain number allowed by the machine model. If a domain number + higher than the maximum is specified, the operation will terminate with + an error (ENODEV). + + * All APQNs that can be derived from the domain ID being assigned and the IDs + of the previously assigned adapters must be bound to the vfio_ap device + driver. If no domains have yet been assigned, then there must be at least + one APQN with the specified APQI bound to the vfio_ap driver. If no such + APQNs are bound to the driver, the operation will terminate with an + error (EADDRNOTAVAIL). + + * No APQN that can be derived from the domain ID being assigned and the IDs + of the previously assigned adapters can be assigned to another mediated + matrix device. If an APQN is assigned to another mediated matrix device, + the operation will terminate with an error (EADDRINUSE). + +``unassign_domain`` + To unassign a usage domain, the domain number is written into the + ``unassign_domain`` file. This may be done multiple times to unassign more than + one usage domain. + +``assign_control_domain`` + To assign a control domain, the domain number is written into the + ``assign_control_domain`` file. This may be done multiple times to + assign more than one control domain. The domain number may be specified using + conventional semantics as a decimal, hexadecimal, or octal number. For + example, to assign control domains 4, 8, and 71 to a mediated matrix device + in decimal, hexadecimal and octal respectively:: + + echo 4 > assign_domain + echo 0x8 > assign_domain + echo 0107 > assign_domain + + In order to successfully assign a control domain, the domain number + specified must represent a value from 0 up to the maximum domain number + allowed by the machine model. If a control domain number higher than the + maximum is specified, the operation will terminate with an error (ENODEV). + +``unassign_control_domain`` + To unassign a control domain, the domain number is written into the + ``unassign_domain`` file. This may be done multiple times to unassign more than + one control domain. + +Notes: No changes to the AP matrix will be allowed while a guest using +the mediated matrix device is running. Attempts to assign an adapter, +domain or control domain will be rejected and an error (EBUSY) returned. + +Starting a Linux Guest Configured with an AP Matrix +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To provide a mediated matrix device for use by a guest, the following option +must be specified on the QEMU command line:: + + -device vfio_ap,sysfsdev=$path-to-mdev + +The sysfsdev parameter specifies the path to the mediated matrix device. +There are a number of ways to specify this path:: + + /sys/devices/vfio_ap/matrix/$uuid + /sys/bus/mdev/devices/$uuid + /sys/bus/mdev/drivers/vfio_mdev/$uuid + /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/devices/$uuid + +When the linux guest is started, the guest will open the mediated +matrix device's file descriptor to get information about the mediated matrix +device. The ``vfio_ap`` device driver will update the APM, AQM, and ADM fields in +the guest's CRYCB with the adapter, usage domain and control domains assigned +via the mediated matrix device's sysfs attribute files. Programs running on the +linux guest will then: + +1. Have direct access to the APQNs derived from the cross product of the AP + adapter numbers (APID) and queue indexes (APQI) specified in the APM and AQM + fields of the guests's CRYCB respectively. These APQNs identify the AP queues + that are valid for use by the guest; meaning, AP commands can be sent by the + guest to any of these queues for processing. + +2. Have authorization to process AP commands to change a control domain + identified in the ADM field of the guest's CRYCB. The AP command must be sent + to a valid APQN (see 1 above). + +CPU model features: + +Three CPU model features are available for controlling guest access to AP +facilities: + +1. AP facilities feature + + The AP facilities feature indicates that AP facilities are installed on the + guest. This feature will be exposed for use only if the AP facilities + are installed on the host system. The feature is s390-specific and is + represented as a parameter of the -cpu option on the QEMU command line:: + + qemu-system-s390x -cpu $model,ap=on|off + + Where: + + ``$model`` + is the CPU model defined for the guest (defaults to the model of + the host system if not specified). + + ``ap=on|off`` + indicates whether AP facilities are installed (on) or not + (off). The default for CPU models zEC12 or newer + is ``ap=on``. AP facilities must be installed on the guest if a + vfio-ap device (``-device vfio-ap,sysfsdev=$path``) is configured + for the guest, or the guest will fail to start. + +2. Query Configuration Information (QCI) facility + + The QCI facility is used by the AP bus running on the guest to query the + configuration of the AP facilities. This facility will be available + only if the QCI facility is installed on the host system. The feature is + s390-specific and is represented as a parameter of the -cpu option on the + QEMU command line:: + + qemu-system-s390x -cpu $model,apqci=on|off + + Where: + + ``$model`` + is the CPU model defined for the guest + + ``apqci=on|off`` + indicates whether the QCI facility is installed (on) or + not (off). The default for CPU models zEC12 or newer + is ``apqci=on``; for older models, QCI will not be installed. + + If QCI is installed (``apqci=on``) but AP facilities are not + (``ap=off``), an error message will be logged, but the guest + will be allowed to start. It makes no sense to have QCI + installed if the AP facilities are not; this is considered + an invalid configuration. + + If the QCI facility is not installed, APQNs with an APQI + greater than 15 will not be detected by the AP bus + running on the guest. + +3. Adjunct Process Facility Test (APFT) facility + + The APFT facility is used by the AP bus running on the guest to test the + AP facilities available for a given AP queue. This facility will be available + only if the APFT facility is installed on the host system. The feature is + s390-specific and is represented as a parameter of the -cpu option on the + QEMU command line:: + + qemu-system-s390x -cpu $model,apft=on|off + + Where: + + ``$model`` + is the CPU model defined for the guest (defaults to the model of + the host system if not specified). + + ``apft=on|off`` + indicates whether the APFT facility is installed (on) or + not (off). The default for CPU models zEC12 and + newer is ``apft=on`` for older models, APFT will not be + installed. + + If APFT is installed (``apft=on``) but AP facilities are not + (``ap=off``), an error message will be logged, but the guest + will be allowed to start. It makes no sense to have APFT + installed if the AP facilities are not; this is considered + an invalid configuration. + + It also makes no sense to turn APFT off because the AP bus + running on the guest will not detect CEX4 and newer devices + without it. Since only CEX4 and newer devices are supported + for guest usage, no AP devices can be made accessible to a + guest started without APFT installed. + +Hot plug a vfio-ap device into a running guest +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Only one vfio-ap device can be attached to the virtual machine's ap-bus, so a +vfio-ap device can be hot plugged if and only if no vfio-ap device is attached +to the bus already, whether via the QEMU command line or a prior hot plug +action. + +To hot plug a vfio-ap device, use the QEMU ``device_add`` command:: + + (qemu) device_add vfio-ap,sysfsdev="$path-to-mdev",id="$id" + +Where the ``$path-to-mdev`` value specifies the absolute path to a mediated +device to which AP resources to be used by the guest have been assigned. +``$id`` is the name value for the optional id parameter. + +Note that on Linux guests, the AP devices will be created in the +``/sys/bus/ap/devices`` directory when the AP bus subsequently performs its periodic +scan, so there may be a short delay before the AP devices are accessible on the +guest. + +The command will fail if: + +* A vfio-ap device has already been attached to the virtual machine's ap-bus. + +* The CPU model features for controlling guest access to AP facilities are not + enabled (see 'CPU model features' subsection in the previous section). + +Hot unplug a vfio-ap device from a running guest +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A vfio-ap device can be unplugged from a running KVM guest if a vfio-ap device +has been attached to the virtual machine's ap-bus via the QEMU command line +or a prior hot plug action. + +To hot unplug a vfio-ap device, use the QEMU ``device_del`` command:: + + (qemu) device_del "$id" + +Where ``$id`` is the same id that was specified at device creation. + +On a Linux guest, the AP devices will be removed from the ``/sys/bus/ap/devices`` +directory on the guest when the AP bus subsequently performs its periodic scan, +so there may be a short delay before the AP devices are no longer accessible by +the guest. + +The command will fail if the ``$path-to-mdev`` specified on the ``device_del`` command +does not match the value specified when the vfio-ap device was attached to +the virtual machine's ap-bus. + +Example: Configure AP Matrices for Three Linux Guests +----------------------------------------------------- + +Let's now provide an example to illustrate how KVM guests may be given +access to AP facilities. For this example, we will show how to configure +three guests such that executing the lszcrypt command on the guests would +look like this: + +Guest1:: + + CARD.DOMAIN TYPE MODE + ------------------------------ + 05 CEX5C CCA-Coproc + 05.0004 CEX5C CCA-Coproc + 05.00ab CEX5C CCA-Coproc + 06 CEX5A Accelerator + 06.0004 CEX5A Accelerator + 06.00ab CEX5C CCA-Coproc + +Guest2:: + + CARD.DOMAIN TYPE MODE + ------------------------------ + 05 CEX5A Accelerator + 05.0047 CEX5A Accelerator + 05.00ff CEX5A Accelerator + +Guest3:: + + CARD.DOMAIN TYPE MODE + ------------------------------ + 06 CEX5A Accelerator + 06.0047 CEX5A Accelerator + 06.00ff CEX5A Accelerator + +These are the steps: + +1. Install the vfio_ap module on the linux host. The dependency chain for the + vfio_ap module is: + + * iommu + * s390 + * zcrypt + * vfio + * vfio_mdev + * vfio_mdev_device + * KVM + + To build the vfio_ap module, the kernel build must be configured with the + following Kconfig elements selected: + + * IOMMU_SUPPORT + * S390 + * ZCRYPT + * S390_AP_IOMMU + * VFIO + * VFIO_MDEV + * VFIO_MDEV_DEVICE + * KVM + + If using make menuconfig select the following to build the vfio_ap module:: + -> Device Drivers + -> IOMMU Hardware Support + select S390 AP IOMMU Support + -> VFIO Non-Privileged userspace driver framework + -> Mediated device driver framework + -> VFIO driver for Mediated devices + -> I/O subsystem + -> VFIO support for AP devices + +2. Secure the AP queues to be used by the three guests so that the host can not + access them. To secure the AP queues 05.0004, 05.0047, 05.00ab, 05.00ff, + 06.0004, 06.0047, 06.00ab, and 06.00ff for use by the vfio_ap device driver, + the corresponding APQNs must be removed from the default queue drivers pool + as follows:: + + echo -5,-6 > /sys/bus/ap/apmask + + echo -4,-0x47,-0xab,-0xff > /sys/bus/ap/aqmask + + This will result in AP queues 05.0004, 05.0047, 05.00ab, 05.00ff, 06.0004, + 06.0047, 06.00ab, and 06.00ff getting bound to the vfio_ap device driver. The + sysfs directory for the vfio_ap device driver will now contain symbolic links + to the AP queue devices bound to it:: + + /sys/bus/ap + ... [drivers] + ...... [vfio_ap] + ......... [05.0004] + ......... [05.0047] + ......... [05.00ab] + ......... [05.00ff] + ......... [06.0004] + ......... [06.0047] + ......... [06.00ab] + ......... [06.00ff] + + Keep in mind that only type 10 and newer adapters (i.e., CEX4 and later) + can be bound to the vfio_ap device driver. The reason for this is to + simplify the implementation by not needlessly complicating the design by + supporting older devices that will go out of service in the relatively near + future, and for which there are few older systems on which to test. + + The administrator, therefore, must take care to secure only AP queues that + can be bound to the vfio_ap device driver. The device type for a given AP + queue device can be read from the parent card's sysfs directory. For example, + to see the hardware type of the queue 05.0004:: + + cat /sys/bus/ap/devices/card05/hwtype + + The hwtype must be 10 or higher (CEX4 or newer) in order to be bound to the + vfio_ap device driver. + +3. Create the mediated devices needed to configure the AP matrixes for the + three guests and to provide an interface to the vfio_ap driver for + use by the guests:: + + /sys/devices/vfio_ap/matrix/ + ... [mdev_supported_types] + ...... [vfio_ap-passthrough] (passthrough mediated matrix device type) + ......... create + ......... [devices] + + To create the mediated devices for the three guests:: + + uuidgen > create + uuidgen > create + uuidgen > create + + or + + :: + + echo $uuid1 > create + echo $uuid2 > create + echo $uuid3 > create + + This will create three mediated devices in the [devices] subdirectory named + after the UUID used to create the mediated device. We'll call them $uuid1, + $uuid2 and $uuid3 and this is the sysfs directory structure after creation:: + + /sys/devices/vfio_ap/matrix/ + ... [mdev_supported_types] + ...... [vfio_ap-passthrough] + ......... [devices] + ............ [$uuid1] + ............... assign_adapter + ............... assign_control_domain + ............... assign_domain + ............... matrix + ............... unassign_adapter + ............... unassign_control_domain + ............... unassign_domain + + ............ [$uuid2] + ............... assign_adapter + ............... assign_control_domain + ............... assign_domain + ............... matrix + ............... unassign_adapter + ............... unassign_control_domain + ............... unassign_domain + + ............ [$uuid3] + ............... assign_adapter + ............... assign_control_domain + ............... assign_domain + ............... matrix + ............... unassign_adapter + ............... unassign_control_domain + ............... unassign_domain + +4. The administrator now needs to configure the matrixes for the mediated + devices $uuid1 (for Guest1), $uuid2 (for Guest2) and $uuid3 (for Guest3). + + This is how the matrix is configured for Guest1:: + + echo 5 > assign_adapter + echo 6 > assign_adapter + echo 4 > assign_domain + echo 0xab > assign_domain + + Control domains can similarly be assigned using the assign_control_domain + sysfs file. + + If a mistake is made configuring an adapter, domain or control domain, + you can use the ``unassign_xxx`` interfaces to unassign the adapter, domain or + control domain. + + To display the matrix configuration for Guest1:: + + cat matrix + + The output will display the APQNs in the format ``xx.yyyy``, where xx is + the adapter number and yyyy is the domain number. The output for Guest1 + will look like this:: + + 05.0004 + 05.00ab + 06.0004 + 06.00ab + + This is how the matrix is configured for Guest2:: + + echo 5 > assign_adapter + echo 0x47 > assign_domain + echo 0xff > assign_domain + + This is how the matrix is configured for Guest3:: + + echo 6 > assign_adapter + echo 0x47 > assign_domain + echo 0xff > assign_domain + +5. Start Guest1:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid1 ... + +7. Start Guest2:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid2 ... + +7. Start Guest3:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid3 ... + +When the guest is shut down, the mediated matrix devices may be removed. + +Using our example again, to remove the mediated matrix device $uuid1:: + + /sys/devices/vfio_ap/matrix/ + ... [mdev_supported_types] + ...... [vfio_ap-passthrough] + ......... [devices] + ............ [$uuid1] + ............... remove + + + echo 1 > remove + +This will remove all of the mdev matrix device's sysfs structures including +the mdev device itself. To recreate and reconfigure the mdev matrix device, +all of the steps starting with step 3 will have to be performed again. Note +that the remove will fail if a guest using the mdev is still running. + +It is not necessary to remove an mdev matrix device, but one may want to +remove it if no guest will use it during the remaining lifetime of the linux +host. If the mdev matrix device is removed, one may want to also reconfigure +the pool of adapters and queues reserved for use by the default drivers. + +Limitations +----------- + +* The KVM/kernel interfaces do not provide a way to prevent restoring an APQN + to the default drivers pool of a queue that is still assigned to a mediated + device in use by a guest. It is incumbent upon the administrator to + ensure there is no mediated device in use by a guest to which the APQN is + assigned lest the host be given access to the private data of the AP queue + device, such as a private key configured specifically for the guest. + +* Dynamically assigning AP resources to or unassigning AP resources from a + mediated matrix device - see `Configuring an AP matrix for a linux guest`_ + section above - while a running guest is using it is currently not supported. + +* Live guest migration is not supported for guests using AP devices. If a guest + is using AP devices, the vfio-ap device configured for the guest must be + unplugged before migrating the guest (see `Hot unplug a vfio-ap device from a + running guest`_ section above.) diff --git a/docs/system/s390x/vfio-ccw.rst b/docs/system/s390x/vfio-ccw.rst new file mode 100644 index 000000000..41e0bad5b --- /dev/null +++ b/docs/system/s390x/vfio-ccw.rst @@ -0,0 +1,77 @@ +Subchannel passthrough via vfio-ccw +=================================== + +vfio-ccw (based upon the mediated vfio device infrastructure) allows to +make certain I/O subchannels and their devices available to a guest. The +host will not interact with those subchannels/devices any more. + +Note that while vfio-ccw should work with most non-QDIO devices, only ECKD +DASDs have really been tested. + +Example configuration +--------------------- + +Step 1: configure the host device +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As every mdev is identified by a uuid, the first step is to obtain one:: + + [root@host ~]# uuidgen + 7e270a25-e163-4922-af60-757fc8ed48c6 + +Note: it is recommended to use the ``mdevctl`` tool for actually configuring +the host device. + +To define the same device as configured below to be started +automatically, use + +:: + + [root@host ~]# driverctl -b css set-override 0.0.0313 vfio_ccw + [root@host ~]# mdevctl define -u 7e270a25-e163-4922-af60-757fc8ed48c6 \ + -p 0.0.0313 -t vfio_ccw-io -a + +If using ``mdevctl`` is not possible or wanted, follow the manual procedure +below. + +* Locate the subchannel for the device (in this example, ``0.0.2b09``):: + + [root@host ~]# lscss | grep 0.0.2b09 | awk '{print $2}' + 0.0.0313 + +* Unbind the subchannel (in this example, ``0.0.0313``) from the standard + I/O subchannel driver and bind it to the vfio-ccw driver:: + + [root@host ~]# echo 0.0.0313 > /sys/bus/css/devices/0.0.0313/driver/unbind + [root@host ~]# echo 0.0.0313 > /sys/bus/css/drivers/vfio_ccw/bind + +* Create the mediated device (identified by the uuid):: + + [root@host ~]# echo "7e270a25-e163-4922-af60-757fc8ed48c6" > \ + /sys/bus/css/devices/0.0.0313/mdev_supported_types/vfio_ccw-io/create + +Step 2: configure QEMU +~~~~~~~~~~~~~~~~~~~~~~ + +* Reference the created mediated device and (optionally) pick a device id to + be presented in the guest (here, ``fe.0.1234``, which will end up visible + in the guest as ``0.0.1234``:: + + -device vfio-ccw,devno=fe.0.1234,sysfsdev=\ + /sys/bus/mdev/devices/7e270a25-e163-4922-af60-757fc8ed48c6 + +* Start the guest. The device (here, ``0.0.1234``) should now be usable:: + + [root@guest ~]# lscss -d 0.0.1234 + Device Subchan. DevType CU Type Use PIM PAM POM CHPID + ---------------------------------------------------------------------- + 0.0.1234 0.0.0007 3390/0e 3990/e9 f0 f0 ff 1a2a3a0a 00000000 + [root@guest ~]# chccwdev -e 0.0.1234 + Setting device 0.0.1234 online + Done + [root@guest ~]# dmesg -t + (...) + dasd-eckd 0.0.1234: A channel path to the device has become operational + dasd-eckd 0.0.1234: New DASD 3390/0E (CU 3990/01) with 10017 cylinders, 15 heads, 224 sectors + dasd-eckd 0.0.1234: DASD with 4 KB/block, 7212240 KB total size, 48 KB/track, compatible disk layout + dasda:VOL1/ 0X2B09: dasda1 |