diff options
author | Timos Ampelikiotis <t.ampelikiotis@virtualopensystems.com> | 2023-10-10 11:40:56 +0000 |
---|---|---|
committer | Timos Ampelikiotis <t.ampelikiotis@virtualopensystems.com> | 2023-10-10 11:40:56 +0000 |
commit | e02cda008591317b1625707ff8e115a4841aa889 (patch) | |
tree | aee302e3cf8b59ec2d32ec481be3d1afddfc8968 /docs/pvrdma.txt | |
parent | cc668e6b7e0ffd8c9d130513d12053cf5eda1d3b (diff) |
Introduce Virtio-loopback epsilon release:
Epsilon release introduces a new compatibility layer which make virtio-loopback
design to work with QEMU and rust-vmm vhost-user backend without require any
changes.
Signed-off-by: Timos Ampelikiotis <t.ampelikiotis@virtualopensystems.com>
Change-Id: I52e57563e08a7d0bdc002f8e928ee61ba0c53dd9
Diffstat (limited to 'docs/pvrdma.txt')
-rw-r--r-- | docs/pvrdma.txt | 345 |
1 files changed, 345 insertions, 0 deletions
diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt new file mode 100644 index 000000000..5c122fe81 --- /dev/null +++ b/docs/pvrdma.txt @@ -0,0 +1,345 @@ +Paravirtualized RDMA Device (PVRDMA) +==================================== + + +1. Description +=============== +PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. +It works with its Linux Kernel driver AS IS, no need for any special guest +modifications. + +While it complies with the VMware device, it can also communicate with bare +metal RDMA-enabled machines as peers. + +It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe). + +It does not require the whole guest RAM to be pinned allowing memory +over-commit and, even if not implemented yet, migration support will be +possible with some HW assistance. + +A project presentation accompany this document: +- https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf + + + +2. Setup +======== + + +2.1 Guest setup +=============== +Fedora 27+ kernels work out of the box, older distributions +require updating the kernel to 4.14 to include the pvrdma driver. + +However the libpvrdma library needed by User Level Software is still +not available as part of the distributions, so the rdma-core library +needs to be compiled and optionally installed. + +Please follow the instructions at: + https://github.com/linux-rdma/rdma-core.git + + +2.2 Host Setup +============== +The pvrdma backend is an ibdevice interface that can be exposed +either by a Soft-RoCE(rxe) device on machines with no RDMA device, +or an HCA SRIOV function(VF/PF). +Note that ibdevice interfaces can't be shared between pvrdma devices, +each one requiring a separate instance (rxe or SRIOV VF). + + +2.2.1 Soft-RoCE backend(rxe) +=========================== +A stable version of rxe is required, Fedora 27+ or a Linux +Kernel 4.14+ is preferred. + +The rdma_rxe module is part of the Linux Kernel but not loaded by default. +Install the User Level library (librxe) following the instructions from: +https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home + +Associate an ETH interface with rxe by running: + rxe_cfg add eth0 +An rxe0 ibdevice interface will be created and can be used as pvrdma backend. + + +2.2.2 RDMA device Virtual Function backend +========================================== +Nothing special is required, the pvrdma device can work not only with +Ethernet Links, but also Infinibands Links. +All is needed is an ibdevice with an active port, for Mellanox cards +will be something like mlx5_6 which can be the backend. + + +2.2.3 QEMU setup +================ +Configure QEMU with --enable-rdma flag, installing +the required RDMA libraries. + + + +3. Usage +======== + + +3.1 VM Memory settings +====================== +Currently the device is working only with memory backed RAM +and it must be mark as "shared": + -m 1G \ + -object memory-backend-ram,id=mb1,size=1G,share \ + -numa node,memdev=mb1 \ + + +3.2 MAD Multiplexer +=================== +MAD Multiplexer is a service that exposes MAD-like interface for VMs in +order to overcome the limitation where only single entity can register with +MAD layer to send and receive RDMA-CM MAD packets. + +To build rdmacm-mux run +# make rdmacm-mux + +Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel +modules aren't loaded, otherwise the rdmacm-mux service will fail to start. + +The application accepts 3 command line arguments and exposes a UNIX socket +to pass control and data to it. +-d rdma-device-name Name of RDMA device to register with +-s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux) +-p rdma-device-port Port number of RDMA device to register with (default 1) +The final UNIX socket file name is a concatenation of the 3 arguments so +for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2 +will be created. + +pvrdma requires this service. + +Please refer to contrib/rdmacm-mux for more details. + + +3.3 Service exposed by libvirt daemon +===================================== +The control over the RDMA device's GID table is done by updating the +device's Ethernet function addresses. +Usually the first GID entry is determined by the MAC address, the second by +the first IPv6 address and the third by the IPv4 address. Other entries can +be added by adding more IP addresses. The opposite is the same, i.e. +whenever an address is removed, the corresponding GID entry is removed. +The process is done by the network and RDMA stacks. Whenever an address is +added the ib_core driver is notified and calls the device driver add_gid +function which in turn update the device. +To support this in pvrdma device the device hooks into the create_bind and +destroy_bind HW commands triggered by pvrdma driver in guest. + +Whenever changed is made to the pvrdma port's GID table a special QMP +messages is sent to be processed by libvirt to update the address of the +backend Ethernet device. + +pvrdma requires that libvirt service will be up. + + +3.4 PCI devices settings +======================== +RoCE device exposes two functions - an Ethernet and RDMA. +To support it, pvrdma device is composed of two PCI functions, an Ethernet +device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The +Ethernet function can be used for other Ethernet purposes such as IP. + + +3.5 Device parameters +===================== +- netdev: Specifies the Ethernet device function name on the host for + example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet + device used to create it. +- ibdev: The IB device name on host for example rxe0, mlx5_0 etc. +- mad-chardev: The name of the MAD multiplexer char device. +- ibport: In case of multi-port device (such as Mellanox's HCA) this + specify the port to use. If not set 1 will be used. +- dev-caps-max-mr-size: The maximum size of MR. +- dev-caps-max-qp: Maximum number of QPs. +- dev-caps-max-cq: Maximum number of CQs. +- dev-caps-max-mr: Maximum number of MRs. +- dev-caps-max-pd: Maximum number of PDs. +- dev-caps-max-ah: Maximum number of AHs. + +Notes: +- The first 3 parameters are mandatory settings, the rest have their + defaults. +- The last 8 parameters (the ones that prefixed by dev-caps) defines the top + limits but the final values is adjusted by the backend device limitations. +- netdev can be extracted from ibdev's sysfs + (/sys/class/infiniband/<ibdev>/device/net/) + + +3.6 Example +=========== +Define bridge device with vmxnet3 network backend: +<interface type='bridge'> + <mac address='56:b4:44:e9:62:dc'/> + <source bridge='bridge1'/> + <model type='vmxnet3'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/> +</interface> + +Define pvrdma device: +<qemu:commandline> + <qemu:arg value='-object'/> + <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/> + <qemu:arg value='-numa'/> + <qemu:arg value='node,memdev=mb1'/> + <qemu:arg value='-chardev'/> + <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/> + <qemu:arg value='-device'/> + <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/> +</qemu:commandline> + + + +4. Implementation details +========================= + + +4.1 Overview +============ +The device acts like a proxy between the Guest Driver and the host +ibdevice interface. +On configuration path: + - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request + a resource from the backend interface, maintaining a 1-1 mapping + between the guest and host. +On data path: + - Every post_send/receive received from the guest will be converted into + a post_send/receive for the backend. The buffers data will not be touched + or copied resulting in near bare-metal performance for large enough buffers. + - Completions from the backend interface will result in completions for + the pvrdma device. + + +4.2 PCI BARs +============ +PCI Bars: + BAR 0 - MSI-X + MSI-X vectors: + (0) Command - used when execution of a command is completed. + (1) Async - not in use. + (2) Completion - used when a completion event is placed in + device's CQ ring. + BAR 1 - Registers + -------------------------------------------------------- + | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC | + -------------------------------------------------------- + DSR - Address of driver/device shared memory used + for the command channel, used for passing: + - General info such as driver version + - Address of 'command' and 'response' + - Address of async ring + - Address of device's CQ ring + - Device capabilities + CTL - Device control operations (activate, reset etc) + IMG - Set interrupt mask + REQ - Command execution register + ERR - Operation status + + BAR 2 - UAR + --------------------------------------------------------- + | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag | + --------------------------------------------------------- + - Offset 0 used for QP operations (send and recv) + - Offset 4 used for CQ operations (arm and poll) + + +4.3 Major flows +=============== + +4.3.1 Create CQ +=============== + - Guest driver + - Allocates pages for CQ ring + - Creates page directory (pdir) to hold CQ ring's pages + - Initializes CQ ring + - Initializes 'Create CQ' command object (cqe, pdir etc) + - Copies the command to 'command' address + - Writes 0 into REQ register + - Device + - Reads the request object from the 'command' address + - Allocates CQ object and initialize CQ ring based on pdir + - Creates the backend CQ + - Writes operation status to ERR register + - Posts command-interrupt to guest + - Guest driver + - Reads the HW response code from ERR register + +4.3.2 Create QP +=============== + - Guest driver + - Allocates pages for send and receive rings + - Creates page directory(pdir) to hold the ring's pages + - Initializes 'Create QP' command object (max_send_wr, + send_cq_handle, recv_cq_handle, pdir etc) + - Copies the object to 'command' address + - Write 0 into REQ register + - Device + - Reads the request object from 'command' address + - Allocates the QP object and initialize + - Send and recv rings based on pdir + - Send and recv ring state + - Creates the backend QP + - Writes the operation status to ERR register + - Posts command-interrupt to guest + - Guest driver + - Reads the HW response code from ERR register + +4.3.3 Post receive +================== + - Guest driver + - Initializes a wqe and place it on recv ring + - Write to qpn|qp_recv_bit (31) to QP offset in UAR + - Device + - Extracts qpn from UAR + - Walks through the ring and does the following for each wqe + - Prepares the backend CQE context to be used when + receiving completion from backend (wr_id, op_code, emu_cq_num) + - For each sge prepares backend sge + - Calls backend's post_recv + +4.3.4 Process backend events +============================ + - Done by a dedicated thread used to process backend events; + at initialization is attached to the device and creates + the communication channel. + - Thread main loop: + - Polls for completions + - Extracts QEMU _cq_num, wr_id and op_code from context + - Writes CQE to CQ ring + - Writes CQ number to device CQ + - Sends completion-interrupt to guest + - Deallocates context + - Acks the event to backend + + + +5. Limitations +============== +- The device obviously is limited by the Guest Linux Driver features implementation + of the VMware device API. +- Memory registration mechanism requires mremap for every page in the buffer in order + to map it to a contiguous virtual address range. Since this is not the data path + it should not matter much. If the default max mr size is increased, be aware that + memory registration can take up to 0.5 seconds for 1GB of memory. +- The device requires target page size to be the same as the host page size, + otherwise it will fail to init. +- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached, + so it can't work with huge pages. The limitation will be addressed in the future, + however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge + pages available, QEMU will use them. QEMU will fail to init if the requirements + are not met. + + + +6. Performance +============== +By design the pvrdma device exits on each post-send/receive, so for small buffers +the performance is affected; however for medium buffers it will became close to +bare metal and from 1MB buffers and up it reaches bare metal performance. +(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device) + +All the above assumes no memory registration is done on data path. |