diff options
Diffstat (limited to 'docs/tools/virtiofsd.rst')
-rw-r--r-- | docs/tools/virtiofsd.rst | 366 |
1 files changed, 366 insertions, 0 deletions
diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst new file mode 100644 index 000000000..07ac0be55 --- /dev/null +++ b/docs/tools/virtiofsd.rst @@ -0,0 +1,366 @@ +QEMU virtio-fs shared file system daemon +======================================== + +Synopsis +-------- + +**virtiofsd** [*OPTIONS*] + +Description +----------- + +Share a host directory tree with a guest through a virtio-fs device. This +program is a vhost-user backend that implements the virtio-fs device. Each +virtio-fs device instance requires its own virtiofsd process. + +This program is designed to work with QEMU's ``--device vhost-user-fs-pci`` +but should work with any virtual machine monitor (VMM) that supports +vhost-user. See the Examples section below. + +This program must be run as the root user. The program drops privileges where +possible during startup although it must be able to create and access files +with any uid/gid: + +* The ability to invoke syscalls is limited using seccomp(2). +* Linux capabilities(7) are dropped. + +In "namespace" sandbox mode the program switches into a new file system +namespace and invokes pivot_root(2) to make the shared directory tree its root. +A new pid and net namespace is also created to isolate the process. + +In "chroot" sandbox mode the program invokes chroot(2) to make the shared +directory tree its root. This mode is intended for container environments where +the container runtime has already set up the namespaces and the program does +not have permission to create namespaces itself. + +Both sandbox modes prevent "file system escapes" due to symlinks and other file +system objects that might lead to files outside the shared directory. + +Options +------- + +.. program:: virtiofsd + +.. option:: -h, --help + + Print help. + +.. option:: -V, --version + + Print version. + +.. option:: -d + + Enable debug output. + +.. option:: --syslog + + Print log messages to syslog instead of stderr. + +.. option:: -o OPTION + + * debug - + Enable debug output. + + * flock|no_flock - + Enable/disable flock. The default is ``no_flock``. + + * modcaps=CAPLIST + Modify the list of capabilities allowed; CAPLIST is a colon separated + list of capabilities, each preceded by either + or -, e.g. + ''+sys_admin:-chown''. + + * log_level=LEVEL - + Print only log messages matching LEVEL or more severe. LEVEL is one of + ``err``, ``warn``, ``info``, or ``debug``. The default is ``info``. + + * posix_lock|no_posix_lock - + Enable/disable remote POSIX locks. The default is ``no_posix_lock``. + + * readdirplus|no_readdirplus - + Enable/disable readdirplus. The default is ``readdirplus``. + + * sandbox=namespace|chroot - + Sandbox mode: + - namespace: Create mount, pid, and net namespaces and pivot_root(2) into + the shared directory. + - chroot: chroot(2) into shared directory (use in containers). + The default is "namespace". + + * source=PATH - + Share host directory tree located at PATH. This option is required. + + * timeout=TIMEOUT - + I/O timeout in seconds. The default depends on cache= option. + + * writeback|no_writeback - + Enable/disable writeback cache. The cache allows the FUSE client to buffer + and merge write requests. The default is ``no_writeback``. + + * xattr|no_xattr - + Enable/disable extended attributes (xattr) on files and directories. The + default is ``no_xattr``. + + * posix_acl|no_posix_acl - + Enable/disable posix acl support. Posix ACLs are disabled by default. + +.. option:: --socket-path=PATH + + Listen on vhost-user UNIX domain socket at PATH. + +.. option:: --socket-group=GROUP + + Set the vhost-user UNIX domain socket gid to GROUP. + +.. option:: --fd=FDNUM + + Accept connections from vhost-user UNIX domain socket file descriptor FDNUM. + The file descriptor must already be listening for connections. + +.. option:: --thread-pool-size=NUM + + Restrict the number of worker threads per request queue to NUM. The default + is 64. + +.. option:: --cache=none|auto|always + + Select the desired trade-off between coherency and performance. ``none`` + forbids the FUSE client from caching to achieve best coherency at the cost of + performance. ``auto`` acts similar to NFS with a 1 second metadata cache + timeout. ``always`` sets a long cache lifetime at the expense of coherency. + The default is ``auto``. + +Extended attribute (xattr) mapping +---------------------------------- + +By default the name of xattr's used by the client are passed through to the server +file system. This can be a problem where either those xattr names are used +by something on the server (e.g. selinux client/server confusion) or if the +``virtiofsd`` is running in a container with restricted privileges where it +cannot access some attributes. + +Mapping syntax +~~~~~~~~~~~~~~ + +A mapping of xattr names can be made using -o xattrmap=mapping where the ``mapping`` +string consists of a series of rules. + +The first matching rule terminates the mapping. +The set of rules must include a terminating rule to match any remaining attributes +at the end. + +Each rule consists of a number of fields separated with a separator that is the +first non-white space character in the rule. This separator must then be used +for the whole rule. +White space may be added before and after each rule. + +Using ':' as the separator a rule is of the form: + +``:type:scope:key:prepend:`` + +**scope** is: + +- 'client' - match 'key' against a xattr name from the client for + setxattr/getxattr/removexattr +- 'server' - match 'prepend' against a xattr name from the server + for listxattr +- 'all' - can be used to make a single rule where both the server + and client matches are triggered. + +**type** is one of: + +- 'prefix' - is designed to prepend and strip a prefix; the modified + attributes then being passed on to the client/server. + +- 'ok' - Causes the rule set to be terminated when a match is found + while allowing matching xattr's through unchanged. + It is intended both as a way of explicitly terminating + the list of rules, and to allow some xattr's to skip following rules. + +- 'bad' - If a client tries to use a name matching 'key' it's + denied using EPERM; when the server passes an attribute + name matching 'prepend' it's hidden. In many ways it's use is very like + 'ok' as either an explicit terminator or for special handling of certain + patterns. + +- 'unsupported' - If a client tries to use a name matching 'key' it's + denied using ENOTSUP; when the server passes an attribute + name matching 'prepend' it's hidden. In many ways it's use is very like + 'ok' as either an explicit terminator or for special handling of certain + patterns. + +**key** is a string tested as a prefix on an attribute name originating +on the client. It maybe empty in which case a 'client' rule +will always match on client names. + +**prepend** is a string tested as a prefix on an attribute name originating +on the server, and used as a new prefix. It may be empty +in which case a 'server' rule will always match on all names from +the server. + +e.g.: + + ``:prefix:client:trusted.:user.virtiofs.:`` + + will match 'trusted.' attributes in client calls and prefix them before + passing them to the server. + + ``:prefix:server::user.virtiofs.:`` + + will strip 'user.virtiofs.' from all server replies. + + ``:prefix:all:trusted.:user.virtiofs.:`` + + combines the previous two cases into a single rule. + + ``:ok:client:user.::`` + + will allow get/set xattr for 'user.' xattr's and ignore + following rules. + + ``:ok:server::security.:`` + + will pass 'securty.' xattr's in listxattr from the server + and ignore following rules. + + ``:ok:all:::`` + + will terminate the rule search passing any remaining attributes + in both directions. + + ``:bad:server::security.:`` + + would hide 'security.' xattr's in listxattr from the server. + +A simpler 'map' type provides a shorter syntax for the common case: + +``:map:key:prepend:`` + +The 'map' type adds a number of separate rules to add **prepend** as a prefix +to the matched **key** (or all attributes if **key** is empty). +There may be at most one 'map' rule and it must be the last rule in the set. + +Note: When the 'security.capability' xattr is remapped, the daemon has to do +extra work to remove it during many operations, which the host kernel normally +does itself. + +Security considerations +~~~~~~~~~~~~~~~~~~~~~~~ + +Operating systems typically partition the xattr namespace using +well defined name prefixes. Each partition may have different +access controls applied. For example, on Linux there are multiple +partitions + + * ``system.*`` - access varies depending on attribute & filesystem + * ``security.*`` - only processes with CAP_SYS_ADMIN + * ``trusted.*`` - only processes with CAP_SYS_ADMIN + * ``user.*`` - any process granted by file permissions / ownership + +While other OS such as FreeBSD have different name prefixes +and access control rules. + +When remapping attributes on the host, it is important to +ensure that the remapping does not allow a guest user to +evade the guest access control rules. + +Consider if ``trusted.*`` from the guest was remapped to +``user.virtiofs.trusted*`` in the host. An unprivileged +user in a Linux guest has the ability to write to xattrs +under ``user.*``. Thus the user can evade the access +control restriction on ``trusted.*`` by instead writing +to ``user.virtiofs.trusted.*``. + +As noted above, the partitions used and access controls +applied, will vary across guest OS, so it is not wise to +try to predict what the guest OS will use. + +The simplest way to avoid an insecure configuration is +to remap all xattrs at once, to a given fixed prefix. +This is shown in example (1) below. + +If selectively mapping only a subset of xattr prefixes, +then rules must be added to explicitly block direct +access to the target of the remapping. This is shown +in example (2) below. + +Mapping examples +~~~~~~~~~~~~~~~~ + +1) Prefix all attributes with 'user.virtiofs.' + +:: + + -o xattrmap=":prefix:all::user.virtiofs.::bad:all:::" + + +This uses two rules, using : as the field separator; +the first rule prefixes and strips 'user.virtiofs.', +the second rule hides any non-prefixed attributes that +the host set. + +This is equivalent to the 'map' rule: + +:: + + -o xattrmap=":map::user.virtiofs.:" + +2) Prefix 'trusted.' attributes, allow others through + +:: + + "/prefix/all/trusted./user.virtiofs./ + /bad/server//trusted./ + /bad/client/user.virtiofs.// + /ok/all///" + + +Here there are four rules, using / as the field +separator, and also demonstrating that new lines can +be included between rules. +The first rule is the prefixing of 'trusted.' and +stripping of 'user.virtiofs.'. +The second rule hides unprefixed 'trusted.' attributes +on the host. +The third rule stops a guest from explicitly setting +the 'user.virtiofs.' path directly to prevent access +control bypass on the target of the earlier prefix +remapping. +Finally, the fourth rule lets all remaining attributes +through. + +This is equivalent to the 'map' rule: + +:: + + -o xattrmap="/map/trusted./user.virtiofs./" + +3) Hide 'security.' attributes, and allow everything else + +:: + + "/bad/all/security./security./ + /ok/all///' + +The first rule combines what could be separate client and server +rules into a single 'all' rule, matching 'security.' in either +client arguments or lists returned from the host. This stops +the client seeing any 'security.' attributes on the server and +stops it setting any. + +Examples +-------- + +Export ``/var/lib/fs/vm001/`` on vhost-user UNIX domain socket +``/var/run/vm001-vhost-fs.sock``: + +.. parsed-literal:: + + host# virtiofsd --socket-path=/var/run/vm001-vhost-fs.sock -o source=/var/lib/fs/vm001 + host# |qemu_system| \\ + -chardev socket,id=char0,path=/var/run/vm001-vhost-fs.sock \\ + -device vhost-user-fs-pci,chardev=char0,tag=myfs \\ + -object memory-backend-memfd,id=mem,size=4G,share=on \\ + -numa node,memdev=mem \\ + ... + guest# mount -t virtiofs myfs /mnt |