diff options
author | 2023-10-10 11:40:56 +0000 | |
---|---|---|
committer | 2023-10-10 11:40:56 +0000 | |
commit | e02cda008591317b1625707ff8e115a4841aa889 (patch) | |
tree | aee302e3cf8b59ec2d32ec481be3d1afddfc8968 /docs/specs/ppc-spapr-numa.rst | |
parent | cc668e6b7e0ffd8c9d130513d12053cf5eda1d3b (diff) |
Introduce Virtio-loopback epsilon release:
Epsilon release introduces a new compatibility layer which make virtio-loopback
design to work with QEMU and rust-vmm vhost-user backend without require any
changes.
Signed-off-by: Timos Ampelikiotis <t.ampelikiotis@virtualopensystems.com>
Change-Id: I52e57563e08a7d0bdc002f8e928ee61ba0c53dd9
Diffstat (limited to 'docs/specs/ppc-spapr-numa.rst')
-rw-r--r-- | docs/specs/ppc-spapr-numa.rst | 410 |
1 files changed, 410 insertions, 0 deletions
diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst new file mode 100644 index 000000000..ffa687dc8 --- /dev/null +++ b/docs/specs/ppc-spapr-numa.rst @@ -0,0 +1,410 @@ + +NUMA mechanics for sPAPR (pseries machines) +============================================ + +NUMA in sPAPR works different than the System Locality Distance +Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR +1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This +document aims to complement this specification, providing details +of the elements that impacts how QEMU views NUMA in pseries. + +Associativity and ibm,associativity property +-------------------------------------------- + +Associativity is defined as a group of platform resources that has +similar mean performance (or in our context here, distance) relative to +everyone else outside of the group. + +The format of the ibm,associativity property varies with the value of +bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with +bit 0 equal to zero is deprecated. The current format, with the bit 0 +with the value of one, makes ibm,associativity property represent the +physical hierarchy of the platform, as one or more lists that starts +with the highest level grouping up to the smallest. Considering the +following topology: + +:: + + Mem M1 ---- Proc P1 | + ----------------- | Socket S1 ---| + chip C1 | | + | HW module 1 (MOD1) + Mem M2 ---- Proc P2 | | + ----------------- | Socket S2 ---| + chip C2 | + +The ibm,associativity property for the processors would be: + +* P1: {MOD1, S1, C1, P1} +* P2: {MOD1, S2, C2, P2} + +Each allocable resource has an ibm,associativity property. The LOPAPR +specification allows multiple lists to be present in this property, +considering that the same resource can have multiple connections to the +platform. + +Relative Performance Distance and ibm,associativity-reference-points +-------------------------------------------------------------------- + +The ibm,associativity-reference-points property is an array that is used +to define the relevant performance/distance related boundaries, defining +the NUMA levels for the platform. + +The definition of its elements also varies with the value of bit 0 of byte 5 +of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero +is also deprecated. With the current format, each integer of the +ibm,associativity-reference-points represents an 1 based ordinal index (i.e. +the first element is 1) of the ibm,associativity array. The first +boundary is the most significant to application performance, followed by +less significant boundaries. Allocated resources that belongs to the +same performance boundaries are expected to have relative NUMA distance +that matches the relevancy of the boundary itself. Resources that belongs +to the same first boundary will have the shortest distance from each +other. Subsequent boundaries represents greater distances and degraded +performance. + +Using the previous example, the following setting reference points defines +three NUMA levels: + +* ibm,associativity-reference-points = {0x3, 0x2, 0x1} + +The first NUMA level (0x3) is interpreted as the third element of each +ibm,associativity array, the second level is the second element and +the third level is the first element. Let's also consider that elements +belonging to the first NUMA level have distance equal to 10 from each +other, and each NUMA level doubles the distance from the previous. This +means that the second would be 20 and the third level 40. For the P1 and +P2 processors, we would have the following NUMA levels: + +:: + + * ibm,associativity-reference-points = {0x3, 0x2, 0x1} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x3) => associativity[2] = C1 + Second NUMA level (0x2) => associativity[1] = S1 + Third NUMA level (0x1) => associativity[0] = MOD1 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x3) => associativity[2] = C2 + Second NUMA level (0x2) => associativity[1] = S2 + Third NUMA level (0x1) => associativity[0] = MOD1 + + P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40 + +Changing the ibm,associativity-reference-points array changes the performance +distance attributes for the same associativity arrays, as the following +example illustrates: + +:: + + * ibm,associativity-reference-points = {0x2} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x2) => associativity[1] = S1 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x2) => associativity[1] = S2 + + P1 and P2 does not have a common performance boundary. Since this is a one level + NUMA configuration, distance between them is one boundary above the first + level, 20. + + +In a hypothetical platform where all resources inside the same hardware module +is considered to be on the same performance boundary: + +:: + + * ibm,associativity-reference-points = {0x1} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x1) => associativity[0] = MOD0 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x1) => associativity[0] = MOD0 + + P1 and P2 belongs to the same first order boundary. The distance between then + is 10. + + +How the pseries Linux guest calculates NUMA distances +===================================================== + +Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is +how the distances are expressed. The SLIT table provides the NUMA distance +value between the relevant resources. LOPAPR does not provide a standard +way to calculate it. We have the ibm,associativity for each resource, which +provides a common-performance hierarchy, and the ibm,associativity-reference-points +array that tells which level of associativity is considered to be relevant +or not. + +The result is that each OS is free to implement and to interpret the distance +as it sees fit. For the pseries Linux guest, each level of NUMA duplicates +the distance of the previous level, and the maximum amount of levels is +limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the +kernel tree). This results in the following distances: + +* both resources in the first NUMA level: 10 +* resources one NUMA level apart: 20 +* resources two NUMA levels apart: 40 +* resources three NUMA levels apart: 80 +* resources four NUMA levels apart: 160 + + +pseries NUMA mechanics +====================== + +Starting in QEMU 5.2, the pseries machine considers user input when setting NUMA +topology of the guest. The overall design is: + +* ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing + for 4 distinct NUMA distance values based on the NUMA levels + +* ibm,max-associativity-domains supports multiple associativity domains in all + NUMA levels, granting user flexibility + +* ibm,associativity for all resources varies with user input + +These changes are only effective for pseries-5.2 and newer machines that are +created with more than one NUMA node (disconsidering NUMA nodes created by +the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been +around for such a long time, with users seeing NUMA distances 10 and 40 +(and 80 if using NVLink2 GPUs), and there is no need to disrupt the +existing experience of those guests. + +To bring the user experience x86 users have when tuning up NUMA, we had +to operate under the current pseries Linux kernel logic described in +`How the pseries Linux guest calculates NUMA distances`_. The result +is that we needed to translate NUMA distance user input to pseries +Linux kernel input. + +Translating user distance to kernel distance +-------------------------------------------- + +User input for NUMA distance can vary from 10 to 254. We need to translate +that to the values that the Linux kernel operates on (10, 20, 40, 80, 160). +This is how it is being done: + +* user distance 11 to 30 will be interpreted as 20 +* user distance 31 to 60 will be interpreted as 40 +* user distance 61 to 120 will be interpreted as 80 +* user distance 121 and beyond will be interpreted as 160 +* user distance 10 stays 10 + +The reasoning behind this approximation is to avoid any round up to the local +distance (10), keeping it exclusive to the 4th NUMA level (which is still +exclusive to the node_id). All other ranges were chosen under the developer +discretion of what would be (somewhat) sensible considering the user input. +Any other strategy can be used here, but in the end the reality is that we'll +have to accept that a large array of values will be translated to the same +NUMA topology in the guest, e.g. this user input: + +:: + + 0 1 2 + 0 10 31 120 + 1 31 10 30 + 2 120 30 10 + +And this other user input: + +:: + + 0 1 2 + 0 10 60 61 + 1 60 10 11 + 2 61 11 10 + +Will both be translated to the same values internally: + +:: + + 0 1 2 + 0 10 40 80 + 1 40 10 20 + 2 80 20 10 + +Users are encouraged to use only the kernel values in the NUMA definition to +avoid being taken by surprise with that the guest is actually seeing in the +topology. There are enough potential surprises that are inherent to the +associativity domain assignment process, discussed below. + + +How associativity domains are assigned +-------------------------------------- + +LOPAPR allows more than one associativity array (or 'string') per allocated +resource. This would be used to represent that the resource has multiple +connections with the board, and then the operational system, when deciding +NUMA distancing, should consider the associativity information that provides +the shortest distance. + +The spapr implementation does not support multiple associativity arrays per +resource, neither does the pseries Linux kernel. We'll have to represent the +NUMA topology using one associativity per resource, which means that choices +and compromises are going to be made. + +Consider the following NUMA topology entered by user input: + +:: + + 0 1 2 3 + 0 10 40 20 40 + 1 40 10 80 40 + 2 20 80 10 20 + 3 40 40 20 10 + +All the associativity arrays are initialized with NUMA id in all associativity +domains: + +* node 0: 0 0 0 0 +* node 1: 1 1 1 1 +* node 2: 2 2 2 2 +* node 3: 3 3 3 3 + + +Honoring just the relative distances of node 0 to every other node, we find the +NUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for +each distance: + +* distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match + at 0x2) +* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3) +* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match + at 0x2) + +We'll copy the associativity domains of node 0 to all other nodes, based on +the NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy +the domains 0x2 and 0x1 from 0 to 1 as well. This will give us: + +* node 0: 0 0 0 0 +* node 1: 0 0 1 1 + +Doing the same to node 2 and node 3, these are the associativity arrays +after considering all matches with node 0: + +* node 0: 0 0 0 0 +* node 1: 0 0 1 1 +* node 2: 0 0 0 2 +* node 3: 0 0 3 3 + +The distances related to node 0 are accounted for. For node 1, and keeping +in mind that we don't need to revisit node 0 again, the distance from +node 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40, +match in 0x2. Repeating the same logic of copying all domains up to +the NUMA level match: + +* node 0: 0 0 0 0 +* node 1: 1 0 1 1 +* node 2: 1 0 0 2 +* node 3: 1 0 3 3 + +In the last step we will analyze just nodes 2 and 3. The desired distance +between 2 and 3 is 20, i.e. a match in 0x3: + +* node 0: 0 0 0 0 +* node 1: 1 0 1 1 +* node 2: 1 0 0 2 +* node 3: 1 0 0 3 + + +The kernel will read these arrays and will calculate the following NUMA topology for +the guest: + +:: + + 0 1 2 3 + 0 10 40 20 20 + 1 40 10 40 40 + 2 20 40 10 20 + 3 20 40 20 10 + +Note that this is not what the user wanted - the desired distance between +0 and 3 is 40, we calculated it as 20. This is what the current logic and +implementation constraints of the kernel and QEMU will provide inside the +LOPAPR specification. + +Users are welcome to use this knowledge and experiment with the input to get +the NUMA topology they want, or as closer as they want. The important thing +is to keep expectations up to par with what we are capable of provide at this +moment: an approximation. + +Limitations of the implementation +--------------------------------- + +As mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate +user choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways +to fully map user input to actual NUMA distance the guest will use. These limitations +creates two notable limitations in our support: + +* Asymmetrical topologies aren't supported. We only support NUMA topologies where + the distance from node A to B is always the same as B to A. We do not support + any A-B pair where the distance back and forth is asymmetric. For example, the + following topology isn't supported and the pSeries guest will not boot with this + user input: + +:: + + 0 1 + 0 10 40 + 1 20 10 + + +* 'non-transitive' topologies will be poorly translated to the guest. This is the + kind of topology where the distance from a node A to B is X, B to C is X, but + the distance A to C is not X. E.g.: + +:: + + 0 1 2 3 + 0 10 20 20 40 + 1 20 10 80 40 + 2 20 80 10 20 + 3 40 40 20 10 + + In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40. + The kernel will always match with the shortest associativity domain possible, + and we're attempting to retain the previous established relations between the + nodes. This means that a distance equal to 20 between nodes 0 and 2 and the + same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3 + to also be 20. + + +Legacy (5.1 and older) pseries NUMA mechanics +============================================= + +In short, we can summarize the NUMA distances seem in pseries Linux guests, using +QEMU up to 5.1, as follows: + +* local distance, i.e. the distance of the resource to its own NUMA node: 10 +* if it's a NVLink GPU device, distance: 80 +* every other resource, distance: 40 + +The way the pseries Linux guest calculates NUMA distances has a direct effect +on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is +the default ibm,associativity-reference-points being used in the pseries +machine: + +ibm,associativity-reference-points = {0x4, 0x4, 0x2} + +The first and second level are equal, 0x4, and a third one was added in +commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that +regardless of how the ibm,associativity properties are being created in +the device tree, the pseries Linux guest will only recognize three scenarios +as far as NUMA distance goes: + +* if the resources belongs to the same first NUMA level = 10 +* second level is skipped since it's equal to the first +* all resources that aren't a NVLink GPU, it is guaranteed that they will belong + to the same third NUMA level, having distance = 40 +* for NVLink GPUs, distance = 80 from everything else + +This also means that user input in QEMU command line does not change the +NUMA distancing inside the guest for the pseries machine. |