From e02cda008591317b1625707ff8e115a4841aa889 Mon Sep 17 00:00:00 2001 From: Timos Ampelikiotis Date: Tue, 10 Oct 2023 11:40:56 +0000 Subject: Introduce Virtio-loopback epsilon release: Epsilon release introduces a new compatibility layer which make virtio-loopback design to work with QEMU and rust-vmm vhost-user backend without require any changes. Signed-off-by: Timos Ampelikiotis Change-Id: I52e57563e08a7d0bdc002f8e928ee61ba0c53dd9 --- docs/devel/tcg.rst | 190 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 190 insertions(+) create mode 100644 docs/devel/tcg.rst (limited to 'docs/devel/tcg.rst') diff --git a/docs/devel/tcg.rst b/docs/devel/tcg.rst new file mode 100644 index 000000000..a65fb7b1c --- /dev/null +++ b/docs/devel/tcg.rst @@ -0,0 +1,190 @@ +==================== +Translator Internals +==================== + +QEMU is a dynamic translator. When it first encounters a piece of code, +it converts it to the host instruction set. Usually dynamic translators +are very complicated and highly CPU dependent. QEMU uses some tricks +which make it relatively easily portable and simple while achieving good +performances. + +QEMU's dynamic translation backend is called TCG, for "Tiny Code +Generator". For more information, please take a look at ``tcg/README``. + +The following sections outline some notable features and implementation +details of QEMU's dynamic translator. + +CPU state optimisations +----------------------- + +The target CPUs have many internal states which change the way they +evaluate instructions. In order to achieve a good speed, the +translation phase considers that some state information of the virtual +CPU cannot change in it. The state is recorded in the Translation +Block (TB). If the state changes (e.g. privilege level), a new TB will +be generated and the previous TB won't be used anymore until the state +matches the state recorded in the previous TB. The same idea can be applied +to other aspects of the CPU state. For example, on x86, if the SS, +DS and ES segments have a zero base, then the translator does not even +generate an addition for the segment base. + +Direct block chaining +--------------------- + +After each translated basic block is executed, QEMU uses the simulated +Program Counter (PC) and other CPU state information (such as the CS +segment base value) to find the next basic block. + +In its simplest, less optimized form, this is done by exiting from the +current TB, going through the TB epilogue, and then back to the +main loop. That’s where QEMU looks for the next TB to execute, +translating it from the guest architecture if it isn’t already available +in memory. Then QEMU proceeds to execute this next TB, starting at the +prologue and then moving on to the translated instructions. + +Exiting from the TB this way will cause the ``cpu_exec_interrupt()`` +callback to be re-evaluated before executing additional instructions. +It is mandatory to exit this way after any CPU state changes that may +unmask interrupts. + +In order to accelerate the cases where the TB for the new +simulated PC is already available, QEMU has mechanisms that allow +multiple TBs to be chained directly, without having to go back to the +main loop as described above. These mechanisms are: + +``lookup_and_goto_ptr`` +^^^^^^^^^^^^^^^^^^^^^^^ + +Calling ``tcg_gen_lookup_and_goto_ptr()`` will emit a call to +``helper_lookup_tb_ptr``. This helper will look for an existing TB that +matches the current CPU state. If the destination TB is available its +code address is returned, otherwise the address of the JIT epilogue is +returned. The call to the helper is always followed by the tcg ``goto_ptr`` +opcode, which branches to the returned address. In this way, we either +branch to the next TB or return to the main loop. + +``goto_tb + exit_tb`` +^^^^^^^^^^^^^^^^^^^^^ + +The translation code usually implements branching by performing the +following steps: + +1. Call ``tcg_gen_goto_tb()`` passing a jump slot index (either 0 or 1) + as a parameter. + +2. Emit TCG instructions to update the CPU state with any information + that has been assumed constant and is required by the main loop to + correctly locate and execute the next TB. For most guests, this is + just the PC of the branch destination, but others may store additional + data. The information updated in this step must be inferable from both + ``cpu_get_tb_cpu_state()`` and ``cpu_restore_state()``. + +3. Call ``tcg_gen_exit_tb()`` passing the address of the current TB and + the jump slot index again. + +Step 1, ``tcg_gen_goto_tb()``, will emit a ``goto_tb`` TCG +instruction that later on gets translated to a jump to an address +associated with the specified jump slot. Initially, this is the address +of step 2's instructions, which update the CPU state information. Step 3, +``tcg_gen_exit_tb()``, exits from the current TB returning a tagged +pointer composed of the last executed TB’s address and the jump slot +index. + +The first time this whole sequence is executed, step 1 simply jumps +to step 2. Then the CPU state information gets updated and we exit from +the current TB. As a result, the behavior is very similar to the less +optimized form described earlier in this section. + +Next, the main loop looks for the next TB to execute using the +current CPU state information (creating the TB if it wasn’t already +available) and, before starting to execute the new TB’s instructions, +patches the previously executed TB by associating one of its jump +slots (the one specified in the call to ``tcg_gen_exit_tb()``) with the +address of the new TB. + +The next time this previous TB is executed and we get to that same +``goto_tb`` step, it will already be patched (assuming the destination TB +is still in memory) and will jump directly to the first instruction of +the destination TB, without going back to the main loop. + +For the ``goto_tb + exit_tb`` mechanism to be used, the following +conditions need to be satisfied: + +* The change in CPU state must be constant, e.g., a direct branch and + not an indirect branch. + +* The direct branch cannot cross a page boundary. Memory mappings + may change, causing the code at the destination address to change. + +Note that, on step 3 (``tcg_gen_exit_tb()``), in addition to the +jump slot index, the address of the TB just executed is also returned. +This address corresponds to the TB that will be patched; it may be +different than the one that was directly executed from the main loop +if the latter had already been chained to other TBs. + +Self-modifying code and translated code invalidation +---------------------------------------------------- + +Self-modifying code is a special challenge in x86 emulation because no +instruction cache invalidation is signaled by the application when code +is modified. + +User-mode emulation marks a host page as write-protected (if it is +not already read-only) every time translated code is generated for a +basic block. Then, if a write access is done to the page, Linux raises +a SEGV signal. QEMU then invalidates all the translated code in the page +and enables write accesses to the page. For system emulation, write +protection is achieved through the software MMU. + +Correct translated code invalidation is done efficiently by maintaining +a linked list of every translated block contained in a given page. Other +linked lists are also maintained to undo direct block chaining. + +On RISC targets, correctly written software uses memory barriers and +cache flushes, so some of the protection above would not be +necessary. However, QEMU still requires that the generated code always +matches the target instructions in memory in order to handle +exceptions correctly. + +Exception support +----------------- + +longjmp() is used when an exception such as division by zero is +encountered. + +The host SIGSEGV and SIGBUS signal handlers are used to get invalid +memory accesses. QEMU keeps a map from host program counter to +target program counter, and looks up where the exception happened +based on the host program counter at the exception point. + +On some targets, some bits of the virtual CPU's state are not flushed to the +memory until the end of the translation block. This is done for internal +emulation state that is rarely accessed directly by the program and/or changes +very often throughout the execution of a translation block---this includes +condition codes on x86, delay slots on SPARC, conditional execution on +Arm, and so on. This state is stored for each target instruction, and +looked up on exceptions. + +MMU emulation +------------- + +For system emulation QEMU uses a software MMU. In that mode, the MMU +virtual to physical address translation is done at every memory +access. + +QEMU uses an address translation cache (TLB) to speed up the translation. +In order to avoid flushing the translated code each time the MMU +mappings change, all caches in QEMU are physically indexed. This +means that each basic block is indexed with its physical address. + +In order to avoid invalidating the basic block chain when MMU mappings +change, chaining is only performed when the destination of the jump +shares a page with the basic block that is performing the jump. + +The MMU can also distinguish RAM and ROM memory areas from MMIO memory +areas. Access is faster for RAM and ROM because the translation cache also +hosts the offset between guest address and host memory. Accessing MMIO +memory areas instead calls out to C code for device emulation. +Finally, the MMU helps tracking dirty pages and pages pointed to by +translation blocks. + -- cgit 1.2.3-korg