diff options
Diffstat (limited to 'meta-eas/recipes-kernel/linux/linux-renesas/0045-sched-Documentation-for-scheduler-energy-cost-model.patch')
-rw-r--r-- | meta-eas/recipes-kernel/linux/linux-renesas/0045-sched-Documentation-for-scheduler-energy-cost-model.patch | 389 |
1 files changed, 389 insertions, 0 deletions
diff --git a/meta-eas/recipes-kernel/linux/linux-renesas/0045-sched-Documentation-for-scheduler-energy-cost-model.patch b/meta-eas/recipes-kernel/linux/linux-renesas/0045-sched-Documentation-for-scheduler-energy-cost-model.patch new file mode 100644 index 0000000..273f0e8 --- /dev/null +++ b/meta-eas/recipes-kernel/linux/linux-renesas/0045-sched-Documentation-for-scheduler-energy-cost-model.patch @@ -0,0 +1,389 @@ +From ffa86114448cb4f15e3774072964960c5c4d6b5e Mon Sep 17 00:00:00 2001 +From: Morten Rasmussen <morten.rasmussen@arm.com> +Date: Tue, 13 Jan 2015 13:43:28 +0000 +Subject: [PATCH 45/92] sched: Documentation for scheduler energy cost model + +This documentation patch provides an overview of the experimental +scheduler energy costing model, associated data structures, and a +reference recipe on how platforms can be characterized to derive energy +models. + +Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> +(cherry picked from commit 8de9277814e36b07759c65bce220c82584f80aa3) +Signed-off-by: Gaku Inami <gaku.inami.xw@bp.renesas.com> +--- + Documentation/scheduler/sched-energy.txt | 362 +++++++++++++++++++++++++++++++ + 1 file changed, 362 insertions(+) + create mode 100644 Documentation/scheduler/sched-energy.txt + +diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt +new file mode 100644 +index 0000000..dab2f90 +--- /dev/null ++++ b/Documentation/scheduler/sched-energy.txt +@@ -0,0 +1,362 @@ ++Energy cost model for energy-aware scheduling (EXPERIMENTAL) ++ ++Introduction ++============= ++ ++The basic energy model uses platform energy data stored in sched_group_energy ++data structures attached to the sched_groups in the sched_domain hierarchy. The ++energy cost model offers two functions that can be used to guide scheduling ++decisions: ++ ++1. static unsigned int sched_group_energy(struct energy_env *eenv) ++2. static int energy_diff(struct energy_env *eenv) ++ ++sched_group_energy() estimates the energy consumed by all cpus in a specific ++sched_group including any shared resources owned exclusively by this group of ++cpus. Resources shared with other cpus are excluded (e.g. later level caches). ++ ++energy_diff() estimates the total energy impact of a utilization change. That ++is, adding, removing, or migrating utilization (tasks). ++ ++Both functions use a struct energy_env to specify the scenario to be evaluated: ++ ++ struct energy_env { ++ struct sched_group *sg_top; ++ struct sched_group *sg_cap; ++ int cap_idx; ++ int util_delta; ++ int src_cpu; ++ int dst_cpu; ++ int energy; ++ }; ++ ++sg_top: sched_group to be evaluated. Not used by energy_diff(). ++ ++sg_cap: sched_group covering the cpus in the same frequency domain. Set by ++sched_group_energy(). ++ ++cap_idx: Capacity state to be used for energy calculations. Set by ++find_new_capacity(). ++ ++util_delta: Amount of utilization to be added, removed, or migrated. ++ ++src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be ++-1 if no source (e.g. task wake-up). ++ ++dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1 ++if utilization is removed (e.g. terminating tasks). ++ ++energy: Result of sched_group_energy(). ++ ++The metric used to represent utilization is the actual per-entity running time ++averaged over time using a geometric series. Very similar to the existing ++per-entity load-tracking, but _not_ scaled by task priority and capped by the ++capacity of the cpu. The latter property does mean that utilization may ++underestimate the compute requirements for task on fully/over utilized cpus. ++The greatest potential for energy savings without affecting performance too much ++is scenarios where the system isn't fully utilized. If the system is deemed ++fully utilized load-balancing should be done with task load (includes task ++priority) instead in the interest of fairness and performance. ++ ++ ++Background and Terminology ++=========================== ++ ++To make it clear from the start: ++ ++energy = [joule] (resource like a battery on powered devices) ++power = energy/time = [joule/second] = [watt] ++ ++The goal of energy-aware scheduling is to minimize energy, while still getting ++the job done. That is, we want to maximize: ++ ++ performance [inst/s] ++ -------------------- ++ power [W] ++ ++which is equivalent to minimizing: ++ ++ energy [J] ++ ----------- ++ instruction ++ ++while still getting 'good' performance. It is essentially an alternative ++optimization objective to the current performance-only objective for the ++scheduler. This alternative considers two objectives: energy-efficiency and ++performance. Hence, there needs to be a user controllable knob to switch the ++objective. Since it is early days, this is currently a sched_feature ++(ENERGY_AWARE). ++ ++The idea behind introducing an energy cost model is to allow the scheduler to ++evaluate the implications of its decisions rather than applying energy-saving ++techniques blindly that may only have positive effects on some platforms. At ++the same time, the energy cost model must be as simple as possible to minimize ++the scheduler latency impact. ++ ++Platform topology ++------------------ ++ ++The system topology (cpus, caches, and NUMA information, not peripherals) is ++represented in the scheduler by the sched_domain hierarchy which has ++sched_groups attached at each level that covers one or more cpus (see ++sched-domains.txt for more details). To add energy awareness to the scheduler ++we need to consider power and frequency domains. ++ ++Power domain: ++ ++A power domain is a part of the system that can be powered on/off ++independently. Power domains are typically organized in a hierarchy where you ++may be able to power down just a cpu or a group of cpus along with any ++associated resources (e.g. shared caches). Powering up a cpu means that all ++power domains it is a part of in the hierarchy must be powered up. Hence, it is ++more expensive to power up the first cpu that belongs to a higher level power ++domain than powering up additional cpus in the same high level domain. Two ++level power domain hierarchy example: ++ ++ Power source ++ +-------------------------------+----... ++per group PD G G ++ | +----------+ | ++ +--------+-------| Shared | (other groups) ++per-cpu PD G G | resource | ++ | | +----------+ ++ +-------+ +-------+ ++ | CPU 0 | | CPU 1 | ++ +-------+ +-------+ ++ ++Frequency domain: ++ ++Frequency domains (P-states) typically cover the same group of cpus as one of ++the power domain levels. That is, there might be several smaller power domains ++sharing the same frequency (P-state) or there might be a power domain spanning ++multiple frequency domains. ++ ++From a scheduling point of view there is no need to know the actual frequencies ++[Hz]. All the scheduler cares about is the compute capacity available at the ++current state (P-state) the cpu is in and any other available states. For that ++reason, and to also factor in any cpu micro-architecture differences, compute ++capacity scaling states are called 'capacity states' in this document. For SMP ++systems this is equivalent to P-states. For mixed micro-architecture systems ++(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture ++performance relative to the other cpus in the system. ++ ++Energy modelling: ++------------------ ++ ++Due to the hierarchical nature of the power domains, the most obvious way to ++model energy costs is therefore to associate power and energy costs with ++domains (groups of cpus). Energy costs of shared resources are associated with ++the group of cpus that share the resources, only the cost of powering the ++cpu itself and any private resources (e.g. private L1 caches) is associated ++with the per-cpu groups (lowest level). ++ ++For example, for an SMP system with per-cpu power domains and a cluster level ++(group of cpus) power domain we get the overall energy costs to be: ++ ++ energy = energy_cluster + n * energy_cpu ++ ++where 'n' is the number of cpus powered up and energy_cluster is the cost paid ++as soon as any cpu in the cluster is powered up. ++ ++The power and frequency domains can naturally be mapped onto the existing ++sched_domain hierarchy and sched_groups by adding the necessary data to the ++existing data structures. ++ ++The energy model considers energy consumption from two contributors (shown in ++the illustration below): ++ ++1. Busy energy: Energy consumed while a cpu and the higher level groups that it ++belongs to are busy running tasks. Busy energy is associated with the state of ++the cpu, not an event. The time the cpu spends in this state varies. Thus, the ++most obvious platform parameter for this contribution is busy power ++(energy/time). ++ ++2. Idle energy: Energy consumed while a cpu and higher level groups that it ++belongs to are idle (in a C-state). Like busy energy, idle energy is associated ++with the state of the cpu. Thus, the platform parameter for this contribution ++is idle power (energy/time). ++ ++Energy consumed during transitions from an idle-state (C-state) to a busy state ++(P-state) or going the other way is ignored by the model to simplify the energy ++model calculations. ++ ++ ++ Power ++ ^ ++ | busy->idle idle->busy ++ | transition transition ++ | ++ | _ __ ++ | / \ / \__________________ ++ |______________/ \ / ++ | \ / ++ | Busy \ Idle / Busy ++ | low P-state \____________/ high P-state ++ | ++ +------------------------------------------------------------> time ++ ++Busy |--------------| |-----------------| ++ ++Wakeup |------| |------| ++ ++Idle |------------| ++ ++ ++The basic algorithm ++==================== ++ ++The basic idea is to determine the total energy impact when utilization is ++added or removed by estimating the impact at each level in the sched_domain ++hierarchy starting from the bottom (sched_group contains just a single cpu). ++The energy cost comes from busy time (sched_group is awake because one or more ++cpus are busy) and idle time (in an idle-state). Energy model numbers account ++for energy costs associated with all cpus in the sched_group as a group. ++ ++ for_each_domain(cpu, sd) { ++ sg = sched_group_of(cpu) ++ energy_before = curr_util(sg) * busy_power(sg) ++ + (1-curr_util(sg)) * idle_power(sg) ++ energy_after = new_util(sg) * busy_power(sg) ++ + (1-new_util(sg)) * idle_power(sg) ++ energy_diff += energy_before - energy_after ++ ++ } ++ ++ return energy_diff ++ ++{curr, new}_util: The cpu utilization at the lowest level and the overall ++non-idle time for the entire group for higher levels. Utilization is in the ++range 0.0 to 1.0 in the pseudo-code. ++ ++busy_power: The power consumption of the sched_group. ++ ++idle_power: The power consumption of the sched_group when idle. ++ ++Note: It is a fundamental assumption that the utilization is (roughly) scale ++invariant. Task utilization tracking factors in any frequency scaling and ++performance scaling differences due to difference cpu microarchitectures such ++that task utilization can be used across the entire system. ++ ++ ++Platform energy data ++===================== ++ ++struct sched_group_energy can be attached to sched_groups in the sched_domain ++hierarchy and has the following members: ++ ++cap_states: ++ List of struct capacity_state representing the supported capacity states ++ (P-states). struct capacity_state has two members: cap and power, which ++ represents the compute capacity and the busy_power of the state. The ++ list must be ordered by capacity low->high. ++ ++nr_cap_states: ++ Number of capacity states in cap_states list. ++ ++idle_states: ++ List of struct idle_state containing idle_state power cost for each ++ idle-state supported by the system orderd by shallowest state first. ++ All states must be included at all level in the hierarchy, i.e. a ++ sched_group spanning just a single cpu must also include coupled ++ idle-states (cluster states). In addition to the cpuidle idle-states, ++ the list must also contain an entry for the idling using the arch ++ default idle (arch_idle_cpu()). Despite this state may not be a true ++ hardware idle-state it is considered the shallowest idle-state in the ++ energy model and must be the first entry. cpus may enter this state ++ (possibly 'active idling') if cpuidle decides not enter a cpuidle ++ idle-state. Default idle may not be used when cpuidle is enabled. ++ In this case, it should just be a copy of the first cpuidle idle-state. ++ ++nr_idle_states: ++ Number of idle states in idle_states list. ++ ++There are no unit requirements for the energy cost data. Data can be normalized ++with any reference, however, the normalization must be consistent across all ++energy cost data. That is, one bogo-joule/watt must be the same quantity for ++data, but we don't care what it is. ++ ++A recipe for platform characterization ++======================================= ++ ++Obtaining the actual model data for a particular platform requires some way of ++measuring power/energy. There isn't a tool to help with this (yet). This ++section provides a recipe for use as reference. It covers the steps used to ++characterize the ARM TC2 development platform. This sort of measurements is ++expected to be done anyway when tuning cpuidle and cpufreq for a given ++platform. ++ ++The energy model needs two types of data (struct sched_group_energy holds ++these) for each sched_group where energy costs should be taken into account: ++ ++1. Capacity state information ++ ++A list containing the compute capacity and power consumption when fully ++utilized attributed to the group as a whole for each available capacity state. ++At the lowest level (group contains just a single cpu) this is the power of the ++cpu alone without including power consumed by resources shared with other cpus. ++It basically needs to fit the basic modelling approach described in "Background ++and Terminology" section: ++ ++ energy_system = energy_shared + n * energy_cpu ++ ++for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at ++the lowest level. 'energy_shared' is included at the next level which ++represents the group of cpus among which the resources are shared. ++ ++This model is, of course, a simplification of reality. Thus, power/energy ++attributions might not always exactly represent how the hardware is designed. ++Also, busy power is likely to depend on the workload. It is therefore ++recommended to use a representative mix of workloads when characterizing the ++capacity states. ++ ++If the group has no capacity scaling support, the list will contain a single ++state where power is the busy power attributed to the group. The capacity ++should be set to a default value (1024). ++ ++When frequency domains include multiple power domains, the group representing ++the frequency domain and all child groups share capacity states. This must be ++indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at ++all levels that share the capacity state must have the list of capacity states ++with the power set to the contribution of the individual group. ++ ++2. Idle power information ++ ++Stored in the idle_states list. The power number is the group idle power ++consumption in each idle state as well when the group is idle but has not ++entered an idle-state ('active idle' as mentioned earlier). Due to the way the ++energy model is defined, the idle power of the deepest group idle state can ++alternatively be accounted for in the parent group busy power. In that case the ++group idle state power values are offset such that the idle power of the ++deepest state is zero. It is less intuitive, but it is easier to measure as ++idle power consumed by the group and the busy/idle power of the parent group ++cannot be distinguished without per group measurement points. ++ ++Measuring capacity states and idle power: ++ ++The capacity states' capacity and power can be estimated by running a benchmark ++workload at each available capacity state. By restricting the benchmark to run ++on subsets of cpus it is possible to extrapolate the power consumption of ++shared resources. ++ ++ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a ++shared L2 cache. TC2 has on-chip energy counters per cluster. Running a ++benchmark workload on just one cpu in a cluster means that power is consumed in ++the cluster (higher level group) and a single cpu (lowest level group). Adding ++another benchmark task to another cpu increases the power consumption by the ++amount consumed by the additional cpu. Hence, it is possible to extrapolate the ++cluster busy power. ++ ++For platforms that don't have energy counters or equivalent instrumentation ++built-in, it may be possible to use an external DAQ to acquire similar data. ++ ++If the benchmark includes some performance score (for example sysbench cpu ++benchmark), this can be used to record the compute capacity. ++ ++Measuring idle power requires insight into the idle state implementation on the ++particular platform. Specifically, if the platform has coupled idle-states (or ++package states). To measure non-coupled per-cpu idle-states it is necessary to ++keep one cpu busy to keep any shared resources alive to isolate the idle power ++of the cpu from idle/busy power of the shared resources. The cpu can be tricked ++into different per-cpu idle states by disabling the other states. Based on ++various combinations of measurements with specific cpus busy and disabling ++idle-states it is possible to extrapolate the idle-state power. +-- +1.9.1 + |