Feature Introduction
Pod CPU Priorities
Rubik allows you to configure CPU priorities of services. In the hybrid deployment of online and offline services, Rubik ensures that online services preempt CPU resources.
Prerequisites
- The kernel of openEuler 22.03 or later is recommended. The kernel supports CPU priority configuration based on control groups (cgroups). The CPU subsystem provides the cpu.qos_level interface.
CPU Priority Kernel Interface
- The interface exists in the cgroup of the container in the /sys/fs/cgroup/cpu directory, for example, /sys/fs/cgroup/cpu/kubepods/burstable/
/ .- cpu.qos_level: enables the CPU priority configuration. The value can be 0 or -1, with 0 being the default.
- 0 indicates an online service.
- -1 indicates an offline service.
- cpu.qos_level: enables the CPU priority configuration. The value can be 0 or -1, with 0 being the default.
CPU Priority Configuration
Rubik automatically configures cpu.qos_level based on the annotation volcano.sh/preemptable in the YAML file of the pod. The default value is false.
annotations:
volcano.sh/preemptable: true
- true indicates an offline service.
- true indicates an online service.
Pod Memory Priorities
Rubik allows you to configure memory priorities of services. In the hybrid deployment of online and offline services, Rubik ensures that offline services are first terminated in the case of out-of-memory (OOM).
Prerequisites
- openEuler 22.03 or later is recommended. The kernel of openEuler 22.03 or later supports memory priority configuration based on cgroups, that is, the memory subsystem interface memory.qos_level.
- To enable the memory priority feature, run
echo 1 > /proc/sys/vm/memcg_qos_enable
.
Memory Priority Kernel Interface
/proc/sys/vm/memcg_qos_enable: enables the memory priority feature. The value can be 0 or 1, with 0 being the default. You can run
echo 1 > /proc/sys/vm/memcg_qos_enable
to enable the feature.- 0: The feature is disabled.
- 1: The feature is enabled.
The interface exists in the cgroup of the container in the /sys/fs/cgroup/memory directory, for example, /sys/fs/cgroup/memory/kubepods/burstable/
/ .- memory.qos_level: enables the memory priority configuration. The value can be 0 or -1, with 0 being the default.
- 0 indicates an online service.
- -1 indicates an offline service.
- memory.qos_level: enables the memory priority configuration. The value can be 0 or -1, with 0 being the default.
Memory Priority Configuration
Rubik automatically configures memory.qos_level based on the annotation volcano.sh/preemptable in the YAML file of the pod. See CPU Priority Configuration.
dynCache Memory Bandwidth and L3 Cache Access Limit
Rubik can limit pod memory bandwidth and L3 cache access for offline services to reduce the impact on online services.
Prerequisites
The cache access and memory bandwidth limit feature supports only physical machines.
- For x86 physical machines, the CAT and MBA functions of Intel RDT must be enabled in the OS by adding rdt=l3cat,mba to the kernel command line parameters (cmdline).
- For ARM physical machines, the MPAM function must be enabled in the OS by adding mpam=acpi to the kernel command line parameters (cmdline).
Due to kernel restrictions, RDT does not support the pseudo-locksetup mode.
New Permissions and Directories of Rubik
- Mount point: /sys/fs/resctrl. Rubik reads and sets files in the /sys/fs/resctrl directory. This directory must be mounted before Rubik is started and cannot be unmounted during Rubik running.
- Permission: SYS_ADMIN. To set files in the /sys/fs/resctrl directory on the host, the SYS_ADMIN permission must be assigned to the Rubik container.
- namepsace: pid namespace. Rubik obtains the PID of the service container process on the host. Therefore, the Rubik container needs to share the PID namespace with the host.
Rubik RDT Cgroups
Rubik creates five cgroups (rubik_max, rubik_high, rubik_middle, rubik_low and rubik_dynamic) in the RDT resctrl directory (/sys/fs/resctrl by default). Rubik writes the watermarks to the schemata file of each corresponding cgroup upon startup. The low, middle, and high watermarks can be configured in cacheConfig. The max cgroup uses the default maximum value. The initial watermark of the dynamic cgroup is the same as that of the low cgroup.
When an offline service pod is started, the cache level is set based on the volcano.sh/cache-limit annotation and added to the specified cgroup. For example, the pod with the following configuration is added to the rubik_low cgroup:
annotations:
volcano.sh/cache-limit: "low"
Rubik dynamic Cgroup
When offline pods whose cache level is dynamic exist, Rubik collects the cache miss and LLC miss metrics of online service pods on the current node and adjusts the watermark of the rubik_dynamic cgroup. In this way, Rubik dynamically controls offline service pods in the dynamic cgroup.
dynCache Kernel Interface
- Rubik creates five cgroup directories in /sys/fs/resctrl and modifies the schemata and tasks files of each cgroup.
dynCache Configuration
The dynCache function is configured in cacheConfig:
"cacheConfig": {
"enable": false,
"defaultLimitMode": "static",
"adjustInterval": 1000,
"perfDuration": 1000,
"l3Percent": {
"low": 20,
"mid": 30,
"high": 50
},
"memBandPercent": {
"low": 10,
"mid": 30,
"high": 50
}
},
l3Percent and memBandPercent: l3Percent and memBandPercent are used to configure the watermarks of the low, mid, and high cgroups.
Assume that in the current environment rdt bitmask=fffff and numa=2. Based on the low value of l3Percent (20) and the low value of memBandPercent (10), Rubik configures /sys/fs/resctrl/rubik_low as follows:
L3:0=f;1=f MB:0=10;1=10
defaultLimitMode: If the volcano.sh/cache-limit annotation is not specified for an offline pod, the defaultLimitMode of cacheConfig determines the cgroup to which the pod is added.
- If defaultLimitMode is static, the pod is added to the rubik_max cgroup.
- If defaultLimitMode is dynamic, the pod is added to the rubik_dynamic cgroup.
adjustInterval: interval for dynCache to dynamically adjust the rubik_dynamic cgroup, in milliseconds. The default value is 1000.
perfDuration: perf execution duration for dynCache, in milliseconds. The default value is 1000.
Precautions for dynCache
- dynCache takes affect only for offline pods.
- If a service container is manually restarted during running (the container ID remains unchanged but the container process ID changes), dynCache does not take effect for the container.
- After a service container is started and the dynCache level is set, the limit level cannot be changed.
- The sensitivity of adjusting the dynamic cgroup is affected adjustInterval and perfDuration values in the Rubik configuration file and the number of online service pods on the node. If the impact detection result indicates that adjustment is required, the adjustment interval fluctuates within the range [adjustInterval + perfDuration, adjustInterval + perfDuration x Number of pods]. You can set the configuration items based on your required sensitivity.
blkio
The blkio configuration of a pod is specified by annotation volcano.sh/blkio-limit. The configuration is applied upon pod creation and can be dynamically modified using kubectl annotate
during pod running. Both offline and online pods are supported.
The configuration consists of four lists:
Item | Description |
---|---|
device_read_bps | This list item specifies the maximum number of bytes of a read operation for one or more devices. device specifies the block device to be limited, and value specifies the maximum number. |
device_read_iops | This list item specifies the maximum number of read operations for one or more devices. device specifies the block device to be limited. |
device_write_bps | This list item specifies the maximum number of bytes of a write operation for one or more devices. device specifies the block device to be limited, and value specifies the maximum number. |
device_write_iops | This list item specifies the maximum number of write operations for one or more devices. device specifies the block device to be limited. |
blkio Kernel Interface
- The interface exists in the cgroup of the container in the /sys/fs/cgroup/blkio directory, for example, /sys/fs/cgroup/blkio/kubepods/burstable/
/ .- blkio.throttle.read_bps_device
- blkio.throttle.read_iops_device
- blkio.throttle.write_bps_device
- blkio.throttle.write_iops_device
Key-value pairs in the list are in the same format as those of the cgroup.
- The bytes values are written into the pod configuration, they are converted to a multiple of the page size of the environment.
- The configuration takes effect only for devices whose minor value is 0.
- To cancel the limit, set the corresponding value to 0.
blkio Configuration
Enabling or Disabling the blkio Module of Rubik.
The blkio module can be enabled or disabled in blkioConfig.
"blkioConfig": {
"enable": true
}
- enable: whether to enable the I/O control module. The default value is false.
Pod Configuration Example
Four lists can be specified by the pod annotation: write_bps, write_iops, read_bps, and read_iops.
In the YAML file upon pod creation:
volcano.sh/blkio-limit: '{"device_read_bps":[{"device":"/dev/sda1","value":"10485760"}, {"device":"/dev/sda","value":"20971520"}], "device_write_bps":[{"device":"/dev/sda1","value":"20971520"}], "device_read_iops":[{"device":"/dev/sda1","value":"200"}], "device_write_iops":[{"device":"/dev/sda1","value":"300"}]}'
Modify annotation: You can run the kubectl annotate command to dynamically modify annotation. For example:
kubectl annotate --overwrite pods <podname> volcano.sh/blkio-limit='{"device_read_bps":[{"device":"/dev/vda", "value":"211715200"}]}'
memory
Rubik supports multiple memory strategies. You can apply different memory allocation methods to different scenarios.
dynlevel: kernel cgroup-based multi-level control. Rubik monitors node memory usage to dynamically adjust the memory cgroup of offline services, ensuring the quality of online services.
fssr: kernel cgroup-based dynamic watermark control. memory.high is a memcg-level watermark interface provided by the kernel. Rubik continuously detects memory usage and dynamically adjusts the memory.high limit of offline services to suppress the memory usage of offline services, ensuring the quality of online services.
memory dynlevel Strategy Kernel Interface
The interface exists in the cgroup of the container in the /sys/fs/cgroup/memory directory, for example, /sys/fs/cgroup/memory/kubepods/burstable/
/ . When the dynlevel strategy is used, Rubik adjusts the following values of offline service containers based on the memory usage of the current node:- memory.soft_limit_in_bytes
- memory.force_empty
- memory.limit_in_bytes
- /proc/sys/vm/drop_caches
Memory dynlevel Strategy Configuration
The strategy and check interval of the memory module can be specified in memoryConfig:
"memoryConfig": {
"enable": true,
"strategy": "none",
"checkInterval": 5
}
enable: whether to enable the module.
The value of strategy can be dynlevel, fssr, or none. The default value is none.
- none: Memory is not dynamically adjusted.
- dynlevel: Dynamic reclamation strategy
- fssr: fast suppression and slow recovery strategy 1) Upon Rubik startup, memory.high of all offline services is set to 80% of the total memory by default. 2) When freeMemory (available memory) is less than reservedMemory (totalMemory x 5%), memory.high of all offline services is decreased by 10% of totalMemory. The latest memory.high = memory.high - totalMemory x 10%. 3) If the available memory is sufficient for a period of time, that is, freeMemory is more than 3 x reservedMemory, 1% of totalMemory is released to offline services. The latest memory.high = memory.high + totalMemory x 1%. memory.high will be repeatedly increased until freeMemory is between 1 to 3 x reservedMemory.
checkInterval specifies the check interval of the strategy, in seconds. The default value is 5.
memory fssr Strategy Kernel Interface
- The interface exists in the cgroup of the container in the /sys/fs/cgroup/memory directory, for example, /sys/fs/cgroup/memory/kubepods/burstable/
/ . When the fssr strategy is used, Rubik adjusts the following value of offline service containers based on the memory usage of the current node: - memory.high
memory fssr Policy Configuration
The strategy and check interval of the memory module can be specified in memoryConfig:
"memoryConfig": {
"enable": true,
"strategy": "fssr",
"checkInterval": 5
}
enable: whether to enable the module.
The value of strategy can be dynlevel, fssr, or none. The default value is none.
- none: Memory is not dynamically adjusted.
- dynlevel: Dynamic reclamation strategy
- fssr: fast suppression and slow recovery strategy 1) Upon Rubik startup, memory.high of all offline services is set to 80% of the total memory by default. 2) When freeMemory (available memory) is less than reservedMemory (totalMemory x 5%), memory.high of all offline services is decreased by 10% of totalMemory. The latest memory.high = memory.high - totalMemory x 10%. 3) If the available memory is sufficient for a period of time, that is, freeMemory is more than 3 x reservedMemory, 1% of totalMemory is released to offline services. The latest memory.high = memory.high + totalMemory x 1%. memory.high will be repeatedly increased until freeMemory is between 1 to 3 x reservedMemory.
checkInterval specifies the check interval of the strategy, in seconds. The default value is 5.
quota burst
The quota burst configuration of a pod is specified by annotation volcano.sh/blkio-limit. The configuration is applied upon pod creation and can be dynamically modified using kubectl annotate
during pod running. Both offline and online pods are supported.
The default unit of quota burst of a pod is microseconds. Rubik allows a container to accumulate CPU resources when the CPU usage of the container is lower than the quota and uses the accumulated CPU resources when the CPU usage exceeds the quota.
quota burst kernel interface
The interface exists in the cgroup of the container in the /sys/fs/cgroup/cpu directory, for example, /sys/fs/cgroup/cpu/kubepods/burstable/
/ . The annotation value is written into the following file:- cpu.cfs_burst_us
The same restriction applies to the values of volcano.sh/quota-burst-time and cpu.cfs_burst_us.
- When the value of cpu.cfs_quota_us is not -1, the following conditions must be met: cpu.cfs_burst_us + cpu.cfs_quota_us <= 2^44-1 and cpu.cfs_burst_us <= cpu.cfs_quota_us.
- When cpu.cfs_quota_us is -1, the maximum value of cpu.cfs_burst_us is not limited and depends on the maximum value that can be set in the system.
Pod configuration example
In the YAML file upon pod creation:
metadata: annotations: volcano.sh/quota-burst-time : "2000"
Modify annotation: You can run the kubectl annotate command to dynamically modify annotation. For example:
kubectl annotate --overwrite pods <podname> volcano.sh/quota-burst-time='3000'
I/O Weight Control Based on iocost
Dependency Description
Rubik can control the I/O weight distribution of different pods through iocost of cgroup v1. Therefore, the kernel must support the following features:
- blkcg iocost of cgroup v1
- writeback of cgroup v1
Rubik Implementation Description
The procedure of the Rubik implementation is as follows:
- When Rubik is deployed, Rubik parses the configuration and sets iocost parameters.
- Rubik registers the detection event to the Kubernetes API server.
- When a pod is deployed, the pod configuration information is write back to Rubik.
- Rubik parses the pod configuration information and configures the pod iocost weight based on the QoS level.
Rubik Protocol Description
"nodeConfig": [
{
"nodeName": "slaver01",
"iocostEnable": true,
"iocostConfig": [
{
"dev": "sda",
"enable": false,
"model": "linear",
"param": {
"rbps": 174610612,
"rseqiops": 41788,
"rrandiops": 371,
"wbps": 178587889,
"wseqiops": 42792,
"wrandiops": 379
}
}
]
}
]
Item | Type | Description |
---|---|---|
nodeConfig | Array | Node configuration information |
nodeName | String | Name of the node to be configured |
iocostEnable | Bool | Whether to enable iocost for the node |
iocostConfig | Array | Configuration array for different physical drives. This parameter is read when iocostEnable is set to true. |
dev | String | Physical drive name |
enable | Bool | Whether to enable iocost for the physical drive |
model | String | Name of the iocost model. linear is the linear model provided by the kernel. |
param | Object | Parameters related to the model parameter. When model is set to linear, the following parameters are linear model-related parameters. |
r(w)bps | int64 | Maximum read (write) bandwidth of the physical block device |
r(w)seqiops | int64 | Maximum sequential read (write) IOPS of the physical block device |
r(w)randiops | int64 | Maximum random read (write) IOPS of the physical block device |
Other
Parameters related to the iocost linear model can be obtained by running the iocost_coef_gen.py script.
The blkio.cost.qos and blkio.cost.model file interfaces exist in the blkcg root system file. For details about the implementation and interface description, see the openEuler kernel document.