Feature Introduction

Pod CPU Priorities

Rubik allows you to configure CPU priorities of services. In the hybrid deployment of online and offline services, Rubik ensures that online services preempt CPU resources.

Prerequisites

The kernel of openEuler 22.03 or later is recommended. The kernel supports CPU priority configuration based on control groups (cgroups). The CPU subsystem provides the cpu.qos_level interface.

CPU Priority Kernel Interface

The interface exists in the cgroup of the container in the /sys/fs/cgroup/cpu directory, for example, /sys/fs/cgroup/cpu/kubepods/burstable//.
- cpu.qos_level: enables the CPU priority configuration. The value can be 0 or -1, with 0 being the default.
  - 0 indicates an online service.
  - -1 indicates an offline service.

CPU Priority Configuration

Rubik automatically configures cpu.qos_level based on the annotation volcano.sh/preemptable in the YAML file of the pod. The default value is false.

annotations:
    volcano.sh/preemptable: true

true indicates an offline service.
true indicates an online service.

Pod Memory Priorities

Rubik allows you to configure memory priorities of services. In the hybrid deployment of online and offline services, Rubik ensures that offline services are first terminated in the case of out-of-memory (OOM).

Prerequisites

openEuler 22.03 or later is recommended. The kernel of openEuler 22.03 or later supports memory priority configuration based on cgroups, that is, the memory subsystem interface memory.qos_level.
To enable the memory priority feature, run echo 1 > /proc/sys/vm/memcg_qos_enable.

Memory Priority Kernel Interface

/proc/sys/vm/memcg_qos_enable: enables the memory priority feature. The value can be 0 or 1, with 0 being the default. You can run echo 1 > /proc/sys/vm/memcg_qos_enable to enable the feature.
- 0: The feature is disabled.
- 1: The feature is enabled.
The interface exists in the cgroup of the container in the /sys/fs/cgroup/memory directory, for example, /sys/fs/cgroup/memory/kubepods/burstable//.
- memory.qos_level: enables the memory priority configuration. The value can be 0 or -1, with 0 being the default.
  - 0 indicates an online service.
  - -1 indicates an offline service.

Memory Priority Configuration

Rubik automatically configures memory.qos_level based on the annotation volcano.sh/preemptable in the YAML file of the pod. See CPU Priority Configuration.

dynCache Memory Bandwidth and L3 Cache Access Limit

Rubik can limit pod memory bandwidth and L3 cache access for offline services to reduce the impact on online services.

Prerequisites

The cache access and memory bandwidth limit feature supports only physical machines.
- For x86 physical machines, the CAT and MBA functions of Intel RDT must be enabled in the OS by adding rdt=l3cat,mba to the kernel command line parameters (cmdline).
- For ARM physical machines, the MPAM function must be enabled in the OS by adding mpam=acpi to the kernel command line parameters (cmdline).
Due to kernel restrictions, RDT does not support the pseudo-locksetup mode.

New Permissions and Directories of Rubik

Mount point: /sys/fs/resctrl. Rubik reads and sets files in the /sys/fs/resctrl directory. This directory must be mounted before Rubik is started and cannot be unmounted during Rubik running.
Permission: SYS_ADMIN. To set files in the /sys/fs/resctrl directory on the host, the SYS_ADMIN permission must be assigned to the Rubik container.
namepsace: pid namespace. Rubik obtains the PID of the service container process on the host. Therefore, the Rubik container needs to share the PID namespace with the host.

Rubik RDT Cgroups

Rubik creates five cgroups (rubik_max, rubik_high, rubik_middle, rubik_low and rubik_dynamic) in the RDT resctrl directory (/sys/fs/resctrl by default). Rubik writes the watermarks to the schemata file of each corresponding cgroup upon startup. The low, middle, and high watermarks can be configured in cacheConfig. The max cgroup uses the default maximum value. The initial watermark of the dynamic cgroup is the same as that of the low cgroup.

When an offline service pod is started, the cache level is set based on the volcano.sh/cache-limit annotation and added to the specified cgroup. For example, the pod with the following configuration is added to the rubik_low cgroup:

annotations:
    volcano.sh/cache-limit: "low"

Rubik dynamic Cgroup

When offline pods whose cache level is dynamic exist, Rubik collects the cache miss and LLC miss metrics of online service pods on the current node and adjusts the watermark of the rubik_dynamic cgroup. In this way, Rubik dynamically controls offline service pods in the dynamic cgroup.

dynCache Kernel Interface

Rubik creates five cgroup directories in /sys/fs/resctrl and modifies the schemata and tasks files of each cgroup.

dynCache Configuration

The dynCache function is configured in cacheConfig:

"cacheConfig": {
        "enable": false,
        "defaultLimitMode": "static",
        "adjustInterval": 1000,
        "perfDuration": 1000,
        "l3Percent": {
            "low": 20,
            "mid": 30,
            "high": 50
        },
        "memBandPercent": {
            "low": 10,
            "mid": 30,
            "high": 50
        }
    },

l3Percent and memBandPercent: l3Percent and memBandPercent are used to configure the watermarks of the low, mid, and high cgroups.
Assume that in the current environment rdt bitmask=fffff and numa=2. Based on the low value of l3Percent (20) and the low value of memBandPercent (10), Rubik configures /sys/fs/resctrl/rubik_low as follows:
```
L3:0=f;1=f
MB:0=10;1=10
```
defaultLimitMode: If the volcano.sh/cache-limit annotation is not specified for an offline pod, the defaultLimitMode of cacheConfig determines the cgroup to which the pod is added.
- If defaultLimitMode is static, the pod is added to the rubik_max cgroup.
- If defaultLimitMode is dynamic, the pod is added to the rubik_dynamic cgroup.
adjustInterval: interval for dynCache to dynamically adjust the rubik_dynamic cgroup, in milliseconds. The default value is 1000.
perfDuration: perf execution duration for dynCache, in milliseconds. The default value is 1000.

Precautions for dynCache

dynCache takes affect only for offline pods.
If a service container is manually restarted during running (the container ID remains unchanged but the container process ID changes), dynCache does not take effect for the container.
After a service container is started and the dynCache level is set, the limit level cannot be changed.
The sensitivity of adjusting the dynamic cgroup is affected adjustInterval and perfDuration values in the Rubik configuration file and the number of online service pods on the node. If the impact detection result indicates that adjustment is required, the adjustment interval fluctuates within the range [adjustInterval + perfDuration, adjustInterval + perfDuration x Number of pods]. You can set the configuration items based on your required sensitivity.

blkio

The blkio configuration of a pod is specified by annotation volcano.sh/blkio-limit. The configuration is applied upon pod creation and can be dynamically modified using kubectl annotate during pod running. Both offline and online pods are supported.

The configuration consists of four lists:

Item	Description
device_read_bps	This list item specifies the maximum number of bytes of a read operation for one or more devices. device specifies the block device to be limited, and value specifies the maximum number.
device_read_iops	This list item specifies the maximum number of read operations for one or more devices. device specifies the block device to be limited.
device_write_bps	This list item specifies the maximum number of bytes of a write operation for one or more devices. device specifies the block device to be limited, and value specifies the maximum number.
device_write_iops	This list item specifies the maximum number of write operations for one or more devices. device specifies the block device to be limited.

blkio Kernel Interface

The interface exists in the cgroup of the container in the /sys/fs/cgroup/blkio directory, for example, /sys/fs/cgroup/blkio/kubepods/burstable//.
- blkio.throttle.read_bps_device
- blkio.throttle.read_iops_device
- blkio.throttle.write_bps_device
- blkio.throttle.write_iops_device

Key-value pairs in the list are in the same format as those of the cgroup.

The bytes values are written into the pod configuration, they are converted to a multiple of the page size of the environment.
The configuration takes effect only for devices whose minor value is 0.
To cancel the limit, set the corresponding value to 0.

blkio Configuration

Enabling or Disabling the blkio Module of Rubik.

The blkio module can be enabled or disabled in blkioConfig.

"blkioConfig": {
        "enable": true
}

enable: whether to enable the I/O control module. The default value is false.

Pod Configuration Example

Four lists can be specified by the pod annotation: write_bps, write_iops, read_bps, and read_iops.

In the YAML file upon pod creation:

volcano.sh/blkio-limit: '{"device_read_bps":[{"device":"/dev/sda1","value":"10485760"}, {"device":"/dev/sda","value":"20971520"}],
                "device_write_bps":[{"device":"/dev/sda1","value":"20971520"}],
                "device_read_iops":[{"device":"/dev/sda1","value":"200"}],
                "device_write_iops":[{"device":"/dev/sda1","value":"300"}]}'

Modify annotation: You can run the kubectl annotate command to dynamically modify annotation. For example: kubectl annotate --overwrite pods <podname> volcano.sh/blkio-limit='{"device_read_bps":[{"device":"/dev/vda", "value":"211715200"}]}'

memory

Rubik supports multiple memory strategies. You can apply different memory allocation methods to different scenarios.

dynlevel: kernel cgroup-based multi-level control. Rubik monitors node memory usage to dynamically adjust the memory cgroup of offline services, ensuring the quality of online services.

fssr: kernel cgroup-based dynamic watermark control. memory.high is a memcg-level watermark interface provided by the kernel. Rubik continuously detects memory usage and dynamically adjusts the memory.high limit of offline services to suppress the memory usage of offline services, ensuring the quality of online services.

memory dynlevel Strategy Kernel Interface

The interface exists in the cgroup of the container in the /sys/fs/cgroup/memory directory, for example, /sys/fs/cgroup/memory/kubepods/burstable//. When the dynlevel strategy is used, Rubik adjusts the following values of offline service containers based on the memory usage of the current node:
- memory.soft_limit_in_bytes
- memory.force_empty
- memory.limit_in_bytes
- /proc/sys/vm/drop_caches

Memory dynlevel Strategy Configuration

The strategy and check interval of the memory module can be specified in memoryConfig:

"memoryConfig": {
        "enable": true,
        "strategy": "none",
        "checkInterval": 5
   }

enable: whether to enable the module.
The value of strategy can be dynlevel, fssr, or none. The default value is none.
- none: Memory is not dynamically adjusted.
- dynlevel: Dynamic reclamation strategy
- fssr: fast suppression and slow recovery strategy 1) Upon Rubik startup, memory.high of all offline services is set to 80% of the total memory by default. 2) When freeMemory (available memory) is less than reservedMemory (totalMemory x 5%), memory.high of all offline services is decreased by 10% of totalMemory. The latest memory.high = memory.high - totalMemory x 10%. 3) If the available memory is sufficient for a period of time, that is, freeMemory is more than 3 x reservedMemory, 1% of totalMemory is released to offline services. The latest memory.high = memory.high + totalMemory x 1%. memory.high will be repeatedly increased until freeMemory is between 1 to 3 x reservedMemory.
checkInterval specifies the check interval of the strategy, in seconds. The default value is 5.

memory fssr Strategy Kernel Interface

The interface exists in the cgroup of the container in the /sys/fs/cgroup/memory directory, for example, /sys/fs/cgroup/memory/kubepods/burstable//. When the fssr strategy is used, Rubik adjusts the following value of offline service containers based on the memory usage of the current node:
memory.high

memory fssr Policy Configuration

The strategy and check interval of the memory module can be specified in memoryConfig:

"memoryConfig": {
        "enable": true,
        "strategy": "fssr",
        "checkInterval": 5
   }

enable: whether to enable the module.
The value of strategy can be dynlevel, fssr, or none. The default value is none.
- none: Memory is not dynamically adjusted.
- dynlevel: Dynamic reclamation strategy
- fssr: fast suppression and slow recovery strategy 1) Upon Rubik startup, memory.high of all offline services is set to 80% of the total memory by default. 2) When freeMemory (available memory) is less than reservedMemory (totalMemory x 5%), memory.high of all offline services is decreased by 10% of totalMemory. The latest memory.high = memory.high - totalMemory x 10%. 3) If the available memory is sufficient for a period of time, that is, freeMemory is more than 3 x reservedMemory, 1% of totalMemory is released to offline services. The latest memory.high = memory.high + totalMemory x 1%. memory.high will be repeatedly increased until freeMemory is between 1 to 3 x reservedMemory.
checkInterval specifies the check interval of the strategy, in seconds. The default value is 5.

quota burst

The quota burst configuration of a pod is specified by annotation volcano.sh/blkio-limit. The configuration is applied upon pod creation and can be dynamically modified using kubectl annotate during pod running. Both offline and online pods are supported.

The default unit of quota burst of a pod is microseconds. Rubik allows a container to accumulate CPU resources when the CPU usage of the container is lower than the quota and uses the accumulated CPU resources when the CPU usage exceeds the quota.

quota burst kernel interface

The interface exists in the cgroup of the container in the /sys/fs/cgroup/cpu directory, for example, /sys/fs/cgroup/cpu/kubepods/burstable//. The annotation value is written into the following file:
- cpu.cfs_burst_us
The same restriction applies to the values of volcano.sh/quota-burst-time and cpu.cfs_burst_us.
- When the value of cpu.cfs_quota_us is not -1, the following conditions must be met: cpu.cfs_burst_us + cpu.cfs_quota_us <= 2^44-1 and cpu.cfs_burst_us <= cpu.cfs_quota_us.
- When cpu.cfs_quota_us is -1, the maximum value of cpu.cfs_burst_us is not limited and depends on the maximum value that can be set in the system.

Pod configuration example

In the YAML file upon pod creation:

metadata:
  annotations:
    volcano.sh/quota-burst-time : "2000"

Modify annotation: You can run the kubectl annotate command to dynamically modify annotation. For example: kubectl annotate --overwrite pods <podname> volcano.sh/quota-burst-time='3000'

I/O Weight Control Based on iocost

Dependency Description

Rubik can control the I/O weight distribution of different pods through iocost of cgroup v1. Therefore, the kernel must support the following features:

blkcg iocost of cgroup v1
writeback of cgroup v1

Rubik Implementation Description

The procedure of the Rubik implementation is as follows:

When Rubik is deployed, Rubik parses the configuration and sets iocost parameters.
Rubik registers the detection event to the Kubernetes API server.
When a pod is deployed, the pod configuration information is write back to Rubik.
Rubik parses the pod configuration information and configures the pod iocost weight based on the QoS level.

Rubik Protocol Description

"nodeConfig": [
        {
            "nodeName": "slaver01",
            "iocostEnable": true,
            "iocostConfig": [
                {
                    "dev": "sda",
                    "enable": false,
                    "model": "linear",
                    "param": {
                        "rbps": 174610612,
                        "rseqiops": 41788,
                        "rrandiops": 371,
                        "wbps": 178587889,
                        "wseqiops": 42792,
                        "wrandiops": 379
                    }
                }
            ]
        }
    ]

Item	Type	Description
nodeConfig	Array	Node configuration information
nodeName	String	Name of the node to be configured
iocostEnable	Bool	Whether to enable iocost for the node
iocostConfig	Array	Configuration array for different physical drives. This parameter is read when iocostEnable is set to true.
dev	String	Physical drive name
enable	Bool	Whether to enable iocost for the physical drive
model	String	Name of the iocost model. linear is the linear model provided by the kernel.
param	Object	Parameters related to the model parameter. When model is set to linear, the following parameters are linear model-related parameters.
r(w)bps	int64	Maximum read (write) bandwidth of the physical block device
r(w)seqiops	int64	Maximum sequential read (write) IOPS of the physical block device
r(w)randiops	int64	Maximum random read (write) IOPS of the physical block device

Other

Parameters related to the iocost linear model can be obtained by running the iocost_coef_gen.py script.
The blkio.cost.qos and blkio.cost.model file interfaces exist in the blkcg root system file. For details about the implementation and interface description, see the openEuler kernel document.

Feature Introduction

Pod CPU Priorities

CPU Priority Kernel Interface

CPU Priority Configuration

Pod Memory Priorities

Memory Priority Kernel Interface

Memory Priority Configuration

dynCache Memory Bandwidth and L3 Cache Access Limit

dynCache Kernel Interface

dynCache Configuration

Precautions for dynCache

blkio

blkio Kernel Interface

blkio Configuration

memory

memory dynlevel Strategy Kernel Interface

Memory dynlevel Strategy Configuration

memory fssr Strategy Kernel Interface

memory fssr Policy Configuration

quota burst

quota burst kernel interface

I/O Weight Control Based on iocost

Dependency Description

Rubik Implementation Description

Rubik Protocol Description

Other

Bug Catching

Buggy Content

Bug Description

How satisfied are you with this document