LTS

    Innovation Version

      Feature Introduction

      Pod CPU Priorities

      Rubik allows you to configure CPU priorities of services. In the hybrid deployment of online and offline services, Rubik ensures that online services preempt CPU resources.

      Prerequisites

      • The kernel of openEuler 22.03 or later is recommended. The kernel supports CPU priority configuration based on control groups (cgroups). The CPU subsystem provides the cpu.qos_level interface.

      CPU Priority Kernel Interface

      • The interface exists in the cgroup of the container in the /sys/fs/cgroup/cpu directory, for example, /sys/fs/cgroup/cpu/kubepods/burstable//.
        • cpu.qos_level: enables the CPU priority configuration. The value can be 0 or -1, with 0 being the default.
          • 0 indicates an online service.
          • -1 indicates an offline service.

      CPU Priority Configuration

      Rubik automatically configures cpu.qos_level based on the annotation volcano.sh/preemptable in the YAML file of the pod. The default value is false.

      annotations:
          volcano.sh/preemptable: true
      
      • true indicates an offline service.
      • true indicates an online service.

      Pod Memory Priorities

      Rubik allows you to configure memory priorities of services. In the hybrid deployment of online and offline services, Rubik ensures that offline services are first terminated in the case of out-of-memory (OOM).

      Prerequisites

      • openEuler 22.03 or later is recommended. The kernel of openEuler 22.03 or later supports memory priority configuration based on cgroups, that is, the memory subsystem interface memory.qos_level.
      • To enable the memory priority feature, run echo 1 > /proc/sys/vm/memcg_qos_enable.

      Memory Priority Kernel Interface

      • /proc/sys/vm/memcg_qos_enable: enables the memory priority feature. The value can be 0 or 1, with 0 being the default. You can run echo 1 > /proc/sys/vm/memcg_qos_enable to enable the feature.

        • 0: The feature is disabled.
        • 1: The feature is enabled.
      • The interface exists in the cgroup of the container in the /sys/fs/cgroup/memory directory, for example, /sys/fs/cgroup/memory/kubepods/burstable//.

        • memory.qos_level: enables the memory priority configuration. The value can be 0 or -1, with 0 being the default.
          • 0 indicates an online service.
          • -1 indicates an offline service.

      Memory Priority Configuration

      Rubik automatically configures memory.qos_level based on the annotation volcano.sh/preemptable in the YAML file of the pod. See CPU Priority Configuration.

      dynCache Memory Bandwidth and L3 Cache Access Limit

      Rubik can limit pod memory bandwidth and L3 cache access for offline services to reduce the impact on online services.

      Prerequisites

      • The cache access and memory bandwidth limit feature supports only physical machines.

        • For x86 physical machines, the CAT and MBA functions of Intel RDT must be enabled in the OS by adding rdt=l3cat,mba to the kernel command line parameters (cmdline).
        • For ARM physical machines, the MPAM function must be enabled in the OS by adding mpam=acpi to the kernel command line parameters (cmdline).
      • Due to kernel restrictions, RDT does not support the pseudo-locksetup mode.

      New Permissions and Directories of Rubik

      • Mount point: /sys/fs/resctrl. Rubik reads and sets files in the /sys/fs/resctrl directory. This directory must be mounted before Rubik is started and cannot be unmounted during Rubik running.
      • Permission: SYS_ADMIN. To set files in the /sys/fs/resctrl directory on the host, the SYS_ADMIN permission must be assigned to the Rubik container.
      • namepsace: pid namespace. Rubik obtains the PID of the service container process on the host. Therefore, the Rubik container needs to share the PID namespace with the host.

      Rubik RDT Cgroups

      Rubik creates five cgroups (rubik_max, rubik_high, rubik_middle, rubik_low and rubik_dynamic) in the RDT resctrl directory (/sys/fs/resctrl by default). Rubik writes the watermarks to the schemata file of each corresponding cgroup upon startup. The low, middle, and high watermarks can be configured in cacheConfig. The max cgroup uses the default maximum value. The initial watermark of the dynamic cgroup is the same as that of the low cgroup.

      When an offline service pod is started, the cache level is set based on the volcano.sh/cache-limit annotation and added to the specified cgroup. For example, the pod with the following configuration is added to the rubik_low cgroup:

      annotations:
          volcano.sh/cache-limit: "low"
      

      Rubik dynamic Cgroup

      When offline pods whose cache level is dynamic exist, Rubik collects the cache miss and LLC miss metrics of online service pods on the current node and adjusts the watermark of the rubik_dynamic cgroup. In this way, Rubik dynamically controls offline service pods in the dynamic cgroup.

      dynCache Kernel Interface

      • Rubik creates five cgroup directories in /sys/fs/resctrl and modifies the schemata and tasks files of each cgroup.

      dynCache Configuration

      The dynCache function is configured in cacheConfig:

      "cacheConfig": {
              "enable": false,
              "defaultLimitMode": "static",
              "adjustInterval": 1000,
              "perfDuration": 1000,
              "l3Percent": {
                  "low": 20,
                  "mid": 30,
                  "high": 50
              },
              "memBandPercent": {
                  "low": 10,
                  "mid": 30,
                  "high": 50
              }
          },
      
      • l3Percent and memBandPercent: l3Percent and memBandPercent are used to configure the watermarks of the low, mid, and high cgroups.

        Assume that in the current environment rdt bitmask=fffff and numa=2. Based on the low value of l3Percent (20) and the low value of memBandPercent (10), Rubik configures /sys/fs/resctrl/rubik_low as follows:

        L3:0=f;1=f
        MB:0=10;1=10
        
      • defaultLimitMode: If the volcano.sh/cache-limit annotation is not specified for an offline pod, the defaultLimitMode of cacheConfig determines the cgroup to which the pod is added.

        • If defaultLimitMode is static, the pod is added to the rubik_max cgroup.
        • If defaultLimitMode is dynamic, the pod is added to the rubik_dynamic cgroup.
      • adjustInterval: interval for dynCache to dynamically adjust the rubik_dynamic cgroup, in milliseconds. The default value is 1000.

      • perfDuration: perf execution duration for dynCache, in milliseconds. The default value is 1000.

      Precautions for dynCache

      • dynCache takes affect only for offline pods.
      • If a service container is manually restarted during running (the container ID remains unchanged but the container process ID changes), dynCache does not take effect for the container.
      • After a service container is started and the dynCache level is set, the limit level cannot be changed.
      • The sensitivity of adjusting the dynamic cgroup is affected adjustInterval and perfDuration values in the Rubik configuration file and the number of online service pods on the node. If the impact detection result indicates that adjustment is required, the adjustment interval fluctuates within the range [adjustInterval + perfDuration, adjustInterval + perfDuration x Number of pods]. You can set the configuration items based on your required sensitivity.

      blkio

      The blkio configuration of a pod is specified by annotation volcano.sh/blkio-limit. The configuration is applied upon pod creation and can be dynamically modified using kubectl annotate during pod running. Both offline and online pods are supported.

      The configuration consists of four lists:

      ItemDescription
      device_read_bpsThis list item specifies the maximum number of bytes of a read operation for one or more devices. device specifies the block device to be limited, and value specifies the maximum number.
      device_read_iopsThis list item specifies the maximum number of read operations for one or more devices. device specifies the block device to be limited.
      device_write_bpsThis list item specifies the maximum number of bytes of a write operation for one or more devices. device specifies the block device to be limited, and value specifies the maximum number.
      device_write_iopsThis list item specifies the maximum number of write operations for one or more devices. device specifies the block device to be limited.

      blkio Kernel Interface

      • The interface exists in the cgroup of the container in the /sys/fs/cgroup/blkio directory, for example, /sys/fs/cgroup/blkio/kubepods/burstable//.
        • blkio.throttle.read_bps_device
        • blkio.throttle.read_iops_device
        • blkio.throttle.write_bps_device
        • blkio.throttle.write_iops_device

      Key-value pairs in the list are in the same format as those of the cgroup.

      • The bytes values are written into the pod configuration, they are converted to a multiple of the page size of the environment.
      • The configuration takes effect only for devices whose minor value is 0.
      • To cancel the limit, set the corresponding value to 0.

      blkio Configuration

      Enabling or Disabling the blkio Module of Rubik.

      The blkio module can be enabled or disabled in blkioConfig.

      "blkioConfig": {
              "enable": true
      }
      
      • enable: whether to enable the I/O control module. The default value is false.

      Pod Configuration Example

      Four lists can be specified by the pod annotation: write_bps, write_iops, read_bps, and read_iops.

      • In the YAML file upon pod creation:

        volcano.sh/blkio-limit: '{"device_read_bps":[{"device":"/dev/sda1","value":"10485760"}, {"device":"/dev/sda","value":"20971520"}],
                        "device_write_bps":[{"device":"/dev/sda1","value":"20971520"}],
                        "device_read_iops":[{"device":"/dev/sda1","value":"200"}],
                        "device_write_iops":[{"device":"/dev/sda1","value":"300"}]}'
        
      • Modify annotation: You can run the kubectl annotate command to dynamically modify annotation. For example: kubectl annotate --overwrite pods <podname> volcano.sh/blkio-limit='{"device_read_bps":[{"device":"/dev/vda", "value":"211715200"}]}'

      memory

      Rubik supports multiple memory strategies. You can apply different memory allocation methods to different scenarios.

      dynlevel: kernel cgroup-based multi-level control. Rubik monitors node memory usage to dynamically adjust the memory cgroup of offline services, ensuring the quality of online services.

      fssr: kernel cgroup-based dynamic watermark control. memory.high is a memcg-level watermark interface provided by the kernel. Rubik continuously detects memory usage and dynamically adjusts the memory.high limit of offline services to suppress the memory usage of offline services, ensuring the quality of online services.

      memory dynlevel Strategy Kernel Interface

      • The interface exists in the cgroup of the container in the /sys/fs/cgroup/memory directory, for example, /sys/fs/cgroup/memory/kubepods/burstable//. When the dynlevel strategy is used, Rubik adjusts the following values of offline service containers based on the memory usage of the current node:

        • memory.soft_limit_in_bytes
        • memory.force_empty
        • memory.limit_in_bytes
        • /proc/sys/vm/drop_caches

      Memory dynlevel Strategy Configuration

      The strategy and check interval of the memory module can be specified in memoryConfig:

      "memoryConfig": {
              "enable": true,
              "strategy": "none",
              "checkInterval": 5
         }
      
      • enable: whether to enable the module.

      • The value of strategy can be dynlevel, fssr, or none. The default value is none.

        • none: Memory is not dynamically adjusted.
        • dynlevel: Dynamic reclamation strategy
        • fssr: fast suppression and slow recovery strategy 1) Upon Rubik startup, memory.high of all offline services is set to 80% of the total memory by default. 2) When freeMemory (available memory) is less than reservedMemory (totalMemory x 5%), memory.high of all offline services is decreased by 10% of totalMemory. The latest memory.high = memory.high - totalMemory x 10%. 3) If the available memory is sufficient for a period of time, that is, freeMemory is more than 3 x reservedMemory, 1% of totalMemory is released to offline services. The latest memory.high = memory.high + totalMemory x 1%. memory.high will be repeatedly increased until freeMemory is between 1 to 3 x reservedMemory.
      • checkInterval specifies the check interval of the strategy, in seconds. The default value is 5.

      memory fssr Strategy Kernel Interface

      • The interface exists in the cgroup of the container in the /sys/fs/cgroup/memory directory, for example, /sys/fs/cgroup/memory/kubepods/burstable//. When the fssr strategy is used, Rubik adjusts the following value of offline service containers based on the memory usage of the current node:
      • memory.high

      memory fssr Policy Configuration

      The strategy and check interval of the memory module can be specified in memoryConfig:

      "memoryConfig": {
              "enable": true,
              "strategy": "fssr",
              "checkInterval": 5
         }
      
      • enable: whether to enable the module.

      • The value of strategy can be dynlevel, fssr, or none. The default value is none.

        • none: Memory is not dynamically adjusted.
        • dynlevel: Dynamic reclamation strategy
        • fssr: fast suppression and slow recovery strategy 1) Upon Rubik startup, memory.high of all offline services is set to 80% of the total memory by default. 2) When freeMemory (available memory) is less than reservedMemory (totalMemory x 5%), memory.high of all offline services is decreased by 10% of totalMemory. The latest memory.high = memory.high - totalMemory x 10%. 3) If the available memory is sufficient for a period of time, that is, freeMemory is more than 3 x reservedMemory, 1% of totalMemory is released to offline services. The latest memory.high = memory.high + totalMemory x 1%. memory.high will be repeatedly increased until freeMemory is between 1 to 3 x reservedMemory.
      • checkInterval specifies the check interval of the strategy, in seconds. The default value is 5.

      quota burst

      The quota burst configuration of a pod is specified by annotation volcano.sh/blkio-limit. The configuration is applied upon pod creation and can be dynamically modified using kubectl annotate during pod running. Both offline and online pods are supported.

      The default unit of quota burst of a pod is microseconds. Rubik allows a container to accumulate CPU resources when the CPU usage of the container is lower than the quota and uses the accumulated CPU resources when the CPU usage exceeds the quota.

      quota burst kernel interface

      • The interface exists in the cgroup of the container in the /sys/fs/cgroup/cpu directory, for example, /sys/fs/cgroup/cpu/kubepods/burstable//. The annotation value is written into the following file:

        • cpu.cfs_burst_us
      • The same restriction applies to the values of volcano.sh/quota-burst-time and cpu.cfs_burst_us.

        • When the value of cpu.cfs_quota_us is not -1, the following conditions must be met: cpu.cfs_burst_us + cpu.cfs_quota_us <= 2^44-1 and cpu.cfs_burst_us <= cpu.cfs_quota_us.
        • When cpu.cfs_quota_us is -1, the maximum value of cpu.cfs_burst_us is not limited and depends on the maximum value that can be set in the system.

      Pod configuration example

      • In the YAML file upon pod creation:

        metadata:
          annotations:
            volcano.sh/quota-burst-time : "2000"
        
      • Modify annotation: You can run the kubectl annotate command to dynamically modify annotation. For example: kubectl annotate --overwrite pods <podname> volcano.sh/quota-burst-time='3000'

      I/O Weight Control Based on iocost

      Dependency Description

      Rubik can control the I/O weight distribution of different pods through iocost of cgroup v1. Therefore, the kernel must support the following features:

      • blkcg iocost of cgroup v1
      • writeback of cgroup v1

      Rubik Implementation Description

      The procedure of the Rubik implementation is as follows:

      • When Rubik is deployed, Rubik parses the configuration and sets iocost parameters.
      • Rubik registers the detection event to the Kubernetes API server.
      • When a pod is deployed, the pod configuration information is write back to Rubik.
      • Rubik parses the pod configuration information and configures the pod iocost weight based on the QoS level.

      Rubik Protocol Description

      "nodeConfig": [
              {
                  "nodeName": "slaver01",
                  "iocostEnable": true,
                  "iocostConfig": [
                      {
                          "dev": "sda",
                          "enable": false,
                          "model": "linear",
                          "param": {
                              "rbps": 174610612,
                              "rseqiops": 41788,
                              "rrandiops": 371,
                              "wbps": 178587889,
                              "wseqiops": 42792,
                              "wrandiops": 379
                          }
                      }
                  ]
              }
          ]
      
      ItemTypeDescription
      nodeConfigArrayNode configuration information
      nodeNameStringName of the node to be configured
      iocostEnableBoolWhether to enable iocost for the node
      iocostConfigArrayConfiguration array for different physical drives. This parameter is read when iocostEnable is set to true.
      devStringPhysical drive name
      enableBoolWhether to enable iocost for the physical drive
      modelStringName of the iocost model. linear is the linear model provided by the kernel.
      paramObjectParameters related to the model parameter. When model is set to linear, the following parameters are linear model-related parameters.
      r(w)bpsint64Maximum read (write) bandwidth of the physical block device
      r(w)seqiopsint64Maximum sequential read (write) IOPS of the physical block device
      r(w)randiopsint64Maximum random read (write) IOPS of the physical block device

      Other

      • Parameters related to the iocost linear model can be obtained by running the iocost_coef_gen.py script.

      • The blkio.cost.qos and blkio.cost.model file interfaces exist in the blkcg root system file. For details about the implementation and interface description, see the openEuler kernel document.

      Bug Catching

      Buggy Content

      Bug Description

      Submit As Issue

      It's a little complicated....

      I'd like to ask someone.

      PR

      Just a small problem.

      I can fix it online!

      Bug Type
      Specifications and Common Mistakes

      ● Misspellings or punctuation mistakes;

      ● Incorrect links, empty cells, or wrong formats;

      ● Chinese characters in English context;

      ● Minor inconsistencies between the UI and descriptions;

      ● Low writing fluency that does not affect understanding;

      ● Incorrect version numbers, including software package names and version numbers on the UI.

      Usability

      ● Incorrect or missing key steps;

      ● Missing prerequisites or precautions;

      ● Ambiguous figures, tables, or texts;

      ● Unclear logic, such as missing classifications, items, and steps.

      Correctness

      ● Technical principles, function descriptions, or specifications inconsistent with those of the software;

      ● Incorrect schematic or architecture diagrams;

      ● Incorrect commands or command parameters;

      ● Incorrect code;

      ● Commands inconsistent with the functions;

      ● Wrong screenshots.

      Risk Warnings

      ● Lack of risk warnings for operations that may damage the system or important data.

      Content Compliance

      ● Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions;

      ● Copyright infringement.

      How satisfied are you with this document

      Not satisfied at all
      Very satisfied
      Submit
      Click to create an issue. An issue template will be automatically generated based on your feedback.
      Bug Catching
      编组 3备份