LTS

    Innovation Version

      How to Use

      Tiered-Reliability Memory Management for the OS

      Overview

      Memory is divided into two ranges based on high and low reliability. Therefore, memory allocation and release must be managed separately based on the reliability. The OS must be able to control the memory allocation path. User-mode processes use low reliable memory, and kernel-mode processes use highly reliable memory. When the highly reliable memory is insufficient, the allocation needs to fall back to the low reliable memory range or the allocation fails.

      In addition, according to the reliability requirements and types of processes, on-demand allocation of highly reliable and low reliable memory is required. For example, specify highly reliable memory for key processes to reduce the probability of memory errors encountered by key processes. Currently, the kernel uses highly reliable memory, and user-mode processes use low reliable memory. As a result, some key or core services, such as the service forwarding process, are unstable. If an exception occurs, I/Os are interrupted, affecting service stability. Therefore, these key services must use highly reliable memory to improve the stability of key processes.

      When a memory error occurs in the system, the OS overwrites the unallocated low reliable memory to clear the undetected memory error.

      Restrictions

      • High-reliability memory for key processes

        1. The abuse of the /proc/<pid>/reliable API may cause excessive use of highly reliable memory.
        2. The reliable attribute of a user-mode process can be modified by using the proc API or directly inherited from its parent process only after the process is started. systemd (pid=1) uses highly reliable memory. Its reliable attribute is useless and is not inherited. The reliable attribute of kernel-mode threads is invalid.
        3. The program and data segments of processes use highly reliable memory. Because the highly reliable memory is insufficient, the low reliable memory is used for startup.
        4. Common processes also use highly reliable memory in some scenarios, such as HugeTLB, page cache, vDSO, and TMPFS.
      • Overwrite of unallocated memory

        The overwrite of the unallocated memory can be executed only once and does not support concurrent operations. If this feature is executed, it will have the following impacts:

        1. This feature takes a long time. When one CPU of each node is occupied by the overwrite thread, other tasks cannot be scheduled on the CPU.
        2. During the overwrite process, the zone lock needs to be obtained. Other service processes need to wait until the overwrite is complete. As a result, the memory may not be allocated in time.
        3. In the case of concurrent execution, queuing is blocked, resulting in a longer delay.

        If the machine performance is poor, the kernel RCU stall or soft lockup alarm may be triggered, and the process memory allocation may be blocked. Therefore, this feature can be used only on physical machines if necessary. There is a high probability that the preceding problem occurs on VMs.

        The following table lists the reference data of physical machines. (The actual time required depends on the hardware performance and system load.)

      Table 1 Test data when the TaiShan 2280 V2 server is unloaded

      Test ItemNode 0Node 1Node 2Node 3
      Free Mem (MB)10929081218107365112053

      The total time is 3.2s.

      Usage

      This sub-feature provides multiple APIs. You only need to perform steps 1 to 6 to enable and verify the sub-feature.

      1. Configure kernelcore=reliable to enable tiered-reliability memory management. CONFIG_MEMORY_RELIABLE is mandatory. Otherwise, tiered-reliability memory management is disabled for the entire system.

      2. You can use the startup parameter reliable_debug=[F][,S][,P] to disable the fallback function (F), disable the TMPFS to use highly reliable memory (S), or disable the read/write cache to use highly reliable memory (P). By default, all the preceding functions are enabled.

      3. Based on the address range reported by the BIOS, the system searches for and marks the highly reliable memory. For the NUMA system, not every node needs to reserve reliable memory. However, the lower 4 GB physical space on node 0 must be highly reliable memory. During the system startup, the system allocates memory. If the highly reliable memory cannot be allocated, the low reliable memory is allocated (based on the fallback logic of the mirroring function) or the system cannot be started. If low reliable memory is used, the entire system is unstable. Therefore, the highly reliable memory on node 0 must be retained and the lower 4 GB physical space must be highly reliable memory.

      4. After the startup, you can check whether memory tiering is enabled based on the startup log. If it is enabled, the following information is displayed:

        mem reliable: init succeed, mirrored memory
        
      5. The physical address range corresponding to the highly reliable memory can be queried in the startup log. Observe the attributes in the memory map reported by the EFI. The memory range with MR is the highly reliable memory range. The following is an excerpt of the startup log. The memory range mem06 is the highly reliable memory, and mem07 is the low reliable memory. Their physical address ranges are also listed (the highly and low reliable memory address ranges cannot be directly queried in other modes).

        [  0.000000] efi: mem06: [Conventional Memory|  |MR| | | | | |  |WB| | | ] range=[0x0000000100000000-0x000000013fffffff] (1024MB) 
        [  0.000000] efi: mem07: [Conventional Memory|  | | | | | | |  |WB| | | ] range=[0x0000000140000000-0x000000083eb6cfff] (28651MB)     
        
      6. During kernel-mode development, a page struct page can be determined based on the zone where the page is located. ZONE_MOVABLE indicates a low reliable memory zone. If the zone ID is smaller than ZONE_MOVABLE, the zone is a highly reliable memory zone. The following is an example:

        bool page_reliable(struct page *page) 
         { 
           if (!mem_reliable_status() || !page) 
             return false; 
           return page_zonenum(page) < ZONE_MOVABLE; 
         }
        

        In addition, the provided APIs are classified based on their functions:

        1. Checking whether the reliability function is enabled at the code layer: In the kernel module, use the following API to check whether the tiered-reliability memory management function is enabled. If true is returned, the function is enabled. If false is returned, the function is disabled.

          #include <linux/mem_reliable.h>   
          bool mem_reliable_status(void);
          
        2. Memory hot swap: If the kernel enables the memory hot swap operation (Logical Memory hot-add), the highly and low reliable memories also support this operation. The operation unit is the memory block, which is the same as the native process.

          # Bring the memory online to the highly reliable memory range.
          echo online_kernel > /sys/devices/system/memory/auto_online_blocks 
          # Bring the memory online to the low reliable memory range.
          echo online_movable > /sys/devices/system/memory/auto_online_blocks
          
        3. Dynamically disabling a tiered management function: The long type is used to determine whether to enable or disable the tiered-reliability memory management function based on each bit.

          • bit0: enables tiered-reliability memory management.
          • bit1: disables fallback to the low reliable memory range.
          • bit2: disables TMPFS to use highly reliable memory.
          • bit3: disables the page cache to use highly reliable memory.

          Other bits are reserved for extension. If you need to change the value, call the following proc API (the permission is 600). The value range is 0-15. (The subsequent functions are processed only when bit 0 of the general function is 1. Otherwise, all functions are disabled.)

          echo 15 > /proc/sys/vm/reliable_debug 
          # All functions are disabled because bit0 is 0.
          echo 14 > /proc/sys/vm/reliable_debug
          

          This command can only be used to disable the function. This command cannot be used to enable a function that has been disabled or is disabled during running.

          Note: This function is used for escape and is configured only when the tiered-reliability memory management feature needs to be disabled in abnormal scenarios or during commissioning. Do not use this function as a common function.

        4. Viewing highly reliable memory statistics: Call the native /proc/meminfo API.

          • ReliableTotal: total size of reliable memory managed by the kernel.
          • ReliableUsed: total size of reliable memory used by the system, including the reserved memory used in the system.
          • ReliableBuddyMem: remaining reliable memory of the partner system.
          • ReliableTaskUsed: highly reliable memory used by systemd and key user processes, including anonymous pages and file pages.
          • ReliableShmem: highly reliable memory usage of the shared memory, including the total highly reliable memory used by the shared memory, TMPFS, and rootfs.
          • ReliableFileCache: highly reliable memory usage of the read/write cache.
        5. Overwrite of unallocated memory: This function requires the configuration item to be enabled.

          Enable CONFIG_CLEAR_FREELIST_PAGE and add the startup parameter clear_freelist. Call the proc API. The value can only be 1 (the permission is 0200).

          echo 1 > /proc/sys/vm/clear_freelist_pages
          

          Note: This feature depends on the startup parameter clear_freelist. The kernel matches only the prefix of the startup parameter. Therefore, this feature also takes effect for parameters with misspelled suffix, such as clear_freelisttt.

          To prevent misoperations, add the kernel module parameter cfp_timeout_ms to indicate the maximum execution duration of the overwrite function. If the overwrite function times out, the system exits even if the overwrite operation is not complete. The default value is 2000 ms (the permission is 0644).

          echo 500 > /sys/module/clear_freelist_page/parameters/cfp_timeout_ms # Set the timeout to 500 ms.
          
        6. Checking and modifying the high and low reliability attribute of the current process: Call the /proc/<pid>/reliable API to check whether the process is a highly reliable process. If the process is running and written, the attribute is inherited. If the subprocess does not require the attribute, manually modify the subprocess attribute. The systemd and kernel threads do not support the read and write of the attribute. The value can be 0 or 1. The default value is 0, indicating a low reliable process (the permission is 0644).

          # Change the process whose PID is 1024 to a highly reliable process. After the change, the process applies for memory from the highly reliable memory range. If the memory fails to be allocated, the allocation may fall back to the low reliable memory range.
           echo 1 > /proc/1024/reliable
          
        7. Setting the upper limit of highly reliable memory requested by user-mode processes: Call /proc/sys/vm/task_reliable_limit to modify the upper limit of highly reliable memory requested by user-mode processes. The value range is [ReliableTaskUsed, ReliableTotal], and the unit is byte (the permission is 0644). Notes:

          • The default value is ulong_max, indicating that there is no limit.
          • If the value is 0, the reliable process cannot use the highly reliable memory. In fallback mode, the allocation falls back to the low reliable memory range. Otherwise, OOM occurs.
          • If the value is not 0 and the limit is triggered, the fallback function is enabled. The allocation falls back to the low reliable memory range. If the fallback function is disabled, OOM is returned.

      Highly Reliable Memory for Read and Write Cache

      Overview

      A page cache is also called a file cache. When Linux reads or writes files, the page cache is used to cache the logical content of the files to accelerate the access to images and data on disks. If low reliable memory is allocated to page caches, UCE may be triggered during the access, causing system exceptions. Therefore, the read/write cache (page cache) needs to be placed in the highly reliable memory zone. In addition, to prevent the highly reliable memory from being exhausted due to excessive page cache allocations (unlimited by default), the total number of page caches and the total amount of reliable memory need to be limited.

      Restrictions

      1. When the page cache exceeds the limit, it is reclaimed periodically. If the generation speed of the page cache is faster than the reclamation speed, the number of page caches may be higher than the specified limit.
      2. The usage of /proc/sys/vm/reliable_pagecache_max_bytes has certain restrictions. In some scenarios, the page cache forcibly uses reliable memory. For example, when metadata (such as inode and dentry) of the file system is read, the reliable memory used by the page cache exceeds the API limit. In this case, you can run echo 2 \> /proc/sys/vm/drop_caches to release inode and dentry.
      3. When the highly reliable memory used by the page cache exceeds the reliable_pagecache_max_bytes limit, the low reliable memory is allocated by default. If the low reliable memory cannot be allocated, the native process is used.
      4. FileCache statistics are first collected in the percpu cache. When the value in the cache exceeds the threshold, the cache is added to the entire system and then displayed in /proc/meminfo. ReliableFileCache does not have the preceding threshold in /proc/meminfo. As a result, the value of ReliableFileCache may be greater than that of FileCache.
      5. Write cache scenarios are restricted by dirty_limit (restricted by /proc/sys/vm/dirty_ratio, indicating the percentage of dirty pages on a single memory node). If the threshold is exceeded, the current zone is skipped. For tiered-reliability memory, because highly and low reliable memories are in different zones, the write cache may trigger fallback of the local node and use the low reliable memory of the local node. You can run echo 100 > /proc/sys/vm/dirty_ratio to cancel the restriction.
      6. The highly reliable memory feature for the read/write cache limits the page cache usage. The system performance is affected in the following scenarios:
        • If the upper limit of the page cache is too small, the I/O increases and the system performance is affected.
        • If the page cache is reclaimed too frequently, system freezing may occur.
        • If a large amount of page cache is reclaimed each time after the page cache exceeds the limit, system freezing may occur.

      Usage

      The highly reliable memory is enabled by default for the read/write cache. To disable the highly reliable memory, configure reliable_debug=P. In addition, the page cache cannot be used unlimitedly. The function of limiting the page cache size depends on the CONFIG_SHRINK_PAGECACHE configuration item.

      FileCache in /proc/meminfo can be used to query the usage of the page cache, and ReliableFileCache can be used to query the usage of the reliable memory in the page cache.

      The function of limiting the page cache size depends on several proc APIs, which are defined in /proc/sys/vm/ to control the page cache usage. For details, see the following table.

      API Name (Native/New)PermissionDescriptionDefault Value
      cache_reclaim_enable (native)644Whether to enable the page cache restriction function.
      Value range: 0 or 1. If an invalid value is input, an error is returned.
      Example: echo 1 > cache_reclaim_enable
      1
      cache_limit_mbytes (new)644Upper limit of the cache, in MB.
      Value range: The minimum value is 0, indicating that the restriction function is disabled. The maximum value is the memory size in MB, for example, the value displayed by running the free -m command (the value of MemTotal in meminfo converted in MB).
      Example: echo 1024 \> cache_limit_mbytes
      Others: It is recommended that the cache upper limit be greater than or equal to half of the total memory. Otherwise, the I/O performance may be affected if the cache is too small.
      0
      cache_reclaim_s (native)644Interval for triggering cache reclamation, in seconds. The system creates work queues based on the number of online CPUs. If there are n CPUs, the system creates n work queues. Each work queue performs reclamation every cache_reclaim_s seconds. This parameter is compatible with the CPU online and offline functions. If the CPU is offline, the number of work queues decreases. If the CPU is online, the number of work queues increases.
      Value range: The minimum value is 0 (indicating that the periodic reclamation function is disabled) and the maximum value is 43200. If an invalid value is input, an error is returned.
      Example: echo 120 \> cache_reclaim_s
      Others: You are advised to set the reclamation interval to several minutes (for example, 2 minutes). Otherwise, frequent reclamation may cause system freezing.
      0
      cache_reclaim_weight (native)644Weight of each reclamation. Each CPU of the kernel expects to reclaim 32 x cache_reclaim_weight pages each time. This weight applies to both reclamation triggered by the page upper limit and periodic page cache reclamation.
      Value range: 1 to 100. If an invalid value is input, an error is returned.
      Example: echo 10 \> cache_reclaim_weight
      Others: You are advised to set this parameter to 10 or a smaller value. Otherwise, the system may freeze each time too much memory is reclaimed.
      1
      reliable_pagecache_max_bytes (new)644Total amount of highly reliable memory in the page cache.
      Value range: 0 to the maximum highly reliable memory, in bytes. You can call /proc/meminfo to query the maximum highly reliable memory. If an invalid value is input, an error is returned.
      Example: echo 4096000 \> reliable_pagecache_max_bytes
      Maximum value of the unsigned long type, indicating that the usage is not limited.

      Highly Reliable Memory for entered

      Overview

      If TMPFS is used as rootfs, it stores core files and data used by the OS. However, TMPFS uses low reliable memory by default, which makes core files and data unreliable. Therefore, TMPFS must use highly reliable memory.

      Usage

      By default, the highly reliable memory is enabled for TMPFS. To disable it, configure reliable_debug=S. You can dynamically disable it by calling /proc/sys/vm/reliable_debug, but cannot dynamically enable it.

      When enabling TMPFS to use highly reliable memory, you can check ReliableShmem in /proc/meminfo to view the highly reliable memory that has been used by TMPFS.

      By default, the upper limit for TMPFS to use highly reliable memory is half of the physical memory (except when TMPFS is used as rootfs). The conventional SYS V shared memory is restricted by /proc/sys/kernel/shmmax and /proc/sys/kernel/shmall and can be dynamically configured. It is also restricted by the highly reliable memory used by TMPFS. For details, see the following table.

      ParameterDescription
      /proc/sys/kernel/shmmax (native)Size of a single SYS V shared memory range.
      /proc/sys/kernel/shmall (native)Total size of the SYS V shared memory that can be used.

      The /proc/sys/vm/shmem_reliable_bytes_limit API is added for you to set the available highly reliable size (in bytes) of the system-level TMPFS. The default value is LONG_MAX, indicating that the usage is not limited. The value ranges from 0 to the total reliable memory size of the system. The permission is 644. When fallback is disabled and the memory usage reaches the upper limit, an error indicating that no memory is available is returned. When fallback is enabled, the system attempts to allocate memory from the low reliable memory zone. Example:

      echo 10000000 > /proc/sys/vm/shmem_reliable_bytes_limit
      

      UCE Does Not Reset After the Switch from the User Mode to Kernel Mode

      Overview

      Based on the tiered-reliability memory management solution, the kernel and key processes use highly reliable memory. Most user-mode processes use low reliable memory. When the system is running, a large amount of data needs to be exchanged between the user mode and kernel mode. When data is transferred to the kernel mode, data in the low reliable memory zone is copied to the highly reliable memory zone. The copy operation is performed in kernel mode. If a UCE occurs when the user-mode data is read, that is, the kernel-mode memory consumption UCE occurs, the system triggers a panic. This sub-feature provides solutions for scenarios where UCEs occurred in the switch from the user mode to kernel mode to avoid system reset, including copy-on-write (COW), copy_to_user, copy_from_user, get_user, put_user, and core dump scenarios. Other scenarios are not supported.

      Restrictions

      1. ARMv8.2 or later that supports the RAS feature.
      2. This feature changes the synchronization exception handling policy. Therefore, this feature takes effect only when the kernel receives a synchronization exception reported by the firmware.
      3. The kernel processing depends on the error type reported by the BIOS. The kernel cannot process fatal hardware errors but can process recoverable hardware errors.
      4. Only the COW, copy_to_user (including the read page cache), copy_from_user, get_user, put_user, and core dump scenarios are supported.
      5. In the core dump scenario, UCE tolerance needs to be implemented on the write API of the file system. This feature supports only three common file systems: ext4, TMPFS, and PipeFS. The corresponding error tolerance APIs are as follows:
        • PipeFS: copy_page_from_iter
        • ext4/TMPFS: iov_iter_copy_from_user_atomic

      Usage

      Ensure that CONFIG_ARCH_HAS_COPY_MC is enabled in the kernel. If /proc/sys/kernel/machine_check_safe is set to 1, this feature is enabled for all scenarios. If /proc/sys/kernel/machine_check_safe is set to 0, this feature is disabled for all scenarios. Other values are invalid.

      The fault tolerance mechanism in each scenario is as follows:

      No.ScenarioSymptomMitigation Measure
      1copy_from/to_user: basic switch to the user mode, involving syscall, sysctl, and procfs operationsIf a UCE occurs during the copy, the kernel is reset.If a UCE occurs, kill the current process. The kernel does not automatically reset.
      2get/put_user: simple variable copy, mainly in netlink scenarios.If a UCE occurs during the copy, the kernel is reset.If a UCE occurs, kill the current process. The kernel does not automatically reset.
      3COW: fork subprocess, which triggers COW.COW is triggered. If a UCE occurs, the kernel is reset.If a UCE occurs, kill related processes. The kernel does not automatically reset.
      4Read cache: The user mode uses low reliable memory. When a user-mode program reads or writes files, the OS uses idle memory to cache hard disk files, improving performance. However, when the user-mode program reads a file, the kernel accesses the cache first.A UCE occurs, causing the kernel to reset.If a UCE occurs, kill the current process. The kernel does not automatically reset.
      5UCE is triggered by memory access during a core dump.A UCE occurs, causing the kernel to reset.If a UCE occurs, kill the current process. The kernel does not automatically reset.
      6Write cache: When the write cache is flushed back to the disk, a UCE is triggered.Cache flushing is actually disk DMA data migration. If a UCE is triggered during this process, page write fails after timeout. As a result, data inconsistency occurs and the file system becomes unavailable. If the data is key data, the kernel resets.No solution is available. The kernel will be reset.
      7Kernel startup parameters and module parameters use highly reliable memory./Not supported. The risk is reduced.
      8relayfs: a file system that quickly forwards data from the kernel mode to the user mode./Not supported. The risk is reduced.
      9seq_file: transfers kernel data to the user mode as a file./Not supported. The risk is reduced.

      Most user-mode data uses low reliable memory. Therefore, this project involves only the scenario where user-mode data is read in kernel mode. In Linux, data can be exchanged between the user space and kernel space in nine modes, including kernel startup parameters, module parameters, sysfs, sysctl, syscall (system call), netlink, procfs, seq_file, debugfs, and relayfs. There are two other cases: COW and read/write file cache (page cache) when a process is created.

      In sysfs, syscall, netlink, and procfs modes, data is transferred from the user mode to the kernel mode in copy_from_user or get_user mode.

      The user mode can be switched to the kernel mode in the following scenarios:

      copy_from_user, get_user, COW, read cache, and write cache flushing.

      The kernel mode can be switched to the user mode in the following scenarios:

      relayfs, seq_file, copy_to_user, and put_user

      Bug Catching

      Buggy Content

      Bug Description

      Submit As Issue

      It's a little complicated....

      I'd like to ask someone.

      PR

      Just a small problem.

      I can fix it online!

      Bug Type
      Specifications and Common Mistakes

      ● Misspellings or punctuation mistakes;

      ● Incorrect links, empty cells, or wrong formats;

      ● Chinese characters in English context;

      ● Minor inconsistencies between the UI and descriptions;

      ● Low writing fluency that does not affect understanding;

      ● Incorrect version numbers, including software package names and version numbers on the UI.

      Usability

      ● Incorrect or missing key steps;

      ● Missing prerequisites or precautions;

      ● Ambiguous figures, tables, or texts;

      ● Unclear logic, such as missing classifications, items, and steps.

      Correctness

      ● Technical principles, function descriptions, or specifications inconsistent with those of the software;

      ● Incorrect schematic or architecture diagrams;

      ● Incorrect commands or command parameters;

      ● Incorrect code;

      ● Commands inconsistent with the functions;

      ● Wrong screenshots.

      Risk Warnings

      ● Lack of risk warnings for operations that may damage the system or important data.

      Content Compliance

      ● Contents that may violate applicable laws and regulations or geo-cultural context-sensitive words and expressions;

      ● Copyright infringement.

      How satisfied are you with this document

      Not satisfied at all
      Very satisfied
      Submit
      Click to create an issue. An issue template will be automatically generated based on your feedback.
      Bug Catching
      编组 3备份