restrictions

This section describes the general constraints of this feature. Each subfeature has specific constraints, which are described in the corresponding section.

Compatibility

Currently, this feature applies only to ARM64.
The hardware needs to support partial memory mirroring (address range mirroring), that is, the memory whose attribute is EFI_MEMORY_MORE_RELIABLE is reported through the UEFI standard API. Common memory does not need to be set. The mirrored memory is the highly reliable memory, and the common memory is the low reliable memory.
High-reliability and low reliable memory tiering is implemented by using the memory management zones of the kernel. They cannot dynamically flow (that is, pages cannot move between different zones).
Continuous physical memory with different reliability is divided into different memblocks. As a result, the allocation of large continuous physical memory blocks may be restricted after memory tiering is enabled.
To enable this feature, the value of kernelcore must be reliable, which is incompatible with other values of this parameter.

Design Specifications

During kernel-mode development, pay attention to the following points when allocating memory:
- If the memory allocation API supports the specified gfp_flag, only the memory allocation whose gfp_flag contains __GFP_HIGHMEM and __GFP_MOVABLE forcibly allocates the common memory range or redirects to the reliable memory range. Other gfp_flags do not intervene.
- High-reliability memory is allocated from slab, slub, and slob. (If the memory allocated at a time is greater than KMALLOC_MAX_CACHE_SIZE and gfp_flag is set to a common memory range, low reliable memory may be allocated.)
During user-mode development, pay attention to the following points when allocating memory:
- After the attribute of a common process is changed to a key process, the highly reliable memory is used only in the actual physical memory allocation phase (page fault). The attribute of the previously allocated memory does not change, and vice versa. Therefore, the memory allocated when a common process is started and changed to a key process may not be highly reliable memory. Whether the configuration takes effect can be verified by querying whether the physical address corresponding to the virtual address belongs to the highly reliable memory range.
- Similar mechanisms (ptmalloc, tcmalloc, and dpdk) in libc libraries, such as chunks in glibc, use cache logic to improve performance. However, memory cache causes inconsistency between the memory allocation logics of the user and the kernel. When a common process becomes a key process, this feature cannot be enabled (it is enabled only when the kernel allocates memory).
When an upper-layer service applies for memory, if the highly reliable memory is insufficient (triggering the native min waterline of the zone) or the corresponding limit is triggered, the page cache is preferentially released to attempt to reclaim the highly reliable memory. If the memory still cannot be allocated, the kernel selects OOM or fallback to the low reliable memory range based on the fallback switch to complete memory allocation. (Fallback indicates that when the memory of a memory management zone or node is insufficient, memory is allocated from other memory management zones or nodes.)
The dynamic memory migration mechanism similar to NUMA_BALANCING may cause the allocated highly reliable or low reliable memory to be migrated to another node. Because the migration operation loses the memory allocation context and the target node may not have the corresponding reliable memory, the memory reliability after the migration may not be as expected.

The following configuration files are introduced based on the usage of the user-mode highly reliable memory:

/proc/sys/vm/task_reliable_limit: upper limit of the highly reliable memory used by key processes (including systemd). It contains anonymous pages and file pages. The SHMEM used by the process is also counted (included in anonymous pages).
/proc/sys/vm/reliable_pagecache_max_bytes: soft upper limit of the highly reliable memory used by the global page cache. The number of highly reliable page caches used by common processes is limited. By default, the system does not limit the highly reliable memory used by page caches. This restriction does not apply to scenarios such as highly reliable processes and file system metadata. Regardless of whether fallback is enabled, when a common process triggers the upper limit, the low reliable memory is allocated by default. If the low reliable memory cannot be allocated, the native process is used.
/proc/sys/vm/shmem_reliable_bytes_limit: soft upper limit of the highly reliable memory used by the global SHMEM. It limits the amount of highly reliable memory used by the SHMEM of common processes. By default, the system does not limit the amount of highly reliable memory used by SHMEM. High-reliability processes are not subject to this restriction. When fallback is disabled, if a common process triggers the upper limit, memory allocation fails, but OOM does not occur (consistent with the native process).

If the above limits are reached, memory allocation fallback or OOM may occur.

Memory allocation caused by page faults generated by key processes in the TMPFS or page cache may trigger multiple limits. For details about the interaction between multiple limits, see the following table.

Whether task_reliable_limit Is Reached	Whether reliable_pagecache_max_bytes or shmem_reliable_bytes_limit Is Reached	Memory Allocation Processing Policy
Yes	Yes	The page cache is reclaimed first to meet the allocation requirements. Otherwise, fallback or OOM occurs.
Yes	No	The page cache is reclaimed first to meet the allocation requirements. Otherwise, fallback or OOM occurs.
No	No	High-reliability memory is allocated first. Otherwise, fallback or OOM occurs.
No	Yes	High-reliability memory is allocated first. Otherwise, fallback or OOM occurs.

Key processes comply with task_reliable_limit. If task_reliable_limit is greater than tmpfs or pagecachelimit, page cache and TMPFS generated by key processes still use highly reliable memory. As a result, the highly reliable memory used by page cache and TMPFS is greater than the corresponding limit.

When task_reliable_limit is triggered, if the size of the highly reliable file cache is less than 4 MB, the file cache will not be reclaimed synchronously. If the highly reliable file cache is less than 4 MB when the page cache is generated, the allocation will fall back to the low reliable memory range. If the highly reliable file cache is greater than or equal to 4 MB, the page cache is reclaimed preferentially for allocation. However, when the size is close to 4 MB, direct cache reclamation is triggered more frequently. Because the lock overhead of direct cache reclamation is high, the CPU usage is high. In this case, the file read/write performance is close to the raw disk performance.

Even if the system has sufficient highly reliable memory, the allocation may fall back to the low reliable memory range.
- If the memory cannot be migrated to another node for allocation, the allocation falls back to the low reliable memory range of the current node. The common scenarios are as follows:
- If the memory allocation contains __GFP_THISNODE (for example, transparent huge page allocation), memory can be allocated only from the current node. If the highly reliable memory of the node does not meet the allocation requirements, the system attempts to allocate memory from the low reliable memory range of the memory node.
- A process runs on a node that contains common memory by running commands such as taskset and numactl.
- A process is scheduled to a common memory node under the native scheduling mechanism of the system memory.
- High-reliability memory allocation triggers the highly reliable memory usage threshold, which also causes fallback to the low reliable memory range.
If tiered-reliability memory fallback is disabled, highly reliable memory cannot be expanded to low reliable memory. As a result, user-mode applications may not be compatible with this feature in determining the memory usage, for example, determining the available memory based on MemFree.
If tiered-reliability memory fallback is enabled, the native fallback is affected. The main difference lies in the selection of the memory management zone and NUMA node.

Fallback process of common user processes: low reliable memory of the local node -> low reliable memory of the remote node.
Fallback process of key user processes: highly reliable memory of the local node -> highly reliable memory of the remote node. If no memory is allocated and the fallback function of reliable is enabled, the system retries as follows: low reliable memory of the local node -> low reliable memory of the remote node.

Scenarios

The default page size (PAGE_SIZE) is 4 KB.
The lower 4 GB memory of the NUMA node 0 must be highly reliable, and the highly reliable memory size and low reliable memory size must meet the kernel requirements. Otherwise, the system may fail to start. There is no requirement on the highly reliable memory size of other nodes. However, if a node does not have highly reliable memory or the highly reliable memory is insufficient, the per-node management structure may be located in the highly reliable memory of other nodes (because the per-node management structure is a kernel data structure and needs to be located in the highly reliable memory zone). As a result, a kernel warning is generated, for example, vmemmap_verify alarms are generated and the performance is affected.
Some statistics (such as the total amount of highly reliable memory for TMPFS) of this feature are collected using the percpu technology, which causes extra overhead. To reduce the impact on performance, there is a certain error when calculating the sum. It is normal that the error is less than 10%.
Huge page limit:
- In the startup phase, static huge pages are low reliable memory. By default, static huge pages allocated during running are low reliable memory. If memory allocation occurs in the context of a key process, the allocated huge pages are highly reliable memory.
- In the transparent huge page (THP) scenario, if one of the 512 4 KB pages to be combined (2 MB for example) is a highly reliable page, the newly allocated 2 MB huge page uses highly reliable memory. That is, the THP uses more highly reliable memory.
- The allocation of the reserved 2 MB huge page complies with the native fallback process. If the current node lacks low reliable memory, the allocation falls back to the highly reliable range.
- In the startup phase, 2 MB huge pages are reserved. If no memory node is specified, the load is balanced to each memory node for huge page reservation. If a memory node lacks low reliable memory, highly reliable memory is used according to the native process.
Currently, only the normal system startup scenario is supported. In some abnormal scenarios, kernel startup may be incompatible with the memory tiering function, for example, the kdump startup phase. (Currently, kdump can be automatically disabled. In other scenarios, it needs to be disabled by upper-layer services.)
In the swap-in and swap-out, memory offline, KSM, cma, and gigantic page processes, the newly allocated page types are not considered based on the tiered-reliability memory. As a result, the page types may not be defined (for example, the highly reliable memory usage statistics are inaccurate and the reliability level of the allocated memory is not as expected).

Impact on Performance

Due to the introduction of tiered-reliability memory management, the judgment logic is added for physical page allocation, which affects the performance. The impact depends on the system status, memory type, and high- and low reliable memory margin of each node.
This feature introduces highly reliable memory usage statistics, which affects system performance.
When task_reliable_limit is triggered, the cache in the highly reliable zone is reclaimed synchronously, which increases the CPU usage. In the scenario where task_reliable_limit is triggered by page cache allocation (file read/write operations, such as dd), if the available highly reliable memory (ReliableFileCache is considered as available memory) is close to 4 MB, cache reclamation is triggered more frequently. The overhead of direct cache reclamation is high, causing high CPU usage. In this case, the file read/write performance is close to the raw disk performance.

Restrictions

Bug Catching

Buggy Content

Bug Description

How satisfied are you with this document