GCC Base Performance Optimization Guide
Overview
The optimization of compiler base performance is crucial to improving the development efficiency, running performance, and maintainability of applications. It is an important research direction in computer science and one of the key steps in the process of software development. Based on the general compilation optimization capability, GCC for openEuler enhances mid- and back-end performance optimization technologies, including instruction optimization, vectorization enhancement, prefetch enhancement, and data flow analysis enhancement.
Installation and Deployment
Software Requirements
OS: openEuler 22.03 LTS SP3
Hardware Requirements
AArch64 architecture
Software Installation
Install GCC and related components as required. For example, install GCC:
yum install gcc
Usage
CRC Optimization
Description
Cyclic redundancy check (CRC) code is identified to generate efficient hardware instructions.
Usage
Add the -floop-crc
option during compilation.
Note: -floop-crc
must be used together with -O3 -march=armv8.1-a
.
IF-conversion Enhancement
Description
IF-conversion is enhanced to use more registers to reduce conflicts.
Usage
This enhancement is part of the IF-conversion optimization of the Register Transfer Language (RTL). Enable the enhancement by using the following options.
-fifcvt-allow-complicated-cmps
-param=ifcvt-allow-register-renaming=[0,1,2]
The default value is 0. The number is used to control the optimization scope.
Note: This enhancement requires the -O2
optimization level and must be used together with --param=max-rtl-if-conversion-unpredictable-cost=48
and --param=max-rtl-if-conversion-predictable-cost=48
.
Multiplication Optimization
Description
Arm instructions are combined to convert low-order multiplications into high-order multiplication instructions.
Usage
Use the -fuaddsub-overflow-match-all
and -fif-conversion-gimple
options.
Note: This optimization requires the -O3
or higher optimization level and must be used together with -ftree-fold-phiopt option
.
CMLT Instruction Generation
Description
CMLT instructions are generated for some elementary arithmetic operations to reduce the number of instructions.
Usage
Use the -mcmlt-arith
option.
Note: This optimization requires the -O3
or higher optimization level.
Vectorization Enhancement
Description
Redundant instructions generated during vectorization are identified and simplified, and shorter arrays can be vectorized.
Usage
Use --param=tree-forwprop-perm=1
and --param=vect-alias-flexible-segment-len=1
. The default values are 0.
Note: This optimization requires the -O3
or higher optimization level.
maxmin and UZP1/UZP2 Instruction Optimization
Description
The maxmin and UZP1/UZP2 instructions are optimized to reduce the total instructions and improve performance.
Usage
Use the -fconvert-minmax
option. UZP1/UZP2 instruction optimization is enabled by default at a level higher than -O3
.
Note: This optimization requires the -O3
or higher optimization level.
LDP and STP Optimization
Description
Each LDP and STP instruction with poor performance is split into two LDR and STR instructions.
Usage
Use the -fsplit-ldp-stp
option. Use --param=param-ldp-dependency-search-range= [1,32]
to control the search range. The default value is 16.
Note: This optimization requires the -O1
or higher optimization level.
AES Instruction Optimization
Description
The AES algorithm code is identified to accelerate instructions using hardware.
Usage
Use the -fcrypto-accel-aes
option.
Note: This optimization requires the -O3
or higher optimization level.
Indirect Call Optimization
Description
Indirect calls in programs are identified and analyzed to convert them into direct calls.
Usage
Use the -ficp -ficp-speculatively
option.
Note: This optimization must be used together with -O2 -flto -flto-partition=one
.
IPA-prefetch
Description
Indirect memory accesses in a loop are identified to insert a prefetch instruction, thereby reducing the delay of indirect memory accesses.
Usage
Use the -fipa-prefetch -fipa-ic
option.
Note: This optimization must be used together with -O3 -flto
.
LLC-prefetch
Description
GCC for openEuler analyzes main execution paths in programs, performs memory multiplexing analysis on loops on the primary path, calculates and sorts top hot data, and inserts prefetch instructions to pre-allocate data to LLCs, reducing LLC misses.
Usage
Use the -fllc-allocate
option. The -O2
or higher optimization level is required.
Other related interfaces:
Option | Default Value | Description |
---|---|---|
-param=mem-access-ratio=[0,100] | 20 | Ratio of the number of memory accesses in a loop to the number of instructions. |
-param=mem-access-num=unsigned | 3 | Number of memory accesses in a loop. |
-param=outer-loop-nums=[1,10] | 1 | Maximum number of outer loop layers that can be unrolled. |
-param=filter-kernels=[0,1] | 1 | Whether to perform path series filtering on loops. |
-param=branch-prob-threshold=[50,100] | 80 | Probability threshold for a branch to be considered highly probable. |
-param=prefetch-offset=[1,999999] | 1024 | Prefetch offset distance. Generally, the value is a power of 2. |
-param=issue-topn=unsigned | 1 | Number of prefetch instructions. |
-param=force-issue=[0,1] | 0 | Whether to perform forcible prefetch, that is, the static mode. |
-param=llc-capacity-per-core=[0,999999] | 114 | Average LLC capacity allocated to each core in multi-branch prefetch mode. |