GCC Base Performance Optimization Guide

Overview

The optimization of compiler base performance is crucial to improving the development efficiency, running performance, and maintainability of applications. It is an important research direction in computer science and one of the key steps in the process of software development. Based on the general compilation optimization capability, GCC for openEuler enhances mid- and back-end performance optimization technologies, including instruction optimization, vectorization enhancement, prefetch enhancement, and data flow analysis enhancement.

Installation and Deployment

Software Requirements

OS: openEuler 22.03 LTS SP4

Hardware Requirements

AArch64 architecture

Software Installation

Install GCC and related components as required. For example, install GCC:

shell
yum install gcc

Usage

CRC Optimization

Description

Cyclic redundancy check (CRC) code is identified to generate efficient hardware instructions.

Usage

Add the -floop-crc option during compilation.

Note: -floop-crc must be used together with -O3 -march=armv8.1-a.

IF-conversion Enhancement

Description

IF-conversion is enhanced to use more registers to reduce conflicts.

Usage

This enhancement is part of the IF-conversion optimization of the Register Transfer Language (RTL). Enable the enhancement by using the following options.

-fifcvt-allow-complicated-cmps

-param=ifcvt-allow-register-renaming=[0,1,2] The default value is 0. The number is used to control the optimization scope.

Note: This enhancement requires the -O2 optimization level and must be used together with --param=max-rtl-if-conversion-unpredictable-cost=48 and --param=max-rtl-if-conversion-predictable-cost=48.

Multiplication Optimization

Description

Arm instructions are combined to convert low-order multiplications into high-order multiplication instructions.

Usage

Use the -fuaddsub-overflow-match-all and -fif-conversion-gimple options.

Note: This optimization requires the -O3 or higher optimization level and must be used together with -ftree-fold-phiopt option.

CMLT Instruction Generation

Description

CMLT instructions are generated for some elementary arithmetic operations to reduce the number of instructions.

Usage

Use the -mcmlt-arith option.

Note: This optimization requires the -O3 or higher optimization level.

Vectorization Enhancement

Description

Redundant instructions generated during vectorization are identified and simplified, and shorter arrays can be vectorized.

Usage

Use --param=tree-forwprop-perm=1 and --param=vect-alias-flexible-segment-len=1. The default values are 0.

Note: This optimization requires the -O3 or higher optimization level.

maxmin and UZP1/UZP2 Instruction Optimization

Description

The maxmin and UZP1/UZP2 instructions are optimized to reduce the total instructions and improve performance.

Usage

Use the -fconvert-minmax option. UZP1/UZP2 instruction optimization is enabled by default at a level higher than -O3.

Note: This optimization requires the -O3 or higher optimization level.

LDP and STP Optimization

Description

Each LDP and STP instruction with poor performance is split into two LDR and STR instructions.

Usage

Use the -fsplit-ldp-stp option. Use --param=param-ldp-dependency-search-range= [1,32] to control the search range. The default value is 16.

Note: This optimization requires the -O1 or higher optimization level.

AES Instruction Optimization

Description

The AES algorithm code is identified to accelerate instructions using hardware.

Usage

Use the -fcrypto-accel-aes option.

Note: This optimization requires the -O3 or higher optimization level.

Indirect Call Optimization

Description

Indirect calls in programs are identified and analyzed to convert them into direct calls.

Usage

Use the -ficp -ficp-speculatively option.

Note: This optimization must be used together with -O2 -flto -flto-partition=one.

IPA-prefetch

Description

Indirect memory accesses in a loop are identified to insert a prefetch instruction, thereby reducing the delay of indirect memory accesses.

Usage

Use the -fipa-prefetch -fipa-ic option.

Note: This optimization must be used together with -O3 -flto.

LLC-prefetch

Description

GCC for openEuler analyzes main execution paths in programs, performs memory multiplexing analysis on loops on the primary path, calculates and sorts top hot data, and inserts prefetch instructions to pre-allocate data to LLCs, reducing LLC misses.

Usage

Use the -fllc-allocate option. The -O2 or higher optimization level is required.

Other related interfaces:

OptionDefault ValueDescription
-param=mem-access-ratio=[0,100]20Ratio of the number of memory accesses in a loop to the number of instructions.
-param=mem-access-num=unsigned3Number of memory accesses in a loop.
-param=outer-loop-nums=[1,10]1Maximum number of outer loop layers that can be unrolled.
-param=filter-kernels=[0,1]1Whether to perform path series filtering on loops.
-param=branch-prob-threshold=[50,100]80Probability threshold for a branch to be considered highly probable.
-param=prefetch-offset=[1,999999]1024Prefetch offset distance. Generally, the value is a power of 2.
-param=issue-topn=unsigned1Number of prefetch instructions.
-param=force-issue=[0,1]0Whether to perform forcible prefetch, that is, the static mode.
-param=llc-capacity-per-core=[0,999999]114Average LLC capacity allocated to each core in multi-branch prefetch mode.