GCC Base Performance Optimization Guide

Overview

The optimization of compiler base performance is crucial to improving the development efficiency, running performance, and maintainability of applications. It is important in both computer science and software development. Based on the general compilation optimization capability, GCC for openEuler enhances middle- and back-end performance optimization technologies, including instruction optimization, vectorization enhancement, prefetch enhancement, and data flow analysis enhancement.

Installation and Deployment

Software Requirements

OS: openEuler 25.03

Hardware Requirements

AArch64 architecture

Software Installation

Install GCC and related components as needed. The following uses GCC as an example:

shell
yum install gcc

How to Use

Optimization for CRC

Description

GCC identifies cyclic redundancy check (CRC) code and generates efficient hardware instructions.

How to Use

Add -floop-crc during compilation.

Note: -floop-crc must be used together with -O3 -march=armv8.1-a.

If-conversion Enhancement

Description

If-conversion optimization is enhanced by using more registers to reduce conflicts.

How to Use

This enhancement is part of Register Transfer Language (RTL) if-conversion optimization. Enable the enhancement by using the following options:

-fifcvt-allow-complicated-cmps

--param=ifcvt-allow-register-renaming=[0,1,2], where the numbers are used to control the optimization scope

Note: This optimization requires the -O2 optimization level and must be used together with --param=max-rtl-if-conversion-unpredictable-cost=48 and --param=max-rtl-if-conversion-predictable-cost=48.

Optimization for Multiplication

Description

Arm instructions are combined to convert low-order 32-bit multiplications into high-order 64-bit multiplication instructions.

How to Use

Use the -fuaddsub-overflow-match-all and -fif-conversion-gimple options.

Note: This optimization requires the -O3 or higher optimization level.

Optimization for CMLT Instruction Generation

Description

cmlt instructions are generated for some elementary arithmetic operations to reduce the number of instructions.

How to Use

Use the -mcmlt-arith option.

Note: This optimization requires the -O3 or higher optimization level.

Optimization for Vectorization

Description

Redundant instructions generated during vectorization are identified and simplified, and shorter arrays can be vectorized.

How to Use

Use --param=vect-alias-flexible-segment-len=1. The default value is 0.

Note: This optimization requires the -O3 or higher optimization level.

Optimization for min max and uzp1/uzp2 Instructions

Description

The min max and uzp1/uzp2 instructions are optimized to reduce the total instructions and improve performance.

How to Use

Use the -fconvert-minmax option to enable min max optimization. uzp1/uzp2 instruction optimization is enabled by default at a level higher than -O3.

Note: This optimization requires the -O3 or higher optimization level.

Optimization for LDP and STP

Description

Each LDP and STP instruction with poor performance is split into two LDR and STR instructions.

How to Use

Use the -fsplit-ldp-stp option. Use --param=param-ldp-dependency-search-range=[1,32] to control the search range. The default value is 16.

Note: This optimization requires the -O1 or higher optimization level.

Optimization for AES Instruction

Description

The AES software instruction sequences are identified and accelerated using hardware instructions.

How to Use

Use the -fcrypto-accel-aes option.

Note: This optimization requires the -O3 or higher optimization level.

Optimization for Indirect Calls

Description

Indirect calls in programs are identified, analyzed, and then optimized into direct calls.

How to Use

Use the -ficp -ficp-speculatively option.

Note: This optimization must be used together with -O2 -flto -flto-partition=one.

IPA-prefetch

Description

Indirect memory accesses in a loop are identified, and a prefetch instruction is inserted to reduce the delay.

How to Use

Use the -fipa-prefetch -fipa-ic option.

Note: This optimization must be used together with -O3 -flto.

-fipa-struct-reorg

Description

This option optimizes memory layout. The structure members are rearranged in memory to improve the cache hit rate.

How to Use

Add -O3 -flto -flto-partition=one -fipa-struct-reorg to the option.

Note: The -fipa-struct-reorg option can be enabled only when -O3 -flto -flto-partition=one is enabled globally.

-fipa-reorder-fields

Description

The memory space layout is optimized by arranging structure members from largest to smallest based on their size. This reduces padding caused by alignment boundaries, decreases overall memory usage, and improves the cache hit rate.

How to Use

Add -O3 -flto -flto-partition=one -fipa-reorder-fields to the option.

Note: The -fipa-reorder-fields option can be enabled only when -O3 -flto -flto-partition=one is enabled globally.

-ftree-slp-transpose-vectorize

Description

In the loop splitting phase, temporary arrays are introduced to partition the loop, which enhances data-flow analysis for loops that read continuous memory. In the superword-level parallelism (SLP) vectorization phase, SLP analysis is performed for transposing grouped_stores.

How to Use

Add -O3 -ftree-slp-transpose-vectorize to the option.

Note: The -ftree-slp-transpose-vectorize option can be enabled only when -O3 is enabled.

LLC-prefetch

Description

In main execution paths of programs, memory-reuse patterns in loops are analyzed to identify and rank top hot data. The prefetch instruction is introduced to allocate the data to last-level cache (LLC), reducing LLC misses.

How to Use

Use the -fllc-allocate option. The -O2 or higher optimization level is required.

Other related interfaces:

OptionDefault ValueDescription
--param=mem-access-ratio=[0,100]20Ratio of the number of memory accesses in a loop to the number of instructions.
--param=mem-access-num=unsigned3Number of memory accesses in a loop.
--param=outer-loop-nums=[1,10]1Maximum number of outer loop layers that can be unrolled.
--param=filter-kernels=[0,1]1Indicates whether to perform path series filtering on loops.
--param=branch-prob-threshold=[50,100]80Probability threshold for a branch to be considered highly probable.
--param=prefetch-offset=[1,999999]1024Prefetch offset distance, where the value is a power of 2.
--param=issue-topn=unsigned1Number of prefetch instructions.
--param=force-issue=[0,1]0Indicates whether to perform forcible prefetch, that is, the static mode.
--param=llc-capacity-per-core=[0,999999]107Average LLC capacity allocated to each core in multi-branch prefetch mode.

-fipa-struct-sfc

Description

This option is used to statically compress structure members to reduce the structure size and improve the cache hit rate.

How to Use

Add -O3 -flto -flto-partition=one -fipa-reorder-fields -fipa-struct-sfc to the option. You can use -fipa-struct-sfc-bitfield and -fipa-struct-sfc-shadow for further optimization.

Note: The -fipa-struct-sfc option can be enabled only when -O3 -flto -flto-partition=one is enabled globally and -fipa-reorder-fields or -fipa-struct-reorg>=2 is enabled.

-fipa-struct-dfc

Description

This option is used to dynamically compress structure members by cloning the program path and heuristically minimizing the structure size. At runtime, it improves the cache hit rate by checking execution paths and selecting the optimal one.

How to Use

Add -O3 -flto -flto-partition=one -fipa-reorder-fields -fipa-struct-dfc to the option. You can use -fipa-struct-dfc-bitfield and -fipa-struct-dfc-shadow for further optimization.

Note: The -fipa-struct-dfc option can be enabled only when -O3 -flto -flto-partition=one is enabled globally and -fipa-reorder-fields or -fipa-struct-reorg>=2 is enabled.

-fipa-alignment-propagation

Description

This option is used to analyze and propagate the address-alignment values for local variables, optimizing the bitwise AND operations.

How to Use

Add -O3 -fipa-alignment-propagation to the option.

Note: The -fipa-alignment-propagation option can be enabled only when -O3 is enabled.

-fipa-localize-array

Description

This option is used to convert the global pointer variables allocated by calloc to local variables.

How to Use

Add -O3 -fipa-localize-array to the option.

Note: The -fipa-localize-array option can be enabled only when -O3 is enabled.

-fipa-array-dse

Description

This option is used to analyze the transfer of arrays between functions and the usage of the arrays in the called functions, removing redundant array writes.

How to Use

Add -O3 -fipa-array-dse to the option.

Note: The -fipa-array-dse option can be enabled only when -O3 is enabled.

-ffind-with-sve

Description

This option is used to identify std::find function calls and attempt to optimize them using SVE instructions.

How to Use

Add -ffind-with-sve to the option.

-floop-sve-mode-opt

Description

By analyzing static code characteristics, special scenarios can be identified. When the conditions are met, additional optimization opportunities leveraging the SVE instruction set are introduced, resulting in improved performance.

How to Use

Add -O3 -floop-sve-mode-opt to the option.

Note: The -floop-sve-mode-opt option can be enabled only when -O3 is enabled and SVE is included in the -march setting.