i386: Add pass_remove_partial_avx_dependency
With -mavx, for $ cat foo.i extern float f; extern double d; extern int i; void foo (void) { d = f; f = i; } we need to generate vxorp[ds] %xmmN, %xmmN, %xmmN ... vcvtss2sd f(%rip), %xmmN, %xmmX ... vcvtsi2ss i(%rip), %xmmN, %xmmY to avoid partial XMM register stall. This patch adds a pass to generate a single vxorps %xmmN, %xmmN, %xmmN at entry of the nearest dominator for basic blocks with SF/DF conversions, which is in the fake loop that contains the whole function, instead of generating one vxorp[ds] %xmmN, %xmmN, %xmmN for each SF/DF conversion. NB: The LCM algorithm isn't appropriate here since it may place a vxorps inside the loop. Simple testcase show this: $ cat badcase.c extern float f; extern double d; void foo (int n, int k) { for (int j = 0; j != n; j++) if (j < k) d = f; } It generates ... loop: if(j < k) vxorps %xmm0, %xmm0, %xmm0 vcvtss2sd f(%rip), %xmm0, %xmm0 ... loopend ... This is because LCM only works when there is a certain benifit. But for conditional branch, LCM wouldn't move vxorps %xmm0, %xmm0, %xmm0 out of loop. SPEC CPU 2017 on Intel Xeon with AVX512 shows: 1. The nearest dominator |RATE |Improvement| |500.perlbench_r | 0.55% | |538.imagick_r | 8.43% | |544.nab_r | 0.71% | 2. LCM |RATE |Improvement| |500.perlbench_r | -0.76% | |538.imagick_r | 7.96% | |544.nab_r | -0.13% | Performance impacts of SPEC CPU 2017 rate on Intel Xeon with AVX512 using -Ofast -flto -march=skylake-avx512 -funroll-loops before commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576 Author: uros <uros@138bc75d-0d04-0410-961f-82ee72b054a4> Date: Thu Jan 31 20:06:42 2019 +0000 PR target/89071 * config/i386/i386.md (*extendsfdf2): Split out reg->reg alternative to avoid partial SSE register stall for TARGET_AVX. (truncdfsf2): Ditto. (sse4_1_round<mode>2): Ditto. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268427 138bc75d-0d04-0410-961f-82ee72b054a4 are: |INT RATE |Improvement| |500.perlbench_r | 0.55% | |502.gcc_r | 0.14% | |505.mcf_r | 0.08% | |523.xalancbmk_r | 0.18% | |525.x264_r |-0.49% | |531.deepsjeng_r |-0.04% | |541.leela_r |-0.26% | |548.exchange2_r |-0.3% | |557.xz_r |BuildSame| |FP RATE |Improvement| |503.bwaves_r |-0.29% | |507.cactuBSSN_r | 0.04% | |508.namd_r |-0.74% | |510.parest_r |-0.01% | |511.povray_r | 2.23% | |519.lbm_r | 0.1% | |521.wrf_r | 0.49% | |526.blender_r | 0.13% | |527.cam4_r | 0.65% | |538.imagick_r | 8.43% | |544.nab_r | 0.71% | |549.fotonik3d_r | 0.15% | |554.roms_r | 0.08% | After commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576, on Skylake client, impacts on 538.imagick_r with -fno-unsafe-math-optimizations -march=native -Ofast -funroll-loops -flto 1. Size comparision: before: text data bss dec hex filename 2436377 8352 4528 2449257 255f69 imagick_r after: text data bss dec hex filename 2425249 8352 4528 2438129 2533f1 imagick_r 2. Number of vxorps: before after difference 4948 4135 -19.66% 3. Performance improvement: |RATE |Improvement| |538.imagick_r | 5.5% | gcc/ 2019-02-22 H.J. Lu <hongjiu.lu@intel.com> Hongtao Liu <hongtao.liu@intel.com> Sunil K Pandey <sunil.k.pandey@intel.com> PR target/87007 * config/i386/i386-passes.def: Add pass_remove_partial_avx_dependency. * config/i386/i386-protos.h (make_pass_remove_partial_avx_dependency): New. * config/i386/i386.c (make_pass_remove_partial_avx_dependency): New function. (pass_data_remove_partial_avx_dependency): New. (pass_remove_partial_avx_dependency): Likewise. (make_pass_remove_partial_avx_dependency): Likewise. * config/i386/i386.md (avx_partial_xmm_update): New attribute. (*extendsfdf2): Add avx_partial_xmm_update. (truncdfsf2): Likewise. (*float<SWI48:mode><MODEF:mode>2): Likewise. (SF/DF conversion splitters): Disabled for TARGET_AVX. gcc/testsuite/ 2019-02-22 H.J. Lu <hongjiu.lu@intel.com> Hongtao Liu <hongtao.liu@intel.com> Sunil K Pandey <sunil.k.pandey@intel.com> PR target/87007 * gcc.target/i386/pr87007-1.c: New test. * gcc.target/i386/pr87007-2.c: Likewise. Co-Authored-By: Hongtao Liu <hongtao.liu@intel.com> Co-Authored-By: Sunil K Pandey <sunil.k.pandey@intel.com> From-SVN: r269119
Showing
gcc/testsuite/gcc.target/i386/pr87007-1.c
0 → 100644
gcc/testsuite/gcc.target/i386/pr87007-2.c
0 → 100644
Please
register
or
sign in
to comment