aarch64-simd-builtins.def
26.5 KB
-
[AArch64] Use intrinsics for widening multiplies (PR91598) · 0b839322
Inline assembler instructions don't have latency info and the scheduler does not attempt to schedule them at all - it does not even honor latencies of asm source operands. As a result, SIMD intrinsics which are implemented using inline assembler perform very poorly, particularly on in-order cores. Add new patterns and intrinsics for widening multiplies, which results in a 63% speedup for the example in the PR, thus fixing the reported regression. gcc/ PR target/91598 * config/aarch64/aarch64-builtins.c (TYPES_TERNOPU_LANE): Add define. * config/aarch64/aarch64-simd.md (aarch64_vec_<su>mult_lane<Qlane>): Add new insn for widening lane mul. (aarch64_vec_<su>mlal_lane<Qlane>): Likewise. * config/aarch64/aarch64-simd-builtins.def: Add intrinsics. * config/aarch64/arm_neon.h: (vmlal_lane_s16): Expand using intrinsics rather than inline asm. (vmlal_lane_u16): Likewise. (vmlal_lane_s32): Likewise. (vmlal_lane_u32): Likewise. (vmlal_laneq_s16): Likewise. (vmlal_laneq_u16): Likewise. (vmlal_laneq_s32): Likewise. (vmlal_laneq_u32): Likewise. (vmull_lane_s16): Likewise. (vmull_lane_u16): Likewise. (vmull_lane_s32): Likewise. (vmull_lane_u32): Likewise. (vmull_laneq_s16): Likewise. (vmull_laneq_u16): Likewise. (vmull_laneq_s32): Likewise. (vmull_laneq_u32): Likewise. * config/aarch64/iterators.md (Vcondtype): New iterator for lane mul. (Qlane): Likewise.
Wilco Dijkstra committed