Commit b781a135 by Richard Sandiford Committed by Richard Sandiford

Add support for in-order addition reduction using SVE FADDA

This patch adds support for in-order floating-point addition reductions,
which are suitable even in strict IEEE mode.

Previously vect_is_simple_reduction would reject any cases that forbid
reassociation.  The idea is instead to tentatively accept them as
"FOLD_LEFT_REDUCTIONs" and only fail later if there is no support
for them.  Although this patch only handles the particular case of plus
and minus on floating-point types, there's no reason in principle why
we couldn't handle other cases.

The reductions use a new fold_left_plus_optab if available, otherwise
they fall back to elementwise additions or subtractions.

The vect_force_simple_reduction change makes it easier for parloops
to read the type of reduction.

2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* optabs.def (fold_left_plus_optab): New optab.
	* doc/md.texi (fold_left_plus_@var{m}): Document.
	* internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function.
	* internal-fn.c (fold_left_direct): Define.
	(expand_fold_left_optab_fn): Likewise.
	(direct_fold_left_optab_supported_p): Likewise.
	* fold-const-call.c (fold_const_fold_left): New function.
	(fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS.
	* tree-parloops.c (valid_reduction_p): New function.
	(gather_scalar_reductions): Use it.
	* tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
	(vect_finish_replace_stmt): Declare.
	* tree-vect-loop.c (fold_left_reduction_fn): New function.
	(needs_fold_left_reduction_p): New function, split out from...
	(vect_is_simple_reduction): ...here.  Accept reductions that
	forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
	(vect_force_simple_reduction): Also store the reduction type in
	the assignment's STMT_VINFO_REDUC_TYPE.
	(vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
	(merge_with_identity): New function.
	(vect_expand_fold_left): Likewise.
	(vectorize_fold_left_reduction): Likewise.
	(vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the
	scalar phi in place for it.  Check for target support and reject
	cases that would reassociate the operation.  Defer the transform
	phase to vectorize_fold_left_reduction.
	* config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
	* config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
	(*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.

gcc/testsuite/
	* gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and
	check for a message about using in-order reductions.
	* gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and
	check for a message about using in-order reductions.
	* gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be
	vectorized and check for a message about using in-order reductions.
	Expect targets with variable-length vectors to fall back to the
	fixed-length mininum.
	* gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and
	check for a message about using in-order reductions.
	* gcc.dg/vect/vect-reduc-in-order-1.c: New test.
	* gcc.dg/vect/vect-reduc-in-order-2.c: Likewise.
	* gcc.dg/vect/vect-reduc-in-order-3.c: Likewise.
	* gcc.dg/vect/vect-reduc-in-order-4.c: Likewise.
	* gcc.target/aarch64/sve/reduc_strict_1.c: New test.
	* gcc.target/aarch64/sve/reduc_strict_1_run.c: Likewise.
	* gcc.target/aarch64/sve/reduc_strict_2.c: Likewise.
	* gcc.target/aarch64/sve/reduc_strict_2_run.c: Likewise.
	* gcc.target/aarch64/sve/reduc_strict_3.c: Likewise.
	* gcc.target/aarch64/sve/slp_13.c: Add floating-point types.
	* gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if
	vect_fold_left_plus.

Co-Authored-By: Alan Hayward <alan.hayward@arm.com>
Co-Authored-By: David Sherwood <david.sherwood@arm.com>

From-SVN: r256639
parent b89fa419
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org> 2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* optabs.def (fold_left_plus_optab): New optab.
* doc/md.texi (fold_left_plus_@var{m}): Document.
* internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function.
* internal-fn.c (fold_left_direct): Define.
(expand_fold_left_optab_fn): Likewise.
(direct_fold_left_optab_supported_p): Likewise.
* fold-const-call.c (fold_const_fold_left): New function.
(fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS.
* tree-parloops.c (valid_reduction_p): New function.
(gather_scalar_reductions): Use it.
* tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
(vect_finish_replace_stmt): Declare.
* tree-vect-loop.c (fold_left_reduction_fn): New function.
(needs_fold_left_reduction_p): New function, split out from...
(vect_is_simple_reduction): ...here. Accept reductions that
forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
(vect_force_simple_reduction): Also store the reduction type in
the assignment's STMT_VINFO_REDUC_TYPE.
(vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
(merge_with_identity): New function.
(vect_expand_fold_left): Likewise.
(vectorize_fold_left_reduction): Likewise.
(vectorizable_reduction): Handle FOLD_LEFT_REDUCTION. Leave the
scalar phi in place for it. Check for target support and reject
cases that would reassociate the operation. Defer the transform
phase to vectorize_fold_left_reduction.
* config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
* config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
(*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
* tree-if-conv.c (predicate_mem_writes): Remove redundant * tree-if-conv.c (predicate_mem_writes): Remove redundant
call to ifc_temp_var. call to ifc_temp_var.
......
...@@ -1550,6 +1550,45 @@ ...@@ -1550,6 +1550,45 @@
"<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>" "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"
) )
;; Unpredicated in-order FP reductions.
(define_expand "fold_left_plus_<mode>"
[(set (match_operand:<VEL> 0 "register_operand")
(unspec:<VEL> [(match_dup 3)
(match_operand:<VEL> 1 "register_operand")
(match_operand:SVE_F 2 "register_operand")]
UNSPEC_FADDA))]
"TARGET_SVE"
{
operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));
}
)
;; In-order FP reductions predicated with PTRUE.
(define_insn "*fold_left_plus_<mode>"
[(set (match_operand:<VEL> 0 "register_operand" "=w")
(unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
(match_operand:<VEL> 2 "register_operand" "0")
(match_operand:SVE_F 3 "register_operand" "w")]
UNSPEC_FADDA))]
"TARGET_SVE"
"fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"
)
;; Predicated form of the above in-order reduction.
(define_insn "*pred_fold_left_plus_<mode>"
[(set (match_operand:<VEL> 0 "register_operand" "=w")
(unspec:<VEL>
[(match_operand:<VEL> 1 "register_operand" "0")
(unspec:SVE_F
[(match_operand:<VPRED> 2 "register_operand" "Upl")
(match_operand:SVE_F 3 "register_operand" "w")
(match_operand:SVE_F 4 "aarch64_simd_imm_zero")]
UNSPEC_SEL)]
UNSPEC_FADDA))]
"TARGET_SVE"
"fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"
)
;; Unpredicated floating-point addition. ;; Unpredicated floating-point addition.
(define_expand "add<mode>3" (define_expand "add<mode>3"
[(set (match_operand:SVE_F 0 "register_operand") [(set (match_operand:SVE_F 0 "register_operand")
......
...@@ -165,6 +165,7 @@ ...@@ -165,6 +165,7 @@
UNSPEC_STN UNSPEC_STN
UNSPEC_INSR UNSPEC_INSR
UNSPEC_CLASTB UNSPEC_CLASTB
UNSPEC_FADDA
]) ])
(define_c_enum "unspecv" [ (define_c_enum "unspecv" [
......
...@@ -5236,6 +5236,14 @@ has mode @var{m} and operands 0 and 1 have the mode appropriate for ...@@ -5236,6 +5236,14 @@ has mode @var{m} and operands 0 and 1 have the mode appropriate for
one element of @var{m}. Operand 2 has the usual mask mode for vectors one element of @var{m}. Operand 2 has the usual mask mode for vectors
of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}. of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.
@cindex @code{fold_left_plus_@var{m}} instruction pattern
@item @code{fold_left_plus_@var{m}}
Take scalar operand 1 and successively add each element from vector
operand 2. Store the result in scalar operand 0. The vector has
mode @var{m} and the scalars have the mode appropriate for one
element of @var{m}. The operation is strictly in-order: there is
no reassociation.
@cindex @code{sdot_prod@var{m}} instruction pattern @cindex @code{sdot_prod@var{m}} instruction pattern
@item @samp{sdot_prod@var{m}} @item @samp{sdot_prod@var{m}}
@cindex @code{udot_prod@var{m}} instruction pattern @cindex @code{udot_prod@var{m}} instruction pattern
......
...@@ -1195,6 +1195,28 @@ fold_const_call (combined_fn fn, tree type, tree arg) ...@@ -1195,6 +1195,28 @@ fold_const_call (combined_fn fn, tree type, tree arg)
} }
} }
/* Fold a call to IFN_FOLD_LEFT_<CODE> (ARG0, ARG1), returning a value
of type TYPE. */
static tree
fold_const_fold_left (tree type, tree arg0, tree arg1, tree_code code)
{
if (TREE_CODE (arg1) != VECTOR_CST)
return NULL_TREE;
unsigned HOST_WIDE_INT nelts;
if (!VECTOR_CST_NELTS (arg1).is_constant (&nelts))
return NULL_TREE;
for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
{
arg0 = const_binop (code, type, arg0, VECTOR_CST_ELT (arg1, i));
if (arg0 == NULL_TREE || !CONSTANT_CLASS_P (arg0))
return NULL_TREE;
}
return arg0;
}
/* Try to evaluate: /* Try to evaluate:
*RESULT = FN (*ARG0, *ARG1) *RESULT = FN (*ARG0, *ARG1)
...@@ -1500,6 +1522,9 @@ fold_const_call (combined_fn fn, tree type, tree arg0, tree arg1) ...@@ -1500,6 +1522,9 @@ fold_const_call (combined_fn fn, tree type, tree arg0, tree arg1)
} }
return NULL_TREE; return NULL_TREE;
case CFN_FOLD_LEFT_PLUS:
return fold_const_fold_left (type, arg0, arg1, PLUS_EXPR);
default: default:
return fold_const_call_1 (fn, type, arg0, arg1); return fold_const_call_1 (fn, type, arg0, arg1);
} }
......
...@@ -92,6 +92,7 @@ init_internal_fns () ...@@ -92,6 +92,7 @@ init_internal_fns ()
#define cond_binary_direct { 1, 1, true } #define cond_binary_direct { 1, 1, true }
#define while_direct { 0, 2, false } #define while_direct { 0, 2, false }
#define fold_extract_direct { 2, 2, false } #define fold_extract_direct { 2, 2, false }
#define fold_left_direct { 1, 1, false }
const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = { const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
#define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct, #define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
...@@ -2897,6 +2898,9 @@ expand_while_optab_fn (internal_fn, gcall *stmt, convert_optab optab) ...@@ -2897,6 +2898,9 @@ expand_while_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
#define expand_fold_extract_optab_fn(FN, STMT, OPTAB) \ #define expand_fold_extract_optab_fn(FN, STMT, OPTAB) \
expand_direct_optab_fn (FN, STMT, OPTAB, 3) expand_direct_optab_fn (FN, STMT, OPTAB, 3)
#define expand_fold_left_optab_fn(FN, STMT, OPTAB) \
expand_direct_optab_fn (FN, STMT, OPTAB, 2)
/* RETURN_TYPE and ARGS are a return type and argument list that are /* RETURN_TYPE and ARGS are a return type and argument list that are
in principle compatible with FN (which satisfies direct_internal_fn_p). in principle compatible with FN (which satisfies direct_internal_fn_p).
Return the types that should be used to determine whether the Return the types that should be used to determine whether the
...@@ -2980,6 +2984,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types, ...@@ -2980,6 +2984,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
#define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
#define direct_while_optab_supported_p convert_optab_supported_p #define direct_while_optab_supported_p convert_optab_supported_p
#define direct_fold_extract_optab_supported_p direct_optab_supported_p #define direct_fold_extract_optab_supported_p direct_optab_supported_p
#define direct_fold_left_optab_supported_p direct_optab_supported_p
/* Return the optab used by internal function FN. */ /* Return the optab used by internal function FN. */
......
...@@ -58,6 +58,8 @@ along with GCC; see the file COPYING3. If not see ...@@ -58,6 +58,8 @@ along with GCC; see the file COPYING3. If not see
- cond_binary: a conditional binary optab, such as add<mode>cc - cond_binary: a conditional binary optab, such as add<mode>cc
- fold_left: for scalar = FN (scalar, vector), keyed off the vector mode
DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that
maps to one of two optabs, depending on the signedness of an input. maps to one of two optabs, depending on the signedness of an input.
SIGNED_OPTAB and UNSIGNED_OPTAB are the optabs for signed and SIGNED_OPTAB and UNSIGNED_OPTAB are the optabs for signed and
...@@ -162,6 +164,8 @@ DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF_CONST | ECF_NOTHROW, ...@@ -162,6 +164,8 @@ DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
DEF_INTERNAL_OPTAB_FN (FOLD_EXTRACT_LAST, ECF_CONST | ECF_NOTHROW, DEF_INTERNAL_OPTAB_FN (FOLD_EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
fold_extract_last, fold_extract) fold_extract_last, fold_extract)
DEF_INTERNAL_OPTAB_FN (FOLD_LEFT_PLUS, ECF_CONST | ECF_NOTHROW,
fold_left_plus, fold_left)
/* Unary math functions. */ /* Unary math functions. */
DEF_INTERNAL_FLT_FN (ACOS, ECF_CONST, acos, unary) DEF_INTERNAL_FLT_FN (ACOS, ECF_CONST, acos, unary)
......
...@@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a") ...@@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
OPTAB_D (reduc_and_scal_optab, "reduc_and_scal_$a") OPTAB_D (reduc_and_scal_optab, "reduc_and_scal_$a")
OPTAB_D (reduc_ior_scal_optab, "reduc_ior_scal_$a") OPTAB_D (reduc_ior_scal_optab, "reduc_ior_scal_$a")
OPTAB_D (reduc_xor_scal_optab, "reduc_xor_scal_$a") OPTAB_D (reduc_xor_scal_optab, "reduc_xor_scal_$a")
OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")
OPTAB_D (extract_last_optab, "extract_last_$a") OPTAB_D (extract_last_optab, "extract_last_$a")
OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a") OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")
......
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org> 2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and
check for a message about using in-order reductions.
* gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and
check for a message about using in-order reductions.
* gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be
vectorized and check for a message about using in-order reductions.
Expect targets with variable-length vectors to fall back to the
fixed-length mininum.
* gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and
check for a message about using in-order reductions.
* gcc.dg/vect/vect-reduc-in-order-1.c: New test.
* gcc.dg/vect/vect-reduc-in-order-2.c: Likewise.
* gcc.dg/vect/vect-reduc-in-order-3.c: Likewise.
* gcc.dg/vect/vect-reduc-in-order-4.c: Likewise.
* gcc.target/aarch64/sve/reduc_strict_1.c: New test.
* gcc.target/aarch64/sve/reduc_strict_1_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_strict_2.c: Likewise.
* gcc.target/aarch64/sve/reduc_strict_2_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_strict_3.c: Likewise.
* gcc.target/aarch64/sve/slp_13.c: Add floating-point types.
* gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if
vect_fold_left_plus.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
* gcc.target/aarch64/sve/spill_1.c: Also test that no predicates * gcc.target/aarch64/sve/spill_1.c: Also test that no predicates
are spilled. are spilled.
......
...@@ -33,5 +33,5 @@ int main (void) ...@@ -33,5 +33,5 @@ int main (void)
return main1 (); return main1 ();
} }
/* Requires fast-math. */ /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail *-*-* } } } */ /* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-do run } */ /* { dg-do run } */
/* { dg-additional-options "-O3" } */ /* { dg-additional-options "-O3 -fno-fast-math" } */
#include "tree-vect.h" #include "tree-vect.h"
...@@ -41,4 +41,5 @@ int main() ...@@ -41,4 +41,5 @@ int main()
return 0; return 0;
} }
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */ /* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */
...@@ -46,5 +46,8 @@ int main (void) ...@@ -46,5 +46,8 @@ int main (void)
return 0; return 0;
} }
/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect" } } */ /* We can't handle the first loop with variable-length vectors and so
/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */ fall back to the fixed-length mininum instead. */
/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" { xfail vect_variable_length } } } */
/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-require-effective-target vect_float } */ /* { dg-require-effective-target vect_float } */
/* { dg-additional-options "-fno-fast-math" } */
#include <stdarg.h> #include <stdarg.h>
#include "tree-vect.h" #include "tree-vect.h"
...@@ -48,6 +49,5 @@ int main (void) ...@@ -48,6 +49,5 @@ int main (void)
return 0; return 0;
} }
/* need -ffast-math to vectorizer these loops. */ /* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* ARM NEON passes -ffast-math to these tests, so expect this to fail. */ /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail arm_neon_ok } } } */
/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
/* { dg-require-effective-target vect_double } */
/* { dg-add-options ieee } */
/* { dg-additional-options "-fno-fast-math" } */
#include "tree-vect.h"
#define N (VECTOR_BITS * 17)
double __attribute__ ((noinline, noclone))
reduc_plus_double (double *a, double *b)
{
double r = 0, q = 3;
for (int i = 0; i < N; i++)
{
r += a[i];
q -= b[i];
}
return r * q;
}
int __attribute__ ((optimize (1)))
main ()
{
double a[N];
double b[N];
double r = 0, q = 3;
for (int i = 0; i < N; i++)
{
a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
b[i] = (i * 0.3) * (i & 1 ? 1 : -1);
r += a[i];
q -= b[i];
asm volatile ("" ::: "memory");
}
double res = reduc_plus_double (a, b);
if (res != r * q)
__builtin_abort ();
return 0;
}
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 2 "vect" } } */
/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
/* { dg-require-effective-target vect_double } */
/* { dg-add-options ieee } */
/* { dg-additional-options "-fno-fast-math" } */
#include "tree-vect.h"
#define N (VECTOR_BITS * 17)
double __attribute__ ((noinline, noclone))
reduc_plus_double (double *restrict a, int n)
{
double res = 0.0;
for (int i = 0; i < n; i++)
for (int j = 0; j < N; j++)
res += a[i];
return res;
}
int __attribute__ ((optimize (1)))
main ()
{
int n = 19;
double a[N];
double r = 0;
for (int i = 0; i < N; i++)
{
a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
asm volatile ("" ::: "memory");
}
for (int i = 0; i < n; i++)
for (int j = 0; j < N; j++)
{
r += a[i];
asm volatile ("" ::: "memory");
}
double res = reduc_plus_double (a, n);
if (res != r)
__builtin_abort ();
return 0;
}
/* { dg-final { scan-tree-dump {in-order double reduction not supported} "vect" } } */
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
/* { dg-require-effective-target vect_double } */
/* { dg-add-options ieee } */
/* { dg-additional-options "-fno-fast-math" } */
#include "tree-vect.h"
#define N (VECTOR_BITS * 17)
double __attribute__ ((noinline, noclone))
reduc_plus_double (double *a)
{
double r = 0;
for (int i = 0; i < N; i += 4)
{
r += a[i] * 2.0;
r += a[i + 1] * 3.0;
r += a[i + 2] * 4.0;
r += a[i + 3] * 5.0;
}
return r;
}
int __attribute__ ((optimize (1)))
main ()
{
double a[N];
double r = 0;
for (int i = 0; i < N; i++)
{
a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
r += a[i] * (i % 4 + 2);
asm volatile ("" ::: "memory");
}
double res = reduc_plus_double (a);
if (res != r)
__builtin_abort ();
return 0;
}
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 1 "vect" } } */
/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
/* { dg-require-effective-target vect_double } */
/* { dg-add-options ieee } */
/* { dg-additional-options "-fno-fast-math" } */
#include "tree-vect.h"
#define N (VECTOR_BITS * 17)
double __attribute__ ((noinline, noclone))
reduc_plus_double (double *a)
{
double r1 = 0;
double r2 = 0;
double r3 = 0;
double r4 = 0;
for (int i = 0; i < N; i += 4)
{
r1 += a[i];
r2 += a[i + 1];
r3 += a[i + 2];
r4 += a[i + 3];
}
return r1 * r2 * r3 * r4;
}
int __attribute__ ((optimize (1)))
main ()
{
double a[N];
double r[4] = {};
for (int i = 0; i < N; i++)
{
a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
r[i % 4] += a[i];
asm volatile ("" ::: "memory");
}
double res = reduc_plus_double (a);
if (res != r[0] * r[1] * r[2] * r[3])
__builtin_abort ();
return 0;
}
/* { dg-final { scan-tree-dump {in-order unchained SLP reductions not supported} "vect" } } */
/* { dg-final { scan-tree-dump-not {vectorizing stmts using SLP} "vect" } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))
#define DEF_REDUC_PLUS(TYPE) \
TYPE __attribute__ ((noinline, noclone)) \
reduc_plus_##TYPE (TYPE *a, TYPE *b) \
{ \
TYPE r = 0, q = 3; \
for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
{ \
r += a[i]; \
q -= b[i]; \
} \
return r * q; \
}
#define TEST_ALL(T) \
T (_Float16) \
T (float) \
T (double)
TEST_ALL (DEF_REDUC_PLUS)
/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 2 } } */
/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 2 } } */
/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 2 } } */
/* { dg-do run { target { aarch64_sve_hw } } } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include "reduc_strict_1.c"
#define TEST_REDUC_PLUS(TYPE) \
{ \
TYPE a[NUM_ELEMS (TYPE)]; \
TYPE b[NUM_ELEMS (TYPE)]; \
TYPE r = 0, q = 3; \
for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
{ \
a[i] = (i * 0.1) * (i & 1 ? 1 : -1); \
b[i] = (i * 0.3) * (i & 1 ? 1 : -1); \
r += a[i]; \
q -= b[i]; \
asm volatile ("" ::: "memory"); \
} \
TYPE res = reduc_plus_##TYPE (a, b); \
if (res != r * q) \
__builtin_abort (); \
}
int __attribute__ ((optimize (1)))
main ()
{
TEST_ALL (TEST_REDUC_PLUS);
return 0;
}
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))
#define DEF_REDUC_PLUS(TYPE) \
void __attribute__ ((noinline, noclone)) \
reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS (TYPE)], \
TYPE *restrict r, int n) \
{ \
for (int i = 0; i < n; i++) \
{ \
r[i] = 0; \
for (int j = 0; j < NUM_ELEMS (TYPE); j++) \
r[i] += a[i][j]; \
} \
}
#define TEST_ALL(T) \
T (_Float16) \
T (float) \
T (double)
TEST_ALL (DEF_REDUC_PLUS)
/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 1 } } */
/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 1 } } */
/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 1 } } */
/* { dg-do run { target { aarch64_sve_hw } } } */
/* { dg-options "-O2 -ftree-vectorize -fno-inline" } */
#include "reduc_strict_2.c"
#define NROWS 5
#define TEST_REDUC_PLUS(TYPE) \
{ \
TYPE a[NROWS][NUM_ELEMS (TYPE)]; \
TYPE r[NROWS]; \
TYPE expected[NROWS] = {}; \
for (int i = 0; i < NROWS; ++i) \
for (int j = 0; j < NUM_ELEMS (TYPE); ++j) \
{ \
a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1); \
expected[i] += a[i][j]; \
asm volatile ("" ::: "memory"); \
} \
reduc_plus_##TYPE (a, r, NROWS); \
for (int i = 0; i < NROWS; ++i) \
if (r[i] != expected[i]) \
__builtin_abort (); \
}
int __attribute__ ((optimize (1)))
main ()
{
TEST_ALL (TEST_REDUC_PLUS);
return 0;
}
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -fno-inline -msve-vector-bits=256 -fdump-tree-vect-details" } */
double mat[100][4];
double mat2[100][8];
double mat3[100][12];
double mat4[100][3];
double
slp_reduc_plus (int n)
{
double tmp = 0.0;
for (int i = 0; i < n; i++)
{
tmp = tmp + mat[i][0];
tmp = tmp + mat[i][1];
tmp = tmp + mat[i][2];
tmp = tmp + mat[i][3];
}
return tmp;
}
double
slp_reduc_plus2 (int n)
{
double tmp = 0.0;
for (int i = 0; i < n; i++)
{
tmp = tmp + mat2[i][0];
tmp = tmp + mat2[i][1];
tmp = tmp + mat2[i][2];
tmp = tmp + mat2[i][3];
tmp = tmp + mat2[i][4];
tmp = tmp + mat2[i][5];
tmp = tmp + mat2[i][6];
tmp = tmp + mat2[i][7];
}
return tmp;
}
double
slp_reduc_plus3 (int n)
{
double tmp = 0.0;
for (int i = 0; i < n; i++)
{
tmp = tmp + mat3[i][0];
tmp = tmp + mat3[i][1];
tmp = tmp + mat3[i][2];
tmp = tmp + mat3[i][3];
tmp = tmp + mat3[i][4];
tmp = tmp + mat3[i][5];
tmp = tmp + mat3[i][6];
tmp = tmp + mat3[i][7];
tmp = tmp + mat3[i][8];
tmp = tmp + mat3[i][9];
tmp = tmp + mat3[i][10];
tmp = tmp + mat3[i][11];
}
return tmp;
}
void
slp_non_chained_reduc (int n, double * restrict out)
{
for (int i = 0; i < 3; i++)
out[i] = 0;
for (int i = 0; i < n; i++)
{
out[0] = out[0] + mat4[i][0];
out[1] = out[1] + mat4[i][1];
out[2] = out[2] + mat4[i][2];
}
}
/* Strict FP reductions shouldn't be used for the outer loops, only the
inner loops. */
float
double_reduc1 (float (*restrict i)[16])
{
float l = 0;
for (int a = 0; a < 8; a++)
for (int b = 0; b < 8; b++)
l += i[b][a];
return l;
}
float
double_reduc2 (float *restrict i)
{
float l = 0;
for (int a = 0; a < 8; a++)
for (int b = 0; b < 16; b++)
{
l += i[b * 4];
l += i[b * 4 + 1];
l += i[b * 4 + 2];
l += i[b * 4 + 3];
}
return l;
}
float
double_reduc3 (float *restrict i, float *restrict j)
{
float k = 0, l = 0;
for (int a = 0; a < 8; a++)
for (int b = 0; b < 8; b++)
{
k += i[b];
l += j[b];
}
return l * k;
}
/* We can't yet handle double_reduc1. */
/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 3 } } */
/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 9 } } */
/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3. Each one
is reported three times, once for SVE, once for 128-bit AdvSIMD and once
for 64-bit AdvSIMD. */
/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } } */
/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.
double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)
before failing. */
/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */
/* { dg-do compile } */ /* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */ /* The cost model thinks that the double loop isn't a win for SVE-128. */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -fno-vect-cost-model" } */
#include <stdint.h> #include <stdint.h>
...@@ -24,7 +25,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \ ...@@ -24,7 +25,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \
T (int32_t) \ T (int32_t) \
T (uint32_t) \ T (uint32_t) \
T (int64_t) \ T (int64_t) \
T (uint64_t) T (uint64_t) \
T (_Float16) \
T (float) \
T (double)
TEST_ALL (VEC_PERM) TEST_ALL (VEC_PERM)
...@@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM) ...@@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)
/* ??? We don't treat the uint loops as SLP. */ /* ??? We don't treat the uint loops as SLP. */
/* The loop should be fully-masked. */ /* The loop should be fully-masked. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */ /* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */ /* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */
/* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */ /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */ /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */ /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */ /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */
/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h\n} 1 } } */
/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s\n} 1 } } */
/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d\n} 1 } } */
/* { dg-final { scan-assembler-not {\tfadd\n} } } */
/* { dg-final { scan-assembler-not {\tuqdec} } } */ /* { dg-final { scan-assembler-not {\tuqdec} } } */
...@@ -704,5 +704,5 @@ CALL track('KERNEL ') ...@@ -704,5 +704,5 @@ CALL track('KERNEL ')
RETURN RETURN
END SUBROUTINE kernel END SUBROUTINE kernel
! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt } } } } ! { dg-final { scan-tree-dump-times "vectorized 22 loops" 1 "vect" { target vect_intdouble_cvt } } }
! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { ! vect_intdouble_cvt } } } } ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { ! vect_intdouble_cvt } } } }
...@@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slot, void *data ATTRIBUTE_UNUSED) ...@@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slot, void *data ATTRIBUTE_UNUSED)
return 1; return 1;
} }
/* Return true if the type of reduction performed by STMT is suitable
for this pass. */
static bool
valid_reduction_p (gimple *stmt)
{
/* Parallelization would reassociate the operation, which isn't
allowed for in-order reductions. */
stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);
return reduc_type != FOLD_LEFT_REDUCTION;
}
/* Detect all reductions in the LOOP, insert them into REDUCTION_LIST. */ /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST. */
static void static void
...@@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, reduction_info_table_type *reduction_list ...@@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, reduction_info_table_type *reduction_list
gimple *reduc_stmt gimple *reduc_stmt
= vect_force_simple_reduction (simple_loop_info, phi, = vect_force_simple_reduction (simple_loop_info, phi,
&double_reduc, true); &double_reduc, true);
if (!reduc_stmt) if (!reduc_stmt || !valid_reduction_p (reduc_stmt))
continue; continue;
if (double_reduc) if (double_reduc)
...@@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, reduction_info_table_type *reduction_list ...@@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, reduction_info_table_type *reduction_list
= vect_force_simple_reduction (simple_loop_info, inner_phi, = vect_force_simple_reduction (simple_loop_info, inner_phi,
&double_reduc, true); &double_reduc, true);
gcc_assert (!double_reduc); gcc_assert (!double_reduc);
if (inner_reduc_stmt == NULL) if (inner_reduc_stmt == NULL
|| !valid_reduction_p (inner_reduc_stmt))
continue; continue;
build_new_reduction (reduction_list, double_reduc_stmts[i], phi); build_new_reduction (reduction_list, double_reduc_stmts[i], phi);
......
...@@ -74,7 +74,15 @@ enum vect_reduction_type { ...@@ -74,7 +74,15 @@ enum vect_reduction_type {
for (int i = 0; i < VF; ++i) for (int i = 0; i < VF; ++i)
res = cond[i] ? val[i] : res; */ res = cond[i] ? val[i] : res; */
EXTRACT_LAST_REDUCTION EXTRACT_LAST_REDUCTION,
/* Use a folding reduction within the loop to implement:
for (int i = 0; i < VF; ++i)
res = res OP val[i];
(with no reassocation). */
FOLD_LEFT_REDUCTION
}; };
#define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def) \ #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def) \
...@@ -1390,6 +1398,7 @@ extern void vect_model_load_cost (stmt_vec_info, int, vect_memory_access_type, ...@@ -1390,6 +1398,7 @@ extern void vect_model_load_cost (stmt_vec_info, int, vect_memory_access_type,
extern unsigned record_stmt_cost (stmt_vector_for_cost *, int, extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
enum vect_cost_for_stmt, stmt_vec_info, enum vect_cost_for_stmt, stmt_vec_info,
int, enum vect_cost_model_location); int, enum vect_cost_model_location);
extern void vect_finish_replace_stmt (gimple *, gimple *);
extern void vect_finish_stmt_generation (gimple *, gimple *, extern void vect_finish_stmt_generation (gimple *, gimple *,
gimple_stmt_iterator *); gimple_stmt_iterator *);
extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info); extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment