Commit b781a135 by Richard Sandiford Committed by Richard Sandiford

Add support for in-order addition reduction using SVE FADDA

This patch adds support for in-order floating-point addition reductions,
which are suitable even in strict IEEE mode.

Previously vect_is_simple_reduction would reject any cases that forbid
reassociation.  The idea is instead to tentatively accept them as
"FOLD_LEFT_REDUCTIONs" and only fail later if there is no support
for them.  Although this patch only handles the particular case of plus
and minus on floating-point types, there's no reason in principle why
we couldn't handle other cases.

The reductions use a new fold_left_plus_optab if available, otherwise
they fall back to elementwise additions or subtractions.

The vect_force_simple_reduction change makes it easier for parloops
to read the type of reduction.

2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* optabs.def (fold_left_plus_optab): New optab.
	* doc/md.texi (fold_left_plus_@var{m}): Document.
	* internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function.
	* internal-fn.c (fold_left_direct): Define.
	(expand_fold_left_optab_fn): Likewise.
	(direct_fold_left_optab_supported_p): Likewise.
	* fold-const-call.c (fold_const_fold_left): New function.
	(fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS.
	* tree-parloops.c (valid_reduction_p): New function.
	(gather_scalar_reductions): Use it.
	* tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
	(vect_finish_replace_stmt): Declare.
	* tree-vect-loop.c (fold_left_reduction_fn): New function.
	(needs_fold_left_reduction_p): New function, split out from...
	(vect_is_simple_reduction): ...here.  Accept reductions that
	forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
	(vect_force_simple_reduction): Also store the reduction type in
	the assignment's STMT_VINFO_REDUC_TYPE.
	(vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
	(merge_with_identity): New function.
	(vect_expand_fold_left): Likewise.
	(vectorize_fold_left_reduction): Likewise.
	(vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the
	scalar phi in place for it.  Check for target support and reject
	cases that would reassociate the operation.  Defer the transform
	phase to vectorize_fold_left_reduction.
	* config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
	* config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
	(*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.

gcc/testsuite/
	* gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and
	check for a message about using in-order reductions.
	* gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and
	check for a message about using in-order reductions.
	* gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be
	vectorized and check for a message about using in-order reductions.
	Expect targets with variable-length vectors to fall back to the
	fixed-length mininum.
	* gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and
	check for a message about using in-order reductions.
	* gcc.dg/vect/vect-reduc-in-order-1.c: New test.
	* gcc.dg/vect/vect-reduc-in-order-2.c: Likewise.
	* gcc.dg/vect/vect-reduc-in-order-3.c: Likewise.
	* gcc.dg/vect/vect-reduc-in-order-4.c: Likewise.
	* gcc.target/aarch64/sve/reduc_strict_1.c: New test.
	* gcc.target/aarch64/sve/reduc_strict_1_run.c: Likewise.
	* gcc.target/aarch64/sve/reduc_strict_2.c: Likewise.
	* gcc.target/aarch64/sve/reduc_strict_2_run.c: Likewise.
	* gcc.target/aarch64/sve/reduc_strict_3.c: Likewise.
	* gcc.target/aarch64/sve/slp_13.c: Add floating-point types.
	* gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if
	vect_fold_left_plus.

Co-Authored-By: Alan Hayward <alan.hayward@arm.com>
Co-Authored-By: David Sherwood <david.sherwood@arm.com>

From-SVN: r256639
parent b89fa419
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* optabs.def (fold_left_plus_optab): New optab.
* doc/md.texi (fold_left_plus_@var{m}): Document.
* internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function.
* internal-fn.c (fold_left_direct): Define.
(expand_fold_left_optab_fn): Likewise.
(direct_fold_left_optab_supported_p): Likewise.
* fold-const-call.c (fold_const_fold_left): New function.
(fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS.
* tree-parloops.c (valid_reduction_p): New function.
(gather_scalar_reductions): Use it.
* tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
(vect_finish_replace_stmt): Declare.
* tree-vect-loop.c (fold_left_reduction_fn): New function.
(needs_fold_left_reduction_p): New function, split out from...
(vect_is_simple_reduction): ...here. Accept reductions that
forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
(vect_force_simple_reduction): Also store the reduction type in
the assignment's STMT_VINFO_REDUC_TYPE.
(vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
(merge_with_identity): New function.
(vect_expand_fold_left): Likewise.
(vectorize_fold_left_reduction): Likewise.
(vectorizable_reduction): Handle FOLD_LEFT_REDUCTION. Leave the
scalar phi in place for it. Check for target support and reject
cases that would reassociate the operation. Defer the transform
phase to vectorize_fold_left_reduction.
* config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
* config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
(*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
* tree-if-conv.c (predicate_mem_writes): Remove redundant
call to ifc_temp_var.
......
......@@ -1550,6 +1550,45 @@
"<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"
)
;; Unpredicated in-order FP reductions.
(define_expand "fold_left_plus_<mode>"
[(set (match_operand:<VEL> 0 "register_operand")
(unspec:<VEL> [(match_dup 3)
(match_operand:<VEL> 1 "register_operand")
(match_operand:SVE_F 2 "register_operand")]
UNSPEC_FADDA))]
"TARGET_SVE"
{
operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));
}
)
;; In-order FP reductions predicated with PTRUE.
(define_insn "*fold_left_plus_<mode>"
[(set (match_operand:<VEL> 0 "register_operand" "=w")
(unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
(match_operand:<VEL> 2 "register_operand" "0")
(match_operand:SVE_F 3 "register_operand" "w")]
UNSPEC_FADDA))]
"TARGET_SVE"
"fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"
)
;; Predicated form of the above in-order reduction.
(define_insn "*pred_fold_left_plus_<mode>"
[(set (match_operand:<VEL> 0 "register_operand" "=w")
(unspec:<VEL>
[(match_operand:<VEL> 1 "register_operand" "0")
(unspec:SVE_F
[(match_operand:<VPRED> 2 "register_operand" "Upl")
(match_operand:SVE_F 3 "register_operand" "w")
(match_operand:SVE_F 4 "aarch64_simd_imm_zero")]
UNSPEC_SEL)]
UNSPEC_FADDA))]
"TARGET_SVE"
"fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"
)
;; Unpredicated floating-point addition.
(define_expand "add<mode>3"
[(set (match_operand:SVE_F 0 "register_operand")
......
......@@ -165,6 +165,7 @@
UNSPEC_STN
UNSPEC_INSR
UNSPEC_CLASTB
UNSPEC_FADDA
])
(define_c_enum "unspecv" [
......
......@@ -5236,6 +5236,14 @@ has mode @var{m} and operands 0 and 1 have the mode appropriate for
one element of @var{m}. Operand 2 has the usual mask mode for vectors
of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.
@cindex @code{fold_left_plus_@var{m}} instruction pattern
@item @code{fold_left_plus_@var{m}}
Take scalar operand 1 and successively add each element from vector
operand 2. Store the result in scalar operand 0. The vector has
mode @var{m} and the scalars have the mode appropriate for one
element of @var{m}. The operation is strictly in-order: there is
no reassociation.
@cindex @code{sdot_prod@var{m}} instruction pattern
@item @samp{sdot_prod@var{m}}
@cindex @code{udot_prod@var{m}} instruction pattern
......
......@@ -1195,6 +1195,28 @@ fold_const_call (combined_fn fn, tree type, tree arg)
}
}
/* Fold a call to IFN_FOLD_LEFT_<CODE> (ARG0, ARG1), returning a value
of type TYPE. */
static tree
fold_const_fold_left (tree type, tree arg0, tree arg1, tree_code code)
{
if (TREE_CODE (arg1) != VECTOR_CST)
return NULL_TREE;
unsigned HOST_WIDE_INT nelts;
if (!VECTOR_CST_NELTS (arg1).is_constant (&nelts))
return NULL_TREE;
for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
{
arg0 = const_binop (code, type, arg0, VECTOR_CST_ELT (arg1, i));
if (arg0 == NULL_TREE || !CONSTANT_CLASS_P (arg0))
return NULL_TREE;
}
return arg0;
}
/* Try to evaluate:
*RESULT = FN (*ARG0, *ARG1)
......@@ -1500,6 +1522,9 @@ fold_const_call (combined_fn fn, tree type, tree arg0, tree arg1)
}
return NULL_TREE;
case CFN_FOLD_LEFT_PLUS:
return fold_const_fold_left (type, arg0, arg1, PLUS_EXPR);
default:
return fold_const_call_1 (fn, type, arg0, arg1);
}
......
......@@ -92,6 +92,7 @@ init_internal_fns ()
#define cond_binary_direct { 1, 1, true }
#define while_direct { 0, 2, false }
#define fold_extract_direct { 2, 2, false }
#define fold_left_direct { 1, 1, false }
const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
#define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
......@@ -2897,6 +2898,9 @@ expand_while_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
#define expand_fold_extract_optab_fn(FN, STMT, OPTAB) \
expand_direct_optab_fn (FN, STMT, OPTAB, 3)
#define expand_fold_left_optab_fn(FN, STMT, OPTAB) \
expand_direct_optab_fn (FN, STMT, OPTAB, 2)
/* RETURN_TYPE and ARGS are a return type and argument list that are
in principle compatible with FN (which satisfies direct_internal_fn_p).
Return the types that should be used to determine whether the
......@@ -2980,6 +2984,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
#define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
#define direct_while_optab_supported_p convert_optab_supported_p
#define direct_fold_extract_optab_supported_p direct_optab_supported_p
#define direct_fold_left_optab_supported_p direct_optab_supported_p
/* Return the optab used by internal function FN. */
......
......@@ -58,6 +58,8 @@ along with GCC; see the file COPYING3. If not see
- cond_binary: a conditional binary optab, such as add<mode>cc
- fold_left: for scalar = FN (scalar, vector), keyed off the vector mode
DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that
maps to one of two optabs, depending on the signedness of an input.
SIGNED_OPTAB and UNSIGNED_OPTAB are the optabs for signed and
......@@ -162,6 +164,8 @@ DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
DEF_INTERNAL_OPTAB_FN (FOLD_EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
fold_extract_last, fold_extract)
DEF_INTERNAL_OPTAB_FN (FOLD_LEFT_PLUS, ECF_CONST | ECF_NOTHROW,
fold_left_plus, fold_left)
/* Unary math functions. */
DEF_INTERNAL_FLT_FN (ACOS, ECF_CONST, acos, unary)
......
......@@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
OPTAB_D (reduc_and_scal_optab, "reduc_and_scal_$a")
OPTAB_D (reduc_ior_scal_optab, "reduc_ior_scal_$a")
OPTAB_D (reduc_xor_scal_optab, "reduc_xor_scal_$a")
OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")
OPTAB_D (extract_last_optab, "extract_last_$a")
OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")
......
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and
check for a message about using in-order reductions.
* gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and
check for a message about using in-order reductions.
* gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be
vectorized and check for a message about using in-order reductions.
Expect targets with variable-length vectors to fall back to the
fixed-length mininum.
* gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and
check for a message about using in-order reductions.
* gcc.dg/vect/vect-reduc-in-order-1.c: New test.
* gcc.dg/vect/vect-reduc-in-order-2.c: Likewise.
* gcc.dg/vect/vect-reduc-in-order-3.c: Likewise.
* gcc.dg/vect/vect-reduc-in-order-4.c: Likewise.
* gcc.target/aarch64/sve/reduc_strict_1.c: New test.
* gcc.target/aarch64/sve/reduc_strict_1_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_strict_2.c: Likewise.
* gcc.target/aarch64/sve/reduc_strict_2_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_strict_3.c: Likewise.
* gcc.target/aarch64/sve/slp_13.c: Add floating-point types.
* gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if
vect_fold_left_plus.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
* gcc.target/aarch64/sve/spill_1.c: Also test that no predicates
are spilled.
......
......@@ -33,5 +33,5 @@ int main (void)
return main1 ();
}
/* Requires fast-math. */
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail *-*-* } } } */
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-do run } */
/* { dg-additional-options "-O3" } */
/* { dg-additional-options "-O3 -fno-fast-math" } */
#include "tree-vect.h"
......@@ -41,4 +41,5 @@ int main()
return 0;
}
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */
......@@ -46,5 +46,8 @@ int main (void)
return 0;
}
/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect" } } */
/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */
/* We can't handle the first loop with variable-length vectors and so
fall back to the fixed-length mininum instead. */
/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" { xfail vect_variable_length } } } */
/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-require-effective-target vect_float } */
/* { dg-additional-options "-fno-fast-math" } */
#include <stdarg.h>
#include "tree-vect.h"
......@@ -48,6 +49,5 @@ int main (void)
return 0;
}
/* need -ffast-math to vectorizer these loops. */
/* ARM NEON passes -ffast-math to these tests, so expect this to fail. */
/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail arm_neon_ok } } } */
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
/* { dg-require-effective-target vect_double } */
/* { dg-add-options ieee } */
/* { dg-additional-options "-fno-fast-math" } */
#include "tree-vect.h"
#define N (VECTOR_BITS * 17)
double __attribute__ ((noinline, noclone))
reduc_plus_double (double *a, double *b)
{
double r = 0, q = 3;
for (int i = 0; i < N; i++)
{
r += a[i];
q -= b[i];
}
return r * q;
}
int __attribute__ ((optimize (1)))
main ()
{
double a[N];
double b[N];
double r = 0, q = 3;
for (int i = 0; i < N; i++)
{
a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
b[i] = (i * 0.3) * (i & 1 ? 1 : -1);
r += a[i];
q -= b[i];
asm volatile ("" ::: "memory");
}
double res = reduc_plus_double (a, b);
if (res != r * q)
__builtin_abort ();
return 0;
}
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 2 "vect" } } */
/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
/* { dg-require-effective-target vect_double } */
/* { dg-add-options ieee } */
/* { dg-additional-options "-fno-fast-math" } */
#include "tree-vect.h"
#define N (VECTOR_BITS * 17)
double __attribute__ ((noinline, noclone))
reduc_plus_double (double *restrict a, int n)
{
double res = 0.0;
for (int i = 0; i < n; i++)
for (int j = 0; j < N; j++)
res += a[i];
return res;
}
int __attribute__ ((optimize (1)))
main ()
{
int n = 19;
double a[N];
double r = 0;
for (int i = 0; i < N; i++)
{
a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
asm volatile ("" ::: "memory");
}
for (int i = 0; i < n; i++)
for (int j = 0; j < N; j++)
{
r += a[i];
asm volatile ("" ::: "memory");
}
double res = reduc_plus_double (a, n);
if (res != r)
__builtin_abort ();
return 0;
}
/* { dg-final { scan-tree-dump {in-order double reduction not supported} "vect" } } */
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
/* { dg-require-effective-target vect_double } */
/* { dg-add-options ieee } */
/* { dg-additional-options "-fno-fast-math" } */
#include "tree-vect.h"
#define N (VECTOR_BITS * 17)
double __attribute__ ((noinline, noclone))
reduc_plus_double (double *a)
{
double r = 0;
for (int i = 0; i < N; i += 4)
{
r += a[i] * 2.0;
r += a[i + 1] * 3.0;
r += a[i + 2] * 4.0;
r += a[i + 3] * 5.0;
}
return r;
}
int __attribute__ ((optimize (1)))
main ()
{
double a[N];
double r = 0;
for (int i = 0; i < N; i++)
{
a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
r += a[i] * (i % 4 + 2);
asm volatile ("" ::: "memory");
}
double res = reduc_plus_double (a);
if (res != r)
__builtin_abort ();
return 0;
}
/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
/* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 1 "vect" } } */
/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
/* { dg-require-effective-target vect_double } */
/* { dg-add-options ieee } */
/* { dg-additional-options "-fno-fast-math" } */
#include "tree-vect.h"
#define N (VECTOR_BITS * 17)
double __attribute__ ((noinline, noclone))
reduc_plus_double (double *a)
{
double r1 = 0;
double r2 = 0;
double r3 = 0;
double r4 = 0;
for (int i = 0; i < N; i += 4)
{
r1 += a[i];
r2 += a[i + 1];
r3 += a[i + 2];
r4 += a[i + 3];
}
return r1 * r2 * r3 * r4;
}
int __attribute__ ((optimize (1)))
main ()
{
double a[N];
double r[4] = {};
for (int i = 0; i < N; i++)
{
a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
r[i % 4] += a[i];
asm volatile ("" ::: "memory");
}
double res = reduc_plus_double (a);
if (res != r[0] * r[1] * r[2] * r[3])
__builtin_abort ();
return 0;
}
/* { dg-final { scan-tree-dump {in-order unchained SLP reductions not supported} "vect" } } */
/* { dg-final { scan-tree-dump-not {vectorizing stmts using SLP} "vect" } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))
#define DEF_REDUC_PLUS(TYPE) \
TYPE __attribute__ ((noinline, noclone)) \
reduc_plus_##TYPE (TYPE *a, TYPE *b) \
{ \
TYPE r = 0, q = 3; \
for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
{ \
r += a[i]; \
q -= b[i]; \
} \
return r * q; \
}
#define TEST_ALL(T) \
T (_Float16) \
T (float) \
T (double)
TEST_ALL (DEF_REDUC_PLUS)
/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 2 } } */
/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 2 } } */
/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 2 } } */
/* { dg-do run { target { aarch64_sve_hw } } } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include "reduc_strict_1.c"
#define TEST_REDUC_PLUS(TYPE) \
{ \
TYPE a[NUM_ELEMS (TYPE)]; \
TYPE b[NUM_ELEMS (TYPE)]; \
TYPE r = 0, q = 3; \
for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
{ \
a[i] = (i * 0.1) * (i & 1 ? 1 : -1); \
b[i] = (i * 0.3) * (i & 1 ? 1 : -1); \
r += a[i]; \
q -= b[i]; \
asm volatile ("" ::: "memory"); \
} \
TYPE res = reduc_plus_##TYPE (a, b); \
if (res != r * q) \
__builtin_abort (); \
}
int __attribute__ ((optimize (1)))
main ()
{
TEST_ALL (TEST_REDUC_PLUS);
return 0;
}
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))
#define DEF_REDUC_PLUS(TYPE) \
void __attribute__ ((noinline, noclone)) \
reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS (TYPE)], \
TYPE *restrict r, int n) \
{ \
for (int i = 0; i < n; i++) \
{ \
r[i] = 0; \
for (int j = 0; j < NUM_ELEMS (TYPE); j++) \
r[i] += a[i][j]; \
} \
}
#define TEST_ALL(T) \
T (_Float16) \
T (float) \
T (double)
TEST_ALL (DEF_REDUC_PLUS)
/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 1 } } */
/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 1 } } */
/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 1 } } */
/* { dg-do run { target { aarch64_sve_hw } } } */
/* { dg-options "-O2 -ftree-vectorize -fno-inline" } */
#include "reduc_strict_2.c"
#define NROWS 5
#define TEST_REDUC_PLUS(TYPE) \
{ \
TYPE a[NROWS][NUM_ELEMS (TYPE)]; \
TYPE r[NROWS]; \
TYPE expected[NROWS] = {}; \
for (int i = 0; i < NROWS; ++i) \
for (int j = 0; j < NUM_ELEMS (TYPE); ++j) \
{ \
a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1); \
expected[i] += a[i][j]; \
asm volatile ("" ::: "memory"); \
} \
reduc_plus_##TYPE (a, r, NROWS); \
for (int i = 0; i < NROWS; ++i) \
if (r[i] != expected[i]) \
__builtin_abort (); \
}
int __attribute__ ((optimize (1)))
main ()
{
TEST_ALL (TEST_REDUC_PLUS);
return 0;
}
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -fno-inline -msve-vector-bits=256 -fdump-tree-vect-details" } */
double mat[100][4];
double mat2[100][8];
double mat3[100][12];
double mat4[100][3];
double
slp_reduc_plus (int n)
{
double tmp = 0.0;
for (int i = 0; i < n; i++)
{
tmp = tmp + mat[i][0];
tmp = tmp + mat[i][1];
tmp = tmp + mat[i][2];
tmp = tmp + mat[i][3];
}
return tmp;
}
double
slp_reduc_plus2 (int n)
{
double tmp = 0.0;
for (int i = 0; i < n; i++)
{
tmp = tmp + mat2[i][0];
tmp = tmp + mat2[i][1];
tmp = tmp + mat2[i][2];
tmp = tmp + mat2[i][3];
tmp = tmp + mat2[i][4];
tmp = tmp + mat2[i][5];
tmp = tmp + mat2[i][6];
tmp = tmp + mat2[i][7];
}
return tmp;
}
double
slp_reduc_plus3 (int n)
{
double tmp = 0.0;
for (int i = 0; i < n; i++)
{
tmp = tmp + mat3[i][0];
tmp = tmp + mat3[i][1];
tmp = tmp + mat3[i][2];
tmp = tmp + mat3[i][3];
tmp = tmp + mat3[i][4];
tmp = tmp + mat3[i][5];
tmp = tmp + mat3[i][6];
tmp = tmp + mat3[i][7];
tmp = tmp + mat3[i][8];
tmp = tmp + mat3[i][9];
tmp = tmp + mat3[i][10];
tmp = tmp + mat3[i][11];
}
return tmp;
}
void
slp_non_chained_reduc (int n, double * restrict out)
{
for (int i = 0; i < 3; i++)
out[i] = 0;
for (int i = 0; i < n; i++)
{
out[0] = out[0] + mat4[i][0];
out[1] = out[1] + mat4[i][1];
out[2] = out[2] + mat4[i][2];
}
}
/* Strict FP reductions shouldn't be used for the outer loops, only the
inner loops. */
float
double_reduc1 (float (*restrict i)[16])
{
float l = 0;
for (int a = 0; a < 8; a++)
for (int b = 0; b < 8; b++)
l += i[b][a];
return l;
}
float
double_reduc2 (float *restrict i)
{
float l = 0;
for (int a = 0; a < 8; a++)
for (int b = 0; b < 16; b++)
{
l += i[b * 4];
l += i[b * 4 + 1];
l += i[b * 4 + 2];
l += i[b * 4 + 3];
}
return l;
}
float
double_reduc3 (float *restrict i, float *restrict j)
{
float k = 0, l = 0;
for (int a = 0; a < 8; a++)
for (int b = 0; b < 8; b++)
{
k += i[b];
l += j[b];
}
return l * k;
}
/* We can't yet handle double_reduc1. */
/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 3 } } */
/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 9 } } */
/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3. Each one
is reported three times, once for SVE, once for 128-bit AdvSIMD and once
for 64-bit AdvSIMD. */
/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } } */
/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.
double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)
before failing. */
/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */
/* The cost model thinks that the double loop isn't a win for SVE-128. */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -fno-vect-cost-model" } */
#include <stdint.h>
......@@ -24,7 +25,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t)
T (uint64_t) \
T (_Float16) \
T (float) \
T (double)
TEST_ALL (VEC_PERM)
......@@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)
/* ??? We don't treat the uint loops as SLP. */
/* The loop should be fully-masked. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */
/* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */
/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h\n} 1 } } */
/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s\n} 1 } } */
/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d\n} 1 } } */
/* { dg-final { scan-assembler-not {\tfadd\n} } } */
/* { dg-final { scan-assembler-not {\tuqdec} } } */
......@@ -704,5 +704,5 @@ CALL track('KERNEL ')
RETURN
END SUBROUTINE kernel
! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt } } } }
! { dg-final { scan-tree-dump-times "vectorized 22 loops" 1 "vect" { target vect_intdouble_cvt } } }
! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { ! vect_intdouble_cvt } } } }
......@@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slot, void *data ATTRIBUTE_UNUSED)
return 1;
}
/* Return true if the type of reduction performed by STMT is suitable
for this pass. */
static bool
valid_reduction_p (gimple *stmt)
{
/* Parallelization would reassociate the operation, which isn't
allowed for in-order reductions. */
stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);
return reduc_type != FOLD_LEFT_REDUCTION;
}
/* Detect all reductions in the LOOP, insert them into REDUCTION_LIST. */
static void
......@@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, reduction_info_table_type *reduction_list
gimple *reduc_stmt
= vect_force_simple_reduction (simple_loop_info, phi,
&double_reduc, true);
if (!reduc_stmt)
if (!reduc_stmt || !valid_reduction_p (reduc_stmt))
continue;
if (double_reduc)
......@@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, reduction_info_table_type *reduction_list
= vect_force_simple_reduction (simple_loop_info, inner_phi,
&double_reduc, true);
gcc_assert (!double_reduc);
if (inner_reduc_stmt == NULL)
if (inner_reduc_stmt == NULL
|| !valid_reduction_p (inner_reduc_stmt))
continue;
build_new_reduction (reduction_list, double_reduc_stmts[i], phi);
......
......@@ -74,7 +74,15 @@ enum vect_reduction_type {
for (int i = 0; i < VF; ++i)
res = cond[i] ? val[i] : res; */
EXTRACT_LAST_REDUCTION
EXTRACT_LAST_REDUCTION,
/* Use a folding reduction within the loop to implement:
for (int i = 0; i < VF; ++i)
res = res OP val[i];
(with no reassocation). */
FOLD_LEFT_REDUCTION
};
#define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def) \
......@@ -1390,6 +1398,7 @@ extern void vect_model_load_cost (stmt_vec_info, int, vect_memory_access_type,
extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
enum vect_cost_for_stmt, stmt_vec_info,
int, enum vect_cost_model_location);
extern void vect_finish_replace_stmt (gimple *, gimple *);
extern void vect_finish_stmt_generation (gimple *, gimple *,
gimple_stmt_iterator *);
extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment