Add support for in-order addition reduction using SVE FADDA

This patch adds support for in-order floating-point addition reductions, which are suitable even in strict IEEE mode. Previously vect_is_simple_reduction would reject any cases that forbid reassociation. The idea is instead to tentatively accept them as "FOLD_LEFT_REDUCTIONs" and only fail later if there is no support for them. Although this patch only handles the particular case of plus and minus on floating-point types, there's no reason in principle why we couldn't handle other cases. The reductions use a new fold_left_plus_optab if available, otherwise they fall back to elementwise additions or subtractions. The vect_force_simple_reduction change makes it easier for parloops to read the type of reduction. 2018-01-13 Richard Sandiford <richard.sandiford@linaro.org> Alan Hayward <alan.hayward@arm.com> David Sherwood <david.sherwood@arm.com> gcc/ * optabs.def (fold_left_plus_optab): New optab. * doc/md.texi (fold_left_plus_@var{m}): Document. * internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function. * internal-fn.c (fold_left_direct): Define. (expand_fold_left_optab_fn): Likewise. (direct_fold_left_optab_supported_p): Likewise. * fold-const-call.c (fold_const_fold_left): New function. (fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS. * tree-parloops.c (valid_reduction_p): New function. (gather_scalar_reductions): Use it. * tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type. (vect_finish_replace_stmt): Declare. * tree-vect-loop.c (fold_left_reduction_fn): New function. (needs_fold_left_reduction_p): New function, split out from... (vect_is_simple_reduction): ...here. Accept reductions that forbid reassociation, but give them type FOLD_LEFT_REDUCTION. (vect_force_simple_reduction): Also store the reduction type in the assignment's STMT_VINFO_REDUC_TYPE. (vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION. (merge_with_identity): New function. (vect_expand_fold_left): Likewise. (vectorize_fold_left_reduction): Likewise. (vectorizable_reduction): Handle FOLD_LEFT_REDUCTION. Leave the scalar phi in place for it. Check for target support and reject cases that would reassociate the operation. Defer the transform phase to vectorize_fold_left_reduction. * config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec. * config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander. (*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns. gcc/testsuite/ * gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and check for a message about using in-order reductions. * gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and check for a message about using in-order reductions. * gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be vectorized and check for a message about using in-order reductions. Expect targets with variable-length vectors to fall back to the fixed-length mininum. * gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and check for a message about using in-order reductions. * gcc.dg/vect/vect-reduc-in-order-1.c: New test. * gcc.dg/vect/vect-reduc-in-order-2.c: Likewise. * gcc.dg/vect/vect-reduc-in-order-3.c: Likewise. * gcc.dg/vect/vect-reduc-in-order-4.c: Likewise. * gcc.target/aarch64/sve/reduc_strict_1.c: New test. * gcc.target/aarch64/sve/reduc_strict_1_run.c: Likewise. * gcc.target/aarch64/sve/reduc_strict_2.c: Likewise. * gcc.target/aarch64/sve/reduc_strict_2_run.c: Likewise. * gcc.target/aarch64/sve/reduc_strict_3.c: Likewise. * gcc.target/aarch64/sve/slp_13.c: Add floating-point types. * gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if vect_fold_left_plus. Co-Authored-By: Alan Hayward <alan.hayward@arm.com> Co-Authored-By: David Sherwood <david.sherwood@arm.com> From-SVN: r256639

Add support for in-order addition reduction using SVE FADDA
This patch adds support for in-order floating-point addition reductions, which are suitable even in strict IEEE mode. Previously vect_is_simple_reduction would reject any cases that forbid reassociation. The idea is instead to tentatively accept them as "FOLD_LEFT_REDUCTIONs" and only fail later if there is no support for them. Although this patch only handles the particular case of plus and minus on floating-point types, there's no reason in principle why we couldn't handle other cases. The reductions use a new fold_left_plus_optab if available, otherwise they fall back to elementwise additions or subtractions. The vect_force_simple_reduction change makes it easier for parloops to read the type of reduction. 2018-01-13 Richard Sandiford <richard.sandiford@linaro.org> Alan Hayward <alan.hayward@arm.com> David Sherwood <david.sherwood@arm.com> gcc/ * optabs.def (fold_left_plus_optab): New optab. * doc/md.texi (fold_left_plus_@var{m}): Document. * internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function. * internal-fn.c (fold_left_direct): Define. (expand_fold_left_optab_fn): Likewise. (direct_fold_left_optab_supported_p): Likewise. * fold-const-call.c (fold_const_fold_left): New function. (fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS. * tree-parloops.c (valid_reduction_p): New function. (gather_scalar_reductions): Use it. * tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type. (vect_finish_replace_stmt): Declare. * tree-vect-loop.c (fold_left_reduction_fn): New function. (needs_fold_left_reduction_p): New function, split out from... (vect_is_simple_reduction): ...here. Accept reductions that forbid reassociation, but give them type FOLD_LEFT_REDUCTION. (vect_force_simple_reduction): Also store the reduction type in the assignment's STMT_VINFO_REDUC_TYPE. (vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION. (merge_with_identity): New function. (vect_expand_fold_left): Likewise. (vectorize_fold_left_reduction): Likewise. (vectorizable_reduction): Handle FOLD_LEFT_REDUCTION. Leave the scalar phi in place for it. Check for target support and reject cases that would reassociate the operation. Defer the transform phase to vectorize_fold_left_reduction. * config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec. * config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander. (*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns. gcc/testsuite/ * gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and check for a message about using in-order reductions. * gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and check for a message about using in-order reductions. * gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be vectorized and check for a message about using in-order reductions. Expect targets with variable-length vectors to fall back to the fixed-length mininum. * gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and check for a message about using in-order reductions. * gcc.dg/vect/vect-reduc-in-order-1.c: New test. * gcc.dg/vect/vect-reduc-in-order-2.c: Likewise. * gcc.dg/vect/vect-reduc-in-order-3.c: Likewise. * gcc.dg/vect/vect-reduc-in-order-4.c: Likewise. * gcc.target/aarch64/sve/reduc_strict_1.c: New test. * gcc.target/aarch64/sve/reduc_strict_1_run.c: Likewise. * gcc.target/aarch64/sve/reduc_strict_2.c: Likewise. * gcc.target/aarch64/sve/reduc_strict_2_run.c: Likewise. * gcc.target/aarch64/sve/reduc_strict_3.c: Likewise. * gcc.target/aarch64/sve/slp_13.c: Add floating-point types. * gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if vect_fold_left_plus. Co-Authored-By: Alan Hayward <alan.hayward@arm.com> Co-Authored-By: David Sherwood <david.sherwood@arm.com> From-SVN: r256639
b781a135 · Richard Sandiford · Richard Sandiford · b89fa419 · b781a135 · b781a135
Commit b781a135 authored Jan 13, 2018 by Richard Sandiford Committed by Richard Sandiford Jan 13, 2018
27 changed files
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
 2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
+	    Alan Hayward  <alan.hayward@arm.com>
+	    David Sherwood  <david.sherwood@arm.com>
+	* optabs.def (fold_left_plus_optab): New optab.
+	* doc/md.texi (fold_left_plus_@var{m}): Document.
+	* internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function.
+	* internal-fn.c (fold_left_direct): Define.
+	(expand_fold_left_optab_fn): Likewise.
+	(direct_fold_left_optab_supported_p): Likewise.
+	* fold-const-call.c (fold_const_fold_left): New function.
+	(fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS.
+	* tree-parloops.c (valid_reduction_p): New function.
+	(gather_scalar_reductions): Use it.
+	* tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
+	(vect_finish_replace_stmt): Declare.
+	* tree-vect-loop.c (fold_left_reduction_fn): New function.
+	(needs_fold_left_reduction_p): New function, split out from...
+	(vect_is_simple_reduction): ...here.  Accept reductions that
+	forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
+	(vect_force_simple_reduction): Also store the reduction type in
+	the assignment's STMT_VINFO_REDUC_TYPE.
+	(vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
+	(merge_with_identity): New function.
+	(vect_expand_fold_left): Likewise.
+	(vectorize_fold_left_reduction): Likewise.
+	(vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the
+	scalar phi in place for it.  Check for target support and reject
+	cases that would reassociate the operation.  Defer the transform
+	phase to vectorize_fold_left_reduction.
+	* config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
+	* config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
+	(*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.
+2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
 	* tree-if-conv.c (predicate_mem_writes): Remove redundant
 	call to ifc_temp_var.

--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1550,6 +1550,45 @@
  "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"
 )
+;; Unpredicated in-order FP reductions.
+(define_expand "fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand")
+	(unspec:<VEL> [(match_dup 3)
+		       (match_operand:<VEL> 1 "register_operand")
+		       (match_operand:SVE_F 2 "register_operand")]
+		      UNSPEC_FADDA))]
+  "TARGET_SVE"
+  {
+    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));
+  }
+)
+;; In-order FP reductions predicated with PTRUE.
+(define_insn "*fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand" "=w")
+	(unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
+		       (match_operand:<VEL> 2 "register_operand" "0")
+		       (match_operand:SVE_F 3 "register_operand" "w")]
+		      UNSPEC_FADDA))]
+  "TARGET_SVE"
+  "fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"
+)
+;; Predicated form of the above in-order reduction.
+(define_insn "*pred_fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand" "=w")
+	(unspec:<VEL>
+	  [(match_operand:<VEL> 1 "register_operand" "0")
+	   (unspec:SVE_F
+	     [(match_operand:<VPRED> 2 "register_operand" "Upl")
+	      (match_operand:SVE_F 3 "register_operand" "w")
+	      (match_operand:SVE_F 4 "aarch64_simd_imm_zero")]
+	     UNSPEC_SEL)]
+	  UNSPEC_FADDA))]
+  "TARGET_SVE"
+  "fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"
+)
 ;; Unpredicated floating-point addition.
 (define_expand "add<mode>3"
  [(set (match_operand:SVE_F 0 "register_operand")

--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -165,6 +165,7 @@
    UNSPEC_STN
    UNSPEC_INSR
    UNSPEC_CLASTB
+    UNSPEC_FADDA
 ])
 (define_c_enum "unspecv" [

--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5236,6 +5236,14 @@ has mode @var{m} and operands 0 and 1 have the mode appropriate for
 one element of @var{m}.  Operand 2 has the usual mask mode for vectors
 of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.
+@cindex @code{fold_left_plus_@var{m}} instruction pattern
+@item @code{fold_left_plus_@var{m}}
+Take scalar operand 1 and successively add each element from vector
+operand 2.  Store the result in scalar operand 0.  The vector has
+mode @var{m} and the scalars have the mode appropriate for one
+element of @var{m}.  The operation is strictly in-order: there is
+no reassociation.
 @cindex @code{sdot_prod@var{m}} instruction pattern
 @item @samp{sdot_prod@var{m}}
 @cindex @code{udot_prod@var{m}} instruction pattern

--- a/gcc/fold-const-call.c
+++ b/gcc/fold-const-call.c
@@ -1195,6 +1195,28 @@ fold_const_call (combined_fn fn, tree type, tree arg)
    }
 }
+/* Fold a call to IFN_FOLD_LEFT_<CODE> (ARG0, ARG1), returning a value
+   of type TYPE.  */
+static tree
+fold_const_fold_left (tree type, tree arg0, tree arg1, tree_code code)
+{
+  if (TREE_CODE (arg1) != VECTOR_CST)
+    return NULL_TREE;
+  unsigned HOST_WIDE_INT nelts;
+  if (!VECTOR_CST_NELTS (arg1).is_constant (&nelts))
+    return NULL_TREE;
+  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
+    {
+      arg0 = const_binop (code, type, arg0, VECTOR_CST_ELT (arg1, i));
+      if (arg0 == NULL_TREE || !CONSTANT_CLASS_P (arg0))
+	return NULL_TREE;
+    }
+  return arg0;
+}
 /* Try to evaluate:
      *RESULT = FN (*ARG0, *ARG1)
@@ -1500,6 +1522,9 @@ fold_const_call (combined_fn fn, tree type, tree arg0, tree arg1)
 	}
      return NULL_TREE;
+    case CFN_FOLD_LEFT_PLUS:
+      return fold_const_fold_left (type, arg0, arg1, PLUS_EXPR);
    default:
      return fold_const_call_1 (fn, type, arg0, arg1);
    }

--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -92,6 +92,7 @@ init_internal_fns ()
 #define cond_binary_direct { 1, 1, true }
 #define while_direct { 0, 2, false }
 #define fold_extract_direct { 2, 2, false }
+#define fold_left_direct { 1, 1, false }
 const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
 #define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
@@ -2897,6 +2898,9 @@ expand_while_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 #define expand_fold_extract_optab_fn(FN, STMT, OPTAB) \
  expand_direct_optab_fn (FN, STMT, OPTAB, 3)
+#define expand_fold_left_optab_fn(FN, STMT, OPTAB) \
+  expand_direct_optab_fn (FN, STMT, OPTAB, 2)
 /* RETURN_TYPE and ARGS are a return type and argument list that are
   in principle compatible with FN (which satisfies direct_internal_fn_p).
   Return the types that should be used to determine whether the
@@ -2980,6 +2984,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
+#define direct_fold_left_optab_supported_p direct_optab_supported_p
 /* Return the optab used by internal function FN.  */

--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -58,6 +58,8 @@ along with GCC; see the file COPYING3.  If not see
   - cond_binary: a conditional binary optab, such as add<mode>cc
+   - fold_left: for scalar = FN (scalar, vector), keyed off the vector mode
   DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that
   maps to one of two optabs, depending on the signedness of an input.
   SIGNED_OPTAB and UNSIGNED_OPTAB are the optabs for signed and
@@ -162,6 +164,8 @@ DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
 DEF_INTERNAL_OPTAB_FN (FOLD_EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
 		       fold_extract_last, fold_extract)
+DEF_INTERNAL_OPTAB_FN (FOLD_LEFT_PLUS, ECF_CONST | ECF_NOTHROW,
+		       fold_left_plus, fold_left)
 /* Unary math functions.  */
 DEF_INTERNAL_FLT_FN (ACOS, ECF_CONST, acos, unary)

--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
 OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")
 OPTAB_D (reduc_ior_scal_optab,  "reduc_ior_scal_$a")
 OPTAB_D (reduc_xor_scal_optab,  "reduc_xor_scal_$a")
+OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")
 OPTAB_D (extract_last_optab, "extract_last_$a")
 OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")

--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
 2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
+	    Alan Hayward  <alan.hayward@arm.com>
+	    David Sherwood  <david.sherwood@arm.com>
+	* gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and
+	check for a message about using in-order reductions.
+	* gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and
+	check for a message about using in-order reductions.
+	* gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be
+	vectorized and check for a message about using in-order reductions.
+	Expect targets with variable-length vectors to fall back to the
+	fixed-length mininum.
+	* gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and
+	check for a message about using in-order reductions.
+	* gcc.dg/vect/vect-reduc-in-order-1.c: New test.
+	* gcc.dg/vect/vect-reduc-in-order-2.c: Likewise.
+	* gcc.dg/vect/vect-reduc-in-order-3.c: Likewise.
+	* gcc.dg/vect/vect-reduc-in-order-4.c: Likewise.
+	* gcc.target/aarch64/sve/reduc_strict_1.c: New test.
+	* gcc.target/aarch64/sve/reduc_strict_1_run.c: Likewise.
+	* gcc.target/aarch64/sve/reduc_strict_2.c: Likewise.
+	* gcc.target/aarch64/sve/reduc_strict_2_run.c: Likewise.
+	* gcc.target/aarch64/sve/reduc_strict_3.c: Likewise.
+	* gcc.target/aarch64/sve/slp_13.c: Add floating-point types.
+	* gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if
+	vect_fold_left_plus.
+2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
 	* gcc.target/aarch64/sve/spill_1.c: Also test that no predicates
 	are spilled.

--- a/gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c
+++ b/gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c
@@ -33,5 +33,5 @@ int main (void)
  return main1 ();
 }
-/* Requires fast-math.  */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
--- a/gcc/testsuite/gcc.dg/vect/pr79920.c
+++ b/gcc/testsuite/gcc.dg/vect/pr79920.c
 /* { dg-do run } */
-/* { dg-additional-options "-O3" } */
+/* { dg-additional-options "-O3 -fno-fast-math" } */
 #include "tree-vect.h"
@@ -41,4 +41,5 @@ int main()
  return 0;
 }
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { vect_double && { vect_perm && vect_hw_misalign } } } } } */
--- a/gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
+++ b/gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
@@ -46,5 +46,8 @@ int main (void)
  return 0;
 }
-/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect"  } } */
+/* We can't handle the first loop with variable-length vectors and so
-/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */
+   fall back to the fixed-length mininum instead.  */
+/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" { xfail vect_variable_length } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { ! vect_no_int_min_max } } } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-6.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-6.c
 /* { dg-require-effective-target vect_float } */
+/* { dg-additional-options "-fno-fast-math" } */
 #include <stdarg.h>
 #include "tree-vect.h"
@@ -48,6 +49,5 @@ int main (void)
  return 0;
 }
-/* need -ffast-math to vectorizer these loops.  */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
-/* ARM NEON passes -ffast-math to these tests, so expect this to fail.  */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail arm_neon_ok } } } */
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+#include "tree-vect.h"
+#define N (VECTOR_BITS * 17)
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *a, double *b)
+{
+  double r = 0, q = 3;
+  for (int i = 0; i < N; i++)
+    {
+      r += a[i];
+      q -= b[i];
+    }
+  return r * q;
+}
+int __attribute__ ((optimize (1)))
+main ()
+{
+  double a[N];
+  double b[N];
+  double r = 0, q = 3;
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      b[i] = (i * 0.3) * (i & 1 ? 1 : -1);
+      r += a[i];
+      q -= b[i];
+      asm volatile ("" ::: "memory");
+    }
+  double res = reduc_plus_double (a, b);
+  if (res != r * q)
+    __builtin_abort ();
+  return 0;
+}
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 2 "vect" } } */
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+#include "tree-vect.h"
+#define N (VECTOR_BITS * 17)
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *restrict a, int n)
+{
+  double res = 0.0;
+  for (int i = 0; i < n; i++)
+    for (int j = 0; j < N; j++)
+      res += a[i];
+  return res;
+}
+int __attribute__ ((optimize (1)))
+main ()
+{
+  int n = 19;
+  double a[N];
+  double r = 0;
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      asm volatile ("" ::: "memory");
+    }
+  for (int i = 0; i < n; i++)
+    for (int j = 0; j < N; j++)
+      {
+	r += a[i];
+	asm volatile ("" ::: "memory");
+      }
+  double res = reduc_plus_double (a, n);
+  if (res != r)
+    __builtin_abort ();
+  return 0;
+}
+/* { dg-final { scan-tree-dump {in-order double reduction not supported} "vect" } } */
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+#include "tree-vect.h"
+#define N (VECTOR_BITS * 17)
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *a)
+{
+  double r = 0;
+  for (int i = 0; i < N; i += 4)
+    {
+      r += a[i] * 2.0;
+      r += a[i + 1] * 3.0;
+      r += a[i + 2] * 4.0;
+      r += a[i + 3] * 5.0;
+    }
+  return r;
+}
+int __attribute__ ((optimize (1)))
+main ()
+{
+  double a[N];
+  double r = 0;
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      r += a[i] * (i % 4 + 2);
+      asm volatile ("" ::: "memory");
+    }
+  double res = reduc_plus_double (a);
+  if (res != r)
+    __builtin_abort ();
+  return 0;
+}
+/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) reduction} 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 1 "vect" } } */
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c
+/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-add-options ieee } */
+/* { dg-additional-options "-fno-fast-math" } */
+#include "tree-vect.h"
+#define N (VECTOR_BITS * 17)
+double __attribute__ ((noinline, noclone))
+reduc_plus_double (double *a)
+{
+  double r1 = 0;
+  double r2 = 0;
+  double r3 = 0;
+  double r4 = 0;
+  for (int i = 0; i < N; i += 4)
+    {
+      r1 += a[i];
+      r2 += a[i + 1];
+      r3 += a[i + 2];
+      r4 += a[i + 3];
+    }
+  return r1 * r2 * r3 * r4;
+}
+int __attribute__ ((optimize (1)))
+main ()
+{
+  double a[N];
+  double r[4] = {};
+  for (int i = 0; i < N; i++)
+    {
+      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
+      r[i % 4] += a[i];
+      asm volatile ("" ::: "memory");
+    }
+  double res = reduc_plus_double (a);
+  if (res != r[0] * r[1] * r[2] * r[3])
+    __builtin_abort ();
+  return 0;
+}
+/* { dg-final { scan-tree-dump {in-order unchained SLP reductions not supported} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vectorizing stmts using SLP} "vect" } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_1.c
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))
+#define DEF_REDUC_PLUS(TYPE)			\
+  TYPE __attribute__ ((noinline, noclone))	\
+  reduc_plus_##TYPE (TYPE *a, TYPE *b)		\
+  {						\
+    TYPE r = 0, q = 3;				\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++)	\
+      {						\
+	r += a[i];				\
+	q -= b[i];				\
+      }						\
+    return r * q;				\
+  }
+#define TEST_ALL(T) \
+  T (_Float16) \
+  T (float) \
+  T (double)
+TEST_ALL (DEF_REDUC_PLUS)
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 2 } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_1_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_1_run.c
+/* { dg-do run { target { aarch64_sve_hw } } } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+#include "reduc_strict_1.c"
+#define TEST_REDUC_PLUS(TYPE)			\
+  {						\
+    TYPE a[NUM_ELEMS (TYPE)];			\
+    TYPE b[NUM_ELEMS (TYPE)];			\
+    TYPE r = 0, q = 3;				\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++)	\
+      {						\
+	a[i] = (i * 0.1) * (i & 1 ? 1 : -1);	\
+	b[i] = (i * 0.3) * (i & 1 ? 1 : -1);	\
+	r += a[i];				\
+	q -= b[i];				\
+	asm volatile ("" ::: "memory");		\
+      }						\
+    TYPE res = reduc_plus_##TYPE (a, b);	\
+    if (res != r * q)				\
+      __builtin_abort ();			\
+  }
+int __attribute__ ((optimize (1)))
+main ()
+{
+  TEST_ALL (TEST_REDUC_PLUS);
+  return 0;
+}
--- a/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_2.c
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))
+#define DEF_REDUC_PLUS(TYPE)					\
+void __attribute__ ((noinline, noclone))			\
+reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS (TYPE)],	\
+		   TYPE *restrict r, int n)			\
+{								\
+  for (int i = 0; i < n; i++)					\
+    {								\
+      r[i] = 0;							\
+      for (int j = 0; j < NUM_ELEMS (TYPE); j++)		\
+        r[i] += a[i][j];					\
+    }								\
+}
+#define TEST_ALL(T) \
+  T (_Float16) \
+  T (float) \
+  T (double)
+TEST_ALL (DEF_REDUC_PLUS)
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 1 } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_2_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_2_run.c
+/* { dg-do run { target { aarch64_sve_hw } } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-inline" } */
+#include "reduc_strict_2.c"
+#define NROWS 5
+#define TEST_REDUC_PLUS(TYPE)					\
+  {								\
+    TYPE a[NROWS][NUM_ELEMS (TYPE)];				\
+    TYPE r[NROWS];						\
+    TYPE expected[NROWS] = {};					\
+    for (int i = 0; i < NROWS; ++i)				\
+      for (int j = 0; j < NUM_ELEMS (TYPE); ++j)		\
+	{							\
+	  a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1);	\
+	  expected[i] += a[i][j];				\
+	  asm volatile ("" ::: "memory");			\
+	}							\
+    reduc_plus_##TYPE (a, r, NROWS);				\
+    for (int i = 0; i < NROWS; ++i)				\
+      if (r[i] != expected[i])					\
+	__builtin_abort ();					\
+  }
+int __attribute__ ((optimize (1)))
+main ()
+{
+  TEST_ALL (TEST_REDUC_PLUS);
+  return 0;
+}
--- a/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_3.c
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-inline -msve-vector-bits=256 -fdump-tree-vect-details" } */
+double mat[100][4];
+double mat2[100][8];
+double mat3[100][12];
+double mat4[100][3];
+double
+slp_reduc_plus (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat[i][0];
+      tmp = tmp + mat[i][1];
+      tmp = tmp + mat[i][2];
+      tmp = tmp + mat[i][3];
+    }
+  return tmp;
+}
+double
+slp_reduc_plus2 (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat2[i][0];
+      tmp = tmp + mat2[i][1];
+      tmp = tmp + mat2[i][2];
+      tmp = tmp + mat2[i][3];
+      tmp = tmp + mat2[i][4];
+      tmp = tmp + mat2[i][5];
+      tmp = tmp + mat2[i][6];
+      tmp = tmp + mat2[i][7];
+    }
+  return tmp;
+}
+double
+slp_reduc_plus3 (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat3[i][0];
+      tmp = tmp + mat3[i][1];
+      tmp = tmp + mat3[i][2];
+      tmp = tmp + mat3[i][3];
+      tmp = tmp + mat3[i][4];
+      tmp = tmp + mat3[i][5];
+      tmp = tmp + mat3[i][6];
+      tmp = tmp + mat3[i][7];
+      tmp = tmp + mat3[i][8];
+      tmp = tmp + mat3[i][9];
+      tmp = tmp + mat3[i][10];
+      tmp = tmp + mat3[i][11];
+    }
+  return tmp;
+}
+void
+slp_non_chained_reduc (int n, double * restrict out)
+{
+  for (int i = 0; i < 3; i++)
+    out[i] = 0;
+  for (int i = 0; i < n; i++)
+    {
+      out[0] = out[0] + mat4[i][0];
+      out[1] = out[1] + mat4[i][1];
+      out[2] = out[2] + mat4[i][2];
+    }
+}
+/* Strict FP reductions shouldn't be used for the outer loops, only the
+   inner loops.  */
+float
+double_reduc1 (float (*restrict i)[16])
+{
+  float l = 0;
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 8; b++)
+      l += i[b][a];
+  return l;
+}
+float
+double_reduc2 (float *restrict i)
+{
+  float l = 0;
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 16; b++)
+      {
+        l += i[b * 4];
+        l += i[b * 4 + 1];
+        l += i[b * 4 + 2];
+        l += i[b * 4 + 3];
+      }
+  return l;
+}
+float
+double_reduc3 (float *restrict i, float *restrict j)
+{
+  float k = 0, l = 0;
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 8; b++)
+      {
+        k += i[b];
+        l += j[b];
+      }
+  return l * k;
+}
+/* We can't yet handle double_reduc1.  */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s} 3 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d} 9 } } */
+/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3.  Each one
+   is reported three times, once for SVE, once for 128-bit AdvSIMD and once
+   for 64-bit AdvSIMD.  */
+/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } } */
+/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.
+   double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)
+   before failing.  */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_13.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_13.c
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */
+/* The cost model thinks that the double loop isn't a win for SVE-128.  */
+/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -fno-vect-cost-model" } */
 #include <stdint.h>
@@ -24,7 +25,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\
  T (int32_t)					\
  T (uint32_t)					\
  T (int64_t)					\
-  T (uint64_t)
+  T (uint64_t)					\
+  T (_Float16)					\
+  T (float)					\
+  T (double)
 TEST_ALL (VEC_PERM)
@@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)
 /* ??? We don't treat the uint loops as SLP.  */
 /* The loop should be fully-masked.  */
 /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */
-/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */
 /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 2 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\.d\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tfadd\n} } } */
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
--- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90
+++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
@@ -704,5 +704,5 @@ CALL track('KERNEL  ')
 RETURN
 END SUBROUTINE kernel
-! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { vect_intdouble_cvt } } } }
+! { dg-final { scan-tree-dump-times "vectorized 22 loops" 1 "vect" { target vect_intdouble_cvt } } }
 ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { ! vect_intdouble_cvt } } } }
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slot, void *data ATTRIBUTE_UNUSED)
  return 1;
 }
+/* Return true if the type of reduction performed by STMT is suitable
+   for this pass.  */
+static bool
+valid_reduction_p (gimple *stmt)
+{
+  /* Parallelization would reassociate the operation, which isn't
+     allowed for in-order reductions.  */
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);
+  return reduc_type != FOLD_LEFT_REDUCTION;
+}
 /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST.  */
 static void
@@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, reduction_info_table_type *reduction_list
      gimple *reduc_stmt
 	= vect_force_simple_reduction (simple_loop_info, phi,
 				       &double_reduc, true);
-      if (!reduc_stmt)
+      if (!reduc_stmt || !valid_reduction_p (reduc_stmt))
 	continue;
      if (double_reduc)
@@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, reduction_info_table_type *reduction_list
 		= vect_force_simple_reduction (simple_loop_info, inner_phi,
 					       &double_reduc, true);
 	      gcc_assert (!double_reduc);
-	      if (inner_reduc_stmt == NULL)
+	      if (inner_reduc_stmt == NULL
+		  || !valid_reduction_p (inner_reduc_stmt))
 		continue;
 	      build_new_reduction (reduction_list, double_reduc_stmts[i], phi);

--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -74,7 +74,15 @@ enum vect_reduction_type {
       for (int i = 0; i < VF; ++i)
         res = cond[i] ? val[i] : res;  */
-  EXTRACT_LAST_REDUCTION
+  EXTRACT_LAST_REDUCTION,
+  /* Use a folding reduction within the loop to implement:
+       for (int i = 0; i < VF; ++i)
+	 res = res OP val[i];
+     (with no reassocation).  */
+  FOLD_LEFT_REDUCTION
 };
 #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def)           \
@@ -1390,6 +1398,7 @@ extern void vect_model_load_cost (stmt_vec_info, int, vect_memory_access_type,
 extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
 				  enum vect_cost_for_stmt, stmt_vec_info,
 				  int, enum vect_cost_model_location);
+extern void vect_finish_replace_stmt (gimple *, gimple *);
 extern void vect_finish_stmt_generation (gimple *, gimple *,
                                         gimple_stmt_iterator *);
 extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);