SLP reductions with variable-length vectors

Two things stopped us using SLP reductions with variable-length vectors: (1) We didn't have a way of constructing the initial vector. This patch does it by creating a vector full of the neutral identity value and then using a shift-and-insert function to insert any non-identity inputs into the low-numbered elements. (The non-identity values are needed for double reductions.) Alternatively, for unchained MIN/MAX reductions that have no neutral value, we instead use the same duplicate-and-interleave approach as for SLP constant and external definitions (added by a previous patch). (2) The epilogue for constant-length vectors would extract the vector elements associated with each SLP statement and do scalar arithmetic on these individual elements. For variable-length vectors, the patch instead creates a reduction vector for each SLP statement, replacing the elements for other SLP statements with the identity value. It then uses a hardware reduction instruction on each vector. 2018-01-13 Richard Sandiford <richard.sandiford@linaro.org> Alan Hayward <alan.hayward@arm.com> David Sherwood <david.sherwood@arm.com> gcc/ * doc/md.texi (vec_shl_insert_@var{m}): New optab. * internal-fn.def (VEC_SHL_INSERT): New internal function. * optabs.def (vec_shl_insert_optab): New optab. * tree-vectorizer.h (can_duplicate_and_interleave_p): Declare. (duplicate_and_interleave): Likewise. * tree-vect-loop.c: Include internal-fn.h. (neutral_op_for_slp_reduction): New function, split out from get_initial_defs_for_reduction. (get_initial_def_for_reduction): Handle option 2 for variable-length vectors by loading the neutral value into a vector and then shifting the initial value into element 0. (get_initial_defs_for_reduction): Replace the code argument with the neutral value calculated by neutral_op_for_slp_reduction. Use gimple_build_vector for constant-length vectors. Use IFN_VEC_SHL_INSERT for variable-length vectors if all but the first group_size elements have a neutral value. Use duplicate_and_interleave otherwise. (vect_create_epilog_for_reduction): Take a neutral_op parameter. Update call to get_initial_defs_for_reduction. Handle SLP reductions for variable-length vectors by creating one vector result for each scalar result, with the elements associated with other scalar results stubbed out with the neutral value. (vectorizable_reduction): Call neutral_op_for_slp_reduction. Require IFN_VEC_SHL_INSERT for double reductions on variable-length vectors, or SLP reductions that have a neutral value. Require can_duplicate_and_interleave_p support for variable-length unchained SLP reductions if there is no neutral value, such as for MIN/MAX reductions. Also require the number of vector elements to be a multiple of the number of SLP statements when doing variable-length unchained SLP reductions. Update call to vect_create_epilog_for_reduction. * tree-vect-slp.c (can_duplicate_and_interleave_p): Make public and remove initial values. (duplicate_and_interleave): Make public. * config/aarch64/aarch64.md (UNSPEC_INSR): New unspec. * config/aarch64/aarch64-sve.md (vec_shl_insert_<mode>): New insn. gcc/testsuite/ * gcc.dg/vect/pr37027.c: Remove XFAIL for variable-length vectors. * gcc.dg/vect/pr67790.c: Likewise. * gcc.dg/vect/slp-reduc-1.c: Likewise. * gcc.dg/vect/slp-reduc-2.c: Likewise. * gcc.dg/vect/slp-reduc-3.c: Likewise. * gcc.dg/vect/slp-reduc-5.c: Likewise. * gcc.target/aarch64/sve/slp_5.c: New test. * gcc.target/aarch64/sve/slp_5_run.c: Likewise. * gcc.target/aarch64/sve/slp_6.c: Likewise. * gcc.target/aarch64/sve/slp_6_run.c: Likewise. * gcc.target/aarch64/sve/slp_7.c: Likewise. * gcc.target/aarch64/sve/slp_7_run.c: Likewise. Co-Authored-By: Alan Hayward <alan.hayward@arm.com> Co-Authored-By: David Sherwood <david.sherwood@arm.com> From-SVN: r256623

SLP reductions with variable-length vectors
Two things stopped us using SLP reductions with variable-length vectors: (1) We didn't have a way of constructing the initial vector. This patch does it by creating a vector full of the neutral identity value and then using a shift-and-insert function to insert any non-identity inputs into the low-numbered elements. (The non-identity values are needed for double reductions.) Alternatively, for unchained MIN/MAX reductions that have no neutral value, we instead use the same duplicate-and-interleave approach as for SLP constant and external definitions (added by a previous patch). (2) The epilogue for constant-length vectors would extract the vector elements associated with each SLP statement and do scalar arithmetic on these individual elements. For variable-length vectors, the patch instead creates a reduction vector for each SLP statement, replacing the elements for other SLP statements with the identity value. It then uses a hardware reduction instruction on each vector. 2018-01-13 Richard Sandiford <richard.sandiford@linaro.org> Alan Hayward <alan.hayward@arm.com> David Sherwood <david.sherwood@arm.com> gcc/ * doc/md.texi (vec_shl_insert_@var{m}): New optab. * internal-fn.def (VEC_SHL_INSERT): New internal function. * optabs.def (vec_shl_insert_optab): New optab. * tree-vectorizer.h (can_duplicate_and_interleave_p): Declare. (duplicate_and_interleave): Likewise. * tree-vect-loop.c: Include internal-fn.h. (neutral_op_for_slp_reduction): New function, split out from get_initial_defs_for_reduction. (get_initial_def_for_reduction): Handle option 2 for variable-length vectors by loading the neutral value into a vector and then shifting the initial value into element 0. (get_initial_defs_for_reduction): Replace the code argument with the neutral value calculated by neutral_op_for_slp_reduction. Use gimple_build_vector for constant-length vectors. Use IFN_VEC_SHL_INSERT for variable-length vectors if all but the first group_size elements have a neutral value. Use duplicate_and_interleave otherwise. (vect_create_epilog_for_reduction): Take a neutral_op parameter. Update call to get_initial_defs_for_reduction. Handle SLP reductions for variable-length vectors by creating one vector result for each scalar result, with the elements associated with other scalar results stubbed out with the neutral value. (vectorizable_reduction): Call neutral_op_for_slp_reduction. Require IFN_VEC_SHL_INSERT for double reductions on variable-length vectors, or SLP reductions that have a neutral value. Require can_duplicate_and_interleave_p support for variable-length unchained SLP reductions if there is no neutral value, such as for MIN/MAX reductions. Also require the number of vector elements to be a multiple of the number of SLP statements when doing variable-length unchained SLP reductions. Update call to vect_create_epilog_for_reduction. * tree-vect-slp.c (can_duplicate_and_interleave_p): Make public and remove initial values. (duplicate_and_interleave): Make public. * config/aarch64/aarch64.md (UNSPEC_INSR): New unspec. * config/aarch64/aarch64-sve.md (vec_shl_insert_<mode>): New insn. gcc/testsuite/ * gcc.dg/vect/pr37027.c: Remove XFAIL for variable-length vectors. * gcc.dg/vect/pr67790.c: Likewise. * gcc.dg/vect/slp-reduc-1.c: Likewise. * gcc.dg/vect/slp-reduc-2.c: Likewise. * gcc.dg/vect/slp-reduc-3.c: Likewise. * gcc.dg/vect/slp-reduc-5.c: Likewise. * gcc.target/aarch64/sve/slp_5.c: New test. * gcc.target/aarch64/sve/slp_5_run.c: Likewise. * gcc.target/aarch64/sve/slp_6.c: Likewise. * gcc.target/aarch64/sve/slp_6_run.c: Likewise. * gcc.target/aarch64/sve/slp_7.c: Likewise. * gcc.target/aarch64/sve/slp_7_run.c: Likewise. Co-Authored-By: Alan Hayward <alan.hayward@arm.com> Co-Authored-By: David Sherwood <david.sherwood@arm.com> From-SVN: r256623
f1739b48 · Richard Sandiford · Richard Sandiford · 018b2744 · f1739b48 · f1739b48
Commit f1739b48 authored Jan 13, 2018 by Richard Sandiford Committed by Richard Sandiford Jan 13, 2018
22 changed files
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -2,6 +2,47 @@
 	    Alan Hayward  <alan.hayward@arm.com>
 	    David Sherwood  <david.sherwood@arm.com>
+	* doc/md.texi (vec_shl_insert_@var{m}): New optab.
+	* internal-fn.def (VEC_SHL_INSERT): New internal function.
+	* optabs.def (vec_shl_insert_optab): New optab.
+	* tree-vectorizer.h (can_duplicate_and_interleave_p): Declare.
+	(duplicate_and_interleave): Likewise.
+	* tree-vect-loop.c: Include internal-fn.h.
+	(neutral_op_for_slp_reduction): New function, split out from
+	get_initial_defs_for_reduction.
+	(get_initial_def_for_reduction): Handle option 2 for variable-length
+	vectors by loading the neutral value into a vector and then shifting
+	the initial value into element 0.
+	(get_initial_defs_for_reduction): Replace the code argument with
+	the neutral value calculated by neutral_op_for_slp_reduction.
+	Use gimple_build_vector for constant-length vectors.
+	Use IFN_VEC_SHL_INSERT for variable-length vectors if all
+	but the first group_size elements have a neutral value.
+	Use duplicate_and_interleave otherwise.
+	(vect_create_epilog_for_reduction): Take a neutral_op parameter.
+	Update call to get_initial_defs_for_reduction.  Handle SLP
+	reductions for variable-length vectors by creating one vector
+	result for each scalar result, with the elements associated
+	with other scalar results stubbed out with the neutral value.
+	(vectorizable_reduction): Call neutral_op_for_slp_reduction.
+	Require IFN_VEC_SHL_INSERT for double reductions on
+	variable-length vectors, or SLP reductions that have
+	a neutral value.  Require can_duplicate_and_interleave_p
+	support for variable-length unchained SLP reductions if there
+	is no neutral value, such as for MIN/MAX reductions.  Also require
+	the number of vector elements to be a multiple of the number of
+	SLP statements when doing variable-length unchained SLP reductions.
+	Update call to vect_create_epilog_for_reduction.
+	* tree-vect-slp.c (can_duplicate_and_interleave_p): Make public
+	and remove initial values.
+	(duplicate_and_interleave): Make public.
+	* config/aarch64/aarch64.md (UNSPEC_INSR): New unspec.
+	* config/aarch64/aarch64-sve.md (vec_shl_insert_<mode>): New insn.
+2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
+	    Alan Hayward  <alan.hayward@arm.com>
+	    David Sherwood  <david.sherwood@arm.com>
 	* tree-vect-slp.c: Include gimple-fold.h and internal-fn.h
 	(can_duplicate_and_interleave_p): New function.
 	(vect_get_and_check_slp_defs): Take the vector of statements

--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -2073,3 +2073,16 @@
    operands[5] = gen_reg_rtx (VNx4SImode);
  }
 )
+;; Shift an SVE vector left and insert a scalar into element 0.
+(define_insn "vec_shl_insert_<mode>"
+  [(set (match_operand:SVE_ALL 0 "register_operand" "=w, w")
+	(unspec:SVE_ALL
+	  [(match_operand:SVE_ALL 1 "register_operand" "0, 0")
+	   (match_operand:<VEL> 2 "register_operand" "rZ, w")]
+	  UNSPEC_INSR))]
+  "TARGET_SVE"
+  "@
+   insr\t%0.<Vetype>, %<vwcore>2
+   insr\t%0.<Vetype>, %<Vetype>2"
+)
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -163,6 +163,7 @@
    UNSPEC_WHILE_LO
    UNSPEC_LDN
    UNSPEC_STN
+    UNSPEC_INSR
 ])
 (define_c_enum "unspecv" [

--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5224,6 +5224,14 @@ operand 1. Add operand 1 to operand 2 and place the widened result in
 operand 0. (This is used express accumulation of elements into an accumulator
 of a wider mode.)
+@cindex @code{vec_shl_insert_@var{m}} instruction pattern
+@item @samp{vec_shl_insert_@var{m}}
+Shift the elements in vector input operand 1 left one element (i.e.
+away from element 0) and fill the vacated element 0 with the scalar
+in operand 2.  Store the result in vector output operand 0.  Operands
+0 and 1 have mode @var{m} and operand 2 has the mode appropriate for
+one element of @var{m}.
 @cindex @code{vec_shr_@var{m}} instruction pattern
 @item @samp{vec_shr_@var{m}}
 Whole vector right shift in bits, i.e. towards element 0.

--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -116,6 +116,9 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
+DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT, ECF_CONST | ECF_NOTHROW,
+		       vec_shl_insert, binary)
 DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 DEF_INTERNAL_OPTAB_FN (REDUC_PLUS, ECF_CONST | ECF_NOTHROW,

--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -368,3 +368,4 @@ OPTAB_D (set_thread_pointer_optab, "set_thread_pointer$I$a")
 OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
 OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
+OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -2,6 +2,23 @@
 	    Alan Hayward  <alan.hayward@arm.com>
 	    David Sherwood  <david.sherwood@arm.com>
+	* gcc.dg/vect/pr37027.c: Remove XFAIL for variable-length vectors.
+	* gcc.dg/vect/pr67790.c: Likewise.
+	* gcc.dg/vect/slp-reduc-1.c: Likewise.
+	* gcc.dg/vect/slp-reduc-2.c: Likewise.
+	* gcc.dg/vect/slp-reduc-3.c: Likewise.
+	* gcc.dg/vect/slp-reduc-5.c: Likewise.
+	* gcc.target/aarch64/sve/slp_5.c: New test.
+	* gcc.target/aarch64/sve/slp_5_run.c: Likewise.
+	* gcc.target/aarch64/sve/slp_6.c: Likewise.
+	* gcc.target/aarch64/sve/slp_6_run.c: Likewise.
+	* gcc.target/aarch64/sve/slp_7.c: Likewise.
+	* gcc.target/aarch64/sve/slp_7_run.c: Likewise.
+2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
+	    Alan Hayward  <alan.hayward@arm.com>
+	    David Sherwood  <david.sherwood@arm.com>
 	* gcc.dg/vect/no-scevccp-slp-30.c: Don't XFAIL for vect_variable_length
 	&& vect_load_lanes
 	* gcc.dg/vect/slp-1.c: Likewise.

--- a/gcc/testsuite/gcc.dg/vect/pr37027.c
+++ b/gcc/testsuite/gcc.dg/vect/pr37027.c
@@ -32,5 +32,5 @@ foo (void)
 }
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */
--- a/gcc/testsuite/gcc.dg/vect/pr67790.c
+++ b/gcc/testsuite/gcc.dg/vect/pr67790.c
@@ -37,4 +37,4 @@ int main()
  return 0;
 }
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { xfail vect_variable_length } } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
--- a/gcc/testsuite/gcc.dg/vect/slp-reduc-1.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-reduc-1.c
@@ -43,5 +43,5 @@ int main (void)
 }
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */
--- a/gcc/testsuite/gcc.dg/vect/slp-reduc-2.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-reduc-2.c
@@ -38,5 +38,5 @@ int main (void)
 }
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */
--- a/gcc/testsuite/gcc.dg/vect/slp-reduc-3.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-reduc-3.c
@@ -58,7 +58,4 @@ int main (void)
 /* The initialization loop in main also gets vectorized.  */
 /* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: detected" 1 "vect" { xfail *-*-* } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { target { vect_short_mult && { vect_widen_sum_hi_to_si  && vect_unpack } } } } } */ 
-/* We can't yet create the necessary SLP constant vector for variable-length
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_widen_sum_hi_to_si_pattern || { ! vect_unpack } } } } } */
-   SVE and so fall back to Advanced SIMD.  This means that we repeat each
-   analysis note.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_widen_sum_hi_to_si_pattern || { { ! vect_unpack } || { aarch64_sve && vect_variable_length } } } } } } */
--- a/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-reduc-5.c
@@ -43,5 +43,5 @@ int main (void)
 }
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { xfail vect_no_int_min_max } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_min_max || vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_min_max } } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_5.c
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math" } */
+#include <stdint.h>
+#define VEC_PERM(TYPE)						\
+void __attribute__ ((noinline, noclone))			\
+vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)	\
+{								\
+  TYPE x0 = b[0];						\
+  TYPE x1 = b[1];						\
+  for (int i = 0; i < n; ++i)					\
+    {								\
+      x0 += a[i * 2];						\
+      x1 += a[i * 2 + 1];					\
+    }								\
+  b[0] = x0;							\
+  b[1] = x1;							\
+}
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (_Float16)					\
+  T (float)					\
+  T (double)
+TEST_ALL (VEC_PERM)
+/* ??? We don't think it's worth using SLP for the 64-bit loops and fall
+   back to the less efficient non-SLP implementation instead.  */
+/* ??? At present we don't treat the int8_t and int16_t loops as
+   reductions.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tld2b\t} } } */
+/* { dg-final { scan-assembler-not {\tld2h\t} } } */
+/* { dg-final { scan-assembler-not {\tld2w\t} } } */
+/* { dg-final { scan-assembler-not {\tld2d\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 4 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 4 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 2 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 2 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 2 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 2 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 2 } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_5_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_5_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+#include "slp_5.c"
+#define N (141 * 2)
+#define HARNESS(TYPE)					\
+  {							\
+    TYPE a[N], b[2] = { 40, 22 };			\
+    for (unsigned int i = 0; i < N; ++i)		\
+      {							\
+	a[i] = i * 2 + i % 5;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    vec_slp_##TYPE (a, b, N / 2);			\
+    TYPE x0 = 40;					\
+    TYPE x1 = 22;					\
+    for (unsigned int i = 0; i < N; i += 2)		\
+      {							\
+	x0 += a[i];					\
+	x1 += a[i + 1];					\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    /* _Float16 isn't precise enough for this.  */	\
+    if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000		\
+	&& (x0 != b[0] || x1 != b[1]))			\
+      __builtin_abort ();				\
+  }
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math" } */
+#include <stdint.h>
+#define VEC_PERM(TYPE)						\
+void __attribute__ ((noinline, noclone))			\
+vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)	\
+{								\
+  TYPE x0 = b[0];						\
+  TYPE x1 = b[1];						\
+  TYPE x2 = b[2];						\
+  for (int i = 0; i < n; ++i)					\
+    {								\
+      x0 += a[i * 3];						\
+      x1 += a[i * 3 + 1];					\
+      x2 += a[i * 3 + 2];					\
+    }								\
+  b[0] = x0;							\
+  b[1] = x1;							\
+  b[2] = x2;							\
+}
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (_Float16)					\
+  T (float)					\
+  T (double)
+TEST_ALL (VEC_PERM)
+/* These loops can't use SLP.  */
+/* { dg-final { scan-assembler-not {\tld1b\t} } } */
+/* { dg-final { scan-assembler-not {\tld1h\t} } } */
+/* { dg-final { scan-assembler-not {\tld1w\t} } } */
+/* { dg-final { scan-assembler-not {\tld1d\t} } } */
+/* { dg-final { scan-assembler {\tld3b\t} } } */
+/* { dg-final { scan-assembler {\tld3h\t} } } */
+/* { dg-final { scan-assembler {\tld3w\t} } } */
+/* { dg-final { scan-assembler {\tld3d\t} } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_6_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_6_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+#include "slp_6.c"
+#define N (77 * 3)
+#define HARNESS(TYPE)					\
+  {							\
+    TYPE a[N], b[3] = { 40, 22, 75 };			\
+    for (unsigned int i = 0; i < N; ++i)		\
+      {							\
+	a[i] = i * 2 + i % 5;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    vec_slp_##TYPE (a, b, N / 3);			\
+    TYPE x0 = 40;					\
+    TYPE x1 = 22;					\
+    TYPE x2 = 75;					\
+    for (unsigned int i = 0; i < N; i += 3)		\
+      {							\
+	x0 += a[i];					\
+	x1 += a[i + 1];					\
+	x2 += a[i + 2];					\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    /* _Float16 isn't precise enough for this.  */	\
+    if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000		\
+	&& (x0 != b[0] || x1 != b[1] || x2 != b[2]))	\
+      __builtin_abort ();				\
+  }
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_7.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_7.c
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math" } */
+#include <stdint.h>
+#define VEC_PERM(TYPE)						\
+void __attribute__ ((noinline, noclone))			\
+vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)	\
+{								\
+  TYPE x0 = b[0];						\
+  TYPE x1 = b[1];						\
+  TYPE x2 = b[2];						\
+  TYPE x3 = b[3];						\
+  for (int i = 0; i < n; ++i)					\
+    {								\
+      x0 += a[i * 4];						\
+      x1 += a[i * 4 + 1];					\
+      x2 += a[i * 4 + 2];					\
+      x3 += a[i * 4 + 3];					\
+    }								\
+  b[0] = x0;							\
+  b[1] = x1;							\
+  b[2] = x2;							\
+  b[3] = x3;							\
+}
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (_Float16)					\
+  T (float)					\
+  T (double)
+TEST_ALL (VEC_PERM)
+/* We can't use SLP for the 64-bit loops, since the number of reduction
+   results might be greater than the number of elements in the vector.
+   Otherwise we have two loads per loop, one for the initial vector
+   and one for the loop body.  */
+/* ??? At present we don't treat the int8_t and int16_t loops as
+   reductions.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tld4d\t} 3 } } */
+/* { dg-final { scan-assembler-not {\tld4b\t} } } */
+/* { dg-final { scan-assembler-not {\tld4h\t} } } */
+/* { dg-final { scan-assembler-not {\tld4w\t} } } */
+/* { dg-final { scan-assembler-not {\tld1d\t} } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 8 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 8 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 4 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 8 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 8 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_7_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_7_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+#include "slp_7.c"
+#define N (54 * 4)
+#define HARNESS(TYPE)							\
+  {									\
+    TYPE a[N], b[4] = { 40, 22, 75, 19 };				\
+    for (unsigned int i = 0; i < N; ++i)				\
+      {									\
+	a[i] = i * 2 + i % 5;						\
+	asm volatile ("" ::: "memory");					\
+      }									\
+    vec_slp_##TYPE (a, b, N / 4);					\
+    TYPE x0 = 40;							\
+    TYPE x1 = 22;							\
+    TYPE x2 = 75;							\
+    TYPE x3 = 19;							\
+    for (unsigned int i = 0; i < N; i += 4)				\
+      {									\
+	x0 += a[i];							\
+	x1 += a[i + 1];							\
+	x2 += a[i + 2];							\
+	x3 += a[i + 3];							\
+	asm volatile ("" ::: "memory");					\
+      }									\
+    /* _Float16 isn't precise enough for this.  */			\
+    if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000				\
+	&& (x0 != b[0] || x1 != b[1] || x2 != b[2] || x3 != b[3]))	\
+      __builtin_abort ();						\
+  }
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -216,11 +216,11 @@ vect_get_place_in_interleaving_chain (gimple *stmt, gimple *first_stmt)
   (if nonnull) and the type of each intermediate vector in *VECTOR_TYPE_OUT
   (if nonnull).  */
-static bool
+bool
 can_duplicate_and_interleave_p (unsigned int count, machine_mode elt_mode,
-				unsigned int *nvectors_out = NULL,
+				unsigned int *nvectors_out,
-				tree *vector_type_out = NULL,
+				tree *vector_type_out,
-				tree *permutes = NULL)
+				tree *permutes)
 {
  poly_int64 elt_bytes = count * GET_MODE_SIZE (elt_mode);
  poly_int64 nelts;
@@ -3309,7 +3309,7 @@ vect_mask_constant_operand_p (gimple *stmt, int opnum)
   We try to find the largest IM for which this sequence works, in order
   to cut down on the number of interleaves.  */
-static void
+void
 duplicate_and_interleave (gimple_seq *seq, tree vector_type, vec<tree> elts,
 			  unsigned int nresults, vec<tree> &results)
 {

--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1352,6 +1352,11 @@ extern void vect_get_slp_defs (vec<tree> , slp_tree, vec<vec<tree> > *);
 extern bool vect_slp_bb (basic_block);
 extern gimple *vect_find_last_scalar_stmt_in_slp (slp_tree);
 extern bool is_simple_and_all_uses_invariant (gimple *, loop_vec_info);
+extern bool can_duplicate_and_interleave_p (unsigned int, machine_mode,
+					    unsigned int * = NULL,
+					    tree * = NULL, tree * = NULL);
+extern void duplicate_and_interleave (gimple_seq *, tree, vec<tree>,
+				      unsigned int, vec<tree> &);
 /* In tree-vect-patterns.c.  */
 /* Pattern recognition functions.