Commit f1739b48 by Richard Sandiford Committed by Richard Sandiford

SLP reductions with variable-length vectors

Two things stopped us using SLP reductions with variable-length vectors:

(1) We didn't have a way of constructing the initial vector.
    This patch does it by creating a vector full of the neutral
    identity value and then using a shift-and-insert function
    to insert any non-identity inputs into the low-numbered elements.
    (The non-identity values are needed for double reductions.)
    Alternatively, for unchained MIN/MAX reductions that have no neutral
    value, we instead use the same duplicate-and-interleave approach as
    for SLP constant and external definitions (added by a previous
    patch).

(2) The epilogue for constant-length vectors would extract the vector
    elements associated with each SLP statement and do scalar arithmetic
    on these individual elements.  For variable-length vectors, the patch
    instead creates a reduction vector for each SLP statement, replacing
    the elements for other SLP statements with the identity value.
    It then uses a hardware reduction instruction on each vector.

2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* doc/md.texi (vec_shl_insert_@var{m}): New optab.
	* internal-fn.def (VEC_SHL_INSERT): New internal function.
	* optabs.def (vec_shl_insert_optab): New optab.
	* tree-vectorizer.h (can_duplicate_and_interleave_p): Declare.
	(duplicate_and_interleave): Likewise.
	* tree-vect-loop.c: Include internal-fn.h.
	(neutral_op_for_slp_reduction): New function, split out from
	get_initial_defs_for_reduction.
	(get_initial_def_for_reduction): Handle option 2 for variable-length
	vectors by loading the neutral value into a vector and then shifting
	the initial value into element 0.
	(get_initial_defs_for_reduction): Replace the code argument with
	the neutral value calculated by neutral_op_for_slp_reduction.
	Use gimple_build_vector for constant-length vectors.
	Use IFN_VEC_SHL_INSERT for variable-length vectors if all
	but the first group_size elements have a neutral value.
	Use duplicate_and_interleave otherwise.
	(vect_create_epilog_for_reduction): Take a neutral_op parameter.
	Update call to get_initial_defs_for_reduction.  Handle SLP
	reductions for variable-length vectors by creating one vector
	result for each scalar result, with the elements associated
	with other scalar results stubbed out with the neutral value.
	(vectorizable_reduction): Call neutral_op_for_slp_reduction.
	Require IFN_VEC_SHL_INSERT for double reductions on
	variable-length vectors, or SLP reductions that have
	a neutral value.  Require can_duplicate_and_interleave_p
	support for variable-length unchained SLP reductions if there
	is no neutral value, such as for MIN/MAX reductions.  Also require
	the number of vector elements to be a multiple of the number of
	SLP statements when doing variable-length unchained SLP reductions.
	Update call to vect_create_epilog_for_reduction.
	* tree-vect-slp.c (can_duplicate_and_interleave_p): Make public
	and remove initial values.
	(duplicate_and_interleave): Make public.
	* config/aarch64/aarch64.md (UNSPEC_INSR): New unspec.
	* config/aarch64/aarch64-sve.md (vec_shl_insert_<mode>): New insn.

gcc/testsuite/
	* gcc.dg/vect/pr37027.c: Remove XFAIL for variable-length vectors.
	* gcc.dg/vect/pr67790.c: Likewise.
	* gcc.dg/vect/slp-reduc-1.c: Likewise.
	* gcc.dg/vect/slp-reduc-2.c: Likewise.
	* gcc.dg/vect/slp-reduc-3.c: Likewise.
	* gcc.dg/vect/slp-reduc-5.c: Likewise.
	* gcc.target/aarch64/sve/slp_5.c: New test.
	* gcc.target/aarch64/sve/slp_5_run.c: Likewise.
	* gcc.target/aarch64/sve/slp_6.c: Likewise.
	* gcc.target/aarch64/sve/slp_6_run.c: Likewise.
	* gcc.target/aarch64/sve/slp_7.c: Likewise.
	* gcc.target/aarch64/sve/slp_7_run.c: Likewise.

Co-Authored-By: Alan Hayward <alan.hayward@arm.com>
Co-Authored-By: David Sherwood <david.sherwood@arm.com>

From-SVN: r256623
parent 018b2744
......@@ -2,6 +2,47 @@
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* doc/md.texi (vec_shl_insert_@var{m}): New optab.
* internal-fn.def (VEC_SHL_INSERT): New internal function.
* optabs.def (vec_shl_insert_optab): New optab.
* tree-vectorizer.h (can_duplicate_and_interleave_p): Declare.
(duplicate_and_interleave): Likewise.
* tree-vect-loop.c: Include internal-fn.h.
(neutral_op_for_slp_reduction): New function, split out from
get_initial_defs_for_reduction.
(get_initial_def_for_reduction): Handle option 2 for variable-length
vectors by loading the neutral value into a vector and then shifting
the initial value into element 0.
(get_initial_defs_for_reduction): Replace the code argument with
the neutral value calculated by neutral_op_for_slp_reduction.
Use gimple_build_vector for constant-length vectors.
Use IFN_VEC_SHL_INSERT for variable-length vectors if all
but the first group_size elements have a neutral value.
Use duplicate_and_interleave otherwise.
(vect_create_epilog_for_reduction): Take a neutral_op parameter.
Update call to get_initial_defs_for_reduction. Handle SLP
reductions for variable-length vectors by creating one vector
result for each scalar result, with the elements associated
with other scalar results stubbed out with the neutral value.
(vectorizable_reduction): Call neutral_op_for_slp_reduction.
Require IFN_VEC_SHL_INSERT for double reductions on
variable-length vectors, or SLP reductions that have
a neutral value. Require can_duplicate_and_interleave_p
support for variable-length unchained SLP reductions if there
is no neutral value, such as for MIN/MAX reductions. Also require
the number of vector elements to be a multiple of the number of
SLP statements when doing variable-length unchained SLP reductions.
Update call to vect_create_epilog_for_reduction.
* tree-vect-slp.c (can_duplicate_and_interleave_p): Make public
and remove initial values.
(duplicate_and_interleave): Make public.
* config/aarch64/aarch64.md (UNSPEC_INSR): New unspec.
* config/aarch64/aarch64-sve.md (vec_shl_insert_<mode>): New insn.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* tree-vect-slp.c: Include gimple-fold.h and internal-fn.h
(can_duplicate_and_interleave_p): New function.
(vect_get_and_check_slp_defs): Take the vector of statements
......
......@@ -2073,3 +2073,16 @@
operands[5] = gen_reg_rtx (VNx4SImode);
}
)
;; Shift an SVE vector left and insert a scalar into element 0.
(define_insn "vec_shl_insert_<mode>"
[(set (match_operand:SVE_ALL 0 "register_operand" "=w, w")
(unspec:SVE_ALL
[(match_operand:SVE_ALL 1 "register_operand" "0, 0")
(match_operand:<VEL> 2 "register_operand" "rZ, w")]
UNSPEC_INSR))]
"TARGET_SVE"
"@
insr\t%0.<Vetype>, %<vwcore>2
insr\t%0.<Vetype>, %<Vetype>2"
)
......@@ -163,6 +163,7 @@
UNSPEC_WHILE_LO
UNSPEC_LDN
UNSPEC_STN
UNSPEC_INSR
])
(define_c_enum "unspecv" [
......
......@@ -5224,6 +5224,14 @@ operand 1. Add operand 1 to operand 2 and place the widened result in
operand 0. (This is used express accumulation of elements into an accumulator
of a wider mode.)
@cindex @code{vec_shl_insert_@var{m}} instruction pattern
@item @samp{vec_shl_insert_@var{m}}
Shift the elements in vector input operand 1 left one element (i.e.
away from element 0) and fill the vacated element 0 with the scalar
in operand 2. Store the result in vector output operand 0. Operands
0 and 1 have mode @var{m} and operand 2 has the mode appropriate for
one element of @var{m}.
@cindex @code{vec_shr_@var{m}} instruction pattern
@item @samp{vec_shr_@var{m}}
Whole vector right shift in bits, i.e. towards element 0.
......
......@@ -116,6 +116,9 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
vec_mask_store_lanes, mask_store_lanes)
DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT, ECF_CONST | ECF_NOTHROW,
vec_shl_insert, binary)
DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
DEF_INTERNAL_OPTAB_FN (REDUC_PLUS, ECF_CONST | ECF_NOTHROW,
......
......@@ -368,3 +368,4 @@ OPTAB_D (set_thread_pointer_optab, "set_thread_pointer$I$a")
OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
......@@ -2,6 +2,23 @@
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* gcc.dg/vect/pr37027.c: Remove XFAIL for variable-length vectors.
* gcc.dg/vect/pr67790.c: Likewise.
* gcc.dg/vect/slp-reduc-1.c: Likewise.
* gcc.dg/vect/slp-reduc-2.c: Likewise.
* gcc.dg/vect/slp-reduc-3.c: Likewise.
* gcc.dg/vect/slp-reduc-5.c: Likewise.
* gcc.target/aarch64/sve/slp_5.c: New test.
* gcc.target/aarch64/sve/slp_5_run.c: Likewise.
* gcc.target/aarch64/sve/slp_6.c: Likewise.
* gcc.target/aarch64/sve/slp_6_run.c: Likewise.
* gcc.target/aarch64/sve/slp_7.c: Likewise.
* gcc.target/aarch64/sve/slp_7_run.c: Likewise.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* gcc.dg/vect/no-scevccp-slp-30.c: Don't XFAIL for vect_variable_length
&& vect_load_lanes
* gcc.dg/vect/slp-1.c: Likewise.
......
......@@ -32,5 +32,5 @@ foo (void)
}
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || vect_variable_length } } } } */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */
......@@ -37,4 +37,4 @@ int main()
return 0;
}
/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { xfail vect_variable_length } } } */
/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
......@@ -43,5 +43,5 @@ int main (void)
}
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || vect_variable_length } } } } */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */
......@@ -38,5 +38,5 @@ int main (void)
}
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || vect_variable_length } } } } */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */
......@@ -58,7 +58,4 @@ int main (void)
/* The initialization loop in main also gets vectorized. */
/* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: detected" 1 "vect" { xfail *-*-* } } } */
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { target { vect_short_mult && { vect_widen_sum_hi_to_si && vect_unpack } } } } } */
/* We can't yet create the necessary SLP constant vector for variable-length
SVE and so fall back to Advanced SIMD. This means that we repeat each
analysis note. */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_widen_sum_hi_to_si_pattern || { { ! vect_unpack } || { aarch64_sve && vect_variable_length } } } } } } */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_widen_sum_hi_to_si_pattern || { ! vect_unpack } } } } } */
......@@ -43,5 +43,5 @@ int main (void)
}
/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { xfail vect_no_int_min_max } } } */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_min_max || vect_variable_length } } } } */
/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_min_max } } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math" } */
#include <stdint.h>
#define VEC_PERM(TYPE) \
void __attribute__ ((noinline, noclone)) \
vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \
{ \
TYPE x0 = b[0]; \
TYPE x1 = b[1]; \
for (int i = 0; i < n; ++i) \
{ \
x0 += a[i * 2]; \
x1 += a[i * 2 + 1]; \
} \
b[0] = x0; \
b[1] = x1; \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (_Float16) \
T (float) \
T (double)
TEST_ALL (VEC_PERM)
/* ??? We don't think it's worth using SLP for the 64-bit loops and fall
back to the less efficient non-SLP implementation instead. */
/* ??? At present we don't treat the int8_t and int16_t loops as
reductions. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-not {\tld2b\t} } } */
/* { dg-final { scan-assembler-not {\tld2h\t} } } */
/* { dg-final { scan-assembler-not {\tld2w\t} } } */
/* { dg-final { scan-assembler-not {\tld2d\t} { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 4 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 4 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 2 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 2 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */
/* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 2 } } */
/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 2 } } */
/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 2 } } */
/* { dg-do run { target aarch64_sve_hw } } */
/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
#include "slp_5.c"
#define N (141 * 2)
#define HARNESS(TYPE) \
{ \
TYPE a[N], b[2] = { 40, 22 }; \
for (unsigned int i = 0; i < N; ++i) \
{ \
a[i] = i * 2 + i % 5; \
asm volatile ("" ::: "memory"); \
} \
vec_slp_##TYPE (a, b, N / 2); \
TYPE x0 = 40; \
TYPE x1 = 22; \
for (unsigned int i = 0; i < N; i += 2) \
{ \
x0 += a[i]; \
x1 += a[i + 1]; \
asm volatile ("" ::: "memory"); \
} \
/* _Float16 isn't precise enough for this. */ \
if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000 \
&& (x0 != b[0] || x1 != b[1])) \
__builtin_abort (); \
}
int __attribute__ ((optimize (1)))
main (void)
{
TEST_ALL (HARNESS)
}
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math" } */
#include <stdint.h>
#define VEC_PERM(TYPE) \
void __attribute__ ((noinline, noclone)) \
vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \
{ \
TYPE x0 = b[0]; \
TYPE x1 = b[1]; \
TYPE x2 = b[2]; \
for (int i = 0; i < n; ++i) \
{ \
x0 += a[i * 3]; \
x1 += a[i * 3 + 1]; \
x2 += a[i * 3 + 2]; \
} \
b[0] = x0; \
b[1] = x1; \
b[2] = x2; \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (_Float16) \
T (float) \
T (double)
TEST_ALL (VEC_PERM)
/* These loops can't use SLP. */
/* { dg-final { scan-assembler-not {\tld1b\t} } } */
/* { dg-final { scan-assembler-not {\tld1h\t} } } */
/* { dg-final { scan-assembler-not {\tld1w\t} } } */
/* { dg-final { scan-assembler-not {\tld1d\t} } } */
/* { dg-final { scan-assembler {\tld3b\t} } } */
/* { dg-final { scan-assembler {\tld3h\t} } } */
/* { dg-final { scan-assembler {\tld3w\t} } } */
/* { dg-final { scan-assembler {\tld3d\t} } } */
/* { dg-do run { target aarch64_sve_hw } } */
/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
#include "slp_6.c"
#define N (77 * 3)
#define HARNESS(TYPE) \
{ \
TYPE a[N], b[3] = { 40, 22, 75 }; \
for (unsigned int i = 0; i < N; ++i) \
{ \
a[i] = i * 2 + i % 5; \
asm volatile ("" ::: "memory"); \
} \
vec_slp_##TYPE (a, b, N / 3); \
TYPE x0 = 40; \
TYPE x1 = 22; \
TYPE x2 = 75; \
for (unsigned int i = 0; i < N; i += 3) \
{ \
x0 += a[i]; \
x1 += a[i + 1]; \
x2 += a[i + 2]; \
asm volatile ("" ::: "memory"); \
} \
/* _Float16 isn't precise enough for this. */ \
if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000 \
&& (x0 != b[0] || x1 != b[1] || x2 != b[2])) \
__builtin_abort (); \
}
int __attribute__ ((optimize (1)))
main (void)
{
TEST_ALL (HARNESS)
}
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math" } */
#include <stdint.h>
#define VEC_PERM(TYPE) \
void __attribute__ ((noinline, noclone)) \
vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \
{ \
TYPE x0 = b[0]; \
TYPE x1 = b[1]; \
TYPE x2 = b[2]; \
TYPE x3 = b[3]; \
for (int i = 0; i < n; ++i) \
{ \
x0 += a[i * 4]; \
x1 += a[i * 4 + 1]; \
x2 += a[i * 4 + 2]; \
x3 += a[i * 4 + 3]; \
} \
b[0] = x0; \
b[1] = x1; \
b[2] = x2; \
b[3] = x3; \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (_Float16) \
T (float) \
T (double)
TEST_ALL (VEC_PERM)
/* We can't use SLP for the 64-bit loops, since the number of reduction
results might be greater than the number of elements in the vector.
Otherwise we have two loads per loop, one for the initial vector
and one for the loop body. */
/* ??? At present we don't treat the int8_t and int16_t loops as
reductions. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
/* { dg-final { scan-assembler-times {\tld4d\t} 3 } } */
/* { dg-final { scan-assembler-not {\tld4b\t} } } */
/* { dg-final { scan-assembler-not {\tld4h\t} } } */
/* { dg-final { scan-assembler-not {\tld4w\t} } } */
/* { dg-final { scan-assembler-not {\tld1d\t} } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 8 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 8 { xfail *-*-* } } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 4 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 8 } } */
/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 8 } } */
/* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */
/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */
/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */
/* { dg-do run { target aarch64_sve_hw } } */
/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
#include "slp_7.c"
#define N (54 * 4)
#define HARNESS(TYPE) \
{ \
TYPE a[N], b[4] = { 40, 22, 75, 19 }; \
for (unsigned int i = 0; i < N; ++i) \
{ \
a[i] = i * 2 + i % 5; \
asm volatile ("" ::: "memory"); \
} \
vec_slp_##TYPE (a, b, N / 4); \
TYPE x0 = 40; \
TYPE x1 = 22; \
TYPE x2 = 75; \
TYPE x3 = 19; \
for (unsigned int i = 0; i < N; i += 4) \
{ \
x0 += a[i]; \
x1 += a[i + 1]; \
x2 += a[i + 2]; \
x3 += a[i + 3]; \
asm volatile ("" ::: "memory"); \
} \
/* _Float16 isn't precise enough for this. */ \
if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000 \
&& (x0 != b[0] || x1 != b[1] || x2 != b[2] || x3 != b[3])) \
__builtin_abort (); \
}
int __attribute__ ((optimize (1)))
main (void)
{
TEST_ALL (HARNESS)
}
......@@ -216,11 +216,11 @@ vect_get_place_in_interleaving_chain (gimple *stmt, gimple *first_stmt)
(if nonnull) and the type of each intermediate vector in *VECTOR_TYPE_OUT
(if nonnull). */
static bool
bool
can_duplicate_and_interleave_p (unsigned int count, machine_mode elt_mode,
unsigned int *nvectors_out = NULL,
tree *vector_type_out = NULL,
tree *permutes = NULL)
unsigned int *nvectors_out,
tree *vector_type_out,
tree *permutes)
{
poly_int64 elt_bytes = count * GET_MODE_SIZE (elt_mode);
poly_int64 nelts;
......@@ -3309,7 +3309,7 @@ vect_mask_constant_operand_p (gimple *stmt, int opnum)
We try to find the largest IM for which this sequence works, in order
to cut down on the number of interleaves. */
static void
void
duplicate_and_interleave (gimple_seq *seq, tree vector_type, vec<tree> elts,
unsigned int nresults, vec<tree> &results)
{
......
......@@ -1352,6 +1352,11 @@ extern void vect_get_slp_defs (vec<tree> , slp_tree, vec<vec<tree> > *);
extern bool vect_slp_bb (basic_block);
extern gimple *vect_find_last_scalar_stmt_in_slp (slp_tree);
extern bool is_simple_and_all_uses_invariant (gimple *, loop_vec_info);
extern bool can_duplicate_and_interleave_p (unsigned int, machine_mode,
unsigned int * = NULL,
tree * = NULL, tree * = NULL);
extern void duplicate_and_interleave (gimple_seq *, tree, vec<tree>,
unsigned int, vec<tree> &);
/* In tree-vect-patterns.c. */
/* Pattern recognition functions.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment