Commit 7cfb4d93 by Richard Sandiford Committed by Richard Sandiford

Add support for fully-predicated loops

This patch adds support for using a single fully-predicated loop instead
of a vector loop and a scalar tail.  An SVE WHILELO instruction generates
the predicate for each iteration of the loop, given the current scalar
iv value and the loop bound.  This operation is wrapped up in a new internal
function called WHILE_ULT.  E.g.:

   WHILE_ULT (0, 3, { 0, 0, 0, 0}) -> { 1, 1, 1, 0 }
   WHILE_ULT (UINT_MAX - 1, UINT_MAX, { 0, 0, 0, 0 }) -> { 1, 0, 0, 0 }

The third WHILE_ULT argument is needed to make the operation
unambiguous: without it, WHILE_ULT (0, 3) for one vector type would
seem equivalent to WHILE_ULT (0, 3) for another, even if the types have
different numbers of elements.

Note that the patch uses "mask" and "fully-masked" instead of
"predicate" and "fully-predicated", to follow existing GCC terminology.

This patch just handles the simple cases, punting for things like
reductions and live-out values.  Later patches remove most of these
restrictions.

2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* optabs.def (while_ult_optab): New optab.
	* doc/md.texi (while_ult@var{m}@var{n}): Document.
	* internal-fn.def (WHILE_ULT): New internal function.
	* internal-fn.h (direct_internal_fn_supported_p): New override
	that takes two types as argument.
	* internal-fn.c (while_direct): New macro.
	(expand_while_optab_fn): New function.
	(convert_optab_supported_p): Likewise.
	(direct_while_optab_supported_p): New macro.
	* wide-int.h (wi::udiv_ceil): New function.
	* tree-vectorizer.h (rgroup_masks): New structure.
	(vec_loop_masks): New typedef.
	(_loop_vec_info): Add masks, mask_compare_type, can_fully_mask_p
	and fully_masked_p.
	(LOOP_VINFO_CAN_FULLY_MASK_P, LOOP_VINFO_FULLY_MASKED_P)
	(LOOP_VINFO_MASKS, LOOP_VINFO_MASK_COMPARE_TYPE): New macros.
	(vect_max_vf): New function.
	(slpeel_make_loop_iterate_ntimes): Delete.
	(vect_set_loop_condition, vect_get_loop_mask_type, vect_gen_while)
	(vect_halve_mask_nunits, vect_double_mask_nunits): Declare.
	(vect_record_loop_mask, vect_get_loop_mask): Likewise.
	* tree-vect-loop-manip.c: Include tree-ssa-loop-niter.h,
	internal-fn.h, stor-layout.h and optabs-query.h.
	(vect_set_loop_mask): New function.
	(add_preheader_seq): Likewise.
	(add_header_seq): Likewise.
	(interleave_supported_p): Likewise.
	(vect_maybe_permute_loop_masks): Likewise.
	(vect_set_loop_masks_directly): Likewise.
	(vect_set_loop_condition_masked): Likewise.
	(vect_set_loop_condition_unmasked): New function, split out from
	slpeel_make_loop_iterate_ntimes.
	(slpeel_make_loop_iterate_ntimes): Rename to..
	(vect_set_loop_condition): ...this.  Use vect_set_loop_condition_masked
	for fully-masked loops and vect_set_loop_condition_unmasked otherwise.
	(vect_do_peeling): Update call accordingly.
	(vect_gen_vector_loop_niters): Use VF as the step for fully-masked
	loops.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
	mask_compare_type, can_fully_mask_p and fully_masked_p.
	(release_vec_loop_masks): New function.
	(_loop_vec_info): Use it to free the loop masks.
	(can_produce_all_loop_masks_p): New function.
	(vect_get_max_nscalars_per_iter): Likewise.
	(vect_verify_full_masking): Likewise.
	(vect_analyze_loop_2): Save LOOP_VINFO_CAN_FULLY_MASK_P around
	retries, and free the mask rgroups before retrying.  Check loop-wide
	reasons for disallowing fully-masked loops.  Make the final decision
	about whether use a fully-masked loop or not.
	(vect_estimate_min_profitable_iters): Do not assume that peeling
	for the number of iterations will be needed for fully-masked loops.
	(vectorizable_reduction): Disable fully-masked loops.
	(vectorizable_live_operation): Likewise.
	(vect_halve_mask_nunits): New function.
	(vect_double_mask_nunits): Likewise.
	(vect_record_loop_mask): Likewise.
	(vect_get_loop_mask): Likewise.
	(vect_transform_loop): Handle the case in which the final loop
	iteration might handle a partial vector.  Call vect_set_loop_condition
	instead of slpeel_make_loop_iterate_ntimes.
	* tree-vect-stmts.c: Include tree-ssa-loop-niter.h and gimple-fold.h.
	(check_load_store_masking): New function.
	(prepare_load_store_mask): Likewise.
	(vectorizable_store): Handle fully-masked loops.
	(vectorizable_load): Likewise.
	(supportable_widening_operation): Use vect_halve_mask_nunits for
	booleans.
	(supportable_narrowing_operation): Likewise vect_double_mask_nunits.
	(vect_gen_while): New function.
	* config/aarch64/aarch64.md (umax<mode>3): New expander.
	(aarch64_uqdec<mode>): New insn.

gcc/testsuite/
	* gcc.dg/tree-ssa/cunroll-10.c: Disable vectorization.
	* gcc.dg/tree-ssa/peel1.c: Likewise.
	* gcc.dg/vect/vect-load-lanes-peeling-1.c: Remove XFAIL for
	variable-length vectors.
	* gcc.target/aarch64/sve/vcond_6.c: XFAIL test for AND.
	* gcc.target/aarch64/sve/vec_bool_cmp_1.c: Expect BIC instead of NOT.
	* gcc.target/aarch64/sve/slp_1.c: Check for a fully-masked loop.
	* gcc.target/aarch64/sve/slp_2.c: Likewise.
	* gcc.target/aarch64/sve/slp_3.c: Likewise.
	* gcc.target/aarch64/sve/slp_4.c: Likewise.
	* gcc.target/aarch64/sve/slp_6.c: Likewise.
	* gcc.target/aarch64/sve/slp_8.c: New test.
	* gcc.target/aarch64/sve/slp_8_run.c: Likewise.
	* gcc.target/aarch64/sve/slp_9.c: Likewise.
	* gcc.target/aarch64/sve/slp_9_run.c: Likewise.
	* gcc.target/aarch64/sve/slp_10.c: Likewise.
	* gcc.target/aarch64/sve/slp_10_run.c: Likewise.
	* gcc.target/aarch64/sve/slp_11.c: Likewise.
	* gcc.target/aarch64/sve/slp_11_run.c: Likewise.
	* gcc.target/aarch64/sve/slp_12.c: Likewise.
	* gcc.target/aarch64/sve/slp_12_run.c: Likewise.
	* gcc.target/aarch64/sve/ld1r_2.c: Likewise.
	* gcc.target/aarch64/sve/ld1r_2_run.c: Likewise.
	* gcc.target/aarch64/sve/while_1.c: Likewise.
	* gcc.target/aarch64/sve/while_2.c: Likewise.
	* gcc.target/aarch64/sve/while_3.c: Likewise.
	* gcc.target/aarch64/sve/while_4.c: Likewise.

Co-Authored-By: Alan Hayward <alan.hayward@arm.com>
Co-Authored-By: David Sherwood <david.sherwood@arm.com>

From-SVN: r256625
parent 898f07b0
......@@ -2,6 +2,82 @@
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* optabs.def (while_ult_optab): New optab.
* doc/md.texi (while_ult@var{m}@var{n}): Document.
* internal-fn.def (WHILE_ULT): New internal function.
* internal-fn.h (direct_internal_fn_supported_p): New override
that takes two types as argument.
* internal-fn.c (while_direct): New macro.
(expand_while_optab_fn): New function.
(convert_optab_supported_p): Likewise.
(direct_while_optab_supported_p): New macro.
* wide-int.h (wi::udiv_ceil): New function.
* tree-vectorizer.h (rgroup_masks): New structure.
(vec_loop_masks): New typedef.
(_loop_vec_info): Add masks, mask_compare_type, can_fully_mask_p
and fully_masked_p.
(LOOP_VINFO_CAN_FULLY_MASK_P, LOOP_VINFO_FULLY_MASKED_P)
(LOOP_VINFO_MASKS, LOOP_VINFO_MASK_COMPARE_TYPE): New macros.
(vect_max_vf): New function.
(slpeel_make_loop_iterate_ntimes): Delete.
(vect_set_loop_condition, vect_get_loop_mask_type, vect_gen_while)
(vect_halve_mask_nunits, vect_double_mask_nunits): Declare.
(vect_record_loop_mask, vect_get_loop_mask): Likewise.
* tree-vect-loop-manip.c: Include tree-ssa-loop-niter.h,
internal-fn.h, stor-layout.h and optabs-query.h.
(vect_set_loop_mask): New function.
(add_preheader_seq): Likewise.
(add_header_seq): Likewise.
(interleave_supported_p): Likewise.
(vect_maybe_permute_loop_masks): Likewise.
(vect_set_loop_masks_directly): Likewise.
(vect_set_loop_condition_masked): Likewise.
(vect_set_loop_condition_unmasked): New function, split out from
slpeel_make_loop_iterate_ntimes.
(slpeel_make_loop_iterate_ntimes): Rename to..
(vect_set_loop_condition): ...this. Use vect_set_loop_condition_masked
for fully-masked loops and vect_set_loop_condition_unmasked otherwise.
(vect_do_peeling): Update call accordingly.
(vect_gen_vector_loop_niters): Use VF as the step for fully-masked
loops.
* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
mask_compare_type, can_fully_mask_p and fully_masked_p.
(release_vec_loop_masks): New function.
(_loop_vec_info): Use it to free the loop masks.
(can_produce_all_loop_masks_p): New function.
(vect_get_max_nscalars_per_iter): Likewise.
(vect_verify_full_masking): Likewise.
(vect_analyze_loop_2): Save LOOP_VINFO_CAN_FULLY_MASK_P around
retries, and free the mask rgroups before retrying. Check loop-wide
reasons for disallowing fully-masked loops. Make the final decision
about whether use a fully-masked loop or not.
(vect_estimate_min_profitable_iters): Do not assume that peeling
for the number of iterations will be needed for fully-masked loops.
(vectorizable_reduction): Disable fully-masked loops.
(vectorizable_live_operation): Likewise.
(vect_halve_mask_nunits): New function.
(vect_double_mask_nunits): Likewise.
(vect_record_loop_mask): Likewise.
(vect_get_loop_mask): Likewise.
(vect_transform_loop): Handle the case in which the final loop
iteration might handle a partial vector. Call vect_set_loop_condition
instead of slpeel_make_loop_iterate_ntimes.
* tree-vect-stmts.c: Include tree-ssa-loop-niter.h and gimple-fold.h.
(check_load_store_masking): New function.
(prepare_load_store_mask): Likewise.
(vectorizable_store): Handle fully-masked loops.
(vectorizable_load): Likewise.
(supportable_widening_operation): Use vect_halve_mask_nunits for
booleans.
(supportable_narrowing_operation): Likewise vect_double_mask_nunits.
(vect_gen_while): New function.
* config/aarch64/aarch64.md (umax<mode>3): New expander.
(aarch64_uqdec<mode>): New insn.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* optabs.def (reduc_and_scal_optab, reduc_ior_scal_optab)
(reduc_xor_scal_optab): New optabs.
* doc/md.texi (reduc_and_scal_@var{m}, reduc_ior_scal_@var{m})
......
......@@ -3496,6 +3496,63 @@
[(set_attr "type" "csel")]
)
;; If X can be loaded by a single CNT[BHWD] instruction,
;;
;; A = UMAX (B, X)
;;
;; is equivalent to:
;;
;; TMP = UQDEC[BHWD] (B, X)
;; A = TMP + X
;;
;; Defining the pattern this way means that:
;;
;; A = UMAX (B, X) - X
;;
;; becomes:
;;
;; TMP1 = UQDEC[BHWD] (B, X)
;; TMP2 = TMP1 + X
;; A = TMP2 - X
;;
;; which combine can optimize to:
;;
;; A = UQDEC[BHWD] (B, X)
;;
;; We don't use match_operand predicates because the order of the operands
;; can vary: the CNT[BHWD] constant will come first if the other operand is
;; a simpler constant (such as a CONST_INT), otherwise it will come second.
(define_expand "umax<mode>3"
[(set (match_operand:GPI 0 "register_operand")
(umax:GPI (match_operand:GPI 1 "")
(match_operand:GPI 2 "")))]
"TARGET_SVE"
{
if (aarch64_sve_cnt_immediate (operands[1], <MODE>mode))
std::swap (operands[1], operands[2]);
else if (!aarch64_sve_cnt_immediate (operands[2], <MODE>mode))
FAIL;
rtx temp = gen_reg_rtx (<MODE>mode);
operands[1] = force_reg (<MODE>mode, operands[1]);
emit_insn (gen_aarch64_uqdec<mode> (temp, operands[1], operands[2]));
emit_insn (gen_add<mode>3 (operands[0], temp, operands[2]));
DONE;
}
)
;; Saturating unsigned subtraction of a CNT[BHWD] immediate.
(define_insn "aarch64_uqdec<mode>"
[(set (match_operand:GPI 0 "register_operand" "=r")
(minus:GPI
(umax:GPI (match_operand:GPI 1 "register_operand" "0")
(match_operand:GPI 2 "aarch64_sve_cnt_immediate" "Usv"))
(match_dup 2)))]
"TARGET_SVE"
{
return aarch64_output_sve_cnt_immediate ("uqdec", "%<w>0", operands[2]);
}
)
;; -------------------------------------------------------------------
;; Logical operations
;; -------------------------------------------------------------------
......
......@@ -4954,6 +4954,19 @@ rounding behavior for @var{i} > 1.
This pattern is not allowed to @code{FAIL}.
@cindex @code{while_ult@var{m}@var{n}} instruction pattern
@item @code{while_ult@var{m}@var{n}}
Set operand 0 to a mask that is true while incrementing operand 1
gives a value that is less than operand 2. Operand 0 has mode @var{n}
and operands 1 and 2 are scalar integers of mode @var{m}.
The operation is equivalent to:
@smallexample
operand0[0] = operand1 < operand2;
for (i = 1; i < GET_MODE_NUNITS (@var{n}); i++)
operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
@end smallexample
@cindex @code{vec_cmp@var{m}@var{n}} instruction pattern
@item @samp{vec_cmp@var{m}@var{n}}
Output a vector comparison. Operand 0 of mode @var{n} is the destination for
......
......@@ -88,6 +88,7 @@ init_internal_fns ()
#define mask_store_lanes_direct { 0, 0, false }
#define unary_direct { 0, 0, true }
#define binary_direct { 0, 0, true }
#define while_direct { 0, 2, false }
const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
#define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
......@@ -2817,6 +2818,35 @@ expand_direct_optab_fn (internal_fn fn, gcall *stmt, direct_optab optab,
}
}
/* Expand WHILE_ULT call STMT using optab OPTAB. */
static void
expand_while_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
{
expand_operand ops[3];
tree rhs_type[2];
tree lhs = gimple_call_lhs (stmt);
tree lhs_type = TREE_TYPE (lhs);
rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
for (unsigned int i = 0; i < 2; ++i)
{
tree rhs = gimple_call_arg (stmt, i);
rhs_type[i] = TREE_TYPE (rhs);
rtx rhs_rtx = expand_normal (rhs);
create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type[i]));
}
insn_code icode = convert_optab_handler (optab, TYPE_MODE (rhs_type[0]),
TYPE_MODE (lhs_type));
expand_insn (icode, 3, ops);
if (!rtx_equal_p (lhs_rtx, ops[0].value))
emit_move_insn (lhs_rtx, ops[0].value);
}
/* Expanders for optabs that can use expand_direct_optab_fn. */
#define expand_unary_optab_fn(FN, STMT, OPTAB) \
......@@ -2869,6 +2899,19 @@ direct_optab_supported_p (direct_optab optab, tree_pair types,
return direct_optab_handler (optab, mode, opt_type) != CODE_FOR_nothing;
}
/* Return true if OPTAB is supported for TYPES, where the first type
is the destination and the second type is the source. Used for
convert optabs. */
static bool
convert_optab_supported_p (convert_optab optab, tree_pair types,
optimization_type opt_type)
{
return (convert_optab_handler (optab, TYPE_MODE (types.first),
TYPE_MODE (types.second), opt_type)
!= CODE_FOR_nothing);
}
/* Return true if load/store lanes optab OPTAB is supported for
array type TYPES.first when the optimization type is OPT_TYPE. */
......@@ -2891,6 +2934,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
#define direct_mask_store_optab_supported_p direct_optab_supported_p
#define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
#define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
#define direct_while_optab_supported_p convert_optab_supported_p
/* Return the optab used by internal function FN. */
......
......@@ -116,6 +116,8 @@ DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
vec_mask_store_lanes, mask_store_lanes)
DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT, ECF_CONST | ECF_NOTHROW,
vec_shl_insert, binary)
......
......@@ -174,6 +174,20 @@ extern bool direct_internal_fn_supported_p (internal_fn, tree_pair,
optimization_type);
extern bool direct_internal_fn_supported_p (internal_fn, tree,
optimization_type);
/* Return true if FN is supported for types TYPE0 and TYPE1 when the
optimization type is OPT_TYPE. The types are those associated with
the "type0" and "type1" fields of FN's direct_internal_fn_info
structure. */
inline bool
direct_internal_fn_supported_p (internal_fn fn, tree type0, tree type1,
optimization_type opt_type)
{
return direct_internal_fn_supported_p (fn, tree_pair (type0, type1),
opt_type);
}
extern bool set_edom_supported_p (void);
extern void expand_internal_call (gcall *);
......
......@@ -94,6 +94,8 @@ OPTAB_CD(maskstore_optab, "maskstore$a$b")
OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
OPTAB_CD(vec_init_optab, "vec_init$a$b")
OPTAB_CD (while_ult_optab, "while_ult$a$b")
OPTAB_NL(add_optab, "add$P$a3", PLUS, "add", '3', gen_int_fp_fixed_libfunc)
OPTAB_NX(add_optab, "add$F$a3")
OPTAB_NX(add_optab, "add$Q$a3")
......
......@@ -2,6 +2,38 @@
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* gcc.dg/tree-ssa/cunroll-10.c: Disable vectorization.
* gcc.dg/tree-ssa/peel1.c: Likewise.
* gcc.dg/vect/vect-load-lanes-peeling-1.c: Remove XFAIL for
variable-length vectors.
* gcc.target/aarch64/sve/vcond_6.c: XFAIL test for AND.
* gcc.target/aarch64/sve/vec_bool_cmp_1.c: Expect BIC instead of NOT.
* gcc.target/aarch64/sve/slp_1.c: Check for a fully-masked loop.
* gcc.target/aarch64/sve/slp_2.c: Likewise.
* gcc.target/aarch64/sve/slp_3.c: Likewise.
* gcc.target/aarch64/sve/slp_4.c: Likewise.
* gcc.target/aarch64/sve/slp_6.c: Likewise.
* gcc.target/aarch64/sve/slp_8.c: New test.
* gcc.target/aarch64/sve/slp_8_run.c: Likewise.
* gcc.target/aarch64/sve/slp_9.c: Likewise.
* gcc.target/aarch64/sve/slp_9_run.c: Likewise.
* gcc.target/aarch64/sve/slp_10.c: Likewise.
* gcc.target/aarch64/sve/slp_10_run.c: Likewise.
* gcc.target/aarch64/sve/slp_11.c: Likewise.
* gcc.target/aarch64/sve/slp_11_run.c: Likewise.
* gcc.target/aarch64/sve/slp_12.c: Likewise.
* gcc.target/aarch64/sve/slp_12_run.c: Likewise.
* gcc.target/aarch64/sve/ld1r_2.c: Likewise.
* gcc.target/aarch64/sve/ld1r_2_run.c: Likewise.
* gcc.target/aarch64/sve/while_1.c: Likewise.
* gcc.target/aarch64/sve/while_2.c: Likewise.
* gcc.target/aarch64/sve/while_3.c: Likewise.
* gcc.target/aarch64/sve/while_4.c: Likewise.
2018-01-13 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
* lib/target-supports.exp (check_effective_target_vect_logical_reduc):
New proc.
* gcc.dg/vect/vect-reduc-or_1.c: Also run for vect_logical_reduc
......
/* { dg-do compile } */
/* { dg-options "-O3 -Warray-bounds -fdump-tree-cunroll-details" } */
/* { dg-options "-O3 -Warray-bounds -fno-tree-vectorize -fdump-tree-cunroll-details" } */
int a[3];
int b[4];
int
......
/* { dg-do compile } */
/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
/* { dg-options "-O3 -fno-tree-vectorize -fdump-tree-cunroll-details" } */
struct foo {int b; int a[3];} foo;
void add(struct foo *a,int l)
{
......
......@@ -10,4 +10,4 @@ f (int *__restrict a, int *__restrict b)
}
/* { dg-final { scan-tree-dump-not "Data access with gaps" "vect" } } */
/* { dg-final { scan-tree-dump-not "epilog loop required" "vect" { xfail vect_variable_length } } } */
/* { dg-final { scan-tree-dump-not "epilog loop required" "vect" } } */
/* { dg-do compile } */
/* { dg-options "-O3 -fno-tree-loop-distribute-patterns" } */
#include <stdint.h>
#define NUM_ELEMS(TYPE) (1024 / sizeof (TYPE))
#define DEF_LOAD_BROADCAST(TYPE) \
void __attribute__ ((noinline, noclone)) \
set_##TYPE (TYPE *restrict a, TYPE *restrict b) \
{ \
for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
a[i] = *b; \
}
#define DEF_LOAD_BROADCAST_IMM(TYPE, IMM, SUFFIX) \
void __attribute__ ((noinline, noclone)) \
set_##TYPE##_##SUFFIX (TYPE *a) \
{ \
for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
a[i] = IMM; \
}
#define FOR_EACH_LOAD_BROADCAST(T) \
T (int8_t) \
T (int16_t) \
T (int32_t) \
T (int64_t)
#define FOR_EACH_LOAD_BROADCAST_IMM(T) \
T (int16_t, 129, imm_129) \
T (int32_t, 129, imm_129) \
T (int64_t, 129, imm_129) \
\
T (int16_t, -130, imm_m130) \
T (int32_t, -130, imm_m130) \
T (int64_t, -130, imm_m130) \
\
T (int16_t, 0x1234, imm_0x1234) \
T (int32_t, 0x1234, imm_0x1234) \
T (int64_t, 0x1234, imm_0x1234) \
\
T (int16_t, 0xFEDC, imm_0xFEDC) \
T (int32_t, 0xFEDC, imm_0xFEDC) \
T (int64_t, 0xFEDC, imm_0xFEDC) \
\
T (int32_t, 0x12345678, imm_0x12345678) \
T (int64_t, 0x12345678, imm_0x12345678) \
\
T (int32_t, 0xF2345678, imm_0xF2345678) \
T (int64_t, 0xF2345678, imm_0xF2345678) \
\
T (int64_t, (int64_t) 0xFEBA716B12371765, imm_FEBA716B12371765)
FOR_EACH_LOAD_BROADCAST (DEF_LOAD_BROADCAST)
FOR_EACH_LOAD_BROADCAST_IMM (DEF_LOAD_BROADCAST_IMM)
/* { dg-final { scan-assembler-times {\tld1rb\tz[0-9]+\.b, p[0-7]/z, } 1 } } */
/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, p[0-7]/z, } 5 } } */
/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, p[0-7]/z, } 7 } } */
/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, p[0-7]/z, } 8 } } */
/* { dg-do run { target aarch64_sve_hw } } */
/* { dg-options "-O3 -fno-tree-loop-distribute-patterns" } */
#include "ld1r_2.c"
#define TEST_LOAD_BROADCAST(TYPE) \
{ \
TYPE v[NUM_ELEMS (TYPE)]; \
TYPE val = 99; \
set_##TYPE (v, &val); \
for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
{ \
if (v[i] != (TYPE) 99) \
__builtin_abort (); \
asm volatile ("" ::: "memory"); \
} \
}
#define TEST_LOAD_BROADCAST_IMM(TYPE, IMM, SUFFIX) \
{ \
TYPE v[NUM_ELEMS (TYPE)]; \
set_##TYPE##_##SUFFIX (v); \
for (int i = 0; i < NUM_ELEMS (TYPE); i++ ) \
{ \
if (v[i] != (TYPE) IMM) \
__builtin_abort (); \
asm volatile ("" ::: "memory"); \
} \
}
int __attribute__ ((optimize (1)))
main (int argc, char **argv)
{
FOR_EACH_LOAD_BROADCAST (TEST_LOAD_BROADCAST)
FOR_EACH_LOAD_BROADCAST_IMM (TEST_LOAD_BROADCAST_IMM)
return 0;
}
......@@ -38,3 +38,22 @@ TEST_ALL (VEC_PERM)
/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
/* { dg-final { scan-assembler-not {\tzip2\t} } } */
/* The loop should be fully-masked. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1h\t} 3 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1w\t} 3 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1d\t} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
/* { dg-final { scan-assembler-not {\tldr} } } */
/* { dg-final { scan-assembler-times {\tstr} 2 } } */
/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
/* { dg-final { scan-assembler-not {\tuqdec} } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */
#include <stdint.h>
#define VEC_PERM(TYPE) \
void __attribute__ ((noinline, noclone)) \
vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \
{ \
for (int i = 0; i < n; ++i) \
{ \
a[i] += 1; \
b[i * 4] += 2; \
b[i * 4 + 1] += 3; \
b[i * 4 + 2] += 4; \
b[i * 4 + 3] += 5; \
} \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (float) \
T (double)
TEST_ALL (VEC_PERM)
/* The loop should be fully-masked. */
/* { dg-final { scan-assembler-times {\tld1b\t} 10 } } */
/* { dg-final { scan-assembler-times {\tst1b\t} 10 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 10 } } */
/* { dg-final { scan-assembler-times {\tst1h\t} 10 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 15 } } */
/* { dg-final { scan-assembler-times {\tst1w\t} 15 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 15 } } */
/* { dg-final { scan-assembler-times {\tst1d\t} 15 } } */
/* { dg-final { scan-assembler-not {\tldr} } } */
/* { dg-final { scan-assembler-not {\tstr} } } */
/* We should use WHILEs for all accesses. */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 20 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 20 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 30 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 30 } } */
/* 6 for the 8-bit types and 2 for the 16-bit types. */
/* { dg-final { scan-assembler-times {\tuqdecb\t} 8 } } */
/* 4 for the 16-bit types and 3 for the 32-bit types. */
/* { dg-final { scan-assembler-times {\tuqdech\t} 7 } } */
/* 6 for the 32-bit types and 3 for the 64-bit types. */
/* { dg-final { scan-assembler-times {\tuqdecw\t} 9 } } */
/* { dg-final { scan-assembler-times {\tuqdecd\t} 6 } } */
/* { dg-do run { target aarch64_sve_hw } } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include "slp_10.c"
#define N1 (103 * 2)
#define N2 (111 * 2)
#define HARNESS(TYPE) \
{ \
TYPE a[N2], b[N2 * 4]; \
for (unsigned int i = 0; i < N2; ++i) \
{ \
a[i] = i * 2 + i % 5; \
b[i * 4] = i * 3 + i % 7; \
b[i * 4 + 1] = i * 5 + i % 9; \
b[i * 4 + 2] = i * 7 + i % 11; \
b[i * 4 + 3] = i * 9 + i % 13; \
} \
vec_slp_##TYPE (a, b, N1); \
for (unsigned int i = 0; i < N2; ++i) \
{ \
TYPE orig_a = i * 2 + i % 5; \
TYPE orig_b1 = i * 3 + i % 7; \
TYPE orig_b2 = i * 5 + i % 9; \
TYPE orig_b3 = i * 7 + i % 11; \
TYPE orig_b4 = i * 9 + i % 13; \
TYPE expected_a = orig_a; \
TYPE expected_b1 = orig_b1; \
TYPE expected_b2 = orig_b2; \
TYPE expected_b3 = orig_b3; \
TYPE expected_b4 = orig_b4; \
if (i < N1) \
{ \
expected_a += 1; \
expected_b1 += 2; \
expected_b2 += 3; \
expected_b3 += 4; \
expected_b4 += 5; \
} \
if (a[i] != expected_a \
|| b[i * 4] != expected_b1 \
|| b[i * 4 + 1] != expected_b2 \
|| b[i * 4 + 2] != expected_b3 \
|| b[i * 4 + 3] != expected_b4) \
__builtin_abort (); \
} \
}
int __attribute__ ((optimize (1)))
main (void)
{
TEST_ALL (HARNESS)
}
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */
#include <stdint.h>
#define VEC_PERM(TYPE1, TYPE2) \
void __attribute__ ((noinline, noclone)) \
vec_slp_##TYPE1##_##TYPE2 (TYPE1 *restrict a, \
TYPE2 *restrict b, int n) \
{ \
for (int i = 0; i < n; ++i) \
{ \
a[i * 2] += 1; \
a[i * 2 + 1] += 2; \
b[i * 4] += 3; \
b[i * 4 + 1] += 4; \
b[i * 4 + 2] += 5; \
b[i * 4 + 3] += 6; \
} \
}
#define TEST_ALL(T) \
T (int16_t, uint8_t) \
T (uint16_t, int8_t) \
T (int32_t, uint16_t) \
T (uint32_t, int16_t) \
T (float, uint16_t) \
T (int64_t, float) \
T (uint64_t, int32_t) \
T (double, uint32_t)
TEST_ALL (VEC_PERM)
/* The loop should be fully-masked. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 5 } } */
/* { dg-final { scan-assembler-times {\tst1h\t} 5 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 6 } } */
/* { dg-final { scan-assembler-times {\tst1w\t} 6 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1d\t} 3 } } */
/* { dg-final { scan-assembler-not {\tldr} } } */
/* { dg-final { scan-assembler-not {\tstr} } } */
/* We should use the same WHILEs for both accesses. */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
/* { dg-final { scan-assembler-not {\twhilelo\tp[0-7]\.d} } } */
/* { dg-final { scan-assembler-not {\tuqdec} } } */
/* { dg-do run { target aarch64_sve_hw } } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include "slp_11.c"
#define N1 (103 * 2)
#define N2 (111 * 2)
#define HARNESS(TYPE1, TYPE2) \
{ \
TYPE1 a[N2]; \
TYPE2 b[N2 * 2]; \
for (unsigned int i = 0; i < N2; ++i) \
{ \
a[i] = i * 2 + i % 5; \
b[i * 2] = i * 3 + i % 7; \
b[i * 2 + 1] = i * 5 + i % 9; \
} \
vec_slp_##TYPE1##_##TYPE2 (a, b, N1 / 2); \
for (unsigned int i = 0; i < N2; ++i) \
{ \
TYPE1 orig_a = i * 2 + i % 5; \
TYPE2 orig_b1 = i * 3 + i % 7; \
TYPE2 orig_b2 = i * 5 + i % 9; \
TYPE1 expected_a = orig_a; \
TYPE2 expected_b1 = orig_b1; \
TYPE2 expected_b2 = orig_b2; \
if (i < N1) \
{ \
expected_a += i & 1 ? 2 : 1; \
expected_b1 += i & 1 ? 5 : 3; \
expected_b2 += i & 1 ? 6 : 4; \
} \
if (a[i] != expected_a \
|| b[i * 2] != expected_b1 \
|| b[i * 2 + 1] != expected_b2) \
__builtin_abort (); \
} \
}
int __attribute__ ((optimize (1)))
main (void)
{
TEST_ALL (HARNESS)
}
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */
#include <stdint.h>
#define N1 (19 * 2)
#define VEC_PERM(TYPE) \
void __attribute__ ((noinline, noclone)) \
vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b) \
{ \
for (int i = 0; i < N1; ++i) \
{ \
a[i] += 1; \
b[i * 4] += 2; \
b[i * 4 + 1] += 3; \
b[i * 4 + 2] += 4; \
b[i * 4 + 3] += 5; \
} \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (float) \
T (double)
TEST_ALL (VEC_PERM)
/* The loop should be fully-masked. */
/* { dg-final { scan-assembler-times {\tld1b\t} 10 } } */
/* { dg-final { scan-assembler-times {\tst1b\t} 10 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 10 } } */
/* { dg-final { scan-assembler-times {\tst1h\t} 10 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 15 } } */
/* { dg-final { scan-assembler-times {\tst1w\t} 15 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 15 } } */
/* { dg-final { scan-assembler-times {\tst1d\t} 15 } } */
/* { dg-final { scan-assembler-not {\tldr} } } */
/* { dg-final { scan-assembler-not {\tstr} } } */
/* We should use WHILEs for all accesses. */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 20 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 20 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 30 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 30 } } */
/* 6 for the 8-bit types and 2 for the 16-bit types. */
/* { dg-final { scan-assembler-times {\tuqdecb\t} 8 } } */
/* 4 for the 16-bit types and 3 for the 32-bit types. */
/* { dg-final { scan-assembler-times {\tuqdech\t} 7 } } */
/* 6 for the 32-bit types and 3 for the 64-bit types. */
/* { dg-final { scan-assembler-times {\tuqdecw\t} 9 } } */
/* { dg-final { scan-assembler-times {\tuqdecd\t} 6 } } */
/* { dg-do run { target aarch64_sve_hw } } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include "slp_12.c"
#define N2 (31 * 2)
#define HARNESS(TYPE) \
{ \
TYPE a[N2], b[N2 * 4]; \
for (unsigned int i = 0; i < N2; ++i) \
{ \
a[i] = i * 2 + i % 5; \
b[i * 4] = i * 3 + i % 7; \
b[i * 4 + 1] = i * 5 + i % 9; \
b[i * 4 + 2] = i * 7 + i % 11; \
b[i * 4 + 3] = i * 9 + i % 13; \
} \
vec_slp_##TYPE (a, b); \
for (unsigned int i = 0; i < N2; ++i) \
{ \
TYPE orig_a = i * 2 + i % 5; \
TYPE orig_b1 = i * 3 + i % 7; \
TYPE orig_b2 = i * 5 + i % 9; \
TYPE orig_b3 = i * 7 + i % 11; \
TYPE orig_b4 = i * 9 + i % 13; \
TYPE expected_a = orig_a; \
TYPE expected_b1 = orig_b1; \
TYPE expected_b2 = orig_b2; \
TYPE expected_b3 = orig_b3; \
TYPE expected_b4 = orig_b4; \
if (i < N1) \
{ \
expected_a += 1; \
expected_b1 += 2; \
expected_b2 += 3; \
expected_b3 += 4; \
expected_b4 += 5; \
} \
if (a[i] != expected_a \
|| b[i * 4] != expected_b1 \
|| b[i * 4 + 1] != expected_b2 \
|| b[i * 4 + 2] != expected_b3 \
|| b[i * 4 + 3] != expected_b4) \
__builtin_abort (); \
} \
}
int __attribute__ ((optimize (1)))
main (void)
{
TEST_ALL (HARNESS)
}
......@@ -35,3 +35,21 @@ TEST_ALL (VEC_PERM)
/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 3 } } */
/* { dg-final { scan-assembler-not {\tzip1\t} } } */
/* { dg-final { scan-assembler-not {\tzip2\t} } } */
/* The loop should be fully-masked. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1h\t} 3 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1w\t} 3 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1d\t} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
/* { dg-final { scan-assembler-not {\tldr} } } */
/* { dg-final { scan-assembler-not {\tstr} } } */
/* { dg-final { scan-assembler-not {\tuqdec} } } */
......@@ -47,3 +47,23 @@ TEST_ALL (VEC_PERM)
ZIP1 ZIP2. */
/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 9 } } */
/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
/* The loop should be fully-masked. The 64-bit types need two loads
and stores each. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1h\t} 3 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1w\t} 3 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 6 } } */
/* { dg-final { scan-assembler-times {\tst1d\t} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 12 } } */
/* { dg-final { scan-assembler-not {\tldr} } } */
/* { dg-final { scan-assembler-not {\tstr} } } */
/* { dg-final { scan-assembler-not {\tuqdec[bhw]\t} } } */
/* { dg-final { scan-assembler-times {\tuqdecd\t} 3 } } */
......@@ -59,3 +59,25 @@ TEST_ALL (VEC_PERM)
ZIP1 ZIP2 ZIP1 ZIP2. */
/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 33 } } */
/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
/* The loop should be fully-masked. The 32-bit types need two loads
and stores each and the 64-bit types need four. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */
/* { dg-final { scan-assembler-times {\tst1h\t} 3 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 6 } } */
/* { dg-final { scan-assembler-times {\tst1w\t} 6 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 12 } } */
/* { dg-final { scan-assembler-times {\tst1d\t} 12 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 12 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 24 } } */
/* { dg-final { scan-assembler-not {\tldr} } } */
/* { dg-final { scan-assembler-not {\tstr} } } */
/* { dg-final { scan-assembler-not {\tuqdec[bh]\t} } } */
/* We use UQDECW instead of UQDECD ..., MUL #2. */
/* { dg-final { scan-assembler-times {\tuqdecw\t} 6 } } */
/* { dg-final { scan-assembler-times {\tuqdecd\t} 6 } } */
......@@ -45,3 +45,5 @@ TEST_ALL (VEC_PERM)
/* { dg-final { scan-assembler {\tld3h\t} } } */
/* { dg-final { scan-assembler {\tld3w\t} } } */
/* { dg-final { scan-assembler {\tld3d\t} } } */
/* { dg-final { scan-assembler-not {\tuqdec} } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include <stdint.h>
#define VEC_PERM(TYPE) \
void __attribute__ ((noinline, noclone)) \
vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \
{ \
for (int i = 0; i < n; ++i) \
{ \
a[i * 2] += 1; \
a[i * 2 + 1] += 2; \
b[i * 4] += 3; \
b[i * 4 + 1] += 4; \
b[i * 4 + 2] += 5; \
b[i * 4 + 3] += 6; \
} \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (float) \
T (double)
TEST_ALL (VEC_PERM)
/* The loop should be fully-masked. The load XFAILs for fixed-length
SVE account for extra loads from the constant pool. */
/* { dg-final { scan-assembler-times {\tld1b\t} 6 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
/* { dg-final { scan-assembler-times {\tst1b\t} 6 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 6 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
/* { dg-final { scan-assembler-times {\tst1h\t} 6 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 9 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
/* { dg-final { scan-assembler-times {\tst1w\t} 9 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 9 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
/* { dg-final { scan-assembler-times {\tst1d\t} 9 } } */
/* { dg-final { scan-assembler-not {\tldr} } } */
/* { dg-final { scan-assembler-not {\tstr} } } */
/* We should use WHILEs for the accesses to "a" and ZIPs for the accesses
to "b". */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
/* { dg-final { scan-assembler-times {\tzip1\tp[0-7]\.b} 2 } } */
/* { dg-final { scan-assembler-times {\tzip1\tp[0-7]\.h} 2 } } */
/* { dg-final { scan-assembler-times {\tzip1\tp[0-7]\.s} 3 } } */
/* { dg-final { scan-assembler-times {\tzip1\tp[0-7]\.d} 3 } } */
/* { dg-final { scan-assembler-times {\tzip2\tp[0-7]\.b} 2 } } */
/* { dg-final { scan-assembler-times {\tzip2\tp[0-7]\.h} 2 } } */
/* { dg-final { scan-assembler-times {\tzip2\tp[0-7]\.s} 3 } } */
/* { dg-final { scan-assembler-times {\tzip2\tp[0-7]\.d} 3 } } */
/* { dg-final { scan-assembler-not {\tuqdec} } } */
/* { dg-do run { target aarch64_sve_hw } } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include "slp_8.c"
#define N1 (103 * 2)
#define N2 (111 * 2)
#define HARNESS(TYPE) \
{ \
TYPE a[N2], b[N2 * 2]; \
for (unsigned int i = 0; i < N2; ++i) \
{ \
a[i] = i * 2 + i % 5; \
b[i * 2] = i * 3 + i % 7; \
b[i * 2 + 1] = i * 5 + i % 9; \
} \
vec_slp_##TYPE (a, b, N1 / 2); \
for (unsigned int i = 0; i < N2; ++i) \
{ \
TYPE orig_a = i * 2 + i % 5; \
TYPE orig_b1 = i * 3 + i % 7; \
TYPE orig_b2 = i * 5 + i % 9; \
TYPE expected_a = orig_a; \
TYPE expected_b1 = orig_b1; \
TYPE expected_b2 = orig_b2; \
if (i < N1) \
{ \
expected_a += i & 1 ? 2 : 1; \
expected_b1 += i & 1 ? 5 : 3; \
expected_b2 += i & 1 ? 6 : 4; \
} \
if (a[i] != expected_a \
|| b[i * 2] != expected_b1 \
|| b[i * 2 + 1] != expected_b2) \
__builtin_abort (); \
} \
}
int __attribute__ ((optimize (1)))
main (void)
{
TEST_ALL (HARNESS)
}
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include <stdint.h>
#define VEC_PERM(TYPE1, TYPE2) \
void __attribute__ ((noinline, noclone)) \
vec_slp_##TYPE1##_##TYPE2 (TYPE1 *restrict a, \
TYPE2 *restrict b, int n) \
{ \
for (int i = 0; i < n; ++i) \
{ \
a[i * 2] += 1; \
a[i * 2 + 1] += 2; \
b[i * 2] += 3; \
b[i * 2 + 1] += 4; \
} \
}
#define TEST_ALL(T) \
T (int8_t, uint16_t) \
T (uint8_t, int16_t) \
T (int16_t, uint32_t) \
T (uint16_t, int32_t) \
T (int32_t, double) \
T (uint32_t, int64_t) \
T (float, uint64_t)
TEST_ALL (VEC_PERM)
/* The loop should be fully-masked. The load XFAILs for fixed-length
SVE account for extra loads from the constant pool. */
/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail { aarch64_sve && { ! vect_variable_length } } } } }*/
/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
/* { dg-final { scan-assembler-times {\tld1h\t} 6 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
/* { dg-final { scan-assembler-times {\tst1h\t} 6 } } */
/* { dg-final { scan-assembler-times {\tld1w\t} 7 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
/* { dg-final { scan-assembler-times {\tst1w\t} 7 } } */
/* { dg-final { scan-assembler-times {\tld1d\t} 6 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
/* { dg-final { scan-assembler-times {\tst1d\t} 6 } } */
/* { dg-final { scan-assembler-not {\tldr} } } */
/* { dg-final { scan-assembler-not {\tstr} } } */
/* We should use WHILEs for the accesses to "a" and unpacks for the accesses
to "b". */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
/* { dg-final { scan-assembler-not {\twhilelo\tp[0-7]\.d} } } */
/* { dg-final { scan-assembler-times {\tpunpklo\tp[0-7]\.h} 7 } } */
/* { dg-final { scan-assembler-times {\tpunpkhi\tp[0-7]\.h} 7 } } */
/* { dg-final { scan-assembler-not {\tuqdec} } } */
/* { dg-do run { target aarch64_sve_hw } } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include "slp_9.c"
#define N1 (103 * 2)
#define N2 (111 * 2)
#define HARNESS(TYPE1, TYPE2) \
{ \
TYPE1 a[N2]; \
TYPE2 b[N2]; \
for (unsigned int i = 0; i < N2; ++i) \
{ \
a[i] = i * 2 + i % 5; \
b[i] = i * 3 + i % 7; \
} \
vec_slp_##TYPE1##_##TYPE2 (a, b, N1 / 2); \
for (unsigned int i = 0; i < N2; ++i) \
{ \
TYPE1 orig_a = i * 2 + i % 5; \
TYPE2 orig_b = i * 3 + i % 7; \
TYPE1 expected_a = orig_a; \
TYPE2 expected_b = orig_b; \
if (i < N1) \
{ \
expected_a += i & 1 ? 2 : 1; \
expected_b += i & 1 ? 4 : 3; \
} \
if (a[i] != expected_a || b[i] != expected_b) \
__builtin_abort (); \
} \
}
int __attribute__ ((noinline, noclone))
main (void)
{
TEST_ALL (HARNESS)
}
......@@ -40,7 +40,8 @@
TEST_ALL (LOOP)
/* { dg-final { scan-assembler-times {\tand\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} 3 } } */
/* Currently we don't manage to remove ANDs from the other loops. */
/* { dg-final { scan-assembler-times {\tand\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} 3 { xfail *-*-* } } } */
/* { dg-final { scan-assembler {\tand\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} } } */
/* { dg-final { scan-assembler-times {\torr\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} 3 } } */
/* { dg-final { scan-assembler-times {\teor\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} 3 } } */
......
......@@ -36,5 +36,6 @@ TEST_ALL (VEC_BOOL)
/* Both cmpne and cmpeq loops will contain an exclusive predicate or. */
/* { dg-final { scan-assembler-times {\teors?\tp[0-9]*\.b, p[0-7]/z, p[0-9]*\.b, p[0-9]*\.b\n} 12 } } */
/* cmpeq will also contain a predicate not operation. */
/* { dg-final { scan-assembler-times {\tnot\tp[0-9]*\.b, p[0-7]/z, p[0-9]*\.b\n} 6 } } */
/* cmpeq will also contain a masked predicate not operation, which gets
folded to BIC. */
/* { dg-final { scan-assembler-times {\tbic\tp[0-9]+\.b, p[0-7]/z, p[0-9]+\.b, p[0-9]+\.b\n} 6 } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include <stdint.h>
#define ADD_LOOP(TYPE) \
void __attribute__ ((noinline, noclone)) \
vec_while_##TYPE (TYPE *restrict a, int n) \
{ \
for (int i = 0; i < n; ++i) \
a[i] += 1; \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (float) \
T (double)
TEST_ALL (ADD_LOOP)
/* { dg-final { scan-assembler-not {\tuqdec} } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include <stdint.h>
#define ADD_LOOP(TYPE) \
void __attribute__ ((noinline, noclone)) \
vec_while_##TYPE (TYPE *restrict a, unsigned int n) \
{ \
for (unsigned int i = 0; i < n; ++i) \
a[i] += 1; \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (float) \
T (double)
TEST_ALL (ADD_LOOP)
/* { dg-final { scan-assembler-not {\tuqdec} } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize" } */
#include <stdint.h>
#define ADD_LOOP(TYPE) \
TYPE __attribute__ ((noinline, noclone)) \
vec_while_##TYPE (TYPE *restrict a, int64_t n) \
{ \
for (int64_t i = 0; i < n; ++i) \
a[i] += 1; \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (float) \
T (double)
TEST_ALL (ADD_LOOP)
/* { dg-final { scan-assembler-not {\tuqdec} } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable" } */
#include <stdint.h>
#define ADD_LOOP(TYPE) \
TYPE __attribute__ ((noinline, noclone)) \
vec_while_##TYPE (TYPE *restrict a, uint64_t n) \
{ \
for (uint64_t i = 0; i < n; ++i) \
a[i] += 1; \
}
#define TEST_ALL(T) \
T (int8_t) \
T (uint8_t) \
T (int16_t) \
T (uint16_t) \
T (int32_t) \
T (uint32_t) \
T (int64_t) \
T (uint64_t) \
T (float) \
T (double)
TEST_ALL (ADD_LOOP)
/* { dg-final { scan-assembler-times {\tuqdec} 2 } } */
/* { dg-final { scan-assembler-times {\tuqdecb\tx[0-9]+} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
......@@ -42,6 +42,11 @@ along with GCC; see the file COPYING3. If not see
#include "tree-vectorizer.h"
#include "tree-ssa-loop-ivopts.h"
#include "gimple-fold.h"
#include "tree-ssa-loop-niter.h"
#include "internal-fn.h"
#include "stor-layout.h"
#include "optabs-query.h"
#include "vec-perm-indices.h"
/*************************************************************************
Simple Loop Peeling Utilities
......@@ -248,33 +253,441 @@ adjust_phi_and_debug_stmts (gimple *update_phi, edge e, tree new_def)
gimple_bb (update_phi));
}
/* Make LOOP iterate N == (NITERS - STEP) / STEP + 1 times,
where NITERS is known to be outside the range [1, STEP - 1].
This is equivalent to making the loop execute NITERS / STEP
times when NITERS is nonzero and (1 << M) / STEP times otherwise,
where M is the precision of NITERS.
/* Define one loop mask MASK from loop LOOP. INIT_MASK is the value that
the mask should have during the first iteration and NEXT_MASK is the
value that it should have on subsequent iterations. */
NITERS_MAYBE_ZERO is true if NITERS can be zero, false it is known
to be >= STEP. In the latter case N is always NITERS / STEP.
static void
vect_set_loop_mask (struct loop *loop, tree mask, tree init_mask,
tree next_mask)
{
gphi *phi = create_phi_node (mask, loop->header);
add_phi_arg (phi, init_mask, loop_preheader_edge (loop), UNKNOWN_LOCATION);
add_phi_arg (phi, next_mask, loop_latch_edge (loop), UNKNOWN_LOCATION);
}
If FINAL_IV is nonnull, it is an SSA name that should be set to
N * STEP on exit from the loop.
/* Add SEQ to the end of LOOP's preheader block. */
Assumption: the exit-condition of LOOP is the last stmt in the loop. */
static void
add_preheader_seq (struct loop *loop, gimple_seq seq)
{
if (seq)
{
edge pe = loop_preheader_edge (loop);
basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
gcc_assert (!new_bb);
}
}
void
slpeel_make_loop_iterate_ntimes (struct loop *loop, tree niters, tree step,
tree final_iv, bool niters_maybe_zero)
/* Add SEQ to the beginning of LOOP's header block. */
static void
add_header_seq (struct loop *loop, gimple_seq seq)
{
if (seq)
{
gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
}
}
/* Return true if the target can interleave elements of two vectors.
OFFSET is 0 if the first half of the vectors should be interleaved
or 1 if the second half should. When returning true, store the
associated permutation in INDICES. */
static bool
interleave_supported_p (vec_perm_indices *indices, tree vectype,
unsigned int offset)
{
poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (vectype);
poly_uint64 base = exact_div (nelts, 2) * offset;
vec_perm_builder sel (nelts, 2, 3);
for (unsigned int i = 0; i < 3; ++i)
{
sel.quick_push (base + i);
sel.quick_push (base + i + nelts);
}
indices->new_vector (sel, 2, nelts);
return can_vec_perm_const_p (TYPE_MODE (vectype), *indices);
}
/* Try to use permutes to define the masks in DEST_RGM using the masks
in SRC_RGM, given that the former has twice as many masks as the
latter. Return true on success, adding any new statements to SEQ. */
static bool
vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
rgroup_masks *src_rgm)
{
tree src_masktype = src_rgm->mask_type;
tree dest_masktype = dest_rgm->mask_type;
machine_mode src_mode = TYPE_MODE (src_masktype);
if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
&& optab_handler (vec_unpacku_hi_optab, src_mode) != CODE_FOR_nothing
&& optab_handler (vec_unpacku_lo_optab, src_mode) != CODE_FOR_nothing)
{
/* Unpacking the source masks gives at least as many mask bits as
we need. We can then VIEW_CONVERT any excess bits away. */
tree unpack_masktype = vect_halve_mask_nunits (src_masktype);
for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
{
tree src = src_rgm->masks[i / 2];
tree dest = dest_rgm->masks[i];
tree_code code = (i & 1 ? VEC_UNPACK_HI_EXPR
: VEC_UNPACK_LO_EXPR);
gassign *stmt;
if (dest_masktype == unpack_masktype)
stmt = gimple_build_assign (dest, code, src);
else
{
tree temp = make_ssa_name (unpack_masktype);
stmt = gimple_build_assign (temp, code, src);
gimple_seq_add_stmt (seq, stmt);
stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
build1 (VIEW_CONVERT_EXPR,
dest_masktype, temp));
}
gimple_seq_add_stmt (seq, stmt);
}
return true;
}
vec_perm_indices indices[2];
if (dest_masktype == src_masktype
&& interleave_supported_p (&indices[0], src_masktype, 0)
&& interleave_supported_p (&indices[1], src_masktype, 1))
{
/* The destination requires twice as many mask bits as the source, so
we can use interleaving permutes to double up the number of bits. */
tree masks[2];
for (unsigned int i = 0; i < 2; ++i)
masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
{
tree src = src_rgm->masks[i / 2];
tree dest = dest_rgm->masks[i];
gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
src, src, masks[i & 1]);
gimple_seq_add_stmt (seq, stmt);
}
return true;
}
return false;
}
/* Helper for vect_set_loop_condition_masked. Generate definitions for
all the masks in RGM and return a mask that is nonzero when the loop
needs to iterate. Add any new preheader statements to PREHEADER_SEQ.
Use LOOP_COND_GSI to insert code before the exit gcond.
RGM belongs to loop LOOP. The loop originally iterated NITERS
times and has been vectorized according to LOOP_VINFO. Each iteration
of the vectorized loop handles VF iterations of the scalar loop.
It is known that:
NITERS * RGM->max_nscalars_per_iter
does not overflow. However, MIGHT_WRAP_P says whether an induction
variable that starts at 0 and has step:
VF * RGM->max_nscalars_per_iter
might overflow before hitting a value above:
NITERS * RGM->max_nscalars_per_iter
This means that we cannot guarantee that such an induction variable
would ever hit a value that produces a set of all-false masks for RGM. */
static tree
vect_set_loop_masks_directly (struct loop *loop, loop_vec_info loop_vinfo,
gimple_seq *preheader_seq,
gimple_stmt_iterator loop_cond_gsi,
rgroup_masks *rgm, tree vf,
tree niters, bool might_wrap_p)
{
tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
tree mask_type = rgm->mask_type;
unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
/* Calculate the maximum number of scalar values that the rgroup
handles in total and the number that it handles for each iteration
of the vector loop. */
tree nscalars_total = niters;
tree nscalars_step = vf;
if (nscalars_per_iter != 1)
{
/* We checked before choosing to use a fully-masked loop that these
multiplications don't overflow. */
tree factor = build_int_cst (compare_type, nscalars_per_iter);
nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
nscalars_total, factor);
nscalars_step = gimple_build (preheader_seq, MULT_EXPR, compare_type,
nscalars_step, factor);
}
/* Create an induction variable that counts the number of scalars
processed. */
tree index_before_incr, index_after_incr;
gimple_stmt_iterator incr_gsi;
bool insert_after;
tree zero_index = build_int_cst (compare_type, 0);
standard_iv_increment_position (loop, &incr_gsi, &insert_after);
create_iv (zero_index, nscalars_step, NULL_TREE, loop, &incr_gsi,
insert_after, &index_before_incr, &index_after_incr);
tree test_index, test_limit;
gimple_stmt_iterator *test_gsi;
if (might_wrap_p)
{
/* In principle the loop should stop iterating once the incremented
IV reaches a value greater than or equal to NSCALAR_TOTAL.
However, there's no guarantee that the IV hits a value above
this value before wrapping around. We therefore adjust the
limit down by one IV step:
NSCALARS_TOTAL -[infinite-prec] NSCALARS_STEP
and compare the IV against this limit _before_ incrementing it.
Since the comparison type is unsigned, we actually want the
subtraction to saturate at zero:
NSCALARS_TOTAL -[sat] NSCALARS_STEP. */
test_index = index_before_incr;
test_limit = gimple_build (preheader_seq, MAX_EXPR, compare_type,
nscalars_total, nscalars_step);
test_limit = gimple_build (preheader_seq, MINUS_EXPR, compare_type,
test_limit, nscalars_step);
test_gsi = &incr_gsi;
}
else
{
/* Test the incremented IV, which will always hit a value above
the bound before wrapping. */
test_index = index_after_incr;
test_limit = nscalars_total;
test_gsi = &loop_cond_gsi;
}
/* Provide a definition of each mask in the group. */
tree next_mask = NULL_TREE;
tree mask;
unsigned int i;
FOR_EACH_VEC_ELT_REVERSE (rgm->masks, i, mask)
{
/* Previous masks will cover BIAS scalars. This mask covers the
next batch. */
poly_uint64 bias = nscalars_per_mask * i;
tree bias_tree = build_int_cst (compare_type, bias);
gimple *tmp_stmt;
/* See whether the first iteration of the vector loop is known
to have a full mask. */
poly_uint64 const_limit;
bool first_iteration_full
= (poly_int_tree_p (nscalars_total, &const_limit)
&& known_ge (const_limit, (i + 1) * nscalars_per_mask));
/* Rather than have a new IV that starts at BIAS and goes up to
TEST_LIMIT, prefer to use the same 0-based IV for each mask
and adjust the bound down by BIAS. */
tree this_test_limit = test_limit;
if (i != 0)
{
this_test_limit = gimple_build (preheader_seq, MAX_EXPR,
compare_type, this_test_limit,
bias_tree);
this_test_limit = gimple_build (preheader_seq, MINUS_EXPR,
compare_type, this_test_limit,
bias_tree);
}
/* Create the initial mask. */
tree init_mask = NULL_TREE;
if (!first_iteration_full)
{
tree start, end;
if (nscalars_total == test_limit)
{
/* Use a natural test between zero (the initial IV value)
and the loop limit. The "else" block would be valid too,
but this choice can avoid the need to load BIAS_TREE into
a register. */
start = zero_index;
end = this_test_limit;
}
else
{
start = bias_tree;
end = nscalars_total;
}
init_mask = make_temp_ssa_name (mask_type, NULL, "max_mask");
tmp_stmt = vect_gen_while (init_mask, start, end);
gimple_seq_add_stmt (preheader_seq, tmp_stmt);
}
if (!init_mask)
/* First iteration is full. */
init_mask = build_minus_one_cst (mask_type);
/* Get the mask value for the next iteration of the loop. */
next_mask = make_temp_ssa_name (mask_type, NULL, "next_mask");
gcall *call = vect_gen_while (next_mask, test_index, this_test_limit);
gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
vect_set_loop_mask (loop, mask, init_mask, next_mask);
}
return next_mask;
}
/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.
LOOP_VINFO describes the vectorization of LOOP. NITERS is the
number of iterations of the original scalar loop. NITERS_MAYBE_ZERO
and FINAL_IV are as for vect_set_loop_condition.
Insert the branch-back condition before LOOP_COND_GSI and return the
final gcond. */
static gcond *
vect_set_loop_condition_masked (struct loop *loop, loop_vec_info loop_vinfo,
tree niters, tree final_iv,
bool niters_maybe_zero,
gimple_stmt_iterator loop_cond_gsi)
{
gimple_seq preheader_seq = NULL;
gimple_seq header_seq = NULL;
tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
unsigned int compare_precision = TYPE_PRECISION (compare_type);
unsigned HOST_WIDE_INT max_vf = vect_max_vf (loop_vinfo);
tree orig_niters = niters;
/* Type of the initial value of NITERS. */
tree ni_actual_type = TREE_TYPE (niters);
unsigned int ni_actual_precision = TYPE_PRECISION (ni_actual_type);
/* Convert NITERS to the same size as the compare. */
if (compare_precision > ni_actual_precision
&& niters_maybe_zero)
{
/* We know that there is always at least one iteration, so if the
count is zero then it must have wrapped. Cope with this by
subtracting 1 before the conversion and adding 1 to the result. */
gcc_assert (TYPE_UNSIGNED (ni_actual_type));
niters = gimple_build (&preheader_seq, PLUS_EXPR, ni_actual_type,
niters, build_minus_one_cst (ni_actual_type));
niters = gimple_convert (&preheader_seq, compare_type, niters);
niters = gimple_build (&preheader_seq, PLUS_EXPR, compare_type,
niters, build_one_cst (compare_type));
}
else
niters = gimple_convert (&preheader_seq, compare_type, niters);
/* Now calculate the value that the induction variable must be able
to hit in order to ensure that we end the loop with an all-false mask.
This involves adding the maximum number of inactive trailing scalar
iterations. */
widest_int iv_limit;
bool known_max_iters = max_loop_iterations (loop, &iv_limit);
if (known_max_iters)
{
/* IV_LIMIT is the maximum number of latch iterations, which is also
the maximum in-range IV value. Round this value down to the previous
vector alignment boundary and then add an extra full iteration. */
poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
iv_limit = (iv_limit & -(int) known_alignment (vf)) + max_vf;
}
/* Get the vectorization factor in tree form. */
tree vf = build_int_cst (compare_type,
LOOP_VINFO_VECT_FACTOR (loop_vinfo));
/* Iterate over all the rgroups and fill in their masks. We could use
the first mask from any rgroup for the loop condition; here we
arbitrarily pick the last. */
tree test_mask = NULL_TREE;
rgroup_masks *rgm;
unsigned int i;
vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
FOR_EACH_VEC_ELT (*masks, i, rgm)
if (!rgm->masks.is_empty ())
{
/* First try using permutes. This adds a single vector
instruction to the loop for each mask, but needs no extra
loop invariants or IVs. */
unsigned int nmasks = i + 1;
if ((nmasks & 1) == 0)
{
rgroup_masks *half_rgm = &(*masks)[nmasks / 2 - 1];
if (!half_rgm->masks.is_empty ()
&& vect_maybe_permute_loop_masks (&header_seq, rgm, half_rgm))
continue;
}
/* See whether zero-based IV would ever generate all-false masks
before wrapping around. */
bool might_wrap_p
= (!known_max_iters
|| (wi::min_precision (iv_limit * rgm->max_nscalars_per_iter,
UNSIGNED)
> compare_precision));
/* Set up all masks for this group. */
test_mask = vect_set_loop_masks_directly (loop, loop_vinfo,
&preheader_seq,
loop_cond_gsi, rgm, vf,
niters, might_wrap_p);
}
/* Emit all accumulated statements. */
add_preheader_seq (loop, preheader_seq);
add_header_seq (loop, header_seq);
/* Get a boolean result that tells us whether to iterate. */
edge exit_edge = single_exit (loop);
tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
tree zero_mask = build_zero_cst (TREE_TYPE (test_mask));
gcond *cond_stmt = gimple_build_cond (code, test_mask, zero_mask,
NULL_TREE, NULL_TREE);
gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
/* The loop iterates (NITERS - 1) / VF + 1 times.
Subtract one from this to get the latch count. */
tree step = build_int_cst (compare_type,
LOOP_VINFO_VECT_FACTOR (loop_vinfo));
tree niters_minus_one = fold_build2 (PLUS_EXPR, compare_type, niters,
build_minus_one_cst (compare_type));
loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, compare_type,
niters_minus_one, step);
if (final_iv)
{
gassign *assign = gimple_build_assign (final_iv, orig_niters);
gsi_insert_on_edge_immediate (single_exit (loop), assign);
}
return cond_stmt;
}
/* Like vect_set_loop_condition, but handle the case in which there
are no loop masks. */
static gcond *
vect_set_loop_condition_unmasked (struct loop *loop, tree niters,
tree step, tree final_iv,
bool niters_maybe_zero,
gimple_stmt_iterator loop_cond_gsi)
{
tree indx_before_incr, indx_after_incr;
gcond *cond_stmt;
gcond *orig_cond;
edge pe = loop_preheader_edge (loop);
edge exit_edge = single_exit (loop);
gimple_stmt_iterator loop_cond_gsi;
gimple_stmt_iterator incr_gsi;
bool insert_after;
source_location loop_loc;
enum tree_code code;
tree niters_type = TREE_TYPE (niters);
......@@ -360,7 +773,6 @@ slpeel_make_loop_iterate_ntimes (struct loop *loop, tree niters, tree step,
standard_iv_increment_position (loop, &incr_gsi, &insert_after);
create_iv (init, step, NULL_TREE, loop,
&incr_gsi, insert_after, &indx_before_incr, &indx_after_incr);
indx_after_incr = force_gimple_operand_gsi (&loop_cond_gsi, indx_after_incr,
true, NULL_TREE, true,
GSI_SAME_STMT);
......@@ -372,19 +784,6 @@ slpeel_make_loop_iterate_ntimes (struct loop *loop, tree niters, tree step,
gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
/* Remove old loop exit test: */
gsi_remove (&loop_cond_gsi, true);
free_stmt_vec_info (orig_cond);
loop_loc = find_loop_location (loop);
if (dump_enabled_p ())
{
if (LOCATION_LOCUS (loop_loc) != UNKNOWN_LOCATION)
dump_printf (MSG_NOTE, "\nloop at %s:%d: ", LOCATION_FILE (loop_loc),
LOCATION_LINE (loop_loc));
dump_gimple_stmt (MSG_NOTE, TDF_SLIM, cond_stmt, 0);
}
/* Record the number of latch iterations. */
if (limit == niters)
/* Case A: the loop iterates NITERS times. Subtract one to get the
......@@ -403,6 +802,59 @@ slpeel_make_loop_iterate_ntimes (struct loop *loop, tree niters, tree step,
indx_after_incr, init);
gsi_insert_on_edge_immediate (single_exit (loop), assign);
}
return cond_stmt;
}
/* If we're using fully-masked loops, make LOOP iterate:
N == (NITERS - 1) / STEP + 1
times. When NITERS is zero, this is equivalent to making the loop
execute (1 << M) / STEP times, where M is the precision of NITERS.
NITERS_MAYBE_ZERO is true if this last case might occur.
If we're not using fully-masked loops, make LOOP iterate:
N == (NITERS - STEP) / STEP + 1
times, where NITERS is known to be outside the range [1, STEP - 1].
This is equivalent to making the loop execute NITERS / STEP times
when NITERS is nonzero and (1 << M) / STEP times otherwise.
NITERS_MAYBE_ZERO again indicates whether this last case might occur.
If FINAL_IV is nonnull, it is an SSA name that should be set to
N * STEP on exit from the loop.
Assumption: the exit-condition of LOOP is the last stmt in the loop. */
void
vect_set_loop_condition (struct loop *loop, loop_vec_info loop_vinfo,
tree niters, tree step, tree final_iv,
bool niters_maybe_zero)
{
gcond *cond_stmt;
gcond *orig_cond = get_loop_exit_condition (loop);
gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
cond_stmt = vect_set_loop_condition_masked (loop, loop_vinfo, niters,
final_iv, niters_maybe_zero,
loop_cond_gsi);
else
cond_stmt = vect_set_loop_condition_unmasked (loop, niters, step,
final_iv, niters_maybe_zero,
loop_cond_gsi);
/* Remove old loop exit test. */
gsi_remove (&loop_cond_gsi, true);
free_stmt_vec_info (orig_cond);
if (dump_enabled_p ())
{
dump_printf_loc (MSG_NOTE, vect_location, "New loop exit condition: ");
dump_gimple_stmt (MSG_NOTE, TDF_SLIM, cond_stmt, 0);
}
}
/* Helper routine of slpeel_tree_duplicate_loop_to_edge_cfg.
......@@ -1319,7 +1771,8 @@ vect_gen_vector_loop_niters (loop_vec_info loop_vinfo, tree niters,
ni_minus_gap = niters;
unsigned HOST_WIDE_INT const_vf;
if (vf.is_constant (&const_vf))
if (vf.is_constant (&const_vf)
&& !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
{
/* Create: niters >> log2(vf) */
/* If it's known that niters == number of latch executions + 1 doesn't
......@@ -1726,8 +2179,7 @@ slpeel_update_phi_nodes_for_lcssa (struct loop *epilog)
CHECK_PROFITABILITY is true.
Output:
- *NITERS_VECTOR and *STEP_VECTOR describe how the main loop should
iterate after vectorization; see slpeel_make_loop_iterate_ntimes
for details.
iterate after vectorization; see vect_set_loop_condition for details.
- *NITERS_VECTOR_MULT_VF_VAR is either null or an SSA name that
should be set to the number of scalar iterations handled by the
vector loop. The SSA name is only used on exit from the loop.
......@@ -1892,8 +2344,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
niters_prolog = vect_gen_prolog_loop_niters (loop_vinfo, anchor,
&bound_prolog);
tree step_prolog = build_one_cst (TREE_TYPE (niters_prolog));
slpeel_make_loop_iterate_ntimes (prolog, niters_prolog, step_prolog,
NULL_TREE, false);
vect_set_loop_condition (prolog, NULL, niters_prolog,
step_prolog, NULL_TREE, false);
/* Skip the prolog loop. */
if (skip_prolog)
......
......@@ -1121,12 +1121,15 @@ _loop_vec_info::_loop_vec_info (struct loop *loop_in)
versioning_threshold (0),
vectorization_factor (0),
max_vectorization_factor (0),
mask_compare_type (NULL_TREE),
unaligned_dr (NULL),
peeling_for_alignment (0),
ptr_mask (0),
slp_unrolling_factor (1),
single_scalar_iteration_cost (0),
vectorizable (false),
can_fully_mask_p (true),
fully_masked_p (false),
peeling_for_gaps (false),
peeling_for_niter (false),
operands_swapped (false),
......@@ -1168,6 +1171,17 @@ _loop_vec_info::_loop_vec_info (struct loop *loop_in)
gcc_assert (nbbs == loop->num_nodes);
}
/* Free all levels of MASKS. */
void
release_vec_loop_masks (vec_loop_masks *masks)
{
rgroup_masks *rgm;
unsigned int i;
FOR_EACH_VEC_ELT (*masks, i, rgm)
rgm->masks.release ();
masks->release ();
}
/* Free all memory used by the _loop_vec_info, as well as all the
stmt_vec_info structs of all the stmts in the loop. */
......@@ -1233,9 +1247,98 @@ _loop_vec_info::~_loop_vec_info ()
free (bbs);
release_vec_loop_masks (&masks);
loop->aux = NULL;
}
/* Return true if we can use CMP_TYPE as the comparison type to produce
all masks required to mask LOOP_VINFO. */
static bool
can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
{
rgroup_masks *rgm;
unsigned int i;
FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
if (rgm->mask_type != NULL_TREE
&& !direct_internal_fn_supported_p (IFN_WHILE_ULT,
cmp_type, rgm->mask_type,
OPTIMIZE_FOR_SPEED))
return false;
return true;
}
/* Calculate the maximum number of scalars per iteration for every
rgroup in LOOP_VINFO. */
static unsigned int
vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo)
{
unsigned int res = 1;
unsigned int i;
rgroup_masks *rgm;
FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
res = MAX (res, rgm->max_nscalars_per_iter);
return res;
}
/* Each statement in LOOP_VINFO can be masked where necessary. Check
whether we can actually generate the masks required. Return true if so,
storing the type of the scalar IV in LOOP_VINFO_MASK_COMPARE_TYPE. */
static bool
vect_verify_full_masking (loop_vec_info loop_vinfo)
{
struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
unsigned int min_ni_width;
/* Get the maximum number of iterations that is representable
in the counter type. */
tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo));
widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1;
/* Get a more refined estimate for the number of iterations. */
widest_int max_back_edges;
if (max_loop_iterations (loop, &max_back_edges))
max_ni = wi::smin (max_ni, max_back_edges + 1);
/* Account for rgroup masks, in which each bit is replicated N times. */
max_ni *= vect_get_max_nscalars_per_iter (loop_vinfo);
/* Work out how many bits we need to represent the limit. */
min_ni_width = wi::min_precision (max_ni, UNSIGNED);
/* Find a scalar mode for which WHILE_ULT is supported. */
opt_scalar_int_mode cmp_mode_iter;
tree cmp_type = NULL_TREE;
FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
{
unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
if (cmp_bits >= min_ni_width
&& targetm.scalar_mode_supported_p (cmp_mode_iter.require ()))
{
tree this_type = build_nonstandard_integer_type (cmp_bits, true);
if (this_type
&& can_produce_all_loop_masks_p (loop_vinfo, this_type))
{
/* Although we could stop as soon as we find a valid mode,
it's often better to continue until we hit Pmode, since the
operands to the WHILE are more likely to be reusable in
address calculations. */
cmp_type = this_type;
if (cmp_bits >= GET_MODE_BITSIZE (Pmode))
break;
}
}
}
if (!cmp_type)
return false;
LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo) = cmp_type;
return true;
}
/* Calculate the cost of one scalar iteration of the loop. */
static void
......@@ -1980,6 +2083,12 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal)
vect_update_vf_for_slp (loop_vinfo);
}
bool saved_can_fully_mask_p = LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo);
/* We don't expect to have to roll back to anything other than an empty
set of rgroups. */
gcc_assert (LOOP_VINFO_MASKS (loop_vinfo).is_empty ());
/* This is the point where we can re-start analysis with SLP forced off. */
start_over:
......@@ -2068,11 +2177,47 @@ start_over:
return false;
}
if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
&& LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
{
LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"can't use a fully-masked loop because peeling for"
" gaps is required.\n");
}
if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
&& LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
{
LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"can't use a fully-masked loop because peeling for"
" alignment is required.\n");
}
/* Decide whether to use a fully-masked loop for this vectorization
factor. */
LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
= (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
&& vect_verify_full_masking (loop_vinfo));
if (dump_enabled_p ())
{
if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
dump_printf_loc (MSG_NOTE, vect_location,
"using a fully-masked loop.\n");
else
dump_printf_loc (MSG_NOTE, vect_location,
"not using a fully-masked loop.\n");
}
/* If epilog loop is required because of data accesses with gaps,
one additional iteration needs to be peeled. Check if there is
enough iterations for vectorization. */
if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
&& LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
&& LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
&& !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
{
poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
tree scalar_niters = LOOP_VINFO_NITERSM1 (loop_vinfo);
......@@ -2153,8 +2298,11 @@ start_over:
th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
unsigned HOST_WIDE_INT const_vf;
if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
&& LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > 0)
if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
/* The main loop handles all iterations. */
LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
&& LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > 0)
{
if (!multiple_p (LOOP_VINFO_INT_NITERS (loop_vinfo)
- LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo),
......@@ -2212,7 +2360,8 @@ start_over:
niters_th = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
/* Niters for at least one iteration of vectorized loop. */
niters_th += LOOP_VINFO_VECT_FACTOR (loop_vinfo);
if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
niters_th += LOOP_VINFO_VECT_FACTOR (loop_vinfo);
/* One additional iteration because of peeling for gap. */
if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
niters_th += 1;
......@@ -2315,11 +2464,14 @@ again:
destroy_cost_data (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo));
LOOP_VINFO_TARGET_COST_DATA (loop_vinfo)
= init_cost (LOOP_VINFO_LOOP (loop_vinfo));
/* Reset accumulated rgroup information. */
release_vec_loop_masks (&LOOP_VINFO_MASKS (loop_vinfo));
/* Reset assorted flags. */
LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = 0;
LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = saved_can_fully_mask_p;
goto start_over;
}
......@@ -3523,7 +3675,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
= LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST (loop_vinfo);
/* Add additional cost for the peeled instructions in prologue and epilogue
loop.
loop. (For fully-masked loops there will be no peeling.)
FORNOW: If we don't know the value of peel_iters for prologue or epilogue
at compile-time - we assume it's vf/2 (the worst would be vf-1).
......@@ -3531,7 +3683,12 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
TODO: Build an expression that represents peel_iters for prologue and
epilogue to be used in a run-time test. */
if (npeel < 0)
if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
{
peel_iters_prologue = 0;
peel_iters_epilogue = 0;
}
else if (npeel < 0)
{
peel_iters_prologue = assumed_vf / 2;
dump_printf (MSG_NOTE, "cost model: "
......@@ -3762,8 +3919,9 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
" Calculated minimum iters for profitability: %d\n",
min_profitable_iters);
/* We want the vectorized loop to execute at least once. */
if (min_profitable_iters < (assumed_vf + peel_iters_prologue))
if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
&& min_profitable_iters < (assumed_vf + peel_iters_prologue))
/* We want the vectorized loop to execute at least once. */
min_profitable_iters = assumed_vf + peel_iters_prologue;
if (dump_enabled_p ())
......@@ -6737,6 +6895,15 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
if (!vec_stmt) /* transformation not required. */
{
if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"can't use a fully-masked loop due to "
"reduction operation.\n");
LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
}
if (first_p)
vect_model_reduction_cost (stmt_info, reduc_fn, ncopies);
STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
......@@ -7557,8 +7724,19 @@ vectorizable_live_operation (gimple *stmt,
}
if (!vec_stmt)
/* No transformation required. */
return true;
{
if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"can't use a fully-masked loop because "
"a value is live outside the loop.\n");
LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
}
/* No transformation required. */
return true;
}
/* If stmt has a related stmt, then use that for getting the lhs. */
if (is_pattern_stmt_p (stmt_info))
......@@ -7573,6 +7751,8 @@ vectorizable_live_operation (gimple *stmt,
: TYPE_SIZE (TREE_TYPE (vectype)));
vec_bitsize = TYPE_SIZE (vectype);
gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
/* Get the vectorized lhs of STMT and the lane to use (counted in bits). */
tree vec_lhs, bitstart;
if (slp_node)
......@@ -7706,6 +7886,97 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
return false;
}
/* Return a mask type with half the number of elements as TYPE. */
tree
vect_halve_mask_nunits (tree type)
{
poly_uint64 nunits = exact_div (TYPE_VECTOR_SUBPARTS (type), 2);
return build_truth_vector_type (nunits, current_vector_size);
}
/* Return a mask type with twice as many elements as TYPE. */
tree
vect_double_mask_nunits (tree type)
{
poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (type) * 2;
return build_truth_vector_type (nunits, current_vector_size);
}
/* Record that a fully-masked version of LOOP_VINFO would need MASKS to
contain a sequence of NVECTORS masks that each control a vector of type
VECTYPE. */
void
vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
unsigned int nvectors, tree vectype)
{
gcc_assert (nvectors != 0);
if (masks->length () < nvectors)
masks->safe_grow_cleared (nvectors);
rgroup_masks *rgm = &(*masks)[nvectors - 1];
/* The number of scalars per iteration and the number of vectors are
both compile-time constants. */
unsigned int nscalars_per_iter
= exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
if (rgm->max_nscalars_per_iter < nscalars_per_iter)
{
rgm->max_nscalars_per_iter = nscalars_per_iter;
rgm->mask_type = build_same_sized_truth_vector_type (vectype);
}
}
/* Given a complete set of masks MASKS, extract mask number INDEX
for an rgroup that operates on NVECTORS vectors of type VECTYPE,
where 0 <= INDEX < NVECTORS. Insert any set-up statements before GSI.
See the comment above vec_loop_masks for more details about the mask
arrangement. */
tree
vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
unsigned int nvectors, tree vectype, unsigned int index)
{
rgroup_masks *rgm = &(*masks)[nvectors - 1];
tree mask_type = rgm->mask_type;
/* Populate the rgroup's mask array, if this is the first time we've
used it. */
if (rgm->masks.is_empty ())
{
rgm->masks.safe_grow_cleared (nvectors);
for (unsigned int i = 0; i < nvectors; ++i)
{
tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
/* Provide a dummy definition until the real one is available. */
SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
rgm->masks[i] = mask;
}
}
tree mask = rgm->masks[index];
if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
TYPE_VECTOR_SUBPARTS (vectype)))
{
/* A loop mask for data type X can be reused for data type Y
if X has N times more elements than Y and if Y's elements
are N times bigger than X's. In this case each sequence
of N elements in the loop mask will be all-zero or all-one.
We can then view-convert the mask so that each sequence of
N elements is replaced by a single element. */
gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
TYPE_VECTOR_SUBPARTS (vectype)));
gimple_seq seq = NULL;
mask_type = build_same_sized_truth_vector_type (vectype);
mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
if (seq)
gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
}
return mask;
}
/* Scale profiling counters by estimation for LOOP which is vectorized
by factor VF. */
......@@ -7840,9 +8111,12 @@ vect_transform_loop (loop_vec_info loop_vinfo)
epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
&step_vector, &niters_vector_mult_vf, th,
check_profitability, niters_no_overflow);
if (niters_vector == NULL_TREE)
{
if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && known_eq (lowest_vf, vf))
if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
&& !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
&& known_eq (lowest_vf, vf))
{
niters_vector
= build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
......@@ -8124,13 +8398,15 @@ vect_transform_loop (loop_vec_info loop_vinfo)
a zero NITERS becomes a nonzero NITERS_VECTOR. */
if (integer_onep (step_vector))
niters_no_overflow = true;
slpeel_make_loop_iterate_ntimes (loop, niters_vector, step_vector,
niters_vector_mult_vf,
!niters_no_overflow);
vect_set_loop_condition (loop, loop_vinfo, niters_vector, step_vector,
niters_vector_mult_vf, !niters_no_overflow);
unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
scale_profile_for_vect_loop (loop, assumed_vf);
/* True if the final iteration might not handle a full vector's
worth of scalar iterations. */
bool final_iter_may_be_partial = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
/* The minimum number of iterations performed by the epilogue. This
is 1 when peeling for gaps because we always need a final scalar
iteration. */
......@@ -8143,16 +8419,25 @@ vect_transform_loop (loop_vec_info loop_vinfo)
back to latch counts. */
if (loop->any_upper_bound)
loop->nb_iterations_upper_bound
= wi::udiv_floor (loop->nb_iterations_upper_bound + bias,
lowest_vf) - 1;
= (final_iter_may_be_partial
? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias,
lowest_vf) - 1
: wi::udiv_floor (loop->nb_iterations_upper_bound + bias,
lowest_vf) - 1);
if (loop->any_likely_upper_bound)
loop->nb_iterations_likely_upper_bound
= wi::udiv_floor (loop->nb_iterations_likely_upper_bound + bias,
lowest_vf) - 1;
= (final_iter_may_be_partial
? wi::udiv_ceil (loop->nb_iterations_likely_upper_bound + bias,
lowest_vf) - 1
: wi::udiv_floor (loop->nb_iterations_likely_upper_bound + bias,
lowest_vf) - 1);
if (loop->any_estimate)
loop->nb_iterations_estimate
= wi::udiv_floor (loop->nb_iterations_estimate + bias,
assumed_vf) - 1;
= (final_iter_may_be_partial
? wi::udiv_ceil (loop->nb_iterations_estimate + bias,
assumed_vf) - 1
: wi::udiv_floor (loop->nb_iterations_estimate + bias,
assumed_vf) - 1);
if (dump_enabled_p ())
{
......
......@@ -50,6 +50,8 @@ along with GCC; see the file COPYING3. If not see
#include "internal-fn.h"
#include "tree-vector-builder.h"
#include "vec-perm-indices.h"
#include "tree-ssa-loop-niter.h"
#include "gimple-fold.h"
/* For lang_hooks.types.type_for_mode. */
#include "langhooks.h"
......@@ -1694,6 +1696,113 @@ vectorizable_internal_function (combined_fn cfn, tree fndecl,
static tree permute_vec_elements (tree, tree, tree, gimple *,
gimple_stmt_iterator *);
/* Check whether a load or store statement in the loop described by
LOOP_VINFO is possible in a fully-masked loop. This is testing
whether the vectorizer pass has the appropriate support, as well as
whether the target does.
VLS_TYPE says whether the statement is a load or store and VECTYPE
is the type of the vector being loaded or stored. MEMORY_ACCESS_TYPE
says how the load or store is going to be implemented and GROUP_SIZE
is the number of load or store statements in the containing group.
Clear LOOP_VINFO_CAN_FULLY_MASK_P if a fully-masked loop is not
supported, otherwise record the required mask types. */
static void
check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
vec_load_store_type vls_type, int group_size,
vect_memory_access_type memory_access_type)
{
/* Invariant loads need no special support. */
if (memory_access_type == VMAT_INVARIANT)
return;
vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
machine_mode vecmode = TYPE_MODE (vectype);
bool is_load = (vls_type == VLS_LOAD);
if (memory_access_type == VMAT_LOAD_STORE_LANES)
{
if (is_load
? !vect_load_lanes_supported (vectype, group_size, true)
: !vect_store_lanes_supported (vectype, group_size, true))
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"can't use a fully-masked loop because the"
" target doesn't have an appropriate masked"
" load/store-lanes instruction.\n");
LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
return;
}
unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
return;
}
if (memory_access_type != VMAT_CONTIGUOUS
&& memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
{
/* Element X of the data must come from iteration i * VF + X of the
scalar loop. We need more work to support other mappings. */
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"can't use a fully-masked loop because an access"
" isn't contiguous.\n");
LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
return;
}
machine_mode mask_mode;
if (!(targetm.vectorize.get_mask_mode
(GET_MODE_NUNITS (vecmode),
GET_MODE_SIZE (vecmode)).exists (&mask_mode))
|| !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"can't use a fully-masked loop because the target"
" doesn't have the appropriate masked load or"
" store.\n");
LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
return;
}
/* We might load more scalars than we need for permuting SLP loads.
We checked in get_group_load_store_type that the extra elements
don't leak into a new vector. */
poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
unsigned int nvectors;
if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype);
else
gcc_unreachable ();
}
/* Return the mask input to a masked load or store. VEC_MASK is the vectorized
form of the scalar mask condition and LOOP_MASK, if nonnull, is the mask
that needs to be applied to all loads and stores in a vectorized loop.
Return VEC_MASK if LOOP_MASK is null, otherwise return VEC_MASK & LOOP_MASK.
MASK_TYPE is the type of both masks. If new statements are needed,
insert them before GSI. */
static tree
prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
gimple_stmt_iterator *gsi)
{
gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
if (!loop_mask)
return vec_mask;
gcc_assert (TREE_TYPE (loop_mask) == mask_type);
tree and_res = make_temp_ssa_name (mask_type, NULL, "vec_mask_and");
gimple *and_stmt = gimple_build_assign (and_res, BIT_AND_EXPR,
vec_mask, loop_mask);
gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
return and_res;
}
/* STMT is a non-strided load or store, meaning that it accesses
elements with a known constant step. Return -1 if that step
is negative, 0 if it is zero, and 1 if it is greater than zero. */
......@@ -5796,9 +5905,29 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
return false;
}
grouped_store = STMT_VINFO_GROUPED_ACCESS (stmt_info);
if (grouped_store)
{
first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
first_dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (first_stmt));
group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
}
else
{
first_stmt = stmt;
first_dr = dr;
group_size = vec_num = 1;
}
if (!vec_stmt) /* transformation not required. */
{
STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
if (loop_vinfo
&& LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
check_load_store_masking (loop_vinfo, vectype, vls_type, group_size,
memory_access_type);
STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
/* The SLP costs are calculated during SLP analysis. */
if (!PURE_SLP_STMT (stmt_info))
......@@ -5962,13 +6091,8 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
return true;
}
grouped_store = STMT_VINFO_GROUPED_ACCESS (stmt_info);
if (grouped_store)
{
first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
first_dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (first_stmt));
group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt))++;
/* FORNOW */
......@@ -6003,12 +6127,7 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
ref_type = get_group_alias_ptr_type (first_stmt);
}
else
{
first_stmt = stmt;
first_dr = dr;
group_size = vec_num = 1;
ref_type = reference_alias_ptr_type (DR_REF (first_dr));
}
ref_type = reference_alias_ptr_type (DR_REF (first_dr));
if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
......@@ -6030,6 +6149,7 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
/* Checked by get_load_store_type. */
unsigned int const_nunits = nunits.to_constant ();
gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
gcc_assert (!nested_in_vect_loop_p (loop, stmt));
stride_base
......@@ -6260,10 +6380,13 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
alignment_support_scheme = vect_supportable_dr_alignment (first_dr, false);
gcc_assert (alignment_support_scheme);
bool masked_loop_p = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
/* Targets with store-lane instructions must not require explicit
realignment. vect_supportable_dr_alignment always returns either
dr_aligned or dr_unaligned_supported for masked operations. */
gcc_assert ((memory_access_type != VMAT_LOAD_STORE_LANES && !mask)
gcc_assert ((memory_access_type != VMAT_LOAD_STORE_LANES
&& !mask
&& !masked_loop_p)
|| alignment_support_scheme == dr_aligned
|| alignment_support_scheme == dr_unaligned_supported);
......@@ -6320,6 +6443,7 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
prev_stmt_info = NULL;
tree vec_mask = NULL_TREE;
vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
for (j = 0; j < ncopies; j++)
{
......@@ -6429,8 +6553,15 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
write_vector_array (stmt, gsi, vec_oprnd, vec_array, i);
}
tree final_mask = NULL;
if (masked_loop_p)
final_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
if (vec_mask)
final_mask = prepare_load_store_mask (mask_vectype, final_mask,
vec_mask, gsi);
gcall *call;
if (mask)
if (final_mask)
{
/* Emit:
MASK_STORE_LANES (DATAREF_PTR, ALIAS_PTR, VEC_MASK,
......@@ -6439,7 +6570,7 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
tree alias_ptr = build_int_cst (ref_type, align);
call = gimple_build_call_internal (IFN_MASK_STORE_LANES, 4,
dataref_ptr, alias_ptr,
vec_mask, vec_array);
final_mask, vec_array);
}
else
{
......@@ -6471,6 +6602,14 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
{
unsigned align, misalign;
tree final_mask = NULL_TREE;
if (masked_loop_p)
final_mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
vectype, vec_num * j + i);
if (vec_mask)
final_mask = prepare_load_store_mask (mask_vectype, final_mask,
vec_mask, gsi);
if (i > 0)
/* Bump the vector pointer. */
dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi,
......@@ -6517,14 +6656,14 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
}
/* Arguments are ready. Create the new vector stmt. */
if (mask)
if (final_mask)
{
align = least_bit_hwi (misalign | align);
tree ptr = build_int_cst (ref_type, align);
gcall *call
= gimple_build_call_internal (IFN_MASK_STORE, 4,
dataref_ptr, ptr,
vec_mask, vec_oprnd);
final_mask, vec_oprnd);
gimple_call_set_nothrow (call, true);
new_stmt = call;
}
......@@ -6891,6 +7030,8 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
return false;
}
}
else
group_size = 1;
vect_memory_access_type memory_access_type;
if (!get_load_store_type (stmt, vectype, slp, mask, VLS_LOAD, ncopies,
......@@ -6934,6 +7075,12 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
{
if (!slp)
STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
if (loop_vinfo
&& LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size,
memory_access_type);
STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
/* The SLP costs are calculated during SLP analysis. */
if (!PURE_SLP_STMT (stmt_info))
......@@ -6975,6 +7122,7 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
/* Checked by get_load_store_type. */
unsigned int const_nunits = nunits.to_constant ();
gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
gcc_assert (!nested_in_vect_loop);
if (slp && grouped_load)
......@@ -7251,9 +7399,13 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
alignment_support_scheme = vect_supportable_dr_alignment (first_dr, false);
gcc_assert (alignment_support_scheme);
/* Targets with load-lane instructions must not require explicit
realignment. */
gcc_assert (memory_access_type != VMAT_LOAD_STORE_LANES
bool masked_loop_p = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
/* Targets with store-lane instructions must not require explicit
realignment. vect_supportable_dr_alignment always returns either
dr_aligned or dr_unaligned_supported for masked operations. */
gcc_assert ((memory_access_type != VMAT_LOAD_STORE_LANES
&& !mask
&& !masked_loop_p)
|| alignment_support_scheme == dr_aligned
|| alignment_support_scheme == dr_unaligned_supported);
......@@ -7396,6 +7548,7 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
tree vec_mask = NULL_TREE;
prev_stmt_info = NULL;
poly_uint64 group_elt = 0;
vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
for (j = 0; j < ncopies; j++)
{
/* 1. Create the vector or array pointer update chain. */
......@@ -7471,8 +7624,15 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
vec_array = create_vector_array (vectype, vec_num);
tree final_mask = NULL_TREE;
if (masked_loop_p)
final_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
if (vec_mask)
final_mask = prepare_load_store_mask (mask_vectype, final_mask,
vec_mask, gsi);
gcall *call;
if (mask)
if (final_mask)
{
/* Emit:
VEC_ARRAY = MASK_LOAD_LANES (DATAREF_PTR, ALIAS_PTR,
......@@ -7481,7 +7641,7 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
tree alias_ptr = build_int_cst (ref_type, align);
call = gimple_build_call_internal (IFN_MASK_LOAD_LANES, 3,
dataref_ptr, alias_ptr,
vec_mask);
final_mask);
}
else
{
......@@ -7510,6 +7670,15 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
{
for (i = 0; i < vec_num; i++)
{
tree final_mask = NULL_TREE;
if (masked_loop_p
&& memory_access_type != VMAT_INVARIANT)
final_mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
vectype, vec_num * j + i);
if (vec_mask)
final_mask = prepare_load_store_mask (mask_vectype, final_mask,
vec_mask, gsi);
if (i > 0)
dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi,
stmt, NULL_TREE);
......@@ -7540,14 +7709,14 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
set_ptr_info_alignment (get_ptr_info (dataref_ptr),
align, misalign);
if (mask)
if (final_mask)
{
align = least_bit_hwi (misalign | align);
tree ptr = build_int_cst (ref_type, align);
gcall *call
= gimple_build_call_internal (IFN_MASK_LOAD, 3,
dataref_ptr, ptr,
vec_mask);
final_mask);
gimple_call_set_nothrow (call, true);
new_stmt = call;
data_ref = NULL_TREE;
......@@ -9610,11 +9779,7 @@ supportable_widening_operation (enum tree_code code, gimple *stmt,
intermediate_mode = insn_data[icode1].operand[0].mode;
if (VECTOR_BOOLEAN_TYPE_P (prev_type))
{
poly_uint64 intermediate_nelts
= exact_div (TYPE_VECTOR_SUBPARTS (prev_type), 2);
intermediate_type
= build_truth_vector_type (intermediate_nelts,
current_vector_size);
intermediate_type = vect_halve_mask_nunits (prev_type);
if (intermediate_mode != TYPE_MODE (intermediate_type))
return false;
}
......@@ -9775,11 +9940,9 @@ supportable_narrowing_operation (enum tree_code code,
intermediate_mode = insn_data[icode1].operand[0].mode;
if (VECTOR_BOOLEAN_TYPE_P (prev_type))
{
intermediate_type
= build_truth_vector_type (TYPE_VECTOR_SUBPARTS (prev_type) * 2,
current_vector_size);
intermediate_type = vect_double_mask_nunits (prev_type);
if (intermediate_mode != TYPE_MODE (intermediate_type))
return false;
return false;
}
else
intermediate_type
......@@ -9810,3 +9973,21 @@ supportable_narrowing_operation (enum tree_code code,
interm_types->release ();
return false;
}
/* Generate and return a statement that sets vector mask MASK such that
MASK[I] is true iff J + START_INDEX < END_INDEX for all J <= I. */
gcall *
vect_gen_while (tree mask, tree start_index, tree end_index)
{
tree cmp_type = TREE_TYPE (start_index);
tree mask_type = TREE_TYPE (mask);
gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT,
cmp_type, mask_type,
OPTIMIZE_FOR_SPEED));
gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3,
start_index, end_index,
build_zero_cst (mask_type));
gimple_call_set_lhs (call, mask);
return call;
}
......@@ -211,6 +211,102 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
}
/* In general, we can divide the vector statements in a vectorized loop
into related groups ("rgroups") and say that for each rgroup there is
some nS such that the rgroup operates on nS values from one scalar
iteration followed by nS values from the next. That is, if VF is the
vectorization factor of the loop, the rgroup operates on a sequence:
(1,1) (1,2) ... (1,nS) (2,1) ... (2,nS) ... (VF,1) ... (VF,nS)
where (i,j) represents a scalar value with index j in a scalar
iteration with index i.
[ We use the term "rgroup" to emphasise that this grouping isn't
necessarily the same as the grouping of statements used elsewhere.
For example, if we implement a group of scalar loads using gather
loads, we'll use a separate gather load for each scalar load, and
thus each gather load will belong to its own rgroup. ]
In general this sequence will occupy nV vectors concatenated
together. If these vectors have nL lanes each, the total number
of scalar values N is given by:
N = nS * VF = nV * nL
None of nS, VF, nV and nL are required to be a power of 2. nS and nV
are compile-time constants but VF and nL can be variable (if the target
supports variable-length vectors).
In classical vectorization, each iteration of the vector loop would
handle exactly VF iterations of the original scalar loop. However,
in a fully-masked loop, a particular iteration of the vector loop
might handle fewer than VF iterations of the scalar loop. The vector
lanes that correspond to iterations of the scalar loop are said to be
"active" and the other lanes are said to be "inactive".
In a fully-masked loop, many rgroups need to be masked to ensure that
they have no effect for the inactive lanes. Each such rgroup needs a
sequence of booleans in the same order as above, but with each (i,j)
replaced by a boolean that indicates whether iteration i is active.
This sequence occupies nV vector masks that again have nL lanes each.
Thus the mask sequence as a whole consists of VF independent booleans
that are each repeated nS times.
We make the simplifying assumption that if a sequence of nV masks is
suitable for one (nS,nL) pair, we can reuse it for (nS/2,nL/2) by
VIEW_CONVERTing it. This holds for all current targets that support
fully-masked loops. For example, suppose the scalar loop is:
float *f;
double *d;
for (int i = 0; i < n; ++i)
{
f[i * 2 + 0] += 1.0f;
f[i * 2 + 1] += 2.0f;
d[i] += 3.0;
}
and suppose that vectors have 256 bits. The vectorized f accesses
will belong to one rgroup and the vectorized d access to another:
f rgroup: nS = 2, nV = 1, nL = 8
d rgroup: nS = 1, nV = 1, nL = 4
VF = 4
[ In this simple example the rgroups do correspond to the normal
SLP grouping scheme. ]
If only the first three lanes are active, the masks we need are:
f rgroup: 1 1 | 1 1 | 1 1 | 0 0
d rgroup: 1 | 1 | 1 | 0
Here we can use a mask calculated for f's rgroup for d's, but not
vice versa.
Thus for each value of nV, it is enough to provide nV masks, with the
mask being calculated based on the highest nL (or, equivalently, based
on the highest nS) required by any rgroup with that nV. We therefore
represent the entire collection of masks as a two-level table, with the
first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
the second being indexed by the mask index 0 <= i < nV. */
/* The masks needed by rgroups with nV vectors, according to the
description above. */
struct rgroup_masks {
/* The largest nS for all rgroups that use these masks. */
unsigned int max_nscalars_per_iter;
/* The type of mask to use, based on the highest nS recorded above. */
tree mask_type;
/* A vector of nV masks, in iteration order. */
vec<tree> masks;
};
typedef auto_vec<rgroup_masks> vec_loop_masks;
/*-----------------------------------------------------------------*/
/* Info on vectorized loops. */
/*-----------------------------------------------------------------*/
......@@ -251,6 +347,14 @@ typedef struct _loop_vec_info : public vec_info {
if there is no particular limit. */
unsigned HOST_WIDE_INT max_vectorization_factor;
/* The masks that a fully-masked loop should use to avoid operating
on inactive scalars. */
vec_loop_masks masks;
/* Type of the variables to use in the WHILE_ULT call for fully-masked
loops. */
tree mask_compare_type;
/* Unknown DRs according to which loop was peeled. */
struct data_reference *unaligned_dr;
......@@ -305,6 +409,12 @@ typedef struct _loop_vec_info : public vec_info {
/* Is the loop vectorizable? */
bool vectorizable;
/* Records whether we still have the option of using a fully-masked loop. */
bool can_fully_mask_p;
/* True if have decided to use a fully-masked loop. */
bool fully_masked_p;
/* When we have grouped data accesses with gaps, we may introduce invalid
memory accesses. We peel the last iteration of the loop to prevent
this. */
......@@ -365,8 +475,12 @@ typedef struct _loop_vec_info : public vec_info {
#define LOOP_VINFO_COST_MODEL_THRESHOLD(L) (L)->th
#define LOOP_VINFO_VERSIONING_THRESHOLD(L) (L)->versioning_threshold
#define LOOP_VINFO_VECTORIZABLE_P(L) (L)->vectorizable
#define LOOP_VINFO_CAN_FULLY_MASK_P(L) (L)->can_fully_mask_p
#define LOOP_VINFO_FULLY_MASKED_P(L) (L)->fully_masked_p
#define LOOP_VINFO_VECT_FACTOR(L) (L)->vectorization_factor
#define LOOP_VINFO_MAX_VECT_FACTOR(L) (L)->max_vectorization_factor
#define LOOP_VINFO_MASKS(L) (L)->masks
#define LOOP_VINFO_MASK_COMPARE_TYPE(L) (L)->mask_compare_type
#define LOOP_VINFO_PTR_MASK(L) (L)->ptr_mask
#define LOOP_VINFO_LOOP_NEST(L) (L)->loop_nest
#define LOOP_VINFO_DATAREFS(L) (L)->datarefs
......@@ -1172,6 +1286,17 @@ vect_nunits_for_cost (tree vec_type)
return estimated_poly_value (TYPE_VECTOR_SUBPARTS (vec_type));
}
/* Return the maximum possible vectorization factor for LOOP_VINFO. */
static inline unsigned HOST_WIDE_INT
vect_max_vf (loop_vec_info loop_vinfo)
{
unsigned HOST_WIDE_INT vf;
if (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&vf))
return vf;
return MAX_VECTORIZATION_FACTOR;
}
/* Return the size of the value accessed by unvectorized data reference DR.
This is only valid once STMT_VINFO_VECTYPE has been calculated for the
associated gimple statement, since that guarantees that DR accesses
......@@ -1194,8 +1319,8 @@ extern source_location vect_location;
/* Simple loop peeling and versioning utilities for vectorizer's purposes -
in tree-vect-loop-manip.c. */
extern void slpeel_make_loop_iterate_ntimes (struct loop *, tree, tree,
tree, bool);
extern void vect_set_loop_condition (struct loop *, loop_vec_info,
tree, tree, tree, bool);
extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
struct loop *, edge);
......@@ -1212,6 +1337,7 @@ extern tree get_vectype_for_scalar_type (tree);
extern tree get_vectype_for_scalar_type_and_size (tree, poly_uint64);
extern tree get_mask_type_for_scalar_type (tree);
extern tree get_same_sized_vectype (tree, tree);
extern bool vect_get_loop_mask_type (loop_vec_info);
extern bool vect_is_simple_use (tree, vec_info *, gimple **,
enum vect_def_type *);
extern bool vect_is_simple_use (tree, vec_info *, gimple **,
......@@ -1266,6 +1392,7 @@ extern bool vect_supportable_shift (enum tree_code, tree);
extern tree vect_gen_perm_mask_any (tree, const vec_perm_indices &);
extern tree vect_gen_perm_mask_checked (tree, const vec_perm_indices &);
extern void optimize_mask_stores (struct loop*);
extern gcall *vect_gen_while (tree, tree, tree);
/* In tree-vect-data-refs.c. */
extern bool vect_can_force_dr_alignment_p (const_tree, unsigned int);
......@@ -1322,6 +1449,13 @@ extern loop_vec_info vect_analyze_loop (struct loop *, loop_vec_info);
extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL);
extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
tree *, bool);
extern tree vect_halve_mask_nunits (tree);
extern tree vect_double_mask_nunits (tree);
extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
unsigned int, tree);
extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
unsigned int, tree, unsigned int);
/* Drive for loop transformation stage. */
extern struct loop *vect_transform_loop (loop_vec_info);
extern loop_vec_info vect_analyze_loop_form (struct loop *);
......
......@@ -557,6 +557,7 @@ namespace wi
BINARY_FUNCTION udiv_floor (const T1 &, const T2 &);
BINARY_FUNCTION sdiv_floor (const T1 &, const T2 &);
BINARY_FUNCTION div_ceil (const T1 &, const T2 &, signop, bool * = 0);
BINARY_FUNCTION udiv_ceil (const T1 &, const T2 &);
BINARY_FUNCTION div_round (const T1 &, const T2 &, signop, bool * = 0);
BINARY_FUNCTION divmod_trunc (const T1 &, const T2 &, signop,
WI_BINARY_RESULT (T1, T2) *);
......@@ -2677,6 +2678,14 @@ wi::div_ceil (const T1 &x, const T2 &y, signop sgn, bool *overflow)
return quotient;
}
/* Return X / Y, rouding towards +inf. Treat X and Y as unsigned values. */
template <typename T1, typename T2>
inline WI_BINARY_RESULT (T1, T2)
wi::udiv_ceil (const T1 &x, const T2 &y)
{
return div_ceil (x, y, UNSIGNED);
}
/* Return X / Y, rouding towards nearest with ties away from zero.
Treat X and Y as having the signedness given by SGN. Indicate
in *OVERFLOW if the result overflows. */
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment