Use conditional internal functions in if-conversion

This patch uses IFN_COND_* to vectorise conditionally-executed, potentially-trapping arithmetic, such as most floating-point ops with -ftrapping-math. E.g.: if (cond) { ... x = a + b; ... } becomes: ... x = .COND_ADD (cond, a, b, else_value); ... When this transformation is done on its own, the value of x for !cond isn't important, so else_value is simply the target's preferred_else_value (i.e. the value it can handle the most efficiently). However, the patch also looks for the equivalent of: y = cond ? x : c; in which the "then" value is the result of the conditionally-executed operation and the "else" value "c" is some value that is available at x. In that case we can instead use: x = .COND_ADD (cond, a, b, c); and replace uses of y with uses of x. The patch also looks for: y = !cond ? c : x; which can be transformed in the same way. This involved adding a new utility function inverse_conditions_p, which was already open-coded in a more limited way in match.pd. 2018-07-12 Richard Sandiford <richard.sandiford@linaro.org> gcc/ * fold-const.h (inverse_conditions_p): Declare. * fold-const.c (inverse_conditions_p): New function. * match.pd: Use inverse_conditions_p. Add folds of view_converts that test the inverse condition of a conditional internal function. * internal-fn.h (vectorized_internal_fn_supported_p): Declare. * internal-fn.c (internal_fn_mask_index): Handle conditional internal functions. (vectorized_internal_fn_supported_p): New function. * tree-if-conv.c: Include internal-fn.h and fold-const.h. (any_pred_load_store): Replace with... (need_to_predicate): ...this new variable. (redundant_ssa_names): New variable. (ifcvt_can_use_mask_load_store): Move initial checks to... (ifcvt_can_predicate): ...this new function. Handle tree codes for which a conditional internal function exists. (if_convertible_gimple_assign_stmt_p): Use ifcvt_can_predicate instead of ifcvt_can_use_mask_load_store. Update after variable name change. (predicate_load_or_store): New function, split out from predicate_mem_writes. (check_redundant_cond_expr): New function. (value_available_p): Likewise. (predicate_rhs_code): Likewise. (predicate_mem_writes): Rename to... (predicate_statements): ...this. Use predicate_load_or_store and predicate_rhs_code. (combine_blocks, tree_if_conversion): Update after above name changes. (ifcvt_local_dce): Handle redundant_ssa_names. * tree-vect-patterns.c (vect_recog_mask_conversion_pattern): Handle general conditional functions. * tree-vect-stmts.c (vectorizable_call): Likewise. gcc/testsuite/ * gcc.dg/vect/vect-cond-arith-4.c: New test. * gcc.dg/vect/vect-cond-arith-5.c: Likewise. * gcc.target/aarch64/sve/cond_arith_1.c: Likewise. * gcc.target/aarch64/sve/cond_arith_1_run.c: Likewise. * gcc.target/aarch64/sve/cond_arith_2.c: Likewise. * gcc.target/aarch64/sve/cond_arith_2_run.c: Likewise. * gcc.target/aarch64/sve/cond_arith_3.c: Likewise. * gcc.target/aarch64/sve/cond_arith_3_run.c: Likewise. From-SVN: r262589

Use conditional internal functions in if-conversion
This patch uses IFN_COND_* to vectorise conditionally-executed, potentially-trapping arithmetic, such as most floating-point ops with -ftrapping-math. E.g.: if (cond) { ... x = a + b; ... } becomes: ... x = .COND_ADD (cond, a, b, else_value); ... When this transformation is done on its own, the value of x for !cond isn't important, so else_value is simply the target's preferred_else_value (i.e. the value it can handle the most efficiently). However, the patch also looks for the equivalent of: y = cond ? x : c; in which the "then" value is the result of the conditionally-executed operation and the "else" value "c" is some value that is available at x. In that case we can instead use: x = .COND_ADD (cond, a, b, c); and replace uses of y with uses of x. The patch also looks for: y = !cond ? c : x; which can be transformed in the same way. This involved adding a new utility function inverse_conditions_p, which was already open-coded in a more limited way in match.pd. 2018-07-12 Richard Sandiford <richard.sandiford@linaro.org> gcc/ * fold-const.h (inverse_conditions_p): Declare. * fold-const.c (inverse_conditions_p): New function. * match.pd: Use inverse_conditions_p. Add folds of view_converts that test the inverse condition of a conditional internal function. * internal-fn.h (vectorized_internal_fn_supported_p): Declare. * internal-fn.c (internal_fn_mask_index): Handle conditional internal functions. (vectorized_internal_fn_supported_p): New function. * tree-if-conv.c: Include internal-fn.h and fold-const.h. (any_pred_load_store): Replace with... (need_to_predicate): ...this new variable. (redundant_ssa_names): New variable. (ifcvt_can_use_mask_load_store): Move initial checks to... (ifcvt_can_predicate): ...this new function. Handle tree codes for which a conditional internal function exists. (if_convertible_gimple_assign_stmt_p): Use ifcvt_can_predicate instead of ifcvt_can_use_mask_load_store. Update after variable name change. (predicate_load_or_store): New function, split out from predicate_mem_writes. (check_redundant_cond_expr): New function. (value_available_p): Likewise. (predicate_rhs_code): Likewise. (predicate_mem_writes): Rename to... (predicate_statements): ...this. Use predicate_load_or_store and predicate_rhs_code. (combine_blocks, tree_if_conversion): Update after above name changes. (ifcvt_local_dce): Handle redundant_ssa_names. * tree-vect-patterns.c (vect_recog_mask_conversion_pattern): Handle general conditional functions. * tree-vect-stmts.c (vectorizable_call): Likewise. gcc/testsuite/ * gcc.dg/vect/vect-cond-arith-4.c: New test. * gcc.dg/vect/vect-cond-arith-5.c: Likewise. * gcc.target/aarch64/sve/cond_arith_1.c: Likewise. * gcc.target/aarch64/sve/cond_arith_1_run.c: Likewise. * gcc.target/aarch64/sve/cond_arith_2.c: Likewise. * gcc.target/aarch64/sve/cond_arith_2_run.c: Likewise. * gcc.target/aarch64/sve/cond_arith_3.c: Likewise. * gcc.target/aarch64/sve/cond_arith_3_run.c: Likewise. From-SVN: r262589
2c58d42c · Richard Sandiford · Richard Sandiford · 0936858f · 2c58d42c · 2c58d42c
Commit 2c58d42c authored Jul 12, 2018 by Richard Sandiford Committed by Richard Sandiford Jul 12, 2018
18 changed files
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
 2018-07-12  Richard Sandiford  <richard.sandiford@linaro.org>
+	* fold-const.h (inverse_conditions_p): Declare.
+	* fold-const.c (inverse_conditions_p): New function.
+	* match.pd: Use inverse_conditions_p.  Add folds of view_converts
+	that test the inverse condition of a conditional internal function.
+	* internal-fn.h (vectorized_internal_fn_supported_p): Declare.
+	* internal-fn.c (internal_fn_mask_index): Handle conditional
+	internal functions.
+	(vectorized_internal_fn_supported_p): New function.
+	* tree-if-conv.c: Include internal-fn.h and fold-const.h.
+	(any_pred_load_store): Replace with...
+	(need_to_predicate): ...this new variable.
+	(redundant_ssa_names): New variable.
+	(ifcvt_can_use_mask_load_store): Move initial checks to...
+	(ifcvt_can_predicate): ...this new function.  Handle tree codes
+	for which a conditional internal function exists.
+	(if_convertible_gimple_assign_stmt_p): Use ifcvt_can_predicate
+	instead of ifcvt_can_use_mask_load_store.  Update after variable
+	name change.
+	(predicate_load_or_store): New function, split out from
+	predicate_mem_writes.
+	(check_redundant_cond_expr): New function.
+	(value_available_p): Likewise.
+	(predicate_rhs_code): Likewise.
+	(predicate_mem_writes): Rename to...
+	(predicate_statements): ...this.  Use predicate_load_or_store
+	and predicate_rhs_code.
+	(combine_blocks, tree_if_conversion): Update after above name changes.
+	(ifcvt_local_dce): Handle redundant_ssa_names.
+	* tree-vect-patterns.c (vect_recog_mask_conversion_pattern): Handle
+	general conditional functions.
+	* tree-vect-stmts.c (vectorizable_call): Likewise.
+2018-07-12  Richard Sandiford  <richard.sandiford@linaro.org>
 	    Alan Hayward  <alan.hayward@arm.com>
 	    David Sherwood  <david.sherwood@arm.com>

--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -2787,6 +2787,22 @@ compcode_to_comparison (enum comparison_code code)
    }
 }
+/* Return true if COND1 tests the opposite condition of COND2.  */
+bool
+inverse_conditions_p (const_tree cond1, const_tree cond2)
+{
+  return (COMPARISON_CLASS_P (cond1)
+	  && COMPARISON_CLASS_P (cond2)
+	  && (invert_tree_comparison
+	      (TREE_CODE (cond1),
+	       HONOR_NANS (TREE_OPERAND (cond1, 0))) == TREE_CODE (cond2))
+	  && operand_equal_p (TREE_OPERAND (cond1, 0),
+			      TREE_OPERAND (cond2, 0), 0)
+	  && operand_equal_p (TREE_OPERAND (cond1, 1),
+			      TREE_OPERAND (cond2, 1), 0));
+}
 /* Return a tree for the comparison which is the combination of
   doing the AND or OR (depending on CODE) of the two operations LCODE
   and RCODE on the identical operands LL_ARG and LR_ARG.  Take into account

--- a/gcc/fold-const.h
+++ b/gcc/fold-const.h
@@ -127,6 +127,7 @@ extern enum tree_code swap_tree_comparison (enum tree_code);
 extern bool ptr_difference_const (tree, tree, poly_int64_pod *);
 extern enum tree_code invert_tree_comparison (enum tree_code, bool);
+extern bool inverse_conditions_p (const_tree, const_tree);
 extern bool tree_unary_nonzero_warnv_p (enum tree_code, tree, tree, bool *);
 extern bool tree_binary_nonzero_warnv_p (enum tree_code, tree, tree, tree op1,

--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -3466,7 +3466,8 @@ internal_fn_mask_index (internal_fn fn)
      return 4;
    default:
-      return -1;
+      return (conditional_internal_fn_code (fn) != ERROR_MARK
+	      || get_unconditional_internal_fn (fn) != IFN_LAST ? 0 : -1);
    }
 }
@@ -3531,6 +3532,26 @@ expand_internal_call (gcall *stmt)
  expand_internal_call (gimple_call_internal_fn (stmt), stmt);
 }
+/* If TYPE is a vector type, return true if IFN is a direct internal
+   function that is supported for that type.  If TYPE is a scalar type,
+   return true if IFN is a direct internal function that is supported for
+   the target's preferred vector version of TYPE.  */
+bool
+vectorized_internal_fn_supported_p (internal_fn ifn, tree type)
+{
+  scalar_mode smode;
+  if (!VECTOR_TYPE_P (type) && is_a <scalar_mode> (TYPE_MODE (type), &smode))
+    {
+      machine_mode vmode = targetm.vectorize.preferred_simd_mode (smode);
+      if (VECTOR_MODE_P (vmode))
+	type = build_vector_type_for_mode (type, vmode);
+    }
+  return (VECTOR_MODE_P (TYPE_MODE (type))
+	  && direct_internal_fn_supported_p (ifn, type, OPTIMIZE_FOR_SPEED));
+}
 void
 expand_PHI (internal_fn, gcall *)
 {

--- a/gcc/internal-fn.h
+++ b/gcc/internal-fn.h
@@ -212,4 +212,6 @@ extern void expand_internal_call (gcall *);
 extern void expand_internal_call (internal_fn, gcall *);
 extern void expand_PHI (internal_fn, gcall *);
+extern bool vectorized_internal_fn_supported_p (internal_fn, tree);
 #endif
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2951,21 +2951,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
    from if-conversion.  */
 (simplify
  (cnd @0 @1 (cnd @2 @3 @4))
-  (if (COMPARISON_CLASS_P (@0)
+  (if (inverse_conditions_p (@0, @2))
-       && COMPARISON_CLASS_P (@2)
-       && invert_tree_comparison
-           (TREE_CODE (@0), HONOR_NANS (TREE_OPERAND (@0, 0))) == TREE_CODE (@2)
-       && operand_equal_p (TREE_OPERAND (@0, 0), TREE_OPERAND (@2, 0), 0)
-       && operand_equal_p (TREE_OPERAND (@0, 1), TREE_OPERAND (@2, 1), 0))
   (cnd @0 @1 @3)))
 (simplify
  (cnd @0 (cnd @1 @2 @3) @4)
-  (if (COMPARISON_CLASS_P (@0)
+  (if (inverse_conditions_p (@0, @1))
-       && COMPARISON_CLASS_P (@1)
-       && invert_tree_comparison
-           (TREE_CODE (@0), HONOR_NANS (TREE_OPERAND (@0, 0))) == TREE_CODE (@1)
-       && operand_equal_p (TREE_OPERAND (@0, 0), TREE_OPERAND (@1, 0), 0)
-       && operand_equal_p (TREE_OPERAND (@0, 1), TREE_OPERAND (@1, 1), 0))
   (cnd @0 @3 @4)))
 /* A ? B : B -> B.  */
@@ -4913,7 +4903,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (vec_cond @0 (view_convert? (cond_op @0 @1 @2 @3)) @4)
  (with { tree op_type = TREE_TYPE (@3); }
   (if (element_precision (type) == element_precision (op_type))
-    (view_convert (cond_op @0 @1 @2 (view_convert:op_type @4)))))))
+    (view_convert (cond_op @0 @1 @2 (view_convert:op_type @4))))))
+ (simplify
+  (vec_cond @0 @1 (view_convert? (cond_op @2 @3 @4 @5)))
+  (with { tree op_type = TREE_TYPE (@5); }
+   (if (inverse_conditions_p (@0, @2)
+        && element_precision (type) == element_precision (op_type))
+    (view_convert (cond_op @2 @3 @4 (view_convert:op_type @1)))))))
 /* Same for ternary operations.  */
 (for cond_op (COND_TERNARY)
@@ -4921,4 +4917,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (vec_cond @0 (view_convert? (cond_op @0 @1 @2 @3 @4)) @5)
  (with { tree op_type = TREE_TYPE (@4); }
   (if (element_precision (type) == element_precision (op_type))
-    (view_convert (cond_op @0 @1 @2 @3 (view_convert:op_type @5)))))))
+    (view_convert (cond_op @0 @1 @2 @3 (view_convert:op_type @5))))))
+ (simplify
+  (vec_cond @0 @1 (view_convert? (cond_op @2 @3 @4 @5 @6)))
+  (with { tree op_type = TREE_TYPE (@6); }
+   (if (inverse_conditions_p (@0, @2)
+        && element_precision (type) == element_precision (op_type))
+    (view_convert (cond_op @2 @3 @4 @5 (view_convert:op_type @1)))))))
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
 2018-07-12  Richard Sandiford  <richard.sandiford@linaro.org>
+	* gcc.dg/vect/vect-cond-arith-4.c: New test.
+	* gcc.dg/vect/vect-cond-arith-5.c: Likewise.
+	* gcc.target/aarch64/sve/cond_arith_1.c: Likewise.
+	* gcc.target/aarch64/sve/cond_arith_1_run.c: Likewise.
+	* gcc.target/aarch64/sve/cond_arith_2.c: Likewise.
+	* gcc.target/aarch64/sve/cond_arith_2_run.c: Likewise.
+	* gcc.target/aarch64/sve/cond_arith_3.c: Likewise.
+	* gcc.target/aarch64/sve/cond_arith_3_run.c: Likewise.
+2018-07-12  Richard Sandiford  <richard.sandiford@linaro.org>
 	    Alan Hayward  <alan.hayward@arm.com>
 	    David Sherwood  <david.sherwood@arm.com>

--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
+/* { dg-additional-options "-fdump-tree-optimized" } */
+#include "tree-vect.h"
+#define N (VECTOR_BITS * 11 / 64 + 3)
+#define add(A, B) ((A) + (B))
+#define sub(A, B) ((A) - (B))
+#define mul(A, B) ((A) * (B))
+#define div(A, B) ((A) / (B))
+#define DEF(OP)							\
+  void __attribute__ ((noipa))					\
+  f_##OP (double *restrict a, double *restrict b, double x)	\
+  {								\
+    for (int i = 0; i < N; ++i)					\
+      a[i] = b[i] < 100 ? OP (b[i], x) : b[i];			\
+  }
+#define TEST(OP)					\
+  {							\
+    f_##OP (a, b, 10);					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	int bval = (i % 17) * 10;			\
+	int truev = OP (bval, 10);			\
+	if (a[i] != (bval < 100 ? truev : bval))	\
+	__builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+#define FOR_EACH_OP(T)				\
+  T (add)					\
+  T (sub)					\
+  T (mul)					\
+  T (div)
+FOR_EACH_OP (DEF)
+int
+main (void)
+{
+  double a[N], b[N];
+  for (int i = 0; i < N; ++i)
+    {
+      b[i] = (i % 17) * 10;
+      asm volatile ("" ::: "memory");
+    }
+  FOR_EACH_OP (TEST)
+  return 0;
+}
+/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target vect_double_cond_arith } } } */
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
+/* { dg-additional-options "-fdump-tree-optimized" } */
+#include "tree-vect.h"
+#define N (VECTOR_BITS * 11 / 64 + 3)
+#define add(A, B) ((A) + (B))
+#define sub(A, B) ((A) - (B))
+#define mul(A, B) ((A) * (B))
+#define div(A, B) ((A) / (B))
+#define DEF(OP)							\
+  void __attribute__ ((noipa))					\
+  f_##OP (double *restrict a, double *restrict b, double x)	\
+  {								\
+    for (int i = 0; i < N; ++i)					\
+      if (b[i] < 100)						\
+	a[i] = OP (b[i], x);					\
+  }
+#define TEST(OP)					\
+  {							\
+    f_##OP (a, b, 10);					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	int bval = (i % 17) * 10;			\
+	int truev = OP (bval, 10);			\
+	if (a[i] != (bval < 100 ? truev : i * 3))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+#define FOR_EACH_OP(T)				\
+  T (add)					\
+  T (sub)					\
+  T (mul)					\
+  T (div)
+FOR_EACH_OP (DEF)
+int
+main (void)
+{
+  double a[N], b[N];
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = i * 3;
+      b[i] = (i % 17) * 10;
+      asm volatile ("" ::: "memory");
+    }
+  FOR_EACH_OP (TEST)
+  return 0;
+}
+/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target { vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target { vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target { vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target { vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target { vect_double_cond_arith && vect_masked_store } } } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_1.c
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+#include <stdint.h>
+#define TEST(TYPE, NAME, OP)				\
+  void __attribute__ ((noinline, noclone))		\
+  test_##TYPE##_##NAME (TYPE *__restrict x,		\
+			TYPE *__restrict y,		\
+			TYPE *__restrict z,		\
+			TYPE *__restrict pred, int n)	\
+  {							\
+    for (int i = 0; i < n; ++i)				\
+      x[i] = pred[i] != 1 ? y[i] OP z[i] : y[i];	\
+  }
+#define TEST_INT_TYPE(TYPE) \
+  TEST (TYPE, div, /)
+#define TEST_FP_TYPE(TYPE) \
+  TEST (TYPE, add, +) \
+  TEST (TYPE, sub, -) \
+  TEST (TYPE, mul, *) \
+  TEST (TYPE, div, /)
+#define TEST_ALL \
+  TEST_INT_TYPE (int8_t) \
+  TEST_INT_TYPE (uint8_t) \
+  TEST_INT_TYPE (int16_t) \
+  TEST_INT_TYPE (uint16_t) \
+  TEST_INT_TYPE (int32_t) \
+  TEST_INT_TYPE (uint32_t) \
+  TEST_INT_TYPE (int64_t) \
+  TEST_INT_TYPE (uint64_t) \
+  TEST_FP_TYPE (float) \
+  TEST_FP_TYPE (double)
+TEST_ALL
+/* { dg-final { scan-assembler-not {\t.div\tz[0-9]+\.b} } } */		\
+/* { dg-final { scan-assembler-not {\t.div\tz[0-9]+\.h} } } */		\
+/* { dg-final { scan-assembler-times {\tsdiv\tz[0-9]+\.s, p[0-7]/m,} 7 } } */
+/* At present we don't vectorize the uint8_t or uint16_t loops because the
+   division is done directly in the narrow type, rather than being widened
+   to int first.  */
+/* { dg-final { scan-assembler-times {\tudiv\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tsdiv\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tudiv\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfsub\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfsub\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfmul\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfmul\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfdiv\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfdiv\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* We fail to optimize away the SEL for the int8_t and int16_t loops,
+   because the 32-bit result is converted before selection.  */
+/* { dg-final { scan-assembler-times {\tsel\t} 2 } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_1_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_1_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+#include "cond_arith_1.c"
+#define N 99
+#undef TEST
+#define TEST(TYPE, NAME, OP)					\
+  {								\
+    TYPE x[N], y[N], z[N], pred[N];				\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	y[i] = i * i;						\
+	z[i] = ((i + 2) % 3) * (i + 1);				\
+	pred[i] = i % 3;					\
+      }								\
+    test_##TYPE##_##NAME (x, y, z, pred, N);			\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	TYPE expected = i % 3 != 1 ? y[i] OP z[i] : y[i];	\
+	if (x[i] != expected)					\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+int
+main (void)
+{
+  TEST_ALL
+  return 0;
+}
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_2.c
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-vect-cost-model" } */
+#include <stdint.h>
+#define TEST(DATA_TYPE, PRED_TYPE, NAME, OP)				\
+  void __attribute__ ((noinline, noclone))				\
+  test_##DATA_TYPE##_##PRED_TYPE##_##NAME (DATA_TYPE *__restrict x,	\
+					   DATA_TYPE *__restrict y,	\
+					   DATA_TYPE *__restrict z,	\
+					   PRED_TYPE *__restrict pred,	\
+					   int n)			\
+  {									\
+    for (int i = 0; i < n; ++i)						\
+      x[i] = pred[i] != 1 ? y[i] OP z[i] : y[i];			\
+  }
+#define TEST_INT_TYPE(DATA_TYPE, PRED_TYPE) \
+  TEST (DATA_TYPE, PRED_TYPE, div, /)
+#define TEST_FP_TYPE(DATA_TYPE, PRED_TYPE) \
+  TEST (DATA_TYPE, PRED_TYPE, add, +) \
+  TEST (DATA_TYPE, PRED_TYPE, sub, -) \
+  TEST (DATA_TYPE, PRED_TYPE, mul, *) \
+  TEST (DATA_TYPE, PRED_TYPE, div, /)
+#define TEST_ALL \
+  TEST_INT_TYPE (int32_t, int8_t) \
+  TEST_INT_TYPE (uint32_t, int8_t) \
+  TEST_INT_TYPE (int32_t, int16_t) \
+  TEST_INT_TYPE (uint32_t, int16_t) \
+  TEST_INT_TYPE (int64_t, int8_t) \
+  TEST_INT_TYPE (uint64_t, int8_t) \
+  TEST_INT_TYPE (int64_t, int16_t) \
+  TEST_INT_TYPE (uint64_t, int16_t) \
+  TEST_INT_TYPE (int64_t, int32_t) \
+  TEST_INT_TYPE (uint64_t, int32_t) \
+  TEST_FP_TYPE (float, int8_t) \
+  TEST_FP_TYPE (float, int16_t) \
+  TEST_FP_TYPE (double, int8_t) \
+  TEST_FP_TYPE (double, int16_t) \
+  TEST_FP_TYPE (double, int32_t)
+TEST_ALL
+/* { dg-final { scan-assembler-times {\tsdiv\tz[0-9]+\.s, p[0-7]/m,} 6 } } */
+/* { dg-final { scan-assembler-times {\tudiv\tz[0-9]+\.s, p[0-7]/m,} 6 } } */
+/* { dg-final { scan-assembler-times {\tsdiv\tz[0-9]+\.d, p[0-7]/m,} 14 } } */
+/* { dg-final { scan-assembler-times {\tudiv\tz[0-9]+\.d, p[0-7]/m,} 14 } } */
+/* { dg-final { scan-assembler-times {\tfadd\tz[0-9]+\.s, p[0-7]/m,} 6 } } */
+/* { dg-final { scan-assembler-times {\tfadd\tz[0-9]+\.d, p[0-7]/m,} 14 } } */
+/* { dg-final { scan-assembler-times {\tfsub\tz[0-9]+\.s, p[0-7]/m,} 6 } } */
+/* { dg-final { scan-assembler-times {\tfsub\tz[0-9]+\.d, p[0-7]/m,} 14 } } */
+/* { dg-final { scan-assembler-times {\tfmul\tz[0-9]+\.s, p[0-7]/m,} 6 } } */
+/* { dg-final { scan-assembler-times {\tfmul\tz[0-9]+\.d, p[0-7]/m,} 14 } } */
+/* { dg-final { scan-assembler-times {\tfdiv\tz[0-9]+\.s, p[0-7]/m,} 6 } } */
+/* { dg-final { scan-assembler-times {\tfdiv\tz[0-9]+\.d, p[0-7]/m,} 14 } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_2_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_2_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+#include "cond_arith_2.c"
+#define N 99
+#undef TEST
+#define TEST(DATA_TYPE, PRED_TYPE, NAME, OP)			\
+  {								\
+    DATA_TYPE x[N], y[N], z[N];					\
+    PRED_TYPE pred[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	y[i] = i * i;						\
+	z[i] = ((i + 2) % 3) * (i + 1);				\
+	pred[i] = i % 3;					\
+      }								\
+    test_##DATA_TYPE##_##PRED_TYPE##_##NAME (x, y, z, pred, N);	\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	DATA_TYPE expected = i % 3 != 1 ? y[i] OP z[i] : y[i];	\
+	if (x[i] != expected)					\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+int
+main (void)
+{
+  TEST_ALL
+  return 0;
+}
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_3.c
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+#include <stdint.h>
+#define TEST(TYPE, NAME, OP)				\
+  void __attribute__ ((noinline, noclone))		\
+  test_##TYPE##_##NAME (TYPE *__restrict x,		\
+			TYPE *__restrict y,		\
+			TYPE *__restrict z,		\
+			TYPE *__restrict pred, int n)	\
+  {							\
+    for (int i = 0; i < n; ++i)				\
+      x[i] = pred[i] != 1 ? y[i] OP z[i] : 1;		\
+  }
+#define TEST_INT_TYPE(TYPE) \
+  TEST (TYPE, div, /)
+#define TEST_FP_TYPE(TYPE) \
+  TEST (TYPE, add, +) \
+  TEST (TYPE, sub, -) \
+  TEST (TYPE, mul, *) \
+  TEST (TYPE, div, /)
+#define TEST_ALL \
+  TEST_INT_TYPE (int8_t) \
+  TEST_INT_TYPE (uint8_t) \
+  TEST_INT_TYPE (int16_t) \
+  TEST_INT_TYPE (uint16_t) \
+  TEST_INT_TYPE (int32_t) \
+  TEST_INT_TYPE (uint32_t) \
+  TEST_INT_TYPE (int64_t) \
+  TEST_INT_TYPE (uint64_t) \
+  TEST_FP_TYPE (float) \
+  TEST_FP_TYPE (double)
+TEST_ALL
+/* { dg-final { scan-assembler-not {\t.div\tz[0-9]+\.b} } } */		\
+/* { dg-final { scan-assembler-not {\t.div\tz[0-9]+\.h} } } */		\
+/* { dg-final { scan-assembler-times {\tsdiv\tz[0-9]+\.s, p[0-7]/m,} 7 } } */
+/* At present we don't vectorize the uint8_t or uint16_t loops because the
+   division is done directly in the narrow type, rather than being widened
+   to int first.  */
+/* { dg-final { scan-assembler-times {\tudiv\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tsdiv\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tudiv\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfsub\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfsub\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfmul\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfmul\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfdiv\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tfdiv\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\tsel\t} 14 } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_3_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_arith_3_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+#include "cond_arith_3.c"
+#define N 99
+#undef TEST
+#define TEST(TYPE, NAME, OP)					\
+  {								\
+    TYPE x[N], y[N], z[N], pred[N];				\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	x[i] = -1;						\
+	y[i] = i * i;						\
+	z[i] = ((i + 2) % 3) * (i + 1);				\
+	pred[i] = i % 3;					\
+      }								\
+    test_##TYPE##_##NAME (x, y, z, pred, N);			\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	TYPE expected = i % 3 != 1 ? y[i] OP z[i] : 1;		\
+	if (x[i] != expected)					\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+int
+main (void)
+{
+  TEST_ALL
+  return 0;
+}
--- a/gcc/tree-if-conv.c
+++ b/gcc/tree-if-conv.c
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -3905,65 +3905,68 @@ vect_recog_mask_conversion_pattern (stmt_vec_info stmt_vinfo, tree *type_out)
  /* Check for MASK_LOAD ans MASK_STORE calls requiring mask conversion.  */
  if (is_gimple_call (last_stmt)
-      && gimple_call_internal_p (last_stmt)
+      && gimple_call_internal_p (last_stmt))
-      && (gimple_call_internal_fn (last_stmt) == IFN_MASK_STORE
-	  || gimple_call_internal_fn (last_stmt) == IFN_MASK_LOAD))
    {
      gcall *pattern_stmt;
-      bool load = (gimple_call_internal_fn (last_stmt) == IFN_MASK_LOAD);
-      if (load)
+      internal_fn ifn = gimple_call_internal_fn (last_stmt);
+      int mask_argno = internal_fn_mask_index (ifn);
+      if (mask_argno < 0)
+	return NULL;
+      bool store_p = internal_store_fn_p (ifn);
+      if (store_p)
 	{
-	  lhs = gimple_call_lhs (last_stmt);
+	  int rhs_index = internal_fn_stored_value_index (ifn);
-	  vectype1 = get_vectype_for_scalar_type (TREE_TYPE (lhs));
+	  tree rhs = gimple_call_arg (last_stmt, rhs_index);
+	  vectype1 = get_vectype_for_scalar_type (TREE_TYPE (rhs));
 	}
      else
 	{
-	  rhs2 = gimple_call_arg (last_stmt, 3);
+	  lhs = gimple_call_lhs (last_stmt);
-	  vectype1 = get_vectype_for_scalar_type (TREE_TYPE (rhs2));
+	  vectype1 = get_vectype_for_scalar_type (TREE_TYPE (lhs));
 	}
-      rhs1 = gimple_call_arg (last_stmt, 2);
+      tree mask_arg = gimple_call_arg (last_stmt, mask_argno);
-      rhs1_type = search_type_for_mask (rhs1, vinfo);
+      tree mask_arg_type = search_type_for_mask (mask_arg, vinfo);
-      if (!rhs1_type)
+      if (!mask_arg_type)
 	return NULL;
-      vectype2 = get_mask_type_for_scalar_type (rhs1_type);
+      vectype2 = get_mask_type_for_scalar_type (mask_arg_type);
      if (!vectype1 || !vectype2
 	  || known_eq (TYPE_VECTOR_SUBPARTS (vectype1),
 		       TYPE_VECTOR_SUBPARTS (vectype2)))
 	return NULL;
-      tmp = build_mask_conversion (rhs1, vectype1, stmt_vinfo);
+      tmp = build_mask_conversion (mask_arg, vectype1, stmt_vinfo);
-      if (load)
+      auto_vec<tree, 8> args;
+      unsigned int nargs = gimple_call_num_args (last_stmt);
+      args.safe_grow (nargs);
+      for (unsigned int i = 0; i < nargs; ++i)
+	args[i] = ((int) i == mask_argno
+		   ? tmp
+		   : gimple_call_arg (last_stmt, i));
+      pattern_stmt = gimple_build_call_internal_vec (ifn, args);
+      if (!store_p)
 	{
 	  lhs = vect_recog_temp_ssa_var (TREE_TYPE (lhs), NULL);
-	  pattern_stmt
-	    = gimple_build_call_internal (IFN_MASK_LOAD, 3,
-					  gimple_call_arg (last_stmt, 0),
-					  gimple_call_arg (last_stmt, 1),
-					  tmp);
 	  gimple_call_set_lhs (pattern_stmt, lhs);
 	}
-      else
-	  pattern_stmt
-	    = gimple_build_call_internal (IFN_MASK_STORE, 4,
-					  gimple_call_arg (last_stmt, 0),
-					  gimple_call_arg (last_stmt, 1),
-					  tmp,
-					  gimple_call_arg (last_stmt, 3));
      gimple_call_set_nothrow (pattern_stmt, true);
      pattern_stmt_info = new_stmt_vec_info (pattern_stmt, vinfo);
      set_vinfo_for_stmt (pattern_stmt, pattern_stmt_info);
-      STMT_VINFO_DATA_REF (pattern_stmt_info)
+      if (STMT_VINFO_DATA_REF (stmt_vinfo))
-	= STMT_VINFO_DATA_REF (stmt_vinfo);
+	{
-      STMT_VINFO_DR_WRT_VEC_LOOP (pattern_stmt_info)
+	  STMT_VINFO_DATA_REF (pattern_stmt_info)
-	= STMT_VINFO_DR_WRT_VEC_LOOP (stmt_vinfo);
+	    = STMT_VINFO_DATA_REF (stmt_vinfo);
-      STMT_VINFO_GATHER_SCATTER_P (pattern_stmt_info)
+	  STMT_VINFO_DR_WRT_VEC_LOOP (pattern_stmt_info)
-	= STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo);
+	    = STMT_VINFO_DR_WRT_VEC_LOOP (stmt_vinfo);
+	  STMT_VINFO_GATHER_SCATTER_P (pattern_stmt_info)
+	    = STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo);
+	}
      *type_out = vectype1;
      vect_pattern_detected ("vect_recog_mask_conversion_pattern", last_stmt);

--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -3126,12 +3126,14 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
  bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
  vec_info *vinfo = stmt_info->vinfo;
  tree fndecl, new_temp, rhs_type;
-  enum vect_def_type dt[3]
+  enum vect_def_type dt[4]
-    = {vect_unknown_def_type, vect_unknown_def_type, vect_unknown_def_type};
+    = { vect_unknown_def_type, vect_unknown_def_type, vect_unknown_def_type,
-  int ndts = 3;
+	vect_unknown_def_type };
+  int ndts = ARRAY_SIZE (dt);
  gimple *new_stmt = NULL;
  int ncopies, j;
-  vec<tree> vargs = vNULL;
+  auto_vec<tree, 8> vargs;
+  auto_vec<tree, 8> orig_vargs;
  enum { NARROW, NONE, WIDEN } modifier;
  size_t i, nargs;
  tree lhs;
@@ -3170,22 +3172,38 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
  /* Bail out if the function has more than three arguments, we do not have
     interesting builtin functions to vectorize with more than two arguments
     except for fma.  No arguments is also not good.  */
-  if (nargs == 0 || nargs > 3)
+  if (nargs == 0 || nargs > 4)
    return false;
  /* Ignore the argument of IFN_GOMP_SIMD_LANE, it is magic.  */
-  if (gimple_call_internal_p (stmt)
+  combined_fn cfn = gimple_call_combined_fn (stmt);
-      && gimple_call_internal_fn (stmt) == IFN_GOMP_SIMD_LANE)
+  if (cfn == CFN_GOMP_SIMD_LANE)
    {
      nargs = 0;
      rhs_type = unsigned_type_node;
    }
+  int mask_opno = -1;
+  if (internal_fn_p (cfn))
+    mask_opno = internal_fn_mask_index (as_internal_fn (cfn));
  for (i = 0; i < nargs; i++)
    {
      tree opvectype;
      op = gimple_call_arg (stmt, i);
+      if (!vect_is_simple_use (op, vinfo, &dt[i], &opvectype))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "use not simple.\n");
+	  return false;
+	}
+      /* Skip the mask argument to an internal function.  This operand
+	 has been converted via a pattern if necessary.  */
+      if ((int) i == mask_opno)
+	continue;
      /* We can only handle calls with arguments of the same type.  */
      if (rhs_type
@@ -3199,14 +3217,6 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
      if (!rhs_type)
 	rhs_type = TREE_TYPE (op);
-      if (!vect_is_simple_use (op, vinfo, &dt[i], &opvectype))
-	{
-	  if (dump_enabled_p ())
-	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                             "use not simple.\n");
-	  return false;
-	}
      if (!vectype_in)
 	vectype_in = opvectype;
      else if (opvectype
@@ -3264,7 +3274,6 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
     to vectorize other operations in the loop.  */
  fndecl = NULL_TREE;
  internal_fn ifn = IFN_LAST;
-  combined_fn cfn = gimple_call_combined_fn (stmt);
  tree callee = gimple_call_fndecl (stmt);
  /* First try using an internal function.  */
@@ -3328,6 +3337,7 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
     needs to be generated.  */
  gcc_assert (ncopies >= 1);
+  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
  if (!vec_stmt) /* transformation not required.  */
    {
      STMT_VINFO_TYPE (stmt_info) = call_vec_info_type;
@@ -3337,6 +3347,13 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 	record_stmt_cost (cost_vec, ncopies / 2,
 			  vec_promote_demote, stmt_info, 0, vect_body);
+      if (loop_vinfo && mask_opno >= 0)
+	{
+	  unsigned int nvectors = (slp_node
+				   ? SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node)
+				   : ncopies);
+	  vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_out);
+	}
      return true;
    }
@@ -3349,25 +3366,24 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
  scalar_dest = gimple_call_lhs (stmt);
  vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
+  bool masked_loop_p = loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
  prev_stmt_info = NULL;
  if (modifier == NONE || ifn != IFN_LAST)
    {
      tree prev_res = NULL_TREE;
+      vargs.safe_grow (nargs);
+      orig_vargs.safe_grow (nargs);
      for (j = 0; j < ncopies; ++j)
 	{
 	  /* Build argument list for the vectorized call.  */
-	  if (j == 0)
-	    vargs.create (nargs);
-	  else
-	    vargs.truncate (0);
 	  if (slp_node)
 	    {
 	      auto_vec<vec<tree> > vec_defs (nargs);
 	      vec<tree> vec_oprnds0;
 	      for (i = 0; i < nargs; i++)
-		vargs.quick_push (gimple_call_arg (stmt, i));
+		vargs[i] = gimple_call_arg (stmt, i);
 	      vect_get_slp_defs (vargs, slp_node, &vec_defs);
 	      vec_oprnds0 = vec_defs[0];
@@ -3382,6 +3398,9 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 		    }
 		  if (modifier == NARROW)
 		    {
+		      /* We don't define any narrowing conditional functions
+			 at present.  */
+		      gcc_assert (mask_opno < 0);
 		      tree half_res = make_ssa_name (vectype_in);
 		      gcall *call
 			= gimple_build_call_internal_vec (ifn, vargs);
@@ -3400,6 +3419,17 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 		    }
 		  else
 		    {
+		      if (mask_opno >= 0 && masked_loop_p)
+			{
+			  unsigned int vec_num = vec_oprnds0.length ();
+			  /* Always true for SLP.  */
+			  gcc_assert (ncopies == 1);
+			  tree mask = vect_get_loop_mask (gsi, masks, vec_num,
+							  vectype_out, i);
+			  vargs[mask_opno] = prepare_load_store_mask
+			    (TREE_TYPE (mask), mask, vargs[mask_opno], gsi);
+			}
 		      gcall *call;
 		      if (ifn != IFN_LAST)
 			call = gimple_build_call_internal_vec (ifn, vargs);
@@ -3429,17 +3459,22 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 		vec_oprnd0
 		  = vect_get_vec_def_for_operand (op, stmt);
 	      else
-		{
+		vec_oprnd0
-		  vec_oprnd0 = gimple_call_arg (new_stmt, i);
+		  = vect_get_vec_def_for_stmt_copy (dt[i], orig_vargs[i]);
-		  vec_oprnd0
-                    = vect_get_vec_def_for_stmt_copy (dt[i], vec_oprnd0);
-		}
-	      vargs.quick_push (vec_oprnd0);
+	      orig_vargs[i] = vargs[i] = vec_oprnd0;
+	    }
+	  if (mask_opno >= 0 && masked_loop_p)
+	    {
+	      tree mask = vect_get_loop_mask (gsi, masks, ncopies,
+					      vectype_out, j);
+	      vargs[mask_opno]
+		= prepare_load_store_mask (TREE_TYPE (mask), mask,
+					   vargs[mask_opno], gsi);
 	    }
-	  if (gimple_call_internal_p (stmt)
+	  if (cfn == CFN_GOMP_SIMD_LANE)
-	      && gimple_call_internal_fn (stmt) == IFN_GOMP_SIMD_LANE)
 	    {
 	      tree cst = build_index_vector (vectype_out, j * nunits_out, 1);
 	      tree new_var
@@ -3451,6 +3486,9 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 	    }
 	  else if (modifier == NARROW)
 	    {
+	      /* We don't define any narrowing conditional functions at
+		 present.  */
+	      gcc_assert (mask_opno < 0);
 	      tree half_res = make_ssa_name (vectype_in);
 	      gcall *call = gimple_build_call_internal_vec (ifn, vargs);
 	      gimple_call_set_lhs (call, half_res);
@@ -3490,6 +3528,8 @@ vectorizable_call (gimple *gs, gimple_stmt_iterator *gsi, gimple **vec_stmt,
    }
  else if (modifier == NARROW)
    {
+      /* We don't define any narrowing conditional functions at present.  */
+      gcc_assert (mask_opno < 0);
      for (j = 0; j < ncopies; ++j)
 	{
 	  /* Build argument list for the vectorized call.  */