Handle peeling for alignment with masking

This patch adds support for aligning vectors by using a partial first iteration. E.g. if the start pointer is 3 elements beyond an aligned address, the first iteration will have a mask in which the first three elements are false. On SVE, the optimisation is only useful for vector-length-specific code. Vector-length-agnostic code doesn't try to align vectors since the vector length might not be a power of 2. 2018-01-13 Richard Sandiford <richard.sandiford@linaro.org> Alan Hayward <alan.hayward@arm.com> David Sherwood <david.sherwood@arm.com> gcc/ * tree-vectorizer.h (_loop_vec_info::mask_skip_niters): New field. (LOOP_VINFO_MASK_SKIP_NITERS): New macro. (vect_use_loop_mask_for_alignment_p): New function. (vect_prepare_for_masked_peels, vect_gen_while_not): Declare. * tree-vect-loop-manip.c (vect_set_loop_masks_directly): Add an niters_skip argument. Make sure that the first niters_skip elements of the first iteration are inactive. (vect_set_loop_condition_masked): Handle LOOP_VINFO_MASK_SKIP_NITERS. Update call to vect_set_loop_masks_directly. (get_misalign_in_elems): New function, split out from... (vect_gen_prolog_loop_niters): ...here. (vect_update_init_of_dr): Take a code argument that specifies whether the adjustment should be added or subtracted. (vect_update_init_of_drs): Likewise. (vect_prepare_for_masked_peels): New function. (vect_do_peeling): Skip prologue peeling if we're using a mask instead. Update call to vect_update_inits_of_drs. * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize mask_skip_niters. (vect_analyze_loop_2): Allow fully-masked loops with peeling for alignment. Do not include the number of peeled iterations in the minimum threshold in that case. (vectorizable_induction): Adjust the start value down by LOOP_VINFO_MASK_SKIP_NITERS iterations. (vect_transform_loop): Call vect_prepare_for_masked_peels. Take the number of skipped iterations into account when calculating the loop bounds. * tree-vect-stmts.c (vect_gen_while_not): New function. gcc/testsuite/ * gcc.target/aarch64/sve/nopeel_1.c: New test. * gcc.target/aarch64/sve/peel_ind_1.c: Likewise. * gcc.target/aarch64/sve/peel_ind_1_run.c: Likewise. * gcc.target/aarch64/sve/peel_ind_2.c: Likewise. * gcc.target/aarch64/sve/peel_ind_2_run.c: Likewise. * gcc.target/aarch64/sve/peel_ind_3.c: Likewise. * gcc.target/aarch64/sve/peel_ind_3_run.c: Likewise. * gcc.target/aarch64/sve/peel_ind_4.c: Likewise. * gcc.target/aarch64/sve/peel_ind_4_run.c: Likewise. Co-Authored-By: Alan Hayward <alan.hayward@arm.com> Co-Authored-By: David Sherwood <david.sherwood@arm.com> From-SVN: r256630

Handle peeling for alignment with masking
This patch adds support for aligning vectors by using a partial first iteration. E.g. if the start pointer is 3 elements beyond an aligned address, the first iteration will have a mask in which the first three elements are false. On SVE, the optimisation is only useful for vector-length-specific code. Vector-length-agnostic code doesn't try to align vectors since the vector length might not be a power of 2. 2018-01-13 Richard Sandiford <richard.sandiford@linaro.org> Alan Hayward <alan.hayward@arm.com> David Sherwood <david.sherwood@arm.com> gcc/ * tree-vectorizer.h (_loop_vec_info::mask_skip_niters): New field. (LOOP_VINFO_MASK_SKIP_NITERS): New macro. (vect_use_loop_mask_for_alignment_p): New function. (vect_prepare_for_masked_peels, vect_gen_while_not): Declare. * tree-vect-loop-manip.c (vect_set_loop_masks_directly): Add an niters_skip argument. Make sure that the first niters_skip elements of the first iteration are inactive. (vect_set_loop_condition_masked): Handle LOOP_VINFO_MASK_SKIP_NITERS. Update call to vect_set_loop_masks_directly. (get_misalign_in_elems): New function, split out from... (vect_gen_prolog_loop_niters): ...here. (vect_update_init_of_dr): Take a code argument that specifies whether the adjustment should be added or subtracted. (vect_update_init_of_drs): Likewise. (vect_prepare_for_masked_peels): New function. (vect_do_peeling): Skip prologue peeling if we're using a mask instead. Update call to vect_update_inits_of_drs. * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize mask_skip_niters. (vect_analyze_loop_2): Allow fully-masked loops with peeling for alignment. Do not include the number of peeled iterations in the minimum threshold in that case. (vectorizable_induction): Adjust the start value down by LOOP_VINFO_MASK_SKIP_NITERS iterations. (vect_transform_loop): Call vect_prepare_for_masked_peels. Take the number of skipped iterations into account when calculating the loop bounds. * tree-vect-stmts.c (vect_gen_while_not): New function. gcc/testsuite/ * gcc.target/aarch64/sve/nopeel_1.c: New test. * gcc.target/aarch64/sve/peel_ind_1.c: Likewise. * gcc.target/aarch64/sve/peel_ind_1_run.c: Likewise. * gcc.target/aarch64/sve/peel_ind_2.c: Likewise. * gcc.target/aarch64/sve/peel_ind_2_run.c: Likewise. * gcc.target/aarch64/sve/peel_ind_3.c: Likewise. * gcc.target/aarch64/sve/peel_ind_3_run.c: Likewise. * gcc.target/aarch64/sve/peel_ind_4.c: Likewise. * gcc.target/aarch64/sve/peel_ind_4_run.c: Likewise. Co-Authored-By: Alan Hayward <alan.hayward@arm.com> Co-Authored-By: David Sherwood <david.sherwood@arm.com> From-SVN: r256630
535e7c11 · Richard Sandiford · Richard Sandiford · c2700f74 · 535e7c11 · 535e7c11
Commit 535e7c11 authored Jan 13, 2018 by Richard Sandiford Committed by Richard Sandiford Jan 13, 2018
15 changed files
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -2,6 +2,39 @@
 	    Alan Hayward  <alan.hayward@arm.com>
 	    David Sherwood  <david.sherwood@arm.com>
+	* tree-vectorizer.h (_loop_vec_info::mask_skip_niters): New field.
+	(LOOP_VINFO_MASK_SKIP_NITERS): New macro.
+	(vect_use_loop_mask_for_alignment_p): New function.
+	(vect_prepare_for_masked_peels, vect_gen_while_not): Declare.
+	* tree-vect-loop-manip.c (vect_set_loop_masks_directly): Add an
+	niters_skip argument.  Make sure that the first niters_skip elements
+	of the first iteration are inactive.
+	(vect_set_loop_condition_masked): Handle LOOP_VINFO_MASK_SKIP_NITERS.
+	Update call to vect_set_loop_masks_directly.
+	(get_misalign_in_elems): New function, split out from...
+	(vect_gen_prolog_loop_niters): ...here.
+	(vect_update_init_of_dr): Take a code argument that specifies whether
+	the adjustment should be added or subtracted.
+	(vect_update_init_of_drs): Likewise.
+	(vect_prepare_for_masked_peels): New function.
+	(vect_do_peeling): Skip prologue peeling if we're using a mask
+	instead.  Update call to vect_update_inits_of_drs.
+	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
+	mask_skip_niters.
+	(vect_analyze_loop_2): Allow fully-masked loops with peeling for
+	alignment.  Do not include the number of peeled iterations in
+	the minimum threshold in that case.
+	(vectorizable_induction): Adjust the start value down by
+	LOOP_VINFO_MASK_SKIP_NITERS iterations.
+	(vect_transform_loop): Call vect_prepare_for_masked_peels.
+	Take the number of skipped iterations into account when calculating
+	the loop bounds.
+	* tree-vect-stmts.c (vect_gen_while_not): New function.
+2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
+	    Alan Hayward  <alan.hayward@arm.com>
+	    David Sherwood  <david.sherwood@arm.com>
 	* doc/sourcebuild.texi (vect_fully_masked): Document.
 	* params.def (PARAM_MIN_VECT_LOOP_BOUND): Change minimum and
 	default value to 0.

--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -2,6 +2,20 @@
 	    Alan Hayward  <alan.hayward@arm.com>
 	    David Sherwood  <david.sherwood@arm.com>
+	* gcc.target/aarch64/sve/nopeel_1.c: New test.
+	* gcc.target/aarch64/sve/peel_ind_1.c: Likewise.
+	* gcc.target/aarch64/sve/peel_ind_1_run.c: Likewise.
+	* gcc.target/aarch64/sve/peel_ind_2.c: Likewise.
+	* gcc.target/aarch64/sve/peel_ind_2_run.c: Likewise.
+	* gcc.target/aarch64/sve/peel_ind_3.c: Likewise.
+	* gcc.target/aarch64/sve/peel_ind_3_run.c: Likewise.
+	* gcc.target/aarch64/sve/peel_ind_4.c: Likewise.
+	* gcc.target/aarch64/sve/peel_ind_4_run.c: Likewise.
+2018-01-13  Richard Sandiford  <richard.sandiford@linaro.org>
+	    Alan Hayward  <alan.hayward@arm.com>
+	    David Sherwood  <david.sherwood@arm.com>
 	* lib/target-supports.exp (check_effective_target_vect_fully_masked):
 	New proc.
 	* gcc.dg/vect/slp-3.c: Expect all loops to be vectorized if

--- a/gcc/testsuite/gcc.target/aarch64/sve/nopeel_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/nopeel_1.c
+/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=256" } */
+#include <stdint.h>
+#define TEST(NAME, TYPE)			\
+ void						\
+ NAME##1 (TYPE *x, int n)			\
+ {						\
+   for (int i = 0; i < n; ++i)			\
+     x[i] += 1;					\
+ }						\
+ TYPE NAME##_array[1024];			\
+ void						\
+ NAME##2 (void)					\
+ {						\
+   for (int i = 1; i < 200; ++i)		\
+     NAME##_array[i] += 1;			\
+ }
+TEST (s8, int8_t)
+TEST (u8, uint8_t)
+TEST (s16, int16_t)
+TEST (u16, uint16_t)
+TEST (s32, int32_t)
+TEST (u32, uint32_t)
+TEST (s64, int64_t)
+TEST (u64, uint64_t)
+TEST (f16, _Float16)
+TEST (f32, float)
+TEST (f64, double)
+/* No scalar memory accesses.  */
+/* { dg-final { scan-assembler-not {[wx][0-9]*, \[} } } */
+/* 2 for each NAME##1 test, one in the header and one in the main loop
+   and 1 for each NAME##2 test, in the main loop only.  */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b,} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h,} 9 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s,} 9 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d,} 9 } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_1.c
+/* { dg-do compile } */
+/* Pick an arbitrary target for which unaligned accesses are more
+   expensive.  */
+/* { dg-options "-O3 -msve-vector-bits=256 -mtune=thunderx" } */
+#define N 512
+#define START 1
+#define END 505
+int x[N] __attribute__((aligned(32)));
+void __attribute__((noinline, noclone))
+foo (void)
+{
+  unsigned int v = 0;
+  for (unsigned int i = START; i < END; ++i)
+    {
+      x[i] = v;
+      v += 5;
+    }
+}
+/* We should operate on aligned vectors.  */
+/* { dg-final { scan-assembler {\tadrp\tx[0-9]+, x\n} } } */
+/* We should use an induction that starts at -5, with only the last
+   7 elements of the first iteration being active.  */
+/* { dg-final { scan-assembler {\tindex\tz[0-9]+\.s, #-5, #5\n} } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_1_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_1_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O3 -mtune=thunderx" } */
+/* { dg-options "-O3 -mtune=thunderx -msve-vector-bits=256" { target aarch64_sve256_hw } } */
+#include "peel_ind_1.c"
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  foo ();
+  for (int i = 0; i < N; ++i)
+    {
+      if (x[i] != (i < START || i >= END ? 0 : (i - START) * 5))
+	__builtin_abort ();
+      asm volatile ("" ::: "memory");
+    }
+  return 0;
+}
--- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_2.c
+/* { dg-do compile } */
+/* Pick an arbitrary target for which unaligned accesses are more
+   expensive.  */
+/* { dg-options "-O3 -msve-vector-bits=256 -mtune=thunderx" } */
+#define N 512
+#define START 7
+#define END 22
+int x[N] __attribute__((aligned(32)));
+void __attribute__((noinline, noclone))
+foo (void)
+{
+  for (unsigned int i = START; i < END; ++i)
+    x[i] = i;
+}
+/* We should operate on aligned vectors.  */
+/* { dg-final { scan-assembler {\tadrp\tx[0-9]+, x\n} } } */
+/* We should unroll the loop three times.  */
+/* { dg-final { scan-assembler-times "\tst1w\t" 3 } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_2_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_2_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O3 -mtune=thunderx" } */
+/* { dg-options "-O3 -mtune=thunderx -msve-vector-bits=256" { target aarch64_sve256_hw } } */
+#include "peel_ind_2.c"
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  foo ();
+  for (int i = 0; i < N; ++i)
+    {
+      if (x[i] != (i < START || i >= END ? 0 : i))
+	__builtin_abort ();
+      asm volatile ("" ::: "memory");
+    }
+  return 0;
+}
--- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3.c
+/* { dg-do compile } */
+/* Pick an arbitrary target for which unaligned accesses are more
+   expensive.  */
+/* { dg-options "-O3 -msve-vector-bits=256 -mtune=thunderx" } */
+#define N 32
+#define MAX_START 8
+#define COUNT 16
+int x[MAX_START][N] __attribute__((aligned(32)));
+void __attribute__((noinline, noclone))
+foo (int start)
+{
+  for (int i = start; i < start + COUNT; ++i)
+    x[start][i] = i;
+}
+/* We should operate on aligned vectors.  */
+/* { dg-final { scan-assembler {\tadrp\tx[0-9]+, x\n} } } */
+/* { dg-final { scan-assembler {\tubfx\t} } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O3 -mtune=thunderx" } */
+/* { dg-options "-O3 -mtune=thunderx -msve-vector-bits=256" { target aarch64_sve256_hw } } */
+#include "peel_ind_3.c"
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  for (int start = 0; start < MAX_START; ++start)
+    {
+      foo (start);
+      for (int i = 0; i < N; ++i)
+	{
+	  if (x[start][i] != (i < start || i >= start + COUNT ? 0 : i))
+	    __builtin_abort ();
+	  asm volatile ("" ::: "memory");
+	}
+    }
+  return 0;
+}
--- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_4.c
+/* { dg-do compile } */
+/* Pick an arbitrary target for which unaligned accesses are more
+   expensive.  */
+/* { dg-options "-Ofast -msve-vector-bits=256 -mtune=thunderx -fno-vect-cost-model" } */
+#define START 1
+#define END 505
+void __attribute__((noinline, noclone))
+foo (double *x)
+{
+  double v = 10.0;
+  for (unsigned int i = START; i < END; ++i)
+    {
+      x[i] = v;
+      v += 5.0;
+    }
+}
+/* We should operate on aligned vectors.  */
+/* { dg-final { scan-assembler {\tubfx\t} } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_4_run.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_4_run.c
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-Ofast -mtune=thunderx" } */
+/* { dg-options "-Ofast -mtune=thunderx -mtune=thunderx" { target aarch64_sve256_hw } } */
+#include "peel_ind_4.c"
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  double x[END + 1];
+  for (int i = 0; i < END + 1; ++i)
+    {
+      x[i] = i;
+      asm volatile ("" ::: "memory");
+    }
+  foo (x);
+  for (int i = 0; i < END + 1; ++i)
+    {
+      double expected;
+      if (i < START || i >= END)
+	expected = i;
+      else
+	expected = 10 + (i - START) * 5;
+      if (x[i] != expected)
+	__builtin_abort ();
+      asm volatile ("" ::: "memory");
+    }
+  return 0;
+}
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1121,6 +1121,7 @@ _loop_vec_info::_loop_vec_info (struct loop *loop_in)
    versioning_threshold (0),
    vectorization_factor (0),
    max_vectorization_factor (0),
+    mask_skip_niters (NULL_TREE),
    mask_compare_type (NULL_TREE),
    unaligned_dr (NULL),
    peeling_for_alignment (0),
@@ -2269,16 +2270,6 @@ start_over:
 			 " gaps is required.\n");
    }
-  if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
-      && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
-    {
-      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because peeling for"
-			 " alignment is required.\n");
-    }
  /* Decide whether to use a fully-masked loop for this vectorization
     factor.  */
  LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
@@ -2379,18 +2370,21 @@ start_over:
     increase threshold for this case if necessary.  */
  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
    {
-      poly_uint64 niters_th;
+      poly_uint64 niters_th = 0;
-      /* Niters for peeled prolog loop.  */
+      if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
-      if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)
 	{
-	  struct data_reference *dr = LOOP_VINFO_UNALIGNED_DR (loop_vinfo);
+	  /* Niters for peeled prolog loop.  */
-	  tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr)));
+	  if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)
+	    {
-	  niters_th = TYPE_VECTOR_SUBPARTS (vectype) - 1;
+	      struct data_reference *dr = LOOP_VINFO_UNALIGNED_DR (loop_vinfo);
+	      tree vectype
+		= STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr)));
+	      niters_th += TYPE_VECTOR_SUBPARTS (vectype) - 1;
+	    }
+	  else
+	    niters_th += LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
 	}
-      else
-	niters_th = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
      /* Niters for at least one iteration of vectorized loop.  */
      if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
@@ -7336,9 +7330,28 @@ vectorizable_induction (gimple *phi,
  init_expr = PHI_ARG_DEF_FROM_EDGE (phi,
 				     loop_preheader_edge (iv_loop));
-  /* Convert the step to the desired type.  */
+  /* Convert the initial value and step to the desired type.  */
  stmts = NULL;
+  init_expr = gimple_convert (&stmts, TREE_TYPE (vectype), init_expr);
  step_expr = gimple_convert (&stmts, TREE_TYPE (vectype), step_expr);
+  /* If we are using the loop mask to "peel" for alignment then we need
+     to adjust the start value here.  */
+  tree skip_niters = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
+  if (skip_niters != NULL_TREE)
+    {
+      if (FLOAT_TYPE_P (vectype))
+	skip_niters = gimple_build (&stmts, FLOAT_EXPR, TREE_TYPE (vectype),
+				    skip_niters);
+      else
+	skip_niters = gimple_convert (&stmts, TREE_TYPE (vectype),
+				      skip_niters);
+      tree skip_step = gimple_build (&stmts, MULT_EXPR, TREE_TYPE (vectype),
+				     skip_niters, step_expr);
+      init_expr = gimple_build (&stmts, MINUS_EXPR, TREE_TYPE (vectype),
+				init_expr, skip_step);
+    }
  if (stmts)
    {
      new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
@@ -8209,6 +8222,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
  split_edge (loop_preheader_edge (loop));
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && vect_use_loop_mask_for_alignment_p (loop_vinfo))
+    /* This will deal with any possible peeling.  */
+    vect_prepare_for_masked_peels (loop_vinfo);
  /* FORNOW: the vectorizer supports only loops which body consist
     of one basic block (header + empty latch). When the vectorizer will
     support more involved loop forms, the order by which the BBs are
@@ -8488,29 +8506,40 @@ vect_transform_loop (loop_vec_info loop_vinfo)
  /* +1 to convert latch counts to loop iteration counts,
     -min_epilogue_iters to remove iterations that cannot be performed
       by the vector code.  */
-  int bias = 1 - min_epilogue_iters;
+  int bias_for_lowest = 1 - min_epilogue_iters;
+  int bias_for_assumed = bias_for_lowest;
+  int alignment_npeels = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+  if (alignment_npeels && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
+      /* When the amount of peeling is known at compile time, the first
+	 iteration will have exactly alignment_npeels active elements.
+	 In the worst case it will have at least one.  */
+      int min_first_active = (alignment_npeels > 0 ? alignment_npeels : 1);
+      bias_for_lowest += lowest_vf - min_first_active;
+      bias_for_assumed += assumed_vf - min_first_active;
+    }
  /* In these calculations the "- 1" converts loop iteration counts
     back to latch counts.  */
  if (loop->any_upper_bound)
    loop->nb_iterations_upper_bound
      = (final_iter_may_be_partial
-	 ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias,
+	 ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest,
 			  lowest_vf) - 1
-	 : wi::udiv_floor (loop->nb_iterations_upper_bound + bias,
+	 : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest,
 			   lowest_vf) - 1);
  if (loop->any_likely_upper_bound)
    loop->nb_iterations_likely_upper_bound
      = (final_iter_may_be_partial
-	 ? wi::udiv_ceil (loop->nb_iterations_likely_upper_bound + bias,
+	 ? wi::udiv_ceil (loop->nb_iterations_likely_upper_bound
-			  lowest_vf) - 1
+			  + bias_for_lowest, lowest_vf) - 1
-	 : wi::udiv_floor (loop->nb_iterations_likely_upper_bound + bias,
+	 : wi::udiv_floor (loop->nb_iterations_likely_upper_bound
-			   lowest_vf) - 1);
+			   + bias_for_lowest, lowest_vf) - 1);
  if (loop->any_estimate)
    loop->nb_iterations_estimate
      = (final_iter_may_be_partial
-	 ? wi::udiv_ceil (loop->nb_iterations_estimate + bias,
+	 ? wi::udiv_ceil (loop->nb_iterations_estimate + bias_for_assumed,
 			  assumed_vf) - 1
-	 : wi::udiv_floor (loop->nb_iterations_estimate + bias,
+	 : wi::udiv_floor (loop->nb_iterations_estimate + bias_for_assumed,
 			   assumed_vf) - 1);
  if (dump_enabled_p ())

--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -9991,3 +9991,16 @@ vect_gen_while (tree mask, tree start_index, tree end_index)
  gimple_call_set_lhs (call, mask);
  return call;
 }
+/* Generate a vector mask of type MASK_TYPE for which index I is false iff
+   J + START_INDEX < END_INDEX for all J <= I.  Add the statements to SEQ.  */
+tree
+vect_gen_while_not (gimple_seq *seq, tree mask_type, tree start_index,
+		    tree end_index)
+{
+  tree tmp = make_ssa_name (mask_type);
+  gcall *call = vect_gen_while (tmp, start_index, end_index);
+  gimple_seq_add_stmt (seq, call);
+  return gimple_build (seq, BIT_NOT_EXPR, mask_type, tmp);
+}
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -351,6 +351,12 @@ typedef struct _loop_vec_info : public vec_info {
     on inactive scalars.  */
  vec_loop_masks masks;
+  /* If we are using a loop mask to align memory addresses, this variable
+     contains the number of vector elements that we should skip in the
+     first iteration of the vector loop (i.e. the number of leading
+     elements that should be false in the first mask).  */
+  tree mask_skip_niters;
  /* Type of the variables to use in the WHILE_ULT call for fully-masked
     loops.  */
  tree mask_compare_type;
@@ -480,6 +486,7 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
+#define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
 #define LOOP_VINFO_MASK_COMPARE_TYPE(L)    (L)->mask_compare_type
 #define LOOP_VINFO_PTR_MASK(L)             (L)->ptr_mask
 #define LOOP_VINFO_LOOP_NEST(L)            (L)->loop_nest
@@ -1230,6 +1237,17 @@ unlimited_cost_model (loop_p loop)
  return (flag_vect_cost_model == VECT_COST_MODEL_UNLIMITED);
 }
+/* Return true if the loop described by LOOP_VINFO is fully-masked and
+   if the first iteration should use a partial mask in order to achieve
+   alignment.  */
+static inline bool
+vect_use_loop_mask_for_alignment_p (loop_vec_info loop_vinfo)
+{
+  return (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+	  && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo));
+}
 /* Return the number of vectors of type VECTYPE that are needed to get
   NUNITS elements.  NUNITS should be based on the vectorization factor,
   so it is always a known multiple of the number of elements in VECTYPE.  */
@@ -1328,6 +1346,7 @@ extern void vect_loop_versioning (loop_vec_info, unsigned int, bool,
 				  poly_uint64);
 extern struct loop *vect_do_peeling (loop_vec_info, tree, tree,
 				     tree *, tree *, tree *, int, bool, bool);
+extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern source_location find_loop_location (struct loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
@@ -1393,6 +1412,7 @@ extern tree vect_gen_perm_mask_any (tree, const vec_perm_indices &);
 extern tree vect_gen_perm_mask_checked (tree, const vec_perm_indices &);
 extern void optimize_mask_stores (struct loop*);
 extern gcall *vect_gen_while (tree, tree, tree);
+extern tree vect_gen_while_not (gimple_seq *, tree, tree, tree);
 /* In tree-vect-data-refs.c.  */
 extern bool vect_can_force_dr_alignment_p (const_tree, unsigned int);