[AArch64] Prefer LD1RQ for big-endian SVE

This patch deals with cases in which a CONST_VECTOR contains a repeating bit pattern that is wider than one element but narrower than 128 bits. The current code: * treats the repeating pattern as a single element * uses the associated LD1R to load and replicate it (such as LD1RD for 64-bit patterns) * uses a subreg to cast the result back to the original vector type The problem is that for big-endian targets, the final cast is effectively a form of element reverse. E.g. say we're using LD1RD to load 16-bit elements, with h being the high parts and l being the low parts: +-----+-----+-----+-----+-----+---- lanes | 0 | 1 | 2 | 3 | 4 | ... +-----+-----+-----+-----+-----+---- memory bytes |h0 l0 h1 l1 h2 l2 h3 l3 h0 l0 .... +---------------------------------- V V V V V V V V ----------+-----------------------+ register .... | 0 | after ----------+-----------------------+ lsb LD1RD .... h3 l3 h0 l0 h1 l1 h2 l2 h3 l3| ----------------------------------+ ----+-----+-----+-----+-----+-----+ expected ... | 4 | 3 | 2 | 1 | 0 | register ----+-----+-----+-----+-----+-----+ lsb contents .... h0 l0 h3 l3 h2 l2 h1 l1 h0 l0| ----------------------------------+ A later patch fixes the handling of general subregs to account for this, but it means that we need to do a REV instruction after the load. It seems better to use LD1RQ[BHW] on a 128-bit pattern instead, since that gets the endianness right without a separate fixup instruction. 2018-02-01 Richard Sandiford <richard.sandiford@linaro.org> gcc/ * config/aarch64/aarch64.c (aarch64_expand_sve_const_vector): Prefer the TImode handling for big-endian targets. gcc/testsuite/ * gcc.target/aarch64/sve/slp_2.c: Expect LD1RQ to be used instead of LD1R[HWD] for multi-element constants on big-endian targets. * gcc.target/aarch64/sve/slp_3.c: Likewise. * gcc.target/aarch64/sve/slp_4.c: Likewise. Reviewed-by: James Greenhalgh <james.greenhalgh@arm.com> From-SVN: r257288

[AArch64] Prefer LD1RQ for big-endian SVE
This patch deals with cases in which a CONST_VECTOR contains a repeating bit pattern that is wider than one element but narrower than 128 bits. The current code: * treats the repeating pattern as a single element * uses the associated LD1R to load and replicate it (such as LD1RD for 64-bit patterns) * uses a subreg to cast the result back to the original vector type The problem is that for big-endian targets, the final cast is effectively a form of element reverse. E.g. say we're using LD1RD to load 16-bit elements, with h being the high parts and l being the low parts: +-----+-----+-----+-----+-----+---- lanes | 0 | 1 | 2 | 3 | 4 | ... +-----+-----+-----+-----+-----+---- memory bytes |h0 l0 h1 l1 h2 l2 h3 l3 h0 l0 .... +---------------------------------- V V V V V V V V ----------+-----------------------+ register .... | 0 | after ----------+-----------------------+ lsb LD1RD .... h3 l3 h0 l0 h1 l1 h2 l2 h3 l3| ----------------------------------+ ----+-----+-----+-----+-----+-----+ expected ... | 4 | 3 | 2 | 1 | 0 | register ----+-----+-----+-----+-----+-----+ lsb contents .... h0 l0 h3 l3 h2 l2 h1 l1 h0 l0| ----------------------------------+ A later patch fixes the handling of general subregs to account for this, but it means that we need to do a REV instruction after the load. It seems better to use LD1RQ[BHW] on a 128-bit pattern instead, since that gets the endianness right without a separate fixup instruction. 2018-02-01 Richard Sandiford <richard.sandiford@linaro.org> gcc/ * config/aarch64/aarch64.c (aarch64_expand_sve_const_vector): Prefer the TImode handling for big-endian targets. gcc/testsuite/ * gcc.target/aarch64/sve/slp_2.c: Expect LD1RQ to be used instead of LD1R[HWD] for multi-element constants on big-endian targets. * gcc.target/aarch64/sve/slp_3.c: Likewise. * gcc.target/aarch64/sve/slp_4.c: Likewise. Reviewed-by: James Greenhalgh <james.greenhalgh@arm.com> From-SVN: r257288
8179efe0 · Richard Sandiford · Richard Sandiford · 947b1372 · 8179efe0 · 8179efe0
Commit 8179efe0 authored Feb 01, 2018 by Richard Sandiford Committed by Richard Sandiford Feb 01, 2018
6 changed files
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
 2018-02-01  Richard Sandiford  <richard.sandiford@linaro.org>

+	* config/aarch64/aarch64.c (aarch64_expand_sve_const_vector): Prefer
+	the TImode handling for big-endian targets.
+
+2018-02-01  Richard Sandiford  <richard.sandiford@linaro.org>
+
 	* config/aarch64/aarch64-sve.md (sve_ld1rq): Replace with...
 	(*sve_ld1rq<Vesize>): ... this new pattern.  Handle all element sizes,
 	not just bytes.

--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -2824,10 +2824,18 @@ aarch64_expand_sve_const_vector (rtx dest, rtx src)
      /* The constant is a repeating seqeuence of at least two elements,
 	 where the repeating elements occupy no more than 128 bits.
 	 Get an integer representation of the replicated value.  */
-      unsigned int int_bits = GET_MODE_UNIT_BITSIZE (mode) * npatterns;
-      gcc_assert (int_bits <= 128);
-
-      scalar_int_mode int_mode = int_mode_for_size (int_bits, 0).require ();
+      scalar_int_mode int_mode;
+      if (BYTES_BIG_ENDIAN)
+	/* For now, always use LD1RQ to load the value on big-endian
+	   targets, since the handling of smaller integers includes a
+	   subreg that is semantically an element reverse.  */
+	int_mode = TImode;
+      else
+	{
+	  unsigned int int_bits = GET_MODE_UNIT_BITSIZE (mode) * npatterns;
+	  gcc_assert (int_bits <= 128);
+	  int_mode = int_mode_for_size (int_bits, 0).require ();
+	}
      rtx int_value = simplify_gen_subreg (int_mode, src, mode, 0);
      if (int_value
 	  && aarch64_expand_sve_widened_duplicate (dest, int_mode, int_value))

--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
 2018-02-01  Richard Sandiford  <richard.sandiford@linaro.org>

+	* gcc.target/aarch64/sve/slp_2.c: Expect LD1RQ to be used instead
+	of LD1R[HWD] for multi-element constants on big-endian targets.
+	* gcc.target/aarch64/sve/slp_3.c: Likewise.
+	* gcc.target/aarch64/sve/slp_4.c: Likewise.
+
+2018-02-01  Richard Sandiford  <richard.sandiford@linaro.org>
+
 	* gcc.target/aarch64/sve/slp_2.c: Expect LD1RQD rather than LD1RQB.
 	* gcc.target/aarch64/sve/slp_3.c: Expect LD1RQW rather than LD1RQB.
 	* gcc.target/aarch64/sve/slp_4.c: Expect LD1RQH rather than LD1RQB.

--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_2.c
@@ -29,9 +29,12 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\

 TEST_ALL (VEC_PERM)

-/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, } 2 } } */
-/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 3 } } */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, } 2 { target aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 3 { target aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 { target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 { target aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 { target aarch64_big_endian } } } */
 /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 3 } } */
 /* { dg-final { scan-assembler-not {\tzip1\t} } } */
 /* { dg-final { scan-assembler-not {\tzip2\t} } } */

--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
@@ -32,9 +32,12 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\
 TEST_ALL (VEC_PERM)

 /* 1 for each 8-bit type.  */
-/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
+/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 { target aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target aarch64_big_endian } } } */
 /* 1 for each 16-bit type and 4 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 7 } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 7 { target aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 { target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 { target aarch64_big_endian } } } */
 /* 1 for each 32-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */

--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
@@ -36,7 +36,9 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\
 TEST_ALL (VEC_PERM)

 /* 1 for each 8-bit type, 4 for each 32-bit type and 8 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 22 } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 22 { target aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 20 { target aarch64_big_endian } } } */
 /* 1 for each 16-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]\.h, } 3 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */