Commit 8179efe0 by Richard Sandiford Committed by Richard Sandiford

[AArch64] Prefer LD1RQ for big-endian SVE

This patch deals with cases in which a CONST_VECTOR contains a
repeating bit pattern that is wider than one element but narrower
than 128 bits.  The current code:

* treats the repeating pattern as a single element
* uses the associated LD1R to load and replicate it (such as LD1RD
  for 64-bit patterns)
* uses a subreg to cast the result back to the original vector type

The problem is that for big-endian targets, the final cast is
effectively a form of element reverse.  E.g. say we're using LD1RD to load
16-bit elements, with h being the high parts and l being the low parts:

                               +-----+-----+-----+-----+-----+----
                         lanes |  0  |  1  |  2  |  3  |  4  | ...
                               +-----+-----+-----+-----+-----+----
     memory              bytes |h0 l0 h1 l1 h2 l2 h3 l3 h0 l0 ....
                               +----------------------------------
                                 V  V  V  V  V  V  V  V
                     ----------+-----------------------+
    register         ....      |           0           |
     after           ----------+-----------------------+  lsb
     LD1RD           .... h3 l3 h0 l0 h1 l1 h2 l2 h3 l3|
                     ----------------------------------+

                     ----+-----+-----+-----+-----+-----+
    expected         ... |  4  |  3  |  2  |  1  |  0  |
    register         ----+-----+-----+-----+-----+-----+  lsb
    contents         .... h0 l0 h3 l3 h2 l2 h1 l1 h0 l0|
                     ----------------------------------+

A later patch fixes the handling of general subregs to account
for this, but it means that we need to do a REV instruction
after the load.  It seems better to use LD1RQ[BHW] on a 128-bit
pattern instead, since that gets the endianness right without
a separate fixup instruction.

2018-02-01  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* config/aarch64/aarch64.c (aarch64_expand_sve_const_vector): Prefer
	the TImode handling for big-endian targets.

gcc/testsuite/
	* gcc.target/aarch64/sve/slp_2.c: Expect LD1RQ to be used instead
	of LD1R[HWD] for multi-element constants on big-endian targets.
	* gcc.target/aarch64/sve/slp_3.c: Likewise.
	* gcc.target/aarch64/sve/slp_4.c: Likewise.

Reviewed-by: James Greenhalgh <james.greenhalgh@arm.com>

From-SVN: r257288
parent 947b1372
2018-02-01 Richard Sandiford <richard.sandiford@linaro.org> 2018-02-01 Richard Sandiford <richard.sandiford@linaro.org>
* config/aarch64/aarch64.c (aarch64_expand_sve_const_vector): Prefer
the TImode handling for big-endian targets.
2018-02-01 Richard Sandiford <richard.sandiford@linaro.org>
* config/aarch64/aarch64-sve.md (sve_ld1rq): Replace with... * config/aarch64/aarch64-sve.md (sve_ld1rq): Replace with...
(*sve_ld1rq<Vesize>): ... this new pattern. Handle all element sizes, (*sve_ld1rq<Vesize>): ... this new pattern. Handle all element sizes,
not just bytes. not just bytes.
......
...@@ -2824,10 +2824,18 @@ aarch64_expand_sve_const_vector (rtx dest, rtx src) ...@@ -2824,10 +2824,18 @@ aarch64_expand_sve_const_vector (rtx dest, rtx src)
/* The constant is a repeating seqeuence of at least two elements, /* The constant is a repeating seqeuence of at least two elements,
where the repeating elements occupy no more than 128 bits. where the repeating elements occupy no more than 128 bits.
Get an integer representation of the replicated value. */ Get an integer representation of the replicated value. */
unsigned int int_bits = GET_MODE_UNIT_BITSIZE (mode) * npatterns; scalar_int_mode int_mode;
gcc_assert (int_bits <= 128); if (BYTES_BIG_ENDIAN)
/* For now, always use LD1RQ to load the value on big-endian
scalar_int_mode int_mode = int_mode_for_size (int_bits, 0).require (); targets, since the handling of smaller integers includes a
subreg that is semantically an element reverse. */
int_mode = TImode;
else
{
unsigned int int_bits = GET_MODE_UNIT_BITSIZE (mode) * npatterns;
gcc_assert (int_bits <= 128);
int_mode = int_mode_for_size (int_bits, 0).require ();
}
rtx int_value = simplify_gen_subreg (int_mode, src, mode, 0); rtx int_value = simplify_gen_subreg (int_mode, src, mode, 0);
if (int_value if (int_value
&& aarch64_expand_sve_widened_duplicate (dest, int_mode, int_value)) && aarch64_expand_sve_widened_duplicate (dest, int_mode, int_value))
......
2018-02-01 Richard Sandiford <richard.sandiford@linaro.org> 2018-02-01 Richard Sandiford <richard.sandiford@linaro.org>
* gcc.target/aarch64/sve/slp_2.c: Expect LD1RQ to be used instead
of LD1R[HWD] for multi-element constants on big-endian targets.
* gcc.target/aarch64/sve/slp_3.c: Likewise.
* gcc.target/aarch64/sve/slp_4.c: Likewise.
2018-02-01 Richard Sandiford <richard.sandiford@linaro.org>
* gcc.target/aarch64/sve/slp_2.c: Expect LD1RQD rather than LD1RQB. * gcc.target/aarch64/sve/slp_2.c: Expect LD1RQD rather than LD1RQB.
* gcc.target/aarch64/sve/slp_3.c: Expect LD1RQW rather than LD1RQB. * gcc.target/aarch64/sve/slp_3.c: Expect LD1RQW rather than LD1RQB.
* gcc.target/aarch64/sve/slp_4.c: Expect LD1RQH rather than LD1RQB. * gcc.target/aarch64/sve/slp_4.c: Expect LD1RQH rather than LD1RQB.
......
...@@ -29,9 +29,12 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \ ...@@ -29,9 +29,12 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \
TEST_ALL (VEC_PERM) TEST_ALL (VEC_PERM)
/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, } 2 } } */ /* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, } 2 { target aarch64_little_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 3 } } */ /* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target aarch64_big_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */ /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 3 { target aarch64_little_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 { target aarch64_big_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 { target aarch64_little_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 { target aarch64_big_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 3 } } */ /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 3 } } */
/* { dg-final { scan-assembler-not {\tzip1\t} } } */ /* { dg-final { scan-assembler-not {\tzip1\t} } } */
/* { dg-final { scan-assembler-not {\tzip2\t} } } */ /* { dg-final { scan-assembler-not {\tzip2\t} } } */
......
...@@ -32,9 +32,12 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \ ...@@ -32,9 +32,12 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \
TEST_ALL (VEC_PERM) TEST_ALL (VEC_PERM)
/* 1 for each 8-bit type. */ /* 1 for each 8-bit type. */
/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */ /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 { target aarch64_little_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target aarch64_big_endian } } } */
/* 1 for each 16-bit type and 4 for double. */ /* 1 for each 16-bit type and 4 for double. */
/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 7 } } */ /* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 7 { target aarch64_little_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 { target aarch64_big_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 { target aarch64_big_endian } } } */
/* 1 for each 32-bit type. */ /* 1 for each 32-bit type. */
/* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */ /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */ /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
......
...@@ -36,7 +36,9 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \ ...@@ -36,7 +36,9 @@ vec_slp_##TYPE (TYPE *restrict a, int n) \
TEST_ALL (VEC_PERM) TEST_ALL (VEC_PERM)
/* 1 for each 8-bit type, 4 for each 32-bit type and 8 for double. */ /* 1 for each 8-bit type, 4 for each 32-bit type and 8 for double. */
/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 22 } } */ /* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 22 { target aarch64_little_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target aarch64_big_endian } } } */
/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 20 { target aarch64_big_endian } } } */
/* 1 for each 16-bit type. */ /* 1 for each 16-bit type. */
/* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]\.h, } 3 } } */ /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]\.h, } 3 } } */
/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */ /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment