[AArch64] Improve SVE constant moves

If there's no SVE instruction to load a given constant directly, this patch instead tries to use an Advanced SIMD constant move and then duplicates the constant to fill an SVE vector. The main use of this is to support constants in which each byte is in { 0, 0xff }. Also, the patch prefers a simple integer move followed by a duplicate over a load from memory, like we already do for Advanced SIMD. This is a useful option to have and would be easy to turn off via a tuning parameter if necessary. The patch also extends the handling of wide LD1Rs to big endian, whereas previously we punted to a full LD1RQ. 2019-08-13 Richard Sandiford <richard.sandiford@arm.com> gcc/ * machmode.h (opt_mode::else_mode): New function. (opt_mode::else_blk): Use it. * config/aarch64/aarch64-protos.h (aarch64_vq_mode): Declare. (aarch64_full_sve_mode, aarch64_sve_ld1rq_operand_p): Likewise. (aarch64_gen_stepped_int_parallel): Likewise. (aarch64_stepped_int_parallel_p): Likewise. (aarch64_expand_mov_immediate): Remove the optional gen_vec_duplicate argument. * config/aarch64/aarch64.c (aarch64_expand_sve_widened_duplicate): Delete. (aarch64_expand_sve_dupq, aarch64_expand_sve_ld1rq): New functions. (aarch64_expand_sve_const_vector): Rewrite to handle more cases. (aarch64_expand_mov_immediate): Remove the optional gen_vec_duplicate argument. Use early returns in the !CONST_INT_P handling. Pass all SVE data vectors to aarch64_expand_sve_const_vector rather than handling some inline. (aarch64_full_sve_mode, aarch64_vq_mode): New functions, split out from... (aarch64_simd_container_mode): ...here. (aarch64_gen_stepped_int_parallel, aarch64_stepped_int_parallel_p) (aarch64_sve_ld1rq_operand_p): New functions. * config/aarch64/predicates.md (descending_int_parallel) (aarch64_sve_ld1rq_operand): New predicates. * config/aarch64/constraints.md (UtQ): New constraint. * config/aarch64/aarch64.md (UNSPEC_REINTERPRET): New unspec. * config/aarch64/aarch64-sve.md (mov<SVE_ALL:mode>): Remove the gen_vec_duplicate from call to aarch64_expand_mov_immediate. (@aarch64_sve_reinterpret<mode>): New expander. (*aarch64_sve_reinterpret<mode>): New pattern. (@aarch64_vec_duplicate_vq<mode>_le): New pattern. (@aarch64_vec_duplicate_vq<mode>_be): Likewise. (*sve_ld1rq<Vesize>): Replace with... (@aarch64_sve_ld1rq<mode>): ...this new pattern. gcc/testsuite/ * gcc.target/aarch64/sve/init_2.c: Expect ld1rd to be used instead of a full vector load. * gcc.target/aarch64/sve/init_4.c: Likewise. * gcc.target/aarch64/sve/ld1r_2.c: Remove constants that no longer need to be loaded from memory. * gcc.target/aarch64/sve/slp_2.c: Expect the same output for big and little endian. * gcc.target/aarch64/sve/slp_3.c: Likewise. Expect 3 of the doubles to be moved via integer registers rather than loaded from memory. * gcc.target/aarch64/sve/slp_4.c: Likewise but for 4 doubles. * gcc.target/aarch64/sve/spill_4.c: Expect 16-bit constants to be loaded via an integer register rather than from memory. * gcc.target/aarch64/sve/const_1.c: New test. * gcc.target/aarch64/sve/const_2.c: Likewise. * gcc.target/aarch64/sve/const_3.c: Likewise. From-SVN: r274375

[AArch64] Improve SVE constant moves
If there's no SVE instruction to load a given constant directly, this patch instead tries to use an Advanced SIMD constant move and then duplicates the constant to fill an SVE vector. The main use of this is to support constants in which each byte is in { 0, 0xff }. Also, the patch prefers a simple integer move followed by a duplicate over a load from memory, like we already do for Advanced SIMD. This is a useful option to have and would be easy to turn off via a tuning parameter if necessary. The patch also extends the handling of wide LD1Rs to big endian, whereas previously we punted to a full LD1RQ. 2019-08-13 Richard Sandiford <richard.sandiford@arm.com> gcc/ * machmode.h (opt_mode::else_mode): New function. (opt_mode::else_blk): Use it. * config/aarch64/aarch64-protos.h (aarch64_vq_mode): Declare. (aarch64_full_sve_mode, aarch64_sve_ld1rq_operand_p): Likewise. (aarch64_gen_stepped_int_parallel): Likewise. (aarch64_stepped_int_parallel_p): Likewise. (aarch64_expand_mov_immediate): Remove the optional gen_vec_duplicate argument. * config/aarch64/aarch64.c (aarch64_expand_sve_widened_duplicate): Delete. (aarch64_expand_sve_dupq, aarch64_expand_sve_ld1rq): New functions. (aarch64_expand_sve_const_vector): Rewrite to handle more cases. (aarch64_expand_mov_immediate): Remove the optional gen_vec_duplicate argument. Use early returns in the !CONST_INT_P handling. Pass all SVE data vectors to aarch64_expand_sve_const_vector rather than handling some inline. (aarch64_full_sve_mode, aarch64_vq_mode): New functions, split out from... (aarch64_simd_container_mode): ...here. (aarch64_gen_stepped_int_parallel, aarch64_stepped_int_parallel_p) (aarch64_sve_ld1rq_operand_p): New functions. * config/aarch64/predicates.md (descending_int_parallel) (aarch64_sve_ld1rq_operand): New predicates. * config/aarch64/constraints.md (UtQ): New constraint. * config/aarch64/aarch64.md (UNSPEC_REINTERPRET): New unspec. * config/aarch64/aarch64-sve.md (mov<SVE_ALL:mode>): Remove the gen_vec_duplicate from call to aarch64_expand_mov_immediate. (@aarch64_sve_reinterpret<mode>): New expander. (*aarch64_sve_reinterpret<mode>): New pattern. (@aarch64_vec_duplicate_vq<mode>_le): New pattern. (@aarch64_vec_duplicate_vq<mode>_be): Likewise. (*sve_ld1rq<Vesize>): Replace with... (@aarch64_sve_ld1rq<mode>): ...this new pattern. gcc/testsuite/ * gcc.target/aarch64/sve/init_2.c: Expect ld1rd to be used instead of a full vector load. * gcc.target/aarch64/sve/init_4.c: Likewise. * gcc.target/aarch64/sve/ld1r_2.c: Remove constants that no longer need to be loaded from memory. * gcc.target/aarch64/sve/slp_2.c: Expect the same output for big and little endian. * gcc.target/aarch64/sve/slp_3.c: Likewise. Expect 3 of the doubles to be moved via integer registers rather than loaded from memory. * gcc.target/aarch64/sve/slp_4.c: Likewise but for 4 doubles. * gcc.target/aarch64/sve/spill_4.c: Expect 16-bit constants to be loaded via an integer register rather than from memory. * gcc.target/aarch64/sve/const_1.c: New test. * gcc.target/aarch64/sve/const_2.c: Likewise. * gcc.target/aarch64/sve/const_3.c: Likewise. From-SVN: r274375
4aeb1ba7 · Richard Sandiford · Richard Sandiford · 4e55aefa · 4aeb1ba7 · 4aeb1ba7
Commit 4aeb1ba7 authored Aug 13, 2019 by Richard Sandiford Committed by Richard Sandiford Aug 13, 2019
19 changed files
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
+2019-08-13  Richard Sandiford  <richard.sandiford@arm.com>
+
+	* machmode.h (opt_mode::else_mode): New function.
+	(opt_mode::else_blk): Use it.
+	* config/aarch64/aarch64-protos.h (aarch64_vq_mode): Declare.
+	(aarch64_full_sve_mode, aarch64_sve_ld1rq_operand_p): Likewise.
+	(aarch64_gen_stepped_int_parallel): Likewise.
+	(aarch64_stepped_int_parallel_p): Likewise.
+	(aarch64_expand_mov_immediate): Remove the optional gen_vec_duplicate
+	argument.
+	* config/aarch64/aarch64.c
+	(aarch64_expand_sve_widened_duplicate): Delete.
+	(aarch64_expand_sve_dupq, aarch64_expand_sve_ld1rq): New functions.
+	(aarch64_expand_sve_const_vector): Rewrite to handle more cases.
+	(aarch64_expand_mov_immediate): Remove the optional gen_vec_duplicate
+	argument.  Use early returns in the !CONST_INT_P handling.
+	Pass all SVE data vectors to aarch64_expand_sve_const_vector rather
+	than handling some inline.
+	(aarch64_full_sve_mode, aarch64_vq_mode): New functions, split out
+	from...
+	(aarch64_simd_container_mode): ...here.
+	(aarch64_gen_stepped_int_parallel, aarch64_stepped_int_parallel_p)
+	(aarch64_sve_ld1rq_operand_p): New functions.
+	* config/aarch64/predicates.md (descending_int_parallel)
+	(aarch64_sve_ld1rq_operand): New predicates.
+	* config/aarch64/constraints.md (UtQ): New constraint.
+	* config/aarch64/aarch64.md (UNSPEC_REINTERPRET): New unspec.
+	* config/aarch64/aarch64-sve.md (mov<SVE_ALL:mode>): Remove the
+	gen_vec_duplicate from call to aarch64_expand_mov_immediate.
+	(@aarch64_sve_reinterpret<mode>): New expander.
+	(*aarch64_sve_reinterpret<mode>): New pattern.
+	(@aarch64_vec_duplicate_vq<mode>_le): New pattern.
+	(@aarch64_vec_duplicate_vq<mode>_be): Likewise.
+	(*sve_ld1rq<Vesize>): Replace with...
+	(@aarch64_sve_ld1rq<mode>): ...this new pattern.
+
 2019-08-13  Wilco Dijkstra  <wdijkstr@arm.com>

 	* config/aarch64/aarch64.c (generic_tunings): Set function alignment to

--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -416,6 +416,8 @@ unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
 bool aarch64_and_bitmask_imm (unsigned HOST_WIDE_INT val_in, machine_mode mode);
 int aarch64_branch_cost (bool, bool);
 enum aarch64_symbol_type aarch64_classify_symbolic_expression (rtx);
+opt_machine_mode aarch64_vq_mode (scalar_mode);
+opt_machine_mode aarch64_full_sve_mode (scalar_mode);
 bool aarch64_can_const_movi_rtx_p (rtx x, machine_mode mode);
 bool aarch64_const_vec_all_same_int_p (rtx, HOST_WIDE_INT);
 bool aarch64_const_vec_all_same_in_range_p (rtx, HOST_WIDE_INT,
@@ -504,9 +506,12 @@ rtx aarch64_return_addr (int, rtx);
 rtx aarch64_simd_gen_const_vector_dup (machine_mode, HOST_WIDE_INT);
 bool aarch64_simd_mem_operand_p (rtx);
 bool aarch64_sve_ld1r_operand_p (rtx);
+bool aarch64_sve_ld1rq_operand_p (rtx);
 bool aarch64_sve_ldr_operand_p (rtx);
 bool aarch64_sve_struct_memory_operand_p (rtx);
 rtx aarch64_simd_vect_par_cnst_half (machine_mode, int, bool);
+rtx aarch64_gen_stepped_int_parallel (unsigned int, int, int);
+bool aarch64_stepped_int_parallel_p (rtx, int);
 rtx aarch64_tls_get_addr (void);
 tree aarch64_fold_builtin (tree, int, tree *, bool);
 unsigned aarch64_dbx_register_number (unsigned);
@@ -518,7 +523,7 @@ const char * aarch64_output_probe_stack_range (rtx, rtx);
 const char * aarch64_output_probe_sve_stack_clash (rtx, rtx, rtx, rtx);
 void aarch64_err_no_fpadvsimd (machine_mode);
 void aarch64_expand_epilogue (bool);
-void aarch64_expand_mov_immediate (rtx, rtx, rtx (*) (rtx, rtx) = 0);
+void aarch64_expand_mov_immediate (rtx, rtx);
 rtx aarch64_ptrue_reg (machine_mode);
 rtx aarch64_pfalse_reg (machine_mode);
 void aarch64_emit_sve_pred_move (rtx, rtx, rtx);

--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -207,8 +207,7 @@

    if (CONSTANT_P (operands[1]))
      {
-	aarch64_expand_mov_immediate (operands[0], operands[1],
-				      gen_vec_duplicate<mode>);
+	aarch64_expand_mov_immediate (operands[0], operands[1]);
 	DONE;
      }

@@ -326,6 +325,39 @@
  }
 )

+;; Reinterpret operand 1 in operand 0's mode, without changing its contents.
+;; This is equivalent to a subreg on little-endian targets but not for
+;; big-endian; see the comment at the head of the file for details.
+(define_expand "@aarch64_sve_reinterpret<mode>"
+  [(set (match_operand:SVE_ALL 0 "register_operand")
+	(unspec:SVE_ALL [(match_operand 1 "aarch64_any_register_operand")]
+			UNSPEC_REINTERPRET))]
+  "TARGET_SVE"
+  {
+    if (!BYTES_BIG_ENDIAN)
+      {
+	emit_move_insn (operands[0], gen_lowpart (<MODE>mode, operands[1]));
+	DONE;
+      }
+  }
+)
+
+;; A pattern for handling type punning on big-endian targets.  We use a
+;; special predicate for operand 1 to reduce the number of patterns.
+(define_insn_and_split "*aarch64_sve_reinterpret<mode>"
+  [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
+	(unspec:SVE_ALL [(match_operand 1 "aarch64_any_register_operand" "0")]
+			UNSPEC_REINTERPRET))]
+  "TARGET_SVE"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (match_dup 1))]
+  {
+    emit_note (NOTE_INSN_DELETED);
+    DONE;
+  }
+)
+
 ;; -------------------------------------------------------------------------
 ;; ---- Moves of multiple vectors
 ;; -------------------------------------------------------------------------
@@ -787,6 +819,39 @@
  [(set_attr "length" "4,4,8")]
 )

+;; Duplicate an Advanced SIMD vector to fill an SVE vector (LE version).
+(define_insn "@aarch64_vec_duplicate_vq<mode>_le"
+  [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
+	(vec_duplicate:SVE_ALL
+	  (match_operand:<V128> 1 "register_operand" "w")))]
+  "TARGET_SVE && !BYTES_BIG_ENDIAN"
+  {
+    operands[1] = gen_rtx_REG (<MODE>mode, REGNO (operands[1]));
+    return "dup\t%0.q, %1.q[0]";
+  }
+)
+
+;; Duplicate an Advanced SIMD vector to fill an SVE vector (BE version).
+;; The SVE register layout puts memory lane N into (architectural)
+;; register lane N, whereas the Advanced SIMD layout puts the memory
+;; lsb into the register lsb.  We therefore have to describe this in rtl
+;; terms as a reverse of the V128 vector followed by a duplicate.
+(define_insn "@aarch64_vec_duplicate_vq<mode>_be"
+  [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
+	(vec_duplicate:SVE_ALL
+	  (vec_select:<V128>
+	    (match_operand:<V128> 1 "register_operand" "w")
+	    (match_operand 2 "descending_int_parallel"))))]
+  "TARGET_SVE
+   && BYTES_BIG_ENDIAN
+   && known_eq (INTVAL (XVECEXP (operands[2], 0, 0)),
+		GET_MODE_NUNITS (<V128>mode) - 1)"
+  {
+    operands[1] = gen_rtx_REG (<MODE>mode, REGNO (operands[1]));
+    return "dup\t%0.q, %1.q[0]";
+  }
+)
+
 ;; This is used for vec_duplicate<mode>s from memory, but can also
 ;; be used by combine to optimize selects of a a vec_duplicate<mode>
 ;; with zero.
@@ -802,17 +867,19 @@
  "ld1r<Vesize>\t%0.<Vetype>, %1/z, %2"
 )

-;; Load 128 bits from memory and duplicate to fill a vector.  Since there
-;; are so few operations on 128-bit "elements", we don't define a VNx1TI
-;; and simply use vectors of bytes instead.
-(define_insn "*sve_ld1rq<Vesize>"
+;; Load 128 bits from memory under predicate control and duplicate to
+;; fill a vector.
+(define_insn "@aarch64_sve_ld1rq<mode>"
  [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
 	(unspec:SVE_ALL
-	  [(match_operand:<VPRED> 1 "register_operand" "Upl")
-	   (match_operand:TI 2 "aarch64_sve_ld1r_operand" "Uty")]
+	  [(match_operand:<VPRED> 2 "register_operand" "Upl")
+	   (match_operand:<V128> 1 "aarch64_sve_ld1rq_operand" "UtQ")]
 	  UNSPEC_LD1RQ))]
  "TARGET_SVE"
-  "ld1rq<Vesize>\t%0.<Vetype>, %1/z, %2"
+  {
+    operands[1] = gen_rtx_MEM (<VEL>mode, XEXP (operands[1], 0));
+    return "ld1rq<Vesize>\t%0.<Vetype>, %2/z, %1";
+  }
 )

 ;; -------------------------------------------------------------------------

--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -234,6 +234,7 @@
    UNSPEC_CLASTB
    UNSPEC_FADDA
    UNSPEC_REV_SUBREG
+    UNSPEC_REINTERPRET
    UNSPEC_SPECULATION_TRACKER
    UNSPEC_COPYSIGN
    UNSPEC_TTEST		; Represent transaction test.

--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -272,6 +272,12 @@
       (match_test "aarch64_legitimate_address_p (V2DImode,
 						  XEXP (op, 0), 1)")))

+(define_memory_constraint "UtQ"
+  "@internal
+   An address valid for SVE LD1RQs."
+  (and (match_code "mem")
+       (match_test "aarch64_sve_ld1rq_operand_p (op)")))
+
 (define_memory_constraint "Uty"
  "@internal
   An address valid for SVE LD1Rs."

--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -431,6 +431,12 @@
  return aarch64_simd_check_vect_par_cnst_half (op, mode, false);
 })

+(define_predicate "descending_int_parallel"
+  (match_code "parallel")
+{
+  return aarch64_stepped_int_parallel_p (op, -1);
+})
+
 (define_special_predicate "aarch64_simd_lshift_imm"
  (match_code "const,const_vector")
 {
@@ -543,6 +549,10 @@
  (and (match_operand 0 "memory_operand")
       (match_test "aarch64_sve_ld1r_operand_p (op)")))

+(define_predicate "aarch64_sve_ld1rq_operand"
+  (and (match_code "mem")
+       (match_test "aarch64_sve_ld1rq_operand_p (op)")))
+
 ;; Like memory_operand, but restricted to addresses that are valid for
 ;; SVE LDR and STR instructions.
 (define_predicate "aarch64_sve_ldr_operand"

--- a/gcc/machmode.h
+++ b/gcc/machmode.h
@@ -251,7 +251,8 @@ public:
  ALWAYS_INLINE opt_mode (from_int m) : m_mode (machine_mode (m)) {}

  machine_mode else_void () const;
-  machine_mode else_blk () const;
+  machine_mode else_blk () const { return else_mode (BLKmode); }
+  machine_mode else_mode (machine_mode) const;
  T require () const;

  bool exists () const;
@@ -271,13 +272,13 @@ opt_mode<T>::else_void () const
  return m_mode;
 }

-/* If the T exists, return its enum value, otherwise return E_BLKmode.  */
+/* If the T exists, return its enum value, otherwise return FALLBACK.  */

 template<typename T>
 inline machine_mode
-opt_mode<T>::else_blk () const
+opt_mode<T>::else_mode (machine_mode fallback) const
 {
-  return m_mode == E_VOIDmode ? E_BLKmode : m_mode;
+  return m_mode == E_VOIDmode ? fallback : m_mode;
 }

 /* Assert that the object contains a T and return it.  */

--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
+2019-08-13  Richard Sandiford  <richard.sandiford@arm.com>
+
+	* gcc.target/aarch64/sve/init_2.c: Expect ld1rd to be used
+	instead of a full vector load.
+	* gcc.target/aarch64/sve/init_4.c: Likewise.
+	* gcc.target/aarch64/sve/ld1r_2.c: Remove constants that no longer
+	need to be loaded from memory.
+	* gcc.target/aarch64/sve/slp_2.c: Expect the same output for
+	big and little endian.
+	* gcc.target/aarch64/sve/slp_3.c: Likewise.  Expect 3 of the
+	doubles to be moved via integer registers rather than loaded
+	from memory.
+	* gcc.target/aarch64/sve/slp_4.c: Likewise but for 4 doubles.
+	* gcc.target/aarch64/sve/spill_4.c: Expect 16-bit constants to be
+	loaded via an integer register rather than from memory.
+	* gcc.target/aarch64/sve/const_1.c: New test.
+	* gcc.target/aarch64/sve/const_2.c: Likewise.
+	* gcc.target/aarch64/sve/const_3.c: Likewise.
+
 2019-08-13  Jozef Lawrynowicz  <jozef.l@mittosystems.com>

 	* gcc.target/msp430/msp430.exp (msp430_device_permutations_runtest):

--- a/gcc/testsuite/gcc.target/aarch64/sve/const_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/const_1.c
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+#include <stdint.h>
+
+void
+set (uint64_t *dst, int count)
+{
+  for (int i = 0; i < count; ++i)
+    dst[i] = 0xffff00ff00ffff00ULL;
+}
+
+/* { dg-final { scan-assembler {\tmovi\tv([0-9]+)\.2d, 0xffff00ff00ffff00\n.*\tdup\tz[0-9]+\.q, z\1\.q\[0\]\n} } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/const_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/const_2.c
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+#include <stdint.h>
+
+#define TEST(TYPE, CONST)			\
+  void						\
+  set_##TYPE (TYPE *dst, int count)		\
+  {						\
+    for (int i = 0; i < count; ++i)		\
+      dst[i] = CONST;				\
+  }
+
+TEST (uint16_t, 129)
+TEST (uint32_t, 129)
+TEST (uint64_t, 129)
+
+/* { dg-final { scan-assembler {\tmovi\tv([0-9]+)\.8h, 0x81\n[^:]*\tdup\tz[0-9]+\.q, z\1\.q\[0\]\n} } } */
+/* { dg-final { scan-assembler {\tmovi\tv([0-9]+)\.4s, 0x81\n[^:]*\tdup\tz[0-9]+\.q, z\1\.q\[0\]\n} } } */
+/* { dg-final { scan-assembler {\tmov\t(x[0-9]+), 129\n[^:]*\tmov\tz[0-9]+\.d, \1\n} } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/const_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/const_3.c
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+#include <stdint.h>
+
+#define TEST(TYPE, CONST)			\
+  void						\
+  set_##TYPE (TYPE *dst, int count)		\
+  {						\
+    for (int i = 0; i < count; ++i)		\
+      dst[i] = CONST;				\
+  }
+
+TEST (uint16_t, 0x1234)
+TEST (uint32_t, 0x1234)
+TEST (uint64_t, 0x1234)
+
+/* { dg-final { scan-assembler {\tmov\t(w[0-9]+), 4660\n[^:]*\tmov\tz[0-9]+\.h, \1\n} } } */
+/* { dg-final { scan-assembler {\tmov\t(w[0-9]+), 4660\n[^:]*\tmov\tz[0-9]+\.s, \1\n} } } */
+/* { dg-final { scan-assembler {\tmov\t(x[0-9]+), 4660\n[^:]*\tmov\tz[0-9]+\.d, \1\n} } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/init_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/init_2.c
@@ -11,9 +11,9 @@ typedef int32_t vnx4si __attribute__((vector_size (32)));
 /*
 ** foo:
 **	...
-**	ld1w	(z[0-9]+\.s), p[0-9]+/z, \[x[0-9]+\]
-**	insr	\1, w1
-**	insr	\1, w0
+**	ld1rd	(z[0-9]+)\.d, p[0-9]+/z, \[x[0-9]+\]
+**	insr	\1\.s, w1
+**	insr	\1\.s, w0
 **	...
 */
 __attribute__((noipa))

--- a/gcc/testsuite/gcc.target/aarch64/sve/init_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/init_4.c
@@ -11,10 +11,10 @@ typedef int32_t vnx4si __attribute__((vector_size (32)));
 /*
 ** foo:
 **	...
-**	ld1w	(z[0-9]+\.s), p[0-9]+/z, \[x[0-9]+\]
-**	insr	\1, w1
-**	insr	\1, w0
-**	rev	\1, \1
+**	ld1rd	(z[0-9]+)\.d, p[0-9]+/z, \[x[0-9]+\]
+**	insr	\1\.s, w1
+**	insr	\1\.s, w0
+**	rev	\1\.s, \1\.s
 **	...
 */
 __attribute__((noipa))

--- a/gcc/testsuite/gcc.target/aarch64/sve/ld1r_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/ld1r_2.c
@@ -28,22 +28,6 @@
  T (int64_t)

 #define FOR_EACH_LOAD_BROADCAST_IMM(T)					\
-  T (int16_t, 129, imm_129)						\
-  T (int32_t, 129, imm_129)						\
-  T (int64_t, 129, imm_129)						\
-									\
-  T (int16_t, -130, imm_m130)						\
-  T (int32_t, -130, imm_m130)						\
-  T (int64_t, -130, imm_m130)						\
-									\
-  T (int16_t, 0x1234, imm_0x1234)					\
-  T (int32_t, 0x1234, imm_0x1234)					\
-  T (int64_t, 0x1234, imm_0x1234)					\
-									\
-  T (int16_t, 0xFEDC, imm_0xFEDC)					\
-  T (int32_t, 0xFEDC, imm_0xFEDC)					\
-  T (int64_t, 0xFEDC, imm_0xFEDC)					\
-									\
  T (int32_t, 0x12345678, imm_0x12345678)				\
  T (int64_t, 0x12345678, imm_0x12345678)				\
 									\
@@ -56,6 +40,6 @@ FOR_EACH_LOAD_BROADCAST (DEF_LOAD_BROADCAST)
 FOR_EACH_LOAD_BROADCAST_IMM (DEF_LOAD_BROADCAST_IMM)

 /* { dg-final { scan-assembler-times {\tld1rb\tz[0-9]+\.b, p[0-7]/z, } 1 } } */
-/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, p[0-7]/z, } 5 } } */
-/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, p[0-7]/z, } 7 } } */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, p[0-7]/z, } 8 } } */
+/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, p[0-7]/z, } 1 } } */
+/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, p[0-7]/z, } 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, p[0-7]/z, } 4 } } */
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_2.c
@@ -29,12 +29,9 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\

 TEST_ALL (VEC_PERM)

-/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, } 2 { target aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target aarch64_big_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 3 { target aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 { target aarch64_big_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 { target aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 { target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, w[0-9]+\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
 /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 3 } } */
 /* { dg-final { scan-assembler-not {\tzip1\t} } } */
 /* { dg-final { scan-assembler-not {\tzip2\t} } } */

--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
@@ -32,18 +32,17 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\
 TEST_ALL (VEC_PERM)

 /* 1 for each 8-bit type.  */
-/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 { target aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target aarch64_big_endian } } } */
-/* 1 for each 16-bit type and 4 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 7 { target aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 { target aarch64_big_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 { target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
+/* 1 for each 16-bit type plus 1 for double.  */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */
 /* 1 for each 32-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */
+/* 3 for double.  */
+/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */
 /* The 64-bit types need:

      ZIP1 ZIP1 (2 ZIP2s optimized away)

--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
@@ -35,10 +35,8 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\

 TEST_ALL (VEC_PERM)

-/* 1 for each 8-bit type, 4 for each 32-bit type and 8 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 22 { target aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target aarch64_big_endian } } } */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 20 { target aarch64_big_endian } } } */
+/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double.  */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */
 /* 1 for each 16-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]\.h, } 3 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */
@@ -49,6 +47,8 @@ TEST_ALL (VEC_PERM)
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */
+/* 4 for double.  */
+/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */
 /* The 32-bit types need:

      ZIP1 ZIP1 (2 ZIP2s optimized away)

--- a/gcc/testsuite/gcc.target/aarch64/sve/spill_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/spill_4.c
@@ -24,10 +24,10 @@ TEST_LOOP (uint16_t, 0x1234);
 TEST_LOOP (uint32_t, 0x12345);
 TEST_LOOP (uint64_t, 0x123456);

-/* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.h,} 3 } } */
+/* { dg-final { scan-assembler-not {\tptrue\tp[0-9]+\.h,} } } */
 /* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.s,} 3 } } */
 /* { dg-final { scan-assembler-times {\tptrue\tp[0-9]+\.d,} 3 } } */
-/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h,} 3 } } */
+/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, w[0-9]+\n} 3 } } */
 /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s,} 3 } } */
 /* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d,} 3 } } */
 /* { dg-final { scan-assembler-not {\tldr\tz[0-9]} } } */