Commit 56b1c60e by Martin Jambor Committed by Martin Jambor

backport: hsa-builtins.def: New file.

Merge from HSA branch to trunk

2016-11-23  Martin Jambor  <mjambor@suse.cz>
	    Martin Liska  <mliska@suse.cz>

gcc/
	* hsa-builtins.def: New file.
	* Makefile.in (BUILTINS_DEF): Add hsa-builtins.def dependency.
	* builtins.def: Include hsa-builtins.def.
	(DEF_HSA_BUILTIN): New macro.
	* dumpfile.h (OPTGROUP_OPENMP): Define.
	* dumpfile.c (optgroup_options): Added OPTGROUP_OPENMP.
	* gimple.h (gf_mask): Added elements GF_OMP_FOR_GRID_INTRA_GROUP and
	GF_OMP_FOR_GRID_GROUP_ITER.
	(gimple_omp_for_grid_phony): Added checking assert.
	(gimple_omp_for_set_grid_phony): Likewise.
	(gimple_omp_for_grid_intra_group): New function.
	(gimple_omp_for_set_grid_intra_group): Likewise.
	(gimple_omp_for_grid_group_iter): Likewise.
	(gimple_omp_for_set_grid_group_iter): Likewise.
	* omp-low.c (check_omp_nesting_restrictions): Allow GRID loop where
	previosuly only distribute loop was permitted.
	(lower_lastprivate_clauses): Allow non tcc_comparison predicates.
	(grid_get_kernel_launch_attributes): Support multiple HSA grid
	dimensions.
	(grid_expand_omp_for_loop): Likewise and also support standalone
	distribute constructs.  New parameter INTRA_GROUP, updated both users.
	(grid_expand_target_grid_body): Support standalone distribute
	constructs.
	(pass_data_expand_omp): Changed optinfo_flags to OPTGROUP_OPENMP.
	(pass_data_expand_omp_ssa): Likewise.
	(pass_data_omp_device_lower): Likewsie.
	(pass_data_lower_omp): Likewise.
	(pass_data_diagnose_omp_blocks): Likewise.
	(pass_data_oacc_device_lower): Likewise.
	(pass_data_omp_target_link): Likewise.
	(grid_lastprivate_predicate): New function.
	(lower_omp_for_lastprivate): Call grid_lastprivate_predicate for
	gridified loops.
	(lower_omp_for): Support standalone distribute constructs.
	(grid_prop): New type.
	(grid_safe_assignment_p): Check for assignments to group_sizes, new
	parameter GRID.
	(grid_seq_only_contains_local_assignments): New parameter GRID, pass
	it to callee.
	(grid_find_single_omp_among_assignments_1): Likewise, improve missed
	optimization info messages.
	(grid_find_single_omp_among_assignments): Likewise.
	(grid_find_ungridifiable_statement): Do not bail out for SIMDs.
	(grid_parallel_clauses_gridifiable): New function.
	(grid_inner_loop_gridifiable_p): Likewise.
	(grid_dist_follows_simple_pattern): Likewise.
	(grid_gfor_follows_tiling_pattern): Likewise.
	(grid_call_permissible_in_distribute_p): Likewise.
	(grid_handle_call_in_distribute): Likewise.
	(grid_dist_follows_tiling_pattern): Likewise.
	(grid_target_follows_gridifiable_pattern): Support standalone distribute
	constructs.
	(grid_var_segment): New enum.
	(grid_mark_variable_segment): New function.
	(grid_copy_leading_local_assignments): Call grid_mark_variable_segment
	if a new argument says so.
	(grid_process_grid_body): New function.
	(grid_eliminate_combined_simd_part): Likewise.
	(grid_mark_tiling_loops): Likewise.
	(grid_mark_tiling_parallels_and_loops): Likewise.
	(grid_process_kernel_body_copy): Support standalone distribute
	constructs.
	(grid_attempt_target_gridification): New grid variable holding overall
	gridification state.  Support standalone distribute constructs and
	collapse clauses.
	* doc/optinfo.texi (Optimization groups): Document OPTGROUP_OPENMP.
	* hsa.h (hsa_bb): Add method method append_phi.
	(hsa_insn_br): Renamed to hsa_insn_cbr, renamed all
	occurences in all files too.
	(hsa_insn_br): New class, now the ancestor of hsa_incn_cbr.
	(is_a_helper <hsa_insn_br *>::test): New function.
	(is_a_helper <hsa_insn_cbr *>::test): Adjust to only cover conditional
	branch instructions.
	(hsa_insn_signal): Make a direct descendant of
	hsa_insn_basic.  Add memorder constructor parameter and
	m_memory_order and m_signalop member variables.
	(hsa_insn_queue): Changed constructor parameters to common form.
	Added m_segment and m_memory_order member variables.
	(hsa_summary_t): Add private member function
	process_gpu_implementation_attributes.
	(hsa_function_summary): Rename m_binded_function to
	m_bound_function.
	(hsa_insn_basic_p): Remove typedef.
	(hsa_op_with_type): Change hsa_insn_basic_p into plain pointers.
	(hsa_op_reg_p): Remove typedef.
	(hsa_function_representation): Change hsa_op_reg_p into plain
	pointers.
	(hsa_insn_phi): Removed new and delete operators.
	(hsa_insn_br): Likewise.
	(hsa_insn_cbr): Likewise.
	(hsa_insn_sbr): Likewise.
	(hsa_insn_cmp): Likewise.
	(hsa_insn_mem): Likewise.
	(hsa_insn_atomic): Likewise.
	(hsa_insn_signal): Likewise.
	(hsa_insn_seg): Likewise.
	(hsa_insn_call): Likewise.
	(hsa_insn_arg_block): Likewise.
	(hsa_insn_comment): Likewise.
	(hsa_insn_srctype): Likewise.
	(hsa_insn_packed): Likewise.
	(hsa_insn_cvt): Likewise.
	(hsa_insn_alloca): Likewise.
	* hsa.c (hsa_destroy_insn): Also handle instances of hsa_insn_br.
	(process_gpu_implementation_attributes): New function.
	(link_functions): Move some functionality into it.  Adjust after
	renaming m_binded_functions to m_bound_functions.
	(hsa_insn_basic::op_output_p): Add BRIG_OPCODE_DEBUGTRAP
	to the list of instructions with no output registers.
	(get_in_type): Return this if it is a register of
	matching size.
	(hsa_get_declaration_name): Moved to...
        * hsa-gen.c (hsa_get_declaration_name): ...here.  Allocate
	temporary string on an obstack instead from ggc.
	(query_hsa_grid): Renamed to query_hsa_grid_dim, reimplemented, cut
	down to two overloads.
	(hsa_allocp_operand_address): Removed.
	(hsa_allocp_operand_immed): Likewise.
	(hsa_allocp_operand_reg): Likewise.
	(hsa_allocp_operand_code_list): Likewise.
	(hsa_allocp_operand_operand_list): Likewise.
	(hsa_allocp_inst_basic): Likewise.
	(hsa_allocp_inst_phi): Likewise.
	(hsa_allocp_inst_mem): Likewise.
	(hsa_allocp_inst_atomic): Likewise.
	(hsa_allocp_inst_signal): Likewise.
	(hsa_allocp_inst_seg): Likewise.
	(hsa_allocp_inst_cmp): Likewise.
	(hsa_allocp_inst_br): Likewise.
	(hsa_allocp_inst_sbr): Likewise.
	(hsa_allocp_inst_call): Likewise.
	(hsa_allocp_inst_arg_block): Likewise.
	(hsa_allocp_inst_comment): Likewise.
	(hsa_allocp_inst_queue): Likewise.
	(hsa_allocp_inst_srctype): Likewise.
	(hsa_allocp_inst_packed): Likewise.
	(hsa_allocp_inst_cvt): Likewise.
	(hsa_allocp_inst_alloca): Likewise.
	(hsa_allocp_bb): Likewise.
	(hsa_obstack): New.
	(hsa_init_data_for_cfun): Initialize obstack.
	(hsa_deinit_data_for_cfun): Release memory of the obstack.
	(hsa_op_immed::operator new): Use obstack instead of object_allocator.
	(hsa_op_reg::operator new): Likewise.
	(hsa_op_address::operator new): Likewise.
	(hsa_op_code_list::operator new): Likewise.
	(hsa_op_operand_list::operator new): Likewise.
	(hsa_insn_basic::operator new): Likewise.
	(hsa_insn_phi::operator new): Likewise.
	(hsa_insn_br::operator new): Likewise.
	(hsa_insn_sbr::operator new): Likewise.
	(hsa_insn_cmp::operator new): Likewise.
	(hsa_insn_mem::operator new): Likewise.
	(hsa_insn_atomic::operator new): Likewise.
	(hsa_insn_signal::operator new): Likewise.
	(hsa_insn_seg::operator new): Likewise.
	(hsa_insn_call::operator new): Likewise.
	(hsa_insn_arg_block::operator new): Likewise.
	(hsa_insn_comment::operator new): Likewise.
	(hsa_insn_srctype::operator new): Likewise.
	(hsa_insn_packed::operator new): Likewise.
	(hsa_insn_cvt::operator new): Likewise.
	(hsa_insn_alloca::operator new): Likewise.
	(hsa_init_new_bb): Likewise.
	(hsa_bb::append_phi): New function.
	(gen_hsa_phi_from_gimple_phi): Use it.
	(get_symbol_for_decl): Fix dinstinguishing between
	global and local functions.  Put local variables into a segment
	according to their attribute or static flag, if there is one.
	(hsa_insn_br::hsa_insn_br): New.
	(hsa_insn_br::operator new): Likewise.
	(hsa_insn_cbr::hsa_insn_cbr): Set width via ancestor constructor.
	(query_hsa_grid_nodim): New function.
	(multiply_grid_dim_characteristics): Likewise.
	(gen_get_num_threads): Likewise.
	(gen_get_num_teams): Reimplemented.
	(gen_get_team_num): Likewise.
	(gen_hsa_insns_for_known_library_call): Updated calls to the above
	helper functions.
	(get_memory_order_name): Removed.
	(get_memory_order): Likewise.
	(hsa_memorder_from_tree): New function.
	(gen_hsa_ternary_atomic_for_builtin): Renamed to
	gen_hsa_atomic_for_builtin, can also create signals.
	(gen_hsa_insns_for_call): Handle many new builtins.  Adjust to use
	hsa_memory_order_from_tree and gen_hsa_atomic_for_builtin.
	(hsa_insn_atomic): Fix function comment.
	(hsa_insn_signal::hsa_insn_signal): Fix comment.  Update call to
	ancestor constructor and initialization of new member variables.
	(hsa_insn_queue::hsa_insn_queue): Added initialization of new
	member variables.
	(hsa_get_host_function): Handle functions with no bound CPU
	implementation.  Fix binded to bound.
	(get_brig_function_name): Likewise.
	(HSA_SORRY_ATV): Remove semicolon after macro.
	(HSA_SORRY_AT): Likewise.
	(omp_simple_builtin::generate): Add missing semicolons.
	(hsa_insn_phi::operator new): Removed.
	(hsa_insn_br::operator new): Likewise.
	(hsa_insn_cbr::operator new): Likewise.
	(hsa_insn_sbr::operator new): Likewise.
	(hsa_insn_cmp::operator new): Likewise.
	(hsa_insn_mem::operator new): Likewise.
	(hsa_insn_atomic::operator new): Likewise.
	(hsa_insn_signal::operator new): Likewise.
	(hsa_insn_seg::operator new): Likewise.
	(hsa_insn_call::operator new): Likewise.
	(hsa_insn_arg_block::operator new): Likewise.
	(hsa_insn_comment::operator new): Likewise.
	(hsa_insn_srctype::operator new): Likewise.
	(hsa_insn_packed::operator new): Likewise.
	(hsa_insn_cvt::operator new): Likewise.
	(hsa_insn_alloca::operator new): Likewise.
	(get_symbol_for_decl): Accept CONST_DECLs, put them to
	readonly segment.
	(gen_hsa_addr): Also process CONST_DECLs.
	(gen_hsa_addr_insns): Process CONST_DECLs by creating private
	copies.
	(gen_hsa_unary_operation): Make sure the function does
	not use bittype source type for firstbit and lastbit operations.
	(gen_hsa_popcount_to_dest): Make sure the function uses a bittype
	source type.
	* hsa-brig.c (emit_insn_operands): Cope with zero operands in an
	instruction.
	(emit_branch_insn): Renamed to emit_cond_branch_insn.
	Emit the width stored in the class.
	(emit_generic_branch_insn): New function.
	(emit_insn): Call emit_generic_branch_insn.
	(emit_signal_insn): Remove obsolete comment.  Update
	member variable name, pick a type according to profile.
	(emit_alloca_insn): Remove obsolete comment.
	(emit_atomic_insn): Likewise.
	(emit_queue_insn): Get segment and memory order from the IR object.
	(hsa_brig_section): Make allocate_new_chunk, chunks
	and cur_chunk provate, add a default NULL parameter to add method.
	(hsa_brig_section::add): Added a new parameter, store pointer to
	output data there if it is non-NULL.
	(emit_function_directives): Use this new parameter instead of
	calculating the pointer itself, fix function comment.
	(hsa_brig_emit_function): Add forgotten endian conversion.
	(hsa_output_kernels): Remove unnecessary building of
	kernel_dependencies_vector_type.
	(emit_immediate_operand): Declare.
	(emit_directive_variable): Also emit initializers of CONST_DECLs.
	(gen_hsa_insn_for_internal_fn_call): Also handle IFN_RSQRT.
	(verify_function_arguments): Properly detect variadic
	arguments.
	* hsa-dump.c (hsa_width_specifier_name): New function.
	(dump_hsa_insn_1): Dump generic branch instructions, update signal
	member variable name.  Special dumping for queue objects.
	* ipa-hsa.c (process_hsa_functions): Adjust after renaming
	m_binded_functions to m_bound_functions.  Copy externally visible flag
	to the node.
	(ipa_hsa_write_summary): Likewise.
	(ipa_hsa_read_section): Likewise.


	gcc/fortran/
        * f95-lang.c (DEF_HSA_BUILTIN): New macro.

gcc/testsuite/
	* c-c++-common/gomp/gridify-1.c: Update scan string.
	* gfortran.dg/gomp/gridify-1.f90: Likewise.
	* c-c++-common/gomp/gridify-2.c: New test.
	* c-c++-common/gomp/gridify-3.c: Likewise.

libgomp/
	* testsuite/libgomp.hsa.c/bits-insns.c: New test.
	* testsuite/libgomp.hsa.c/tiling-1.c: Likewise.
	* testsuite/libgomp.hsa.c/tiling-2.c: Likewise.


Co-Authored-By: Martin Liska <mliska@suse.cz>

From-SVN: r242761
parent f6cdfe82
......@@ -911,7 +911,8 @@ RTL_H = $(RTL_BASE_H) $(FLAGS_H) genrtl.h
READ_MD_H = $(OBSTACK_H) $(HASHTAB_H) read-md.h
PARAMS_H = params.h params-enum.h params.def
BUILTINS_DEF = builtins.def sync-builtins.def omp-builtins.def \
gtm-builtins.def sanitizer.def cilkplus.def cilk-builtins.def
gtm-builtins.def sanitizer.def cilkplus.def cilk-builtins.def \
hsa-builtins.def
INTERNAL_FN_DEF = internal-fn.def
INTERNAL_FN_H = internal-fn.h $(INTERNAL_FN_DEF)
TREE_CORE_H = tree-core.h coretypes.h all-tree.def tree.def \
......
......@@ -201,6 +201,19 @@ along with GCC; see the file COPYING3. If not see
|| flag_cilkplus \
|| flag_offload_abi != OFFLOAD_ABI_UNSET))
#undef DEF_HSA_BUILTIN
#ifdef ENABLE_HSA
#define DEF_HSA_BUILTIN(ENUM, NAME, TYPE, ATTRS) \
DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE, \
false, false, true, ATTRS, false, \
(!flag_disable_hsa))
#else
#define DEF_HSA_BUILTIN(ENUM, NAME, TYPE, ATTRS) \
DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE, \
false, false, true, ATTRS, false, \
(false))
#endif
/* Builtin used by implementation of Cilk Plus. Most of these are decomposed
by the compiler but a few are implemented in libcilkrts. */
#undef DEF_CILK_BUILTIN_STUB
......@@ -968,6 +981,9 @@ DEF_GCC_BUILTIN (BUILT_IN_LINE, "LINE", BT_FN_INT, ATTR_NOTHROW_LEAF_LIST)
/* Offloading and Multi Processing builtins. */
#include "omp-builtins.def"
/* Heterogeneous Systems Architecture. */
#include "hsa-builtins.def"
/* Cilk keywords builtins. */
#include "cilk-builtins.def"
......
......@@ -59,6 +59,9 @@ Loop optimization passes. Enabled by @option{-loop}.
@item OPTGROUP_INLINE
Inlining passes. Enabled by @option{-inline}.
@item OPTGROUP_OPENMP
OpenMP passes. Enabled by @option{-openmp}.
@item OPTGROUP_VEC
Vectorization passes. Enabled by @option{-vec}.
......
......@@ -138,6 +138,7 @@ static const struct dump_option_value_info optgroup_options[] =
{"ipa", OPTGROUP_IPA},
{"loop", OPTGROUP_LOOP},
{"inline", OPTGROUP_INLINE},
{"openmp", OPTGROUP_OPENMP},
{"vec", OPTGROUP_VEC},
{"optall", OPTGROUP_ALL},
{NULL, 0}
......
......@@ -98,7 +98,8 @@ enum tree_dump_index
#define OPTGROUP_LOOP (1 << 2) /* Loop optimization passes */
#define OPTGROUP_INLINE (1 << 3) /* Inlining passes */
#define OPTGROUP_VEC (1 << 4) /* Vectorization passes */
#define OPTGROUP_OTHER (1 << 5) /* All other passes */
#define OPTGROUP_OPENMP (1 << 5) /* OpenMP specific transformations */
#define OPTGROUP_OTHER (1 << 6) /* All other passes */
#define OPTGROUP_ALL (OPTGROUP_IPA | OPTGROUP_LOOP | OPTGROUP_INLINE \
| OPTGROUP_VEC | OPTGROUP_OTHER)
......
2016-11-23 Martin Jambor <mjambor@suse.cz>
* f95-lang.c (DEF_HSA_BUILTIN): New macro.
2016-11-22 Steven G. Kargl <kargl@gcc.gnu.org>
PR fortran/78479
......
......@@ -1224,6 +1224,17 @@ gfc_init_builtin_functions (void)
#undef DEF_GOMP_BUILTIN
}
#ifdef ENABLE_HSA
if (!flag_disable_hsa)
{
#undef DEF_HSA_BUILTIN
#define DEF_HSA_BUILTIN(code, name, type, attr) \
gfc_define_builtin ("__builtin_" name, builtin_types[type], \
code, name, attr);
#include "../hsa-builtins.def"
}
#endif
gfc_define_builtin ("__builtin_trap", builtin_types[BT_FN_VOID],
BUILT_IN_TRAP, NULL, ATTR_NOTHROW_LEAF_LIST);
TREE_THIS_VOLATILE (builtin_decl_explicit (BUILT_IN_TRAP)) = 1;
......
......@@ -163,7 +163,13 @@ enum gf_mask {
GF_OMP_FOR_KIND_CILKSIMD = GF_OMP_FOR_SIMD | 1,
GF_OMP_FOR_COMBINED = 1 << 4,
GF_OMP_FOR_COMBINED_INTO = 1 << 5,
/* The following flag must not be used on GF_OMP_FOR_KIND_GRID_LOOP loop
statements. */
GF_OMP_FOR_GRID_PHONY = 1 << 6,
/* The following two flags should only be set on GF_OMP_FOR_KIND_GRID_LOOP
loop statements. */
GF_OMP_FOR_GRID_INTRA_GROUP = 1 << 6,
GF_OMP_FOR_GRID_GROUP_ITER = 1 << 7,
GF_OMP_TARGET_KIND_MASK = (1 << 4) - 1,
GF_OMP_TARGET_KIND_REGION = 0,
GF_OMP_TARGET_KIND_DATA = 1,
......@@ -5143,6 +5149,8 @@ gimple_omp_for_set_pre_body (gimple *gs, gimple_seq pre_body)
static inline bool
gimple_omp_for_grid_phony (const gomp_for *omp_for)
{
gcc_checking_assert (gimple_omp_for_kind (omp_for)
!= GF_OMP_FOR_KIND_GRID_LOOP);
return (gimple_omp_subcode (omp_for) & GF_OMP_FOR_GRID_PHONY) != 0;
}
......@@ -5151,12 +5159,61 @@ gimple_omp_for_grid_phony (const gomp_for *omp_for)
static inline void
gimple_omp_for_set_grid_phony (gomp_for *omp_for, bool value)
{
gcc_checking_assert (gimple_omp_for_kind (omp_for)
!= GF_OMP_FOR_KIND_GRID_LOOP);
if (value)
omp_for->subcode |= GF_OMP_FOR_GRID_PHONY;
else
omp_for->subcode &= ~GF_OMP_FOR_GRID_PHONY;
}
/* Return the kernel_intra_group of a GRID_LOOP OMP_FOR statement. */
static inline bool
gimple_omp_for_grid_intra_group (const gomp_for *omp_for)
{
gcc_checking_assert (gimple_omp_for_kind (omp_for)
== GF_OMP_FOR_KIND_GRID_LOOP);
return (gimple_omp_subcode (omp_for) & GF_OMP_FOR_GRID_INTRA_GROUP) != 0;
}
/* Set kernel_intra_group flag of OMP_FOR to VALUE. */
static inline void
gimple_omp_for_set_grid_intra_group (gomp_for *omp_for, bool value)
{
gcc_checking_assert (gimple_omp_for_kind (omp_for)
== GF_OMP_FOR_KIND_GRID_LOOP);
if (value)
omp_for->subcode |= GF_OMP_FOR_GRID_INTRA_GROUP;
else
omp_for->subcode &= ~GF_OMP_FOR_GRID_INTRA_GROUP;
}
/* Return true if iterations of a grid OMP_FOR statement correspond to HSA
groups. */
static inline bool
gimple_omp_for_grid_group_iter (const gomp_for *omp_for)
{
gcc_checking_assert (gimple_omp_for_kind (omp_for)
== GF_OMP_FOR_KIND_GRID_LOOP);
return (gimple_omp_subcode (omp_for) & GF_OMP_FOR_GRID_GROUP_ITER) != 0;
}
/* Set group_iter flag of OMP_FOR to VALUE. */
static inline void
gimple_omp_for_set_grid_group_iter (gomp_for *omp_for, bool value)
{
gcc_checking_assert (gimple_omp_for_kind (omp_for)
== GF_OMP_FOR_KIND_GRID_LOOP);
if (value)
omp_for->subcode |= GF_OMP_FOR_GRID_GROUP_ITER;
else
omp_for->subcode &= ~GF_OMP_FOR_GRID_GROUP_ITER;
}
/* Return the clauses associated with OMP_PARALLEL GS. */
static inline tree
......
/* This file contains the definitions and documentation for the
Offloading and Multi Processing builtins used in the GNU compiler.
Copyright (C) 2005-2015 Free Software Foundation, Inc.
This file is part of GCC.
GCC is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 3, or (at your option) any later
version.
GCC is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
You should have received a copy of the GNU General Public License
along with GCC; see the file COPYING3. If not see
<http://www.gnu.org/licenses/>. */
/* Before including this file, you should define a macro:
DEF_HSA_BUILTIN (ENUM, NAME, TYPE, ATTRS)
See builtins.def for details. */
/* The reason why they aren't in gcc/builtins.def is that the Fortran front end
doesn't source those. */
DEF_HSA_BUILTIN (BUILT_IN_HSA_WORKGROUPID, "hsa_workgroupid",
BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
DEF_HSA_BUILTIN (BUILT_IN_HSA_WORKITEMID, "hsa_workitemid",
BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
DEF_HSA_BUILTIN (BUILT_IN_HSA_WORKITEMABSID, "hsa_workitemabsid",
BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
DEF_HSA_BUILTIN (BUILT_IN_HSA_GRIDSIZE, "hsa_gridsize",
BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
DEF_HSA_BUILTIN (BUILT_IN_HSA_CURRENTWORKGROUPSIZE, "hsa_currentworkgroupsize",
BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
......@@ -621,6 +621,88 @@ hsa_m_atomicop_name (enum BrigAtomicOperation op)
}
}
/* Return textual name for atomic operation. */
static const char *
hsa_width_specifier_name (BrigWidth8_t width)
{
switch (width)
{
case BRIG_WIDTH_NONE:
return "none";
case BRIG_WIDTH_1:
return "1";
case BRIG_WIDTH_2:
return "2";
case BRIG_WIDTH_4:
return "4";
case BRIG_WIDTH_8:
return "8";
case BRIG_WIDTH_16:
return "16";
case BRIG_WIDTH_32:
return "32";
case BRIG_WIDTH_64:
return "64";
case BRIG_WIDTH_128:
return "128";
case BRIG_WIDTH_256:
return "256";
case BRIG_WIDTH_512:
return "512";
case BRIG_WIDTH_1024:
return "1024";
case BRIG_WIDTH_2048:
return "2048";
case BRIG_WIDTH_4096:
return "4096";
case BRIG_WIDTH_8192:
return "8192";
case BRIG_WIDTH_16384:
return "16384";
case BRIG_WIDTH_32768:
return "32768";
case BRIG_WIDTH_65536:
return "65536";
case BRIG_WIDTH_131072:
return "131072";
case BRIG_WIDTH_262144:
return "262144";
case BRIG_WIDTH_524288:
return "524288";
case BRIG_WIDTH_1048576:
return "1048576";
case BRIG_WIDTH_2097152:
return "2097152";
case BRIG_WIDTH_4194304:
return "4194304";
case BRIG_WIDTH_8388608:
return "8388608";
case BRIG_WIDTH_16777216:
return "16777216";
case BRIG_WIDTH_33554432:
return "33554432";
case BRIG_WIDTH_67108864:
return "67108864";
case BRIG_WIDTH_134217728:
return "134217728";
case BRIG_WIDTH_268435456:
return "268435456";
case BRIG_WIDTH_536870912:
return "536870912";
case BRIG_WIDTH_1073741824:
return "1073741824";
case BRIG_WIDTH_2147483648:
return "2147483648";
case BRIG_WIDTH_WAVESIZE:
return "wavesize";
case BRIG_WIDTH_ALL:
return "all";
default:
return "UNKNOWN_WIDTH";
}
}
/* Dump textual representation of HSA IL register REG to file F. */
static void
......@@ -793,9 +875,9 @@ dump_hsa_insn_1 (FILE *f, hsa_insn_basic *insn, int *indent)
hsa_insn_signal *mem = as_a <hsa_insn_signal *> (insn);
fprintf (f, "%s", hsa_opcode_name (mem->m_opcode));
fprintf (f, "_%s", hsa_m_atomicop_name (mem->m_atomicop));
if (mem->m_memoryorder != BRIG_MEMORY_ORDER_NONE)
fprintf (f, "_%s", hsa_memsem_name (mem->m_memoryorder));
fprintf (f, "_%s", hsa_m_atomicop_name (mem->m_signalop));
if (mem->m_memory_order != BRIG_MEMORY_ORDER_NONE)
fprintf (f, "_%s", hsa_memsem_name (mem->m_memory_order));
fprintf (f, "_%s ", hsa_type_name (mem->m_type));
dump_hsa_operands (f, mem);
......@@ -884,9 +966,9 @@ dump_hsa_insn_1 (FILE *f, hsa_insn_basic *insn, int *indent)
fprintf (f, ", ");
dump_hsa_operand (f, cmp->get_op (2));
}
else if (is_a <hsa_insn_br *> (insn))
else if (is_a <hsa_insn_cbr *> (insn))
{
hsa_insn_br *br = as_a <hsa_insn_br *> (insn);
hsa_insn_cbr *br = as_a <hsa_insn_cbr *> (insn);
basic_block target = NULL;
edge_iterator ei;
edge e;
......@@ -921,6 +1003,12 @@ dump_hsa_insn_1 (FILE *f, hsa_insn_basic *insn, int *indent)
fprintf (f, ", ");
}
}
else if (is_a <hsa_insn_br *> (insn))
{
hsa_insn_br *br = as_a <hsa_insn_br *> (insn);
fprintf (f, "%s_width(%s) ", hsa_opcode_name (br->m_opcode),
hsa_width_specifier_name (br->m_width));
}
else if (is_a <hsa_insn_arg_block *> (insn))
{
hsa_insn_arg_block *arg_block = as_a <hsa_insn_arg_block *> (insn);
......@@ -1018,6 +1106,15 @@ dump_hsa_insn_1 (FILE *f, hsa_insn_basic *insn, int *indent)
dump_hsa_operands (f, insn);
}
else if (hsa_insn_queue *qi = dyn_cast <hsa_insn_queue *> (insn))
{
fprintf (f, "%s_%s_%s_%s ", hsa_opcode_name (qi->m_opcode),
hsa_seg_name (qi->m_segment),
hsa_memsem_name (qi->m_memory_order),
hsa_type_name (qi->m_type));
dump_hsa_operands (f, qi);
}
else
{
fprintf (f, "%s_%s ", hsa_opcode_name (insn->m_opcode),
......
......@@ -170,6 +170,7 @@ hsa_insn_basic::op_output_p (unsigned opnum)
case BRIG_OPCODE_SBR:
case BRIG_OPCODE_ST:
case BRIG_OPCODE_SIGNALNORET:
case BRIG_OPCODE_DEBUGTRAP:
/* FIXME: There are probably missing cases here, double check. */
return false;
case BRIG_OPCODE_EXPAND:
......@@ -605,8 +606,8 @@ hsa_destroy_insn (hsa_insn_basic *insn)
{
if (hsa_insn_phi *phi = dyn_cast <hsa_insn_phi *> (insn))
phi->~hsa_insn_phi ();
else if (hsa_insn_br *br = dyn_cast <hsa_insn_br *> (insn))
br->~hsa_insn_br ();
else if (hsa_insn_cbr *br = dyn_cast <hsa_insn_cbr *> (insn))
br->~hsa_insn_cbr ();
else if (hsa_insn_cmp *cmp = dyn_cast <hsa_insn_cmp *> (insn))
cmp->~hsa_insn_cmp ();
else if (hsa_insn_mem *mem = dyn_cast <hsa_insn_mem *> (insn))
......@@ -621,6 +622,8 @@ hsa_destroy_insn (hsa_insn_basic *insn)
block->~hsa_insn_arg_block ();
else if (hsa_insn_sbr *sbr = dyn_cast <hsa_insn_sbr *> (insn))
sbr->~hsa_insn_sbr ();
else if (hsa_insn_br *br = dyn_cast <hsa_insn_br *> (insn))
br->~hsa_insn_br ();
else if (hsa_insn_comment *comment = dyn_cast <hsa_insn_comment *> (insn))
comment->~hsa_insn_comment ();
else
......@@ -783,32 +786,22 @@ hsa_brig_function_name (const char *p)
return buf;
}
/* Return declaration name if exists. */
/* Add a flatten attribute and disable vectorization for gpu implementation
function decl GDECL. */
const char *
hsa_get_declaration_name (tree decl)
void hsa_summary_t::process_gpu_implementation_attributes (tree gdecl)
{
if (!DECL_NAME (decl))
{
char buf[64];
snprintf (buf, 64, "__hsa_anonymous_%i", DECL_UID (decl));
const char *ggc_str = ggc_strdup (buf);
return ggc_str;
}
tree name_tree;
if (TREE_CODE (decl) == FUNCTION_DECL
|| (VAR_P (decl) && is_global_var (decl)))
name_tree = DECL_ASSEMBLER_NAME (decl);
else
name_tree = DECL_NAME (decl);
const char *name = IDENTIFIER_POINTER (name_tree);
/* User-defined assembly names have prepended asterisk symbol. */
if (name[0] == '*')
name++;
DECL_ATTRIBUTES (gdecl)
= tree_cons (get_identifier ("flatten"), NULL_TREE,
DECL_ATTRIBUTES (gdecl));
return name;
tree fn_opts = DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl);
if (fn_opts == NULL_TREE)
fn_opts = optimization_default_node;
fn_opts = copy_node (fn_opts);
TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
}
void
......@@ -827,21 +820,10 @@ hsa_summary_t::link_functions (cgraph_node *gpu, cgraph_node *host,
gpu_summary->m_gridified_kernel_p = gridified_kernel_p;
host_summary->m_gridified_kernel_p = gridified_kernel_p;
gpu_summary->m_binded_function = host;
host_summary->m_binded_function = gpu;
tree gdecl = gpu->decl;
DECL_ATTRIBUTES (gdecl)
= tree_cons (get_identifier ("flatten"), NULL_TREE,
DECL_ATTRIBUTES (gdecl));
gpu_summary->m_bound_function = host;
host_summary->m_bound_function = gpu;
tree fn_opts = DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl);
if (fn_opts == NULL_TREE)
fn_opts = optimization_default_node;
fn_opts = copy_node (fn_opts);
TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
process_gpu_implementation_attributes (gpu->decl);
/* Create reference between a kernel and a corresponding host implementation
to quarantee LTO streaming to a same LTRANS. */
......
......@@ -79,7 +79,7 @@ process_hsa_functions (void)
hsa_function_summary *s = hsa_summaries->get (node);
/* A linked function is skipped. */
if (s->m_binded_function != NULL)
if (s->m_bound_function != NULL)
continue;
if (s->m_kind != HSA_NONE)
......@@ -90,6 +90,7 @@ process_hsa_functions (void)
= node->create_virtual_clone (vec <cgraph_edge *> (),
NULL, NULL, "hsa");
TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
clone->externally_visible = node->externally_visible;
clone->force_output = true;
hsa_summaries->link_functions (clone, node, s->m_kind, false);
......@@ -107,6 +108,7 @@ process_hsa_functions (void)
= node->create_virtual_clone (vec <cgraph_edge *> (),
NULL, NULL, "hsa");
TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
clone->externally_visible = node->externally_visible;
if (!cgraph_local_p (node))
clone->force_output = true;
......@@ -131,7 +133,7 @@ process_hsa_functions (void)
hsa_function_summary *dst = hsa_summaries->get (e->callee);
if (dst->m_kind != HSA_NONE && !dst->m_gpu_implementation_p)
{
e->redirect_callee (dst->m_binded_function);
e->redirect_callee (dst->m_bound_function);
if (dump_file)
fprintf (dump_file,
"Redirecting edge to HSA function: %s->%s\n",
......@@ -193,10 +195,10 @@ ipa_hsa_write_summary (void)
bp = bitpack_create (ob->main_stream);
bp_pack_value (&bp, s->m_kind, 2);
bp_pack_value (&bp, s->m_gpu_implementation_p, 1);
bp_pack_value (&bp, s->m_binded_function != NULL, 1);
bp_pack_value (&bp, s->m_bound_function != NULL, 1);
streamer_write_bitpack (&bp);
if (s->m_binded_function)
stream_write_tree (ob, s->m_binded_function->decl, true);
if (s->m_bound_function)
stream_write_tree (ob, s->m_bound_function->decl, true);
}
}
......@@ -249,7 +251,7 @@ ipa_hsa_read_section (struct lto_file_decl_data *file_data, const char *data,
if (has_tree)
{
tree decl = stream_read_tree (&ib_main, data_in);
s->m_binded_function = cgraph_node::get_create (decl);
s->m_bound_function = cgraph_node::get_create (decl);
}
}
lto_free_section_data (file_data, LTO_section_ipa_hsa, NULL, data,
......
2016-11-23 Martin Jambor <mjambor@suse.cz>
* c-c++-common/gomp/gridify-1.c: Update scan string.
* gfortran.dg/gomp/gridify-1.f90: Likewise.
* c-c++-common/gomp/gridify-2.c: New test.
* c-c++-common/gomp/gridify-3.c: Likewise.
2016-11-23 Richard Biener <rguenther@suse.de>
PR tree-optimization/78396
......
......@@ -51,4 +51,4 @@ foo4 (int j, int n, int *a)
}
/* { dg-final { scan-tree-dump-times "Target construct will be turned into a gridified GPGPU kernel" 4 "omplower" } } */
/* { dg-final { scan-tree-dump-times "Target construct will be turned into a gridified HSA kernel" 4 "omplower" } } */
/* { dg-do compile } */
/* { dg-require-effective-target offload_hsa } */
/* { dg-options "-fopenmp -fdump-tree-omplower-details" } */
#define BLOCK_SIZE 16
void tiled_sgemm_tt(const int M, const int N, const int K, const float alpha, const float*A, const int LDA,
const float*B, const int LDB, const float beta, float*C, const int LDC){
#pragma omp target teams map(to:A[M*K],B[K*N]) map(from:C[M*N])
#pragma omp distribute collapse(2)
for (int C_row_start=0 ; C_row_start < M ; C_row_start+=BLOCK_SIZE)
for (int C_col_start=0 ; C_col_start < N ; C_col_start+=BLOCK_SIZE)
{
// Each team has a local copy of these mini matrices
float As[BLOCK_SIZE][BLOCK_SIZE];
float Bs[BLOCK_SIZE][BLOCK_SIZE];
#pragma omp parallel
{
int C_row, C_col;
float Cval = 0.0;
for (int kblock = 0; kblock < K ; kblock += BLOCK_SIZE )
{
#pragma omp for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
C_row = C_row_start + row;
C_col = C_col_start + col;
if ((C_row < M) && (kblock + col < K))
As[row][col] = A[(C_row*LDA)+ kblock + col];
else
As[row][col] = 0;
if ((kblock + row < K) && C_col < N)
Bs[row][col] = B[((kblock+row)*LDB)+ C_col];
else
Bs[row][col] = 0;
}
#pragma omp for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
for (int e = 0; e < BLOCK_SIZE; ++e)
Cval += As[row][e] * Bs[e][col];
}
} /* End for kblock .. */
#pragma omp for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
C_row = C_row_start + row;
C_col = C_col_start + col;
if ((C_row < M) && (C_col < N))
C[(C_row*LDC)+C_col] = alpha*Cval + beta*C[(C_row*LDC)+C_col];
}
} /* end parallel */
} /* end target teams distribute */
}
/* { dg-final { scan-tree-dump "Target construct will be turned into a gridified HSA kernel" "omplower" } } */
/* { dg-do compile } */
/* { dg-require-effective-target offload_hsa } */
/* { dg-options "-fopenmp -fdump-tree-omplower-details" } */
#define BLOCK_SIZE 16
void tiled_sgemm_tt(const int M, const int N, const int K, const float alpha, const float*A, const int LDA,
const float*B, const int LDB, const float beta, float*C, const int LDC)
{
#pragma omp target teams map(to:A[M*K],B[K*N]) map(from:C[M*N])
#pragma omp distribute collapse(2)
for (int C_row_start=0 ; C_row_start < M ; C_row_start+=BLOCK_SIZE)
for (int C_col_start=0 ; C_col_start < N ; C_col_start+=BLOCK_SIZE)
{
float As[BLOCK_SIZE][BLOCK_SIZE];
float Bs[BLOCK_SIZE][BLOCK_SIZE];
float Cs[BLOCK_SIZE][BLOCK_SIZE];
int C_row, C_col;
#pragma omp parallel for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
Cs[row][col] = 0.0;
}
for (int kblock = 0; kblock < K ; kblock += BLOCK_SIZE )
{
#pragma omp parallel for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
C_row = C_row_start + row;
C_col = C_col_start + col;
if ((C_row < M) && (kblock + col < K))
As[row][col] = A[(C_row*LDA)+ kblock + col];
else
As[row][col] = 0;
if ((kblock + row < K) && C_col < N)
Bs[row][col] = B[((kblock+row)*LDB)+ C_col];
else
Bs[row][col] = 0;
}
#pragma omp parallel for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
for (int e = 0; e < BLOCK_SIZE; ++e)
Cs[row][col] += As[row][e] * Bs[e][col];
}
} /* End for kblock .. */
#pragma omp parallel for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
C_row = C_row_start + row;
C_col = C_col_start + col;
if ((C_row < M) && (C_col < N))
C[(C_row*LDC)+C_col] = alpha*Cs[row][col] + beta*C[(C_row*LDC)+C_col];
}
} /* End distribute */
}
/* { dg-final { scan-tree-dump "Target construct will be turned into a gridified HSA kernel" "omplower" } } */
......@@ -13,4 +13,4 @@ subroutine vector_square(n, a, b)
!$omp end target teams
end subroutine vector_square
! { dg-final { scan-tree-dump "Target construct will be turned into a gridified GPGPU kernel" "omplower" } }
! { dg-final { scan-tree-dump "Target construct will be turned into a gridified HSA kernel" "omplower" } }
2016-11-23 Martin Jambor <mjambor@suse.cz>
* testsuite/libgomp.hsa.c/bits-insns.c: New test.
* testsuite/libgomp.hsa.c/tiling-1.c: Likewise.
* testsuite/libgomp.hsa.c/tiling-2.c: Likewise.
2016-11-23 Martin Liska <mliska@suse.cz>
Martin Jambor <mjambor@suse.cz>
......
#include <math.h>
#define N 12
int main()
{
unsigned int arguments[N] = {0u, 1u, 2u, 3u, 111u, 333u, 444u, 0x80000000u, 0x0000ffffu, 0xf0000000u, 0xff000000u, 0xffffffffu};
int clrsb[N] = {};
int clz[N] = {};
int ctz[N] = {};
int ffs[N] = {};
int parity[N] = {};
int popcount[N] = {};
int ref_clrsb[N] = {};
int ref_clz[N] = {};
int ref_ctz[N] = {};
int ref_ffs[N] = {};
int ref_parity[N] = {};
int ref_popcount[N] = {};
for (unsigned i = 0; i < N; i++)
{
ref_clrsb[i] = __builtin_clrsb (arguments[i]);
ref_clz[i] = __builtin_clz (arguments[i]);
ref_ctz[i] = __builtin_ctz (arguments[i]);
ref_ffs[i] = __builtin_ffs (arguments[i]);
ref_parity[i] = __builtin_parity (arguments[i]);
ref_popcount[i] = __builtin_popcount (arguments[i]);
}
#pragma omp target map(from:clz, ctz, ffs, parity, popcount)
{
for (unsigned i = 0; i < N; i++)
{
clrsb[i] = __builtin_clrsb (arguments[i]);
clz[i] = __builtin_clz (arguments[i]);
ctz[i] = __builtin_ctz (arguments[i]);
ffs[i] = __builtin_ffs (arguments[i]);
parity[i] = __builtin_parity (arguments[i]);
popcount[i] = __builtin_popcount (arguments[i]);
}
}
for (unsigned i = 0; i < N; i++)
if (ref_clrsb[i] != clrsb[i])
__builtin_abort ();
/* CLZ of zero is undefined for zero. */
for (unsigned i = 1; i < N; i++)
if (ref_clz[i] != clz[i])
__builtin_abort ();
/* Likewise for ctz */
for (unsigned i = 1; i < N; i++)
if (ref_ctz[i] != ctz[i])
__builtin_abort ();
for (unsigned i = 0; i < N; i++)
if (ref_ffs[i] != ffs[i])
__builtin_abort ();
for (unsigned i = 0; i < N; i++)
if (ref_parity[i] != parity[i])
__builtin_abort ();
for (unsigned i = 0; i < N; i++)
if (ref_popcount[i] != popcount[i])
__builtin_abort ();
return 0;
}
/*
matmul.c : Matrix Multiplication with tiling for openmp4 example
*/
#include <stdlib.h>
#include <math.h>
#define BLOCK_SIZE 16
/*
#define BLOCK_SIZE 32
*/
#define NSECPERSEC 1000000000L
typedef struct {
int width;
int height;
int stride;
int hpad;
float* elements;
} Matrix;
/* Correctly extract the number of nanoseconds from the two time structures */
long int get_nanosecs( struct timespec start_time, struct timespec end_time) {
long int nanosecs;
if ((end_time.tv_nsec-start_time.tv_nsec)<0) nanosecs =
((((long int) end_time.tv_sec- (long int) start_time.tv_sec )-1)*NSECPERSEC ) +
( NSECPERSEC + (long int) end_time.tv_nsec - (long int) start_time.tv_nsec) ;
else nanosecs =
(((long int) end_time.tv_sec- (long int) start_time.tv_sec )*NSECPERSEC ) +
( (long int) end_time.tv_nsec - (long int) start_time.tv_nsec );
return nanosecs;
}
void simple_sgemm_tt(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
const float* B,const int LDB, const float beta,float* C, const int LDC) ;
void simple_sgemm_tn(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
const float* B,const int LDB, const float beta,float* C, const int LDC) ;
void tiled_sgemm_tt(const int M,const int N,const int K,const float alpha, const float*A, const int LDA,
const float* B,const int LDB, const float beta,float* C, const int LDC) ;
int verify(float* v_res, float* v_ref, int len) {
int passed = 1;
int i;
for (i = 0; i < len; ++i) {
if (fabs(v_res[i] - v_ref[i]) > 0.001*v_ref[i]) {
__builtin_abort ();
}
}
return passed;
}
int main(int argc, char* argv[]){
Matrix A,B,Bt,C,Cref;
int a1,a2,a3,i,j;
struct timespec start_time1, end_time1;
struct timespec start_time2, end_time2;
long int nanosecs,total_ops;
float gflopsTiled,gflopsCPU;
a1 = 35;
a2 = 28;
a3 = 47;
A.height = a1;
A.width = a2;
A.stride = (((A.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
A.hpad = (((A.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
A.elements = (float*)malloc(A.stride * A.hpad* sizeof(float));
B.height = a2;
B.width = a3;
B.stride = (((B.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
B.hpad = (((B.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
B.elements = (float*)malloc(B.stride * B.hpad * sizeof(float));
/* Bt is same as B but stored in column-major order */
Bt.height = B.height;
Bt.width = B.width;
Bt.stride = B.stride;
Bt.hpad = B.hpad;
Bt.elements = (float*)malloc(Bt.stride * Bt.hpad * sizeof(float));
C.height = a1;
C.width = a3;
C.stride = (((C.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
C.hpad = (((C.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
C.elements = (float*)malloc(C.stride * C.hpad * sizeof(float));
Cref.height = a1;
Cref.width = a3;
Cref.stride = (((Cref.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
Cref.hpad = (((Cref.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
Cref.elements = (float*)malloc(Cref.stride * Cref.hpad * sizeof(float));
for(i = 0; i < A.hpad ; i++)
for(j = 0; j < A.stride; j++) {
if (( j<A.width ) && (i<A.height)) {
A.elements[i*A.stride + j] = (i % 3);
} else {
A.elements[i*A.stride + j] = 0.0;
}
}
/* Initialize B and Bt */
for(i = 0; i < B.hpad ; i++)
for(j = 0; j < B.stride; j++) {
if (( j<B.width ) && (i<B.height)) {
B.elements[i*B.stride+j] = (j % 2);
Bt.elements[j*Bt.stride+i] = B.elements[i*B.stride+j] ;
} else {
B.elements[i*B.stride+j] = 0.0;
Bt.elements[j*Bt.stride+i] = 0.0;
}
}
/* zero C, and Cref */
for(i = 0; i < C.hpad; i++)
for(j = 0; j < C.stride; j++) {
C.elements[i*C.stride+j] = 0.0;
Cref.elements[i*Cref.stride+j] = 0.0;
}
simple_sgemm_tt(A.height,B.width,B.height,1.0,A.elements,A.stride,B.elements,B.stride,1.0,Cref.elements,Cref.stride);
tiled_sgemm_tt(A.height,B.width,B.height,1.0,A.elements,A.stride,B.elements,B.stride,1.0,C.elements,C.stride);
verify(C.elements, Cref.elements, C.height * C.stride);
return 0;
}
void simple_sgemm_tt(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
const float* B,const int LDB, const float beta,float* C, const int LDC) {
/* A,B, and C are in row-major order */
int c_row,c_col,inner;
float sum;
for (c_col = 0 ; c_col<N; c_col++ ) {
for (c_row = 0 ; c_row<M; c_row++ ) {
sum = 0.0 ;
for (inner = 0 ; inner<K; inner++ ) {
sum += A[c_row*LDA + inner] * B[inner*LDB + c_col] ;
}
C[c_row*LDC + c_col] = alpha*sum + beta*C[ c_row*LDC + c_col] ;
}
}
}
/***************************
tiled_sgemm_tt: Tiled matrix multiplication:
***************************/
void tiled_sgemm_tt(const int M, const int N, const int K, const float alpha, const float*A, const int LDA,
const float*B, const int LDB, const float beta, float*C, const int LDC){
#pragma omp target teams map(to:A[M*K],B[K*N]) map(from:C[M*N])
#pragma omp distribute collapse(2)
for (int C_row_start=0 ; C_row_start < M ; C_row_start+=BLOCK_SIZE)
for (int C_col_start=0 ; C_col_start < N ; C_col_start+=BLOCK_SIZE)
{
// Each team has a local copy of these mini matrices
float As[BLOCK_SIZE][BLOCK_SIZE];
float Bs[BLOCK_SIZE][BLOCK_SIZE];
#pragma omp parallel
{
int C_row, C_col;
float Cval = 0.0;
for (int kblock = 0; kblock < K ; kblock += BLOCK_SIZE )
{
#pragma omp for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
C_row = C_row_start + row;
C_col = C_col_start + col;
if ((C_row < M) && (kblock + col < K))
As[row][col] = A[(C_row*LDA)+ kblock + col];
else
As[row][col] = 0;
if ((kblock + row < K) && C_col < N)
Bs[row][col] = B[((kblock+row)*LDB)+ C_col];
else
Bs[row][col] = 0;
}
#pragma omp for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
for (int e = 0; e < BLOCK_SIZE; ++e)
Cval += As[row][e] * Bs[e][col];
}
} /* End for kblock .. */
#pragma omp for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++)
for (int col=0 ; col < BLOCK_SIZE ; col++)
{
C_row = C_row_start + row;
C_col = C_col_start + col;
if ((C_row < M) && (C_col < N))
C[(C_row*LDC)+C_col] = alpha*Cval + beta*C[(C_row*LDC)+C_col];
}
} /* end parallel */
} /* end target teams distribute */
}
/*
matmul.c : Matrix Multiplication with tiling for openmp4 example
*/
#include <stdlib.h>
#include <math.h>
#define BLOCK_SIZE 16
/*
#define BLOCK_SIZE 32
*/
#define NSECPERSEC 1000000000L
typedef struct {
int width;
int height;
int stride;
int hpad;
float* elements;
} Matrix;
/* Correctly extract the number of nanoseconds from the two time structures */
long int get_nanosecs( struct timespec start_time, struct timespec end_time) {
long int nanosecs;
if ((end_time.tv_nsec-start_time.tv_nsec)<0) nanosecs =
((((long int) end_time.tv_sec- (long int) start_time.tv_sec )-1)*NSECPERSEC ) +
( NSECPERSEC + (long int) end_time.tv_nsec - (long int) start_time.tv_nsec) ;
else nanosecs =
(((long int) end_time.tv_sec- (long int) start_time.tv_sec )*NSECPERSEC ) +
( (long int) end_time.tv_nsec - (long int) start_time.tv_nsec );
return nanosecs;
}
void simple_sgemm_tt(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
const float* B,const int LDB, const float beta,float* C, const int LDC) ;
void simple_sgemm_tn(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
const float* B,const int LDB, const float beta,float* C, const int LDC) ;
void tiled_sgemm_tt(const int M,const int N,const int K,const float alpha, const float*A, const int LDA,
const float* B,const int LDB, const float beta,float* C, const int LDC) ;
int verify(float* v_res, float* v_ref, int len) {
int passed = 1;
int i;
for (i = 0; i < len; ++i) {
if (fabs(v_res[i] - v_ref[i]) > 0.001*v_ref[i]) {
__builtin_abort ();
}
}
return passed;
}
int main(int argc, char* argv[]){
Matrix A,B,Bt,C,Cref;
int a1,a2,a3,i,j;
struct timespec start_time1, end_time1;
struct timespec start_time2, end_time2;
long int nanosecs,total_ops;
float gflopsTiled,gflopsCPU;
a1 = 35;
a2 = 28;
a3 = 47;
A.height = a1;
A.width = a2;
A.stride = (((A.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
A.hpad = (((A.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
A.elements = (float*)malloc(A.stride * A.hpad* sizeof(float));
B.height = a2;
B.width = a3;
B.stride = (((B.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
B.hpad = (((B.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
B.elements = (float*)malloc(B.stride * B.hpad * sizeof(float));
/* Bt is same as B but stored in column-major order */
Bt.height = B.height;
Bt.width = B.width;
Bt.stride = B.stride;
Bt.hpad = B.hpad;
Bt.elements = (float*)malloc(Bt.stride * Bt.hpad * sizeof(float));
C.height = a1;
C.width = a3;
C.stride = (((C.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
C.hpad = (((C.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
C.elements = (float*)malloc(C.stride * C.hpad * sizeof(float));
Cref.height = a1;
Cref.width = a3;
Cref.stride = (((Cref.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
Cref.hpad = (((Cref.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
Cref.elements = (float*)malloc(Cref.stride * Cref.hpad * sizeof(float));
for(i = 0; i < A.hpad ; i++)
for(j = 0; j < A.stride; j++) {
if (( j<A.width ) && (i<A.height)) {
A.elements[i*A.stride + j] = (i % 3);
} else {
A.elements[i*A.stride + j] = 0.0;
}
}
/* Initialize B and Bt */
for(i = 0; i < B.hpad ; i++)
for(j = 0; j < B.stride; j++) {
if (( j<B.width ) && (i<B.height)) {
B.elements[i*B.stride+j] = (j % 2);
Bt.elements[j*Bt.stride+i] = B.elements[i*B.stride+j] ;
} else {
B.elements[i*B.stride+j] = 0.0;
Bt.elements[j*Bt.stride+i] = 0.0;
}
}
/* zero C, and Cref */
for(i = 0; i < C.hpad; i++)
for(j = 0; j < C.stride; j++) {
C.elements[i*C.stride+j] = 0.0;
Cref.elements[i*Cref.stride+j] = 0.0;
}
simple_sgemm_tt(A.height,B.width,B.height,1.0,A.elements,A.stride,B.elements,B.stride,1.0,Cref.elements,Cref.stride);
tiled_sgemm_tt(A.height,B.width,B.height,1.0,A.elements,A.stride,B.elements,B.stride,1.0,C.elements,C.stride);
verify(C.elements, Cref.elements, C.height * C.stride);
return 0;
}
void simple_sgemm_tt(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
const float* B,const int LDB, const float beta,float* C, const int LDC) {
/* A,B, and C are in row-major order */
int c_row,c_col,inner;
float sum;
for (c_col = 0 ; c_col<N; c_col++ ) {
for (c_row = 0 ; c_row<M; c_row++ ) {
sum = 0.0 ;
for (inner = 0 ; inner<K; inner++ ) {
sum += A[c_row*LDA + inner] * B[inner*LDB + c_col] ;
}
C[c_row*LDC + c_col] = alpha*sum + beta*C[ c_row*LDC + c_col] ;
}
}
}
/***************************
tiled_sgemm_tt: Tiled matrix multiplication:
***************************/
void tiled_sgemm_tt(const int M, const int N, const int K, const float alpha, const float*A, const int LDA,
const float*B, const int LDB, const float beta, float*C, const int LDC){
#pragma omp target teams map(to:A[M*K],B[K*N]) map(from:C[M*N])
#pragma omp distribute collapse(2)
for (int C_row_start=0 ; C_row_start < M ; C_row_start+=BLOCK_SIZE) {
for (int C_col_start=0 ; C_col_start < N ; C_col_start+=BLOCK_SIZE) {
// We now have M/BLOCK_SIZE * N/BLOCK_SIZE teams = (M*N)/(BLOCK_SIZE*BLOCK_SIZE)
// The grid global dimensions are M,N,1
// The grid local dimensions are BLOCK_SIZE,BLOCK_SIZE,1
// -------------------------------------------------------------------
// The rest of this code forms the HSAIL kernel with the
// pairs of "paralell for collapse(2)" loops repalced with a barrier.
// The kernel initializes these values
// C_row_start = get_group_id(0) * BLOCK_SIZE
// C_col_start = get_group_id(1) * BLOCK_SIZE
// row=get_local_id(0)
// col=get_local_id(1)
// -------------------------------------------------------------------
// Each team has a local copy of these mini matrices
float As[BLOCK_SIZE][BLOCK_SIZE];
float Bs[BLOCK_SIZE][BLOCK_SIZE];
float Cs[BLOCK_SIZE][BLOCK_SIZE];
int C_row, C_col;
/* Zero Cs for this BLOCK */
// - - - - - - - - - - - - - - - - - - - -
// REPLACE NEXT THREE LINES WITH A BARRIER
#pragma omp parallel for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++) {
for (int col=0 ; col < BLOCK_SIZE ; col++) {
// END BARRIER
// - - - - - - - - - - - - - - - - - - - -
Cs[row][col] = 0.0;
}
}
// This kblock loop is run on the master thread of each team
for (int kblock = 0; kblock < K ; kblock += BLOCK_SIZE ) {
// Copy global memory values to local memory
// - - - - - - - - - - - - - - - - - - - -
// REPLACE NEXT THREE LINES WITH A BARRIER
#pragma omp parallel for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++) {
for (int col=0 ; col < BLOCK_SIZE ; col++) {
// END BARRIER
// - - - - - - - - - - - - - - - - - - - -
C_row = C_row_start + row;
C_col = C_col_start + col;
if ((C_row < M) && (kblock + col < K))
As[row][col] = A[(C_row*LDA)+ kblock + col];
else
As[row][col] = 0;
if ((kblock + row < K) && C_col < N)
Bs[row][col] = B[((kblock+row)*LDB)+ C_col];
else
Bs[row][col] = 0;
}
}
// Calculate Cs <- Sum(As X Bs) across all kblocks
// - - - - - - - - - - - - - - - - - - - -
// REPLACE NEXT THREE LINES WITH A BARRIER
#pragma omp parallel for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++) {
for (int col=0 ; col < BLOCK_SIZE ; col++) {
// END BARRIER
// - - - - - - - - - - - - - - - - - - - -
for (int e = 0; e < BLOCK_SIZE; ++e)
Cs[row][col] += As[row][e] * Bs[e][col];
}
}
} /* End for kblock .. */
// Scale Update actual C from Cs
// - - - - - - - - - - - - - - - - - - - -
// REPLACE NEXT THREE LINES WITH A BARRIER
#pragma omp parallel for collapse(2)
for (int row=0 ; row < BLOCK_SIZE ; row++) {
for (int col=0 ; col < BLOCK_SIZE ; col++) {
// END BARRIER
// - - - - - - - - - - - - - - - - - - - -
C_row = C_row_start + row;
C_col = C_col_start + col;
if ((C_row < M) && (C_col < N)) {
C[(C_row*LDC)+C_col] = alpha*Cs[row][col] + beta*C[(C_row*LDC)+C_col];
}
}
}
// -------------------------------------------------------------------
// This is the end of the kernel
}
}
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment