backport: hsa-builtins.def: New file.

Merge from HSA branch to trunk 2016-11-23 Martin Jambor <mjambor@suse.cz> Martin Liska <mliska@suse.cz> gcc/ * hsa-builtins.def: New file. * Makefile.in (BUILTINS_DEF): Add hsa-builtins.def dependency. * builtins.def: Include hsa-builtins.def. (DEF_HSA_BUILTIN): New macro. * dumpfile.h (OPTGROUP_OPENMP): Define. * dumpfile.c (optgroup_options): Added OPTGROUP_OPENMP. * gimple.h (gf_mask): Added elements GF_OMP_FOR_GRID_INTRA_GROUP and GF_OMP_FOR_GRID_GROUP_ITER. (gimple_omp_for_grid_phony): Added checking assert. (gimple_omp_for_set_grid_phony): Likewise. (gimple_omp_for_grid_intra_group): New function. (gimple_omp_for_set_grid_intra_group): Likewise. (gimple_omp_for_grid_group_iter): Likewise. (gimple_omp_for_set_grid_group_iter): Likewise. * omp-low.c (check_omp_nesting_restrictions): Allow GRID loop where previosuly only distribute loop was permitted. (lower_lastprivate_clauses): Allow non tcc_comparison predicates. (grid_get_kernel_launch_attributes): Support multiple HSA grid dimensions. (grid_expand_omp_for_loop): Likewise and also support standalone distribute constructs. New parameter INTRA_GROUP, updated both users. (grid_expand_target_grid_body): Support standalone distribute constructs. (pass_data_expand_omp): Changed optinfo_flags to OPTGROUP_OPENMP. (pass_data_expand_omp_ssa): Likewise. (pass_data_omp_device_lower): Likewsie. (pass_data_lower_omp): Likewise. (pass_data_diagnose_omp_blocks): Likewise. (pass_data_oacc_device_lower): Likewise. (pass_data_omp_target_link): Likewise. (grid_lastprivate_predicate): New function. (lower_omp_for_lastprivate): Call grid_lastprivate_predicate for gridified loops. (lower_omp_for): Support standalone distribute constructs. (grid_prop): New type. (grid_safe_assignment_p): Check for assignments to group_sizes, new parameter GRID. (grid_seq_only_contains_local_assignments): New parameter GRID, pass it to callee. (grid_find_single_omp_among_assignments_1): Likewise, improve missed optimization info messages. (grid_find_single_omp_among_assignments): Likewise. (grid_find_ungridifiable_statement): Do not bail out for SIMDs. (grid_parallel_clauses_gridifiable): New function. (grid_inner_loop_gridifiable_p): Likewise. (grid_dist_follows_simple_pattern): Likewise. (grid_gfor_follows_tiling_pattern): Likewise. (grid_call_permissible_in_distribute_p): Likewise. (grid_handle_call_in_distribute): Likewise. (grid_dist_follows_tiling_pattern): Likewise. (grid_target_follows_gridifiable_pattern): Support standalone distribute constructs. (grid_var_segment): New enum. (grid_mark_variable_segment): New function. (grid_copy_leading_local_assignments): Call grid_mark_variable_segment if a new argument says so. (grid_process_grid_body): New function. (grid_eliminate_combined_simd_part): Likewise. (grid_mark_tiling_loops): Likewise. (grid_mark_tiling_parallels_and_loops): Likewise. (grid_process_kernel_body_copy): Support standalone distribute constructs. (grid_attempt_target_gridification): New grid variable holding overall gridification state. Support standalone distribute constructs and collapse clauses. * doc/optinfo.texi (Optimization groups): Document OPTGROUP_OPENMP. * hsa.h (hsa_bb): Add method method append_phi. (hsa_insn_br): Renamed to hsa_insn_cbr, renamed all occurences in all files too. (hsa_insn_br): New class, now the ancestor of hsa_incn_cbr. (is_a_helper <hsa_insn_br *>::test): New function. (is_a_helper <hsa_insn_cbr *>::test): Adjust to only cover conditional branch instructions. (hsa_insn_signal): Make a direct descendant of hsa_insn_basic. Add memorder constructor parameter and m_memory_order and m_signalop member variables. (hsa_insn_queue): Changed constructor parameters to common form. Added m_segment and m_memory_order member variables. (hsa_summary_t): Add private member function process_gpu_implementation_attributes. (hsa_function_summary): Rename m_binded_function to m_bound_function. (hsa_insn_basic_p): Remove typedef. (hsa_op_with_type): Change hsa_insn_basic_p into plain pointers. (hsa_op_reg_p): Remove typedef. (hsa_function_representation): Change hsa_op_reg_p into plain pointers. (hsa_insn_phi): Removed new and delete operators. (hsa_insn_br): Likewise. (hsa_insn_cbr): Likewise. (hsa_insn_sbr): Likewise. (hsa_insn_cmp): Likewise. (hsa_insn_mem): Likewise. (hsa_insn_atomic): Likewise. (hsa_insn_signal): Likewise. (hsa_insn_seg): Likewise. (hsa_insn_call): Likewise. (hsa_insn_arg_block): Likewise. (hsa_insn_comment): Likewise. (hsa_insn_srctype): Likewise. (hsa_insn_packed): Likewise. (hsa_insn_cvt): Likewise. (hsa_insn_alloca): Likewise. * hsa.c (hsa_destroy_insn): Also handle instances of hsa_insn_br. (process_gpu_implementation_attributes): New function. (link_functions): Move some functionality into it. Adjust after renaming m_binded_functions to m_bound_functions. (hsa_insn_basic::op_output_p): Add BRIG_OPCODE_DEBUGTRAP to the list of instructions with no output registers. (get_in_type): Return this if it is a register of matching size. (hsa_get_declaration_name): Moved to... * hsa-gen.c (hsa_get_declaration_name): ...here. Allocate temporary string on an obstack instead from ggc. (query_hsa_grid): Renamed to query_hsa_grid_dim, reimplemented, cut down to two overloads. (hsa_allocp_operand_address): Removed. (hsa_allocp_operand_immed): Likewise. (hsa_allocp_operand_reg): Likewise. (hsa_allocp_operand_code_list): Likewise. (hsa_allocp_operand_operand_list): Likewise. (hsa_allocp_inst_basic): Likewise. (hsa_allocp_inst_phi): Likewise. (hsa_allocp_inst_mem): Likewise. (hsa_allocp_inst_atomic): Likewise. (hsa_allocp_inst_signal): Likewise. (hsa_allocp_inst_seg): Likewise. (hsa_allocp_inst_cmp): Likewise. (hsa_allocp_inst_br): Likewise. (hsa_allocp_inst_sbr): Likewise. (hsa_allocp_inst_call): Likewise. (hsa_allocp_inst_arg_block): Likewise. (hsa_allocp_inst_comment): Likewise. (hsa_allocp_inst_queue): Likewise. (hsa_allocp_inst_srctype): Likewise. (hsa_allocp_inst_packed): Likewise. (hsa_allocp_inst_cvt): Likewise. (hsa_allocp_inst_alloca): Likewise. (hsa_allocp_bb): Likewise. (hsa_obstack): New. (hsa_init_data_for_cfun): Initialize obstack. (hsa_deinit_data_for_cfun): Release memory of the obstack. (hsa_op_immed::operator new): Use obstack instead of object_allocator. (hsa_op_reg::operator new): Likewise. (hsa_op_address::operator new): Likewise. (hsa_op_code_list::operator new): Likewise. (hsa_op_operand_list::operator new): Likewise. (hsa_insn_basic::operator new): Likewise. (hsa_insn_phi::operator new): Likewise. (hsa_insn_br::operator new): Likewise. (hsa_insn_sbr::operator new): Likewise. (hsa_insn_cmp::operator new): Likewise. (hsa_insn_mem::operator new): Likewise. (hsa_insn_atomic::operator new): Likewise. (hsa_insn_signal::operator new): Likewise. (hsa_insn_seg::operator new): Likewise. (hsa_insn_call::operator new): Likewise. (hsa_insn_arg_block::operator new): Likewise. (hsa_insn_comment::operator new): Likewise. (hsa_insn_srctype::operator new): Likewise. (hsa_insn_packed::operator new): Likewise. (hsa_insn_cvt::operator new): Likewise. (hsa_insn_alloca::operator new): Likewise. (hsa_init_new_bb): Likewise. (hsa_bb::append_phi): New function. (gen_hsa_phi_from_gimple_phi): Use it. (get_symbol_for_decl): Fix dinstinguishing between global and local functions. Put local variables into a segment according to their attribute or static flag, if there is one. (hsa_insn_br::hsa_insn_br): New. (hsa_insn_br::operator new): Likewise. (hsa_insn_cbr::hsa_insn_cbr): Set width via ancestor constructor. (query_hsa_grid_nodim): New function. (multiply_grid_dim_characteristics): Likewise. (gen_get_num_threads): Likewise. (gen_get_num_teams): Reimplemented. (gen_get_team_num): Likewise. (gen_hsa_insns_for_known_library_call): Updated calls to the above helper functions. (get_memory_order_name): Removed. (get_memory_order): Likewise. (hsa_memorder_from_tree): New function. (gen_hsa_ternary_atomic_for_builtin): Renamed to gen_hsa_atomic_for_builtin, can also create signals. (gen_hsa_insns_for_call): Handle many new builtins. Adjust to use hsa_memory_order_from_tree and gen_hsa_atomic_for_builtin. (hsa_insn_atomic): Fix function comment. (hsa_insn_signal::hsa_insn_signal): Fix comment. Update call to ancestor constructor and initialization of new member variables. (hsa_insn_queue::hsa_insn_queue): Added initialization of new member variables. (hsa_get_host_function): Handle functions with no bound CPU implementation. Fix binded to bound. (get_brig_function_name): Likewise. (HSA_SORRY_ATV): Remove semicolon after macro. (HSA_SORRY_AT): Likewise. (omp_simple_builtin::generate): Add missing semicolons. (hsa_insn_phi::operator new): Removed. (hsa_insn_br::operator new): Likewise. (hsa_insn_cbr::operator new): Likewise. (hsa_insn_sbr::operator new): Likewise. (hsa_insn_cmp::operator new): Likewise. (hsa_insn_mem::operator new): Likewise. (hsa_insn_atomic::operator new): Likewise. (hsa_insn_signal::operator new): Likewise. (hsa_insn_seg::operator new): Likewise. (hsa_insn_call::operator new): Likewise. (hsa_insn_arg_block::operator new): Likewise. (hsa_insn_comment::operator new): Likewise. (hsa_insn_srctype::operator new): Likewise. (hsa_insn_packed::operator new): Likewise. (hsa_insn_cvt::operator new): Likewise. (hsa_insn_alloca::operator new): Likewise. (get_symbol_for_decl): Accept CONST_DECLs, put them to readonly segment. (gen_hsa_addr): Also process CONST_DECLs. (gen_hsa_addr_insns): Process CONST_DECLs by creating private copies. (gen_hsa_unary_operation): Make sure the function does not use bittype source type for firstbit and lastbit operations. (gen_hsa_popcount_to_dest): Make sure the function uses a bittype source type. * hsa-brig.c (emit_insn_operands): Cope with zero operands in an instruction. (emit_branch_insn): Renamed to emit_cond_branch_insn. Emit the width stored in the class. (emit_generic_branch_insn): New function. (emit_insn): Call emit_generic_branch_insn. (emit_signal_insn): Remove obsolete comment. Update member variable name, pick a type according to profile. (emit_alloca_insn): Remove obsolete comment. (emit_atomic_insn): Likewise. (emit_queue_insn): Get segment and memory order from the IR object. (hsa_brig_section): Make allocate_new_chunk, chunks and cur_chunk provate, add a default NULL parameter to add method. (hsa_brig_section::add): Added a new parameter, store pointer to output data there if it is non-NULL. (emit_function_directives): Use this new parameter instead of calculating the pointer itself, fix function comment. (hsa_brig_emit_function): Add forgotten endian conversion. (hsa_output_kernels): Remove unnecessary building of kernel_dependencies_vector_type. (emit_immediate_operand): Declare. (emit_directive_variable): Also emit initializers of CONST_DECLs. (gen_hsa_insn_for_internal_fn_call): Also handle IFN_RSQRT. (verify_function_arguments): Properly detect variadic arguments. * hsa-dump.c (hsa_width_specifier_name): New function. (dump_hsa_insn_1): Dump generic branch instructions, update signal member variable name. Special dumping for queue objects. * ipa-hsa.c (process_hsa_functions): Adjust after renaming m_binded_functions to m_bound_functions. Copy externally visible flag to the node. (ipa_hsa_write_summary): Likewise. (ipa_hsa_read_section): Likewise. gcc/fortran/ * f95-lang.c (DEF_HSA_BUILTIN): New macro. gcc/testsuite/ * c-c++-common/gomp/gridify-1.c: Update scan string. * gfortran.dg/gomp/gridify-1.f90: Likewise. * c-c++-common/gomp/gridify-2.c: New test. * c-c++-common/gomp/gridify-3.c: Likewise. libgomp/ * testsuite/libgomp.hsa.c/bits-insns.c: New test. * testsuite/libgomp.hsa.c/tiling-1.c: Likewise. * testsuite/libgomp.hsa.c/tiling-2.c: Likewise. Co-Authored-By: Martin Liska <mliska@suse.cz> From-SVN: r242761

backport: hsa-builtins.def: New file.
Merge from HSA branch to trunk 2016-11-23 Martin Jambor <mjambor@suse.cz> Martin Liska <mliska@suse.cz> gcc/ * hsa-builtins.def: New file. * Makefile.in (BUILTINS_DEF): Add hsa-builtins.def dependency. * builtins.def: Include hsa-builtins.def. (DEF_HSA_BUILTIN): New macro. * dumpfile.h (OPTGROUP_OPENMP): Define. * dumpfile.c (optgroup_options): Added OPTGROUP_OPENMP. * gimple.h (gf_mask): Added elements GF_OMP_FOR_GRID_INTRA_GROUP and GF_OMP_FOR_GRID_GROUP_ITER. (gimple_omp_for_grid_phony): Added checking assert. (gimple_omp_for_set_grid_phony): Likewise. (gimple_omp_for_grid_intra_group): New function. (gimple_omp_for_set_grid_intra_group): Likewise. (gimple_omp_for_grid_group_iter): Likewise. (gimple_omp_for_set_grid_group_iter): Likewise. * omp-low.c (check_omp_nesting_restrictions): Allow GRID loop where previosuly only distribute loop was permitted. (lower_lastprivate_clauses): Allow non tcc_comparison predicates. (grid_get_kernel_launch_attributes): Support multiple HSA grid dimensions. (grid_expand_omp_for_loop): Likewise and also support standalone distribute constructs. New parameter INTRA_GROUP, updated both users. (grid_expand_target_grid_body): Support standalone distribute constructs. (pass_data_expand_omp): Changed optinfo_flags to OPTGROUP_OPENMP. (pass_data_expand_omp_ssa): Likewise. (pass_data_omp_device_lower): Likewsie. (pass_data_lower_omp): Likewise. (pass_data_diagnose_omp_blocks): Likewise. (pass_data_oacc_device_lower): Likewise. (pass_data_omp_target_link): Likewise. (grid_lastprivate_predicate): New function. (lower_omp_for_lastprivate): Call grid_lastprivate_predicate for gridified loops. (lower_omp_for): Support standalone distribute constructs. (grid_prop): New type. (grid_safe_assignment_p): Check for assignments to group_sizes, new parameter GRID. (grid_seq_only_contains_local_assignments): New parameter GRID, pass it to callee. (grid_find_single_omp_among_assignments_1): Likewise, improve missed optimization info messages. (grid_find_single_omp_among_assignments): Likewise. (grid_find_ungridifiable_statement): Do not bail out for SIMDs. (grid_parallel_clauses_gridifiable): New function. (grid_inner_loop_gridifiable_p): Likewise. (grid_dist_follows_simple_pattern): Likewise. (grid_gfor_follows_tiling_pattern): Likewise. (grid_call_permissible_in_distribute_p): Likewise. (grid_handle_call_in_distribute): Likewise. (grid_dist_follows_tiling_pattern): Likewise. (grid_target_follows_gridifiable_pattern): Support standalone distribute constructs. (grid_var_segment): New enum. (grid_mark_variable_segment): New function. (grid_copy_leading_local_assignments): Call grid_mark_variable_segment if a new argument says so. (grid_process_grid_body): New function. (grid_eliminate_combined_simd_part): Likewise. (grid_mark_tiling_loops): Likewise. (grid_mark_tiling_parallels_and_loops): Likewise. (grid_process_kernel_body_copy): Support standalone distribute constructs. (grid_attempt_target_gridification): New grid variable holding overall gridification state. Support standalone distribute constructs and collapse clauses. * doc/optinfo.texi (Optimization groups): Document OPTGROUP_OPENMP. * hsa.h (hsa_bb): Add method method append_phi. (hsa_insn_br): Renamed to hsa_insn_cbr, renamed all occurences in all files too. (hsa_insn_br): New class, now the ancestor of hsa_incn_cbr. (is_a_helper <hsa_insn_br *>::test): New function. (is_a_helper <hsa_insn_cbr *>::test): Adjust to only cover conditional branch instructions. (hsa_insn_signal): Make a direct descendant of hsa_insn_basic. Add memorder constructor parameter and m_memory_order and m_signalop member variables. (hsa_insn_queue): Changed constructor parameters to common form. Added m_segment and m_memory_order member variables. (hsa_summary_t): Add private member function process_gpu_implementation_attributes. (hsa_function_summary): Rename m_binded_function to m_bound_function. (hsa_insn_basic_p): Remove typedef. (hsa_op_with_type): Change hsa_insn_basic_p into plain pointers. (hsa_op_reg_p): Remove typedef. (hsa_function_representation): Change hsa_op_reg_p into plain pointers. (hsa_insn_phi): Removed new and delete operators. (hsa_insn_br): Likewise. (hsa_insn_cbr): Likewise. (hsa_insn_sbr): Likewise. (hsa_insn_cmp): Likewise. (hsa_insn_mem): Likewise. (hsa_insn_atomic): Likewise. (hsa_insn_signal): Likewise. (hsa_insn_seg): Likewise. (hsa_insn_call): Likewise. (hsa_insn_arg_block): Likewise. (hsa_insn_comment): Likewise. (hsa_insn_srctype): Likewise. (hsa_insn_packed): Likewise. (hsa_insn_cvt): Likewise. (hsa_insn_alloca): Likewise. * hsa.c (hsa_destroy_insn): Also handle instances of hsa_insn_br. (process_gpu_implementation_attributes): New function. (link_functions): Move some functionality into it. Adjust after renaming m_binded_functions to m_bound_functions. (hsa_insn_basic::op_output_p): Add BRIG_OPCODE_DEBUGTRAP to the list of instructions with no output registers. (get_in_type): Return this if it is a register of matching size. (hsa_get_declaration_name): Moved to... * hsa-gen.c (hsa_get_declaration_name): ...here. Allocate temporary string on an obstack instead from ggc. (query_hsa_grid): Renamed to query_hsa_grid_dim, reimplemented, cut down to two overloads. (hsa_allocp_operand_address): Removed. (hsa_allocp_operand_immed): Likewise. (hsa_allocp_operand_reg): Likewise. (hsa_allocp_operand_code_list): Likewise. (hsa_allocp_operand_operand_list): Likewise. (hsa_allocp_inst_basic): Likewise. (hsa_allocp_inst_phi): Likewise. (hsa_allocp_inst_mem): Likewise. (hsa_allocp_inst_atomic): Likewise. (hsa_allocp_inst_signal): Likewise. (hsa_allocp_inst_seg): Likewise. (hsa_allocp_inst_cmp): Likewise. (hsa_allocp_inst_br): Likewise. (hsa_allocp_inst_sbr): Likewise. (hsa_allocp_inst_call): Likewise. (hsa_allocp_inst_arg_block): Likewise. (hsa_allocp_inst_comment): Likewise. (hsa_allocp_inst_queue): Likewise. (hsa_allocp_inst_srctype): Likewise. (hsa_allocp_inst_packed): Likewise. (hsa_allocp_inst_cvt): Likewise. (hsa_allocp_inst_alloca): Likewise. (hsa_allocp_bb): Likewise. (hsa_obstack): New. (hsa_init_data_for_cfun): Initialize obstack. (hsa_deinit_data_for_cfun): Release memory of the obstack. (hsa_op_immed::operator new): Use obstack instead of object_allocator. (hsa_op_reg::operator new): Likewise. (hsa_op_address::operator new): Likewise. (hsa_op_code_list::operator new): Likewise. (hsa_op_operand_list::operator new): Likewise. (hsa_insn_basic::operator new): Likewise. (hsa_insn_phi::operator new): Likewise. (hsa_insn_br::operator new): Likewise. (hsa_insn_sbr::operator new): Likewise. (hsa_insn_cmp::operator new): Likewise. (hsa_insn_mem::operator new): Likewise. (hsa_insn_atomic::operator new): Likewise. (hsa_insn_signal::operator new): Likewise. (hsa_insn_seg::operator new): Likewise. (hsa_insn_call::operator new): Likewise. (hsa_insn_arg_block::operator new): Likewise. (hsa_insn_comment::operator new): Likewise. (hsa_insn_srctype::operator new): Likewise. (hsa_insn_packed::operator new): Likewise. (hsa_insn_cvt::operator new): Likewise. (hsa_insn_alloca::operator new): Likewise. (hsa_init_new_bb): Likewise. (hsa_bb::append_phi): New function. (gen_hsa_phi_from_gimple_phi): Use it. (get_symbol_for_decl): Fix dinstinguishing between global and local functions. Put local variables into a segment according to their attribute or static flag, if there is one. (hsa_insn_br::hsa_insn_br): New. (hsa_insn_br::operator new): Likewise. (hsa_insn_cbr::hsa_insn_cbr): Set width via ancestor constructor. (query_hsa_grid_nodim): New function. (multiply_grid_dim_characteristics): Likewise. (gen_get_num_threads): Likewise. (gen_get_num_teams): Reimplemented. (gen_get_team_num): Likewise. (gen_hsa_insns_for_known_library_call): Updated calls to the above helper functions. (get_memory_order_name): Removed. (get_memory_order): Likewise. (hsa_memorder_from_tree): New function. (gen_hsa_ternary_atomic_for_builtin): Renamed to gen_hsa_atomic_for_builtin, can also create signals. (gen_hsa_insns_for_call): Handle many new builtins. Adjust to use hsa_memory_order_from_tree and gen_hsa_atomic_for_builtin. (hsa_insn_atomic): Fix function comment. (hsa_insn_signal::hsa_insn_signal): Fix comment. Update call to ancestor constructor and initialization of new member variables. (hsa_insn_queue::hsa_insn_queue): Added initialization of new member variables. (hsa_get_host_function): Handle functions with no bound CPU implementation. Fix binded to bound. (get_brig_function_name): Likewise. (HSA_SORRY_ATV): Remove semicolon after macro. (HSA_SORRY_AT): Likewise. (omp_simple_builtin::generate): Add missing semicolons. (hsa_insn_phi::operator new): Removed. (hsa_insn_br::operator new): Likewise. (hsa_insn_cbr::operator new): Likewise. (hsa_insn_sbr::operator new): Likewise. (hsa_insn_cmp::operator new): Likewise. (hsa_insn_mem::operator new): Likewise. (hsa_insn_atomic::operator new): Likewise. (hsa_insn_signal::operator new): Likewise. (hsa_insn_seg::operator new): Likewise. (hsa_insn_call::operator new): Likewise. (hsa_insn_arg_block::operator new): Likewise. (hsa_insn_comment::operator new): Likewise. (hsa_insn_srctype::operator new): Likewise. (hsa_insn_packed::operator new): Likewise. (hsa_insn_cvt::operator new): Likewise. (hsa_insn_alloca::operator new): Likewise. (get_symbol_for_decl): Accept CONST_DECLs, put them to readonly segment. (gen_hsa_addr): Also process CONST_DECLs. (gen_hsa_addr_insns): Process CONST_DECLs by creating private copies. (gen_hsa_unary_operation): Make sure the function does not use bittype source type for firstbit and lastbit operations. (gen_hsa_popcount_to_dest): Make sure the function uses a bittype source type. * hsa-brig.c (emit_insn_operands): Cope with zero operands in an instruction. (emit_branch_insn): Renamed to emit_cond_branch_insn. Emit the width stored in the class. (emit_generic_branch_insn): New function. (emit_insn): Call emit_generic_branch_insn. (emit_signal_insn): Remove obsolete comment. Update member variable name, pick a type according to profile. (emit_alloca_insn): Remove obsolete comment. (emit_atomic_insn): Likewise. (emit_queue_insn): Get segment and memory order from the IR object. (hsa_brig_section): Make allocate_new_chunk, chunks and cur_chunk provate, add a default NULL parameter to add method. (hsa_brig_section::add): Added a new parameter, store pointer to output data there if it is non-NULL. (emit_function_directives): Use this new parameter instead of calculating the pointer itself, fix function comment. (hsa_brig_emit_function): Add forgotten endian conversion. (hsa_output_kernels): Remove unnecessary building of kernel_dependencies_vector_type. (emit_immediate_operand): Declare. (emit_directive_variable): Also emit initializers of CONST_DECLs. (gen_hsa_insn_for_internal_fn_call): Also handle IFN_RSQRT. (verify_function_arguments): Properly detect variadic arguments. * hsa-dump.c (hsa_width_specifier_name): New function. (dump_hsa_insn_1): Dump generic branch instructions, update signal member variable name. Special dumping for queue objects. * ipa-hsa.c (process_hsa_functions): Adjust after renaming m_binded_functions to m_bound_functions. Copy externally visible flag to the node. (ipa_hsa_write_summary): Likewise. (ipa_hsa_read_section): Likewise. gcc/fortran/ * f95-lang.c (DEF_HSA_BUILTIN): New macro. gcc/testsuite/ * c-c++-common/gomp/gridify-1.c: Update scan string. * gfortran.dg/gomp/gridify-1.f90: Likewise. * c-c++-common/gomp/gridify-2.c: New test. * c-c++-common/gomp/gridify-3.c: Likewise. libgomp/ * testsuite/libgomp.hsa.c/bits-insns.c: New test. * testsuite/libgomp.hsa.c/tiling-1.c: Likewise. * testsuite/libgomp.hsa.c/tiling-2.c: Likewise. Co-Authored-By: Martin Liska <mliska@suse.cz> From-SVN: r242761
56b1c60e · Martin Jambor · Martin Jambor · f6cdfe82 · 56b1c60e · 56b1c60e
Commit 56b1c60e authored Nov 23, 2016 by Martin Jambor Committed by Martin Jambor Nov 23, 2016
26 changed files
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -911,7 +911,8 @@ RTL_H = $(RTL_BASE_H) $(FLAGS_H) genrtl.h
 READ_MD_H = $(OBSTACK_H) $(HASHTAB_H) read-md.h
 PARAMS_H = params.h params-enum.h params.def
 BUILTINS_DEF = builtins.def sync-builtins.def omp-builtins.def \
-	gtm-builtins.def sanitizer.def cilkplus.def cilk-builtins.def
+	gtm-builtins.def sanitizer.def cilkplus.def cilk-builtins.def \
+	hsa-builtins.def
 INTERNAL_FN_DEF = internal-fn.def
 INTERNAL_FN_H = internal-fn.h $(INTERNAL_FN_DEF)
 TREE_CORE_H = tree-core.h coretypes.h all-tree.def tree.def \

--- a/gcc/builtins.def
+++ b/gcc/builtins.def
@@ -201,6 +201,19 @@ along with GCC; see the file COPYING3.  If not see
 		|| flag_cilkplus \
 		|| flag_offload_abi != OFFLOAD_ABI_UNSET))
+#undef DEF_HSA_BUILTIN
+#ifdef ENABLE_HSA
+#define DEF_HSA_BUILTIN(ENUM, NAME, TYPE, ATTRS)			\
+  DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE,    \
+               false, false, true, ATTRS, false, \
+	       (!flag_disable_hsa))
+#else
+#define DEF_HSA_BUILTIN(ENUM, NAME, TYPE, ATTRS)			\
+  DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE,    \
+               false, false, true, ATTRS, false, \
+	       (false))
+#endif
 /* Builtin used by implementation of Cilk Plus.  Most of these are decomposed
   by the compiler but a few are implemented in libcilkrts.  */ 
 #undef DEF_CILK_BUILTIN_STUB
@@ -968,6 +981,9 @@ DEF_GCC_BUILTIN (BUILT_IN_LINE, "LINE", BT_FN_INT, ATTR_NOTHROW_LEAF_LIST)
 /* Offloading and Multi Processing builtins.  */
 #include "omp-builtins.def"
+/* Heterogeneous Systems Architecture.  */
+#include "hsa-builtins.def"
 /* Cilk keywords builtins.  */
 #include "cilk-builtins.def"

--- a/gcc/doc/optinfo.texi
+++ b/gcc/doc/optinfo.texi
@@ -59,6 +59,9 @@ Loop optimization passes. Enabled by @option{-loop}.
 @item OPTGROUP_INLINE
 Inlining passes. Enabled by @option{-inline}.
+@item OPTGROUP_OPENMP
+OpenMP passes. Enabled by @option{-openmp}.
 @item OPTGROUP_VEC
 Vectorization passes. Enabled by @option{-vec}.

--- a/gcc/dumpfile.c
+++ b/gcc/dumpfile.c
@@ -138,6 +138,7 @@ static const struct dump_option_value_info optgroup_options[] =
  {"ipa", OPTGROUP_IPA},
  {"loop", OPTGROUP_LOOP},
  {"inline", OPTGROUP_INLINE},
+  {"openmp", OPTGROUP_OPENMP},
  {"vec", OPTGROUP_VEC},
  {"optall", OPTGROUP_ALL},
  {NULL, 0}

--- a/gcc/dumpfile.h
+++ b/gcc/dumpfile.h
@@ -98,7 +98,8 @@ enum tree_dump_index
 #define OPTGROUP_LOOP        (1 << 2)   /* Loop optimization passes */
 #define OPTGROUP_INLINE      (1 << 3)   /* Inlining passes */
 #define OPTGROUP_VEC         (1 << 4)   /* Vectorization passes */
-#define OPTGROUP_OTHER       (1 << 5)   /* All other passes */
+#define OPTGROUP_OPENMP      (1 << 5)	/* OpenMP specific transformations */
+#define OPTGROUP_OTHER       (1 << 6)   /* All other passes */
 #define OPTGROUP_ALL	     (OPTGROUP_IPA | OPTGROUP_LOOP | OPTGROUP_INLINE \
                              | OPTGROUP_VEC | OPTGROUP_OTHER)

--- a/gcc/fortran/ChangeLog
+++ b/gcc/fortran/ChangeLog
+2016-11-23  Martin Jambor  <mjambor@suse.cz>
+	* f95-lang.c (DEF_HSA_BUILTIN): New macro.
 2016-11-22  Steven G. Kargl  <kargl@gcc.gnu.org>
 	PR fortran/78479

--- a/gcc/fortran/f95-lang.c
+++ b/gcc/fortran/f95-lang.c
@@ -1224,6 +1224,17 @@ gfc_init_builtin_functions (void)
 #undef DEF_GOMP_BUILTIN
    }
+#ifdef ENABLE_HSA
+  if (!flag_disable_hsa)
+    {
+#undef DEF_HSA_BUILTIN
+#define DEF_HSA_BUILTIN(code, name, type, attr) \
+      gfc_define_builtin ("__builtin_" name, builtin_types[type], \
+			  code, name, attr);
+#include "../hsa-builtins.def"
+    }
+#endif
  gfc_define_builtin ("__builtin_trap", builtin_types[BT_FN_VOID],
 		      BUILT_IN_TRAP, NULL, ATTR_NOTHROW_LEAF_LIST);
  TREE_THIS_VOLATILE (builtin_decl_explicit (BUILT_IN_TRAP)) = 1;

--- a/gcc/gimple.h
+++ b/gcc/gimple.h
@@ -163,7 +163,13 @@ enum gf_mask {
    GF_OMP_FOR_KIND_CILKSIMD	= GF_OMP_FOR_SIMD | 1,
    GF_OMP_FOR_COMBINED		= 1 << 4,
    GF_OMP_FOR_COMBINED_INTO	= 1 << 5,
+    /* The following flag must not be used on GF_OMP_FOR_KIND_GRID_LOOP loop
+       statements.  */
    GF_OMP_FOR_GRID_PHONY	= 1 << 6,
+    /* The following two flags should only be set on GF_OMP_FOR_KIND_GRID_LOOP
+       loop statements.  */
+    GF_OMP_FOR_GRID_INTRA_GROUP	= 1 << 6,
+    GF_OMP_FOR_GRID_GROUP_ITER  = 1 << 7,
    GF_OMP_TARGET_KIND_MASK	= (1 << 4) - 1,
    GF_OMP_TARGET_KIND_REGION	= 0,
    GF_OMP_TARGET_KIND_DATA	= 1,
@@ -5143,6 +5149,8 @@ gimple_omp_for_set_pre_body (gimple *gs, gimple_seq pre_body)
 static inline bool
 gimple_omp_for_grid_phony (const gomp_for *omp_for)
 {
+  gcc_checking_assert (gimple_omp_for_kind (omp_for)
+		       != GF_OMP_FOR_KIND_GRID_LOOP);
  return (gimple_omp_subcode (omp_for) & GF_OMP_FOR_GRID_PHONY) != 0;
 }
@@ -5151,12 +5159,61 @@ gimple_omp_for_grid_phony (const gomp_for *omp_for)
 static inline void
 gimple_omp_for_set_grid_phony (gomp_for *omp_for, bool value)
 {
+  gcc_checking_assert (gimple_omp_for_kind (omp_for)
+		       != GF_OMP_FOR_KIND_GRID_LOOP);
  if (value)
    omp_for->subcode |= GF_OMP_FOR_GRID_PHONY;
  else
    omp_for->subcode &= ~GF_OMP_FOR_GRID_PHONY;
 }
+/* Return the kernel_intra_group of a GRID_LOOP OMP_FOR statement.  */
+static inline bool
+gimple_omp_for_grid_intra_group (const gomp_for *omp_for)
+{
+  gcc_checking_assert (gimple_omp_for_kind (omp_for)
+		       == GF_OMP_FOR_KIND_GRID_LOOP);
+  return (gimple_omp_subcode (omp_for) & GF_OMP_FOR_GRID_INTRA_GROUP) != 0;
+}
+/* Set kernel_intra_group flag of OMP_FOR to VALUE.  */
+static inline void
+gimple_omp_for_set_grid_intra_group (gomp_for *omp_for, bool value)
+{
+  gcc_checking_assert (gimple_omp_for_kind (omp_for)
+		       == GF_OMP_FOR_KIND_GRID_LOOP);
+  if (value)
+    omp_for->subcode |= GF_OMP_FOR_GRID_INTRA_GROUP;
+  else
+    omp_for->subcode &= ~GF_OMP_FOR_GRID_INTRA_GROUP;
+}
+/* Return true if iterations of a grid OMP_FOR statement correspond to HSA
+   groups.  */
+static inline bool
+gimple_omp_for_grid_group_iter (const gomp_for *omp_for)
+{
+  gcc_checking_assert (gimple_omp_for_kind (omp_for)
+		       == GF_OMP_FOR_KIND_GRID_LOOP);
+  return (gimple_omp_subcode (omp_for) & GF_OMP_FOR_GRID_GROUP_ITER) != 0;
+}
+/* Set group_iter flag of OMP_FOR to VALUE.  */
+static inline void
+gimple_omp_for_set_grid_group_iter (gomp_for *omp_for, bool value)
+{
+  gcc_checking_assert (gimple_omp_for_kind (omp_for)
+		       == GF_OMP_FOR_KIND_GRID_LOOP);
+  if (value)
+    omp_for->subcode |= GF_OMP_FOR_GRID_GROUP_ITER;
+  else
+    omp_for->subcode &= ~GF_OMP_FOR_GRID_GROUP_ITER;
+}
 /* Return the clauses associated with OMP_PARALLEL GS.  */
 static inline tree

--- a/gcc/hsa-brig.c
+++ b/gcc/hsa-brig.c
--- a/gcc/hsa-builtins.def
+++ b/gcc/hsa-builtins.def
+/* This file contains the definitions and documentation for the
+   Offloading and Multi Processing builtins used in the GNU compiler.
+   Copyright (C) 2005-2015 Free Software Foundation, Inc.
+This file is part of GCC.
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+/* Before including this file, you should define a macro:
+     DEF_HSA_BUILTIN (ENUM, NAME, TYPE, ATTRS)
+   See builtins.def for details.  */
+/* The reason why they aren't in gcc/builtins.def is that the Fortran front end
+   doesn't source those.  */
+DEF_HSA_BUILTIN (BUILT_IN_HSA_WORKGROUPID, "hsa_workgroupid",
+	  	 BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_HSA_BUILTIN (BUILT_IN_HSA_WORKITEMID, "hsa_workitemid",
+	  	 BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_HSA_BUILTIN (BUILT_IN_HSA_WORKITEMABSID, "hsa_workitemabsid",
+	  	 BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_HSA_BUILTIN (BUILT_IN_HSA_GRIDSIZE, "hsa_gridsize",
+		 BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_HSA_BUILTIN (BUILT_IN_HSA_CURRENTWORKGROUPSIZE, "hsa_currentworkgroupsize",
+		 BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST)
--- a/gcc/hsa-dump.c
+++ b/gcc/hsa-dump.c
@@ -621,6 +621,88 @@ hsa_m_atomicop_name (enum BrigAtomicOperation op)
    }
 }
+/* Return textual name for atomic operation.  */
+static const char *
+hsa_width_specifier_name (BrigWidth8_t width)
+{
+  switch (width)
+    {
+    case BRIG_WIDTH_NONE:
+      return "none";
+    case BRIG_WIDTH_1:
+      return "1";
+    case BRIG_WIDTH_2:
+      return "2";
+    case BRIG_WIDTH_4:
+      return "4";
+    case BRIG_WIDTH_8:
+      return "8";
+    case BRIG_WIDTH_16:
+      return "16";
+    case BRIG_WIDTH_32:
+      return "32";
+    case BRIG_WIDTH_64:
+      return "64";
+    case BRIG_WIDTH_128:
+      return "128";
+    case BRIG_WIDTH_256:
+      return "256";
+    case BRIG_WIDTH_512:
+      return "512";
+    case BRIG_WIDTH_1024:
+      return "1024";
+    case BRIG_WIDTH_2048:
+      return "2048";
+    case BRIG_WIDTH_4096:
+      return "4096";
+    case BRIG_WIDTH_8192:
+      return "8192";
+    case BRIG_WIDTH_16384:
+      return "16384";
+    case BRIG_WIDTH_32768:
+      return "32768";
+    case BRIG_WIDTH_65536:
+      return "65536";
+    case BRIG_WIDTH_131072:
+      return "131072";
+    case BRIG_WIDTH_262144:
+      return "262144";
+    case BRIG_WIDTH_524288:
+      return "524288";
+    case BRIG_WIDTH_1048576:
+      return "1048576";
+    case BRIG_WIDTH_2097152:
+      return "2097152";
+    case BRIG_WIDTH_4194304:
+      return "4194304";
+    case BRIG_WIDTH_8388608:
+      return "8388608";
+    case BRIG_WIDTH_16777216:
+      return "16777216";
+    case BRIG_WIDTH_33554432:
+      return "33554432";
+    case BRIG_WIDTH_67108864:
+      return "67108864";
+    case BRIG_WIDTH_134217728:
+      return "134217728";
+    case BRIG_WIDTH_268435456:
+      return "268435456";
+    case BRIG_WIDTH_536870912:
+      return "536870912";
+    case BRIG_WIDTH_1073741824:
+      return "1073741824";
+    case BRIG_WIDTH_2147483648:
+      return "2147483648";
+    case BRIG_WIDTH_WAVESIZE:
+      return "wavesize";
+    case BRIG_WIDTH_ALL:
+      return "all";
+    default:
+      return "UNKNOWN_WIDTH";
+    }
+}
 /* Dump textual representation of HSA IL register REG to file F.  */
 static void
@@ -793,9 +875,9 @@ dump_hsa_insn_1 (FILE *f, hsa_insn_basic *insn, int *indent)
      hsa_insn_signal *mem = as_a <hsa_insn_signal *> (insn);
      fprintf (f, "%s", hsa_opcode_name (mem->m_opcode));
-      fprintf (f, "_%s", hsa_m_atomicop_name (mem->m_atomicop));
+      fprintf (f, "_%s", hsa_m_atomicop_name (mem->m_signalop));
-      if (mem->m_memoryorder != BRIG_MEMORY_ORDER_NONE)
+      if (mem->m_memory_order != BRIG_MEMORY_ORDER_NONE)
-	fprintf (f, "_%s", hsa_memsem_name (mem->m_memoryorder));
+	fprintf (f, "_%s", hsa_memsem_name (mem->m_memory_order));
      fprintf (f, "_%s ", hsa_type_name (mem->m_type));
      dump_hsa_operands (f, mem);
@@ -884,9 +966,9 @@ dump_hsa_insn_1 (FILE *f, hsa_insn_basic *insn, int *indent)
      fprintf (f, ", ");
      dump_hsa_operand (f, cmp->get_op (2));
    }
-  else if (is_a <hsa_insn_br *> (insn))
+  else if (is_a <hsa_insn_cbr *> (insn))
    {
-      hsa_insn_br *br = as_a <hsa_insn_br *> (insn);
+      hsa_insn_cbr *br = as_a <hsa_insn_cbr *> (insn);
      basic_block target = NULL;
      edge_iterator ei;
      edge e;
@@ -921,6 +1003,12 @@ dump_hsa_insn_1 (FILE *f, hsa_insn_basic *insn, int *indent)
 	    fprintf (f, ", ");
 	}
    }
+  else if (is_a <hsa_insn_br *> (insn))
+    {
+      hsa_insn_br *br = as_a <hsa_insn_br *> (insn);
+      fprintf (f, "%s_width(%s) ", hsa_opcode_name (br->m_opcode),
+	       hsa_width_specifier_name (br->m_width));
+    }
  else if (is_a <hsa_insn_arg_block *> (insn))
    {
      hsa_insn_arg_block *arg_block = as_a <hsa_insn_arg_block *> (insn);
@@ -1018,6 +1106,15 @@ dump_hsa_insn_1 (FILE *f, hsa_insn_basic *insn, int *indent)
      dump_hsa_operands (f, insn);
    }
+  else if (hsa_insn_queue *qi = dyn_cast <hsa_insn_queue *> (insn))
+    {
+      fprintf (f, "%s_%s_%s_%s ", hsa_opcode_name (qi->m_opcode),
+	       hsa_seg_name (qi->m_segment),
+	       hsa_memsem_name (qi->m_memory_order),
+	       hsa_type_name (qi->m_type));
+      dump_hsa_operands (f, qi);
+    }
  else
    {
      fprintf (f, "%s_%s ", hsa_opcode_name (insn->m_opcode),

--- a/gcc/hsa-gen.c
+++ b/gcc/hsa-gen.c
--- a/gcc/hsa.c
+++ b/gcc/hsa.c
@@ -170,6 +170,7 @@ hsa_insn_basic::op_output_p (unsigned opnum)
    case BRIG_OPCODE_SBR:
    case BRIG_OPCODE_ST:
    case BRIG_OPCODE_SIGNALNORET:
+    case BRIG_OPCODE_DEBUGTRAP:
      /* FIXME: There are probably missing cases here, double check.  */
      return false;
    case BRIG_OPCODE_EXPAND:
@@ -605,8 +606,8 @@ hsa_destroy_insn (hsa_insn_basic *insn)
 {
  if (hsa_insn_phi *phi = dyn_cast <hsa_insn_phi *> (insn))
    phi->~hsa_insn_phi ();
-  else if (hsa_insn_br *br = dyn_cast <hsa_insn_br *> (insn))
+  else if (hsa_insn_cbr *br = dyn_cast <hsa_insn_cbr *> (insn))
-    br->~hsa_insn_br ();
+    br->~hsa_insn_cbr ();
  else if (hsa_insn_cmp *cmp = dyn_cast <hsa_insn_cmp *> (insn))
    cmp->~hsa_insn_cmp ();
  else if (hsa_insn_mem *mem = dyn_cast <hsa_insn_mem *> (insn))
@@ -621,6 +622,8 @@ hsa_destroy_insn (hsa_insn_basic *insn)
    block->~hsa_insn_arg_block ();
  else if (hsa_insn_sbr *sbr = dyn_cast <hsa_insn_sbr *> (insn))
    sbr->~hsa_insn_sbr ();
+  else if (hsa_insn_br *br = dyn_cast <hsa_insn_br *> (insn))
+    br->~hsa_insn_br ();
  else if (hsa_insn_comment *comment = dyn_cast <hsa_insn_comment *> (insn))
    comment->~hsa_insn_comment ();
  else
@@ -783,32 +786,22 @@ hsa_brig_function_name (const char *p)
  return buf;
 }
-/* Return declaration name if exists.  */
+/* Add a flatten attribute and disable vectorization for gpu implementation
+   function decl GDECL.  */
-const char *
+void hsa_summary_t::process_gpu_implementation_attributes (tree gdecl)
-hsa_get_declaration_name (tree decl)
 {
-  if (!DECL_NAME (decl))
+  DECL_ATTRIBUTES (gdecl)
-    {
+    = tree_cons (get_identifier ("flatten"), NULL_TREE,
-      char buf[64];
+		 DECL_ATTRIBUTES (gdecl));
-      snprintf (buf, 64, "__hsa_anonymous_%i", DECL_UID (decl));
-      const char *ggc_str = ggc_strdup (buf);
-      return ggc_str;
-    }
-  tree name_tree;
-  if (TREE_CODE (decl) == FUNCTION_DECL
-      || (VAR_P (decl) && is_global_var (decl)))
-    name_tree = DECL_ASSEMBLER_NAME (decl);
-  else
-    name_tree = DECL_NAME (decl);
-  const char *name = IDENTIFIER_POINTER (name_tree);
-  /* User-defined assembly names have prepended asterisk symbol.  */
-  if (name[0] == '*')
-    name++;
-  return name;
+  tree fn_opts = DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl);
+  if (fn_opts == NULL_TREE)
+    fn_opts = optimization_default_node;
+  fn_opts = copy_node (fn_opts);
+  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
+  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
+  DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
 }
 void
@@ -827,21 +820,10 @@ hsa_summary_t::link_functions (cgraph_node *gpu, cgraph_node *host,
  gpu_summary->m_gridified_kernel_p = gridified_kernel_p;
  host_summary->m_gridified_kernel_p = gridified_kernel_p;
-  gpu_summary->m_binded_function = host;
+  gpu_summary->m_bound_function = host;
-  host_summary->m_binded_function = gpu;
+  host_summary->m_bound_function = gpu;
-  tree gdecl = gpu->decl;
-  DECL_ATTRIBUTES (gdecl)
-    = tree_cons (get_identifier ("flatten"), NULL_TREE,
-		 DECL_ATTRIBUTES (gdecl));
-  tree fn_opts = DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl);
+  process_gpu_implementation_attributes (gpu->decl);
-  if (fn_opts == NULL_TREE)
-    fn_opts = optimization_default_node;
-  fn_opts = copy_node (fn_opts);
-  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
-  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
-  DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
  /* Create reference between a kernel and a corresponding host implementation
     to quarantee LTO streaming to a same LTRANS.  */

--- a/gcc/hsa.h
+++ b/gcc/hsa.h
--- a/gcc/ipa-hsa.c
+++ b/gcc/ipa-hsa.c
@@ -79,7 +79,7 @@ process_hsa_functions (void)
      hsa_function_summary *s = hsa_summaries->get (node);
      /* A linked function is skipped.  */
-      if (s->m_binded_function != NULL)
+      if (s->m_bound_function != NULL)
 	continue;
      if (s->m_kind != HSA_NONE)
@@ -90,6 +90,7 @@ process_hsa_functions (void)
 	    = node->create_virtual_clone (vec <cgraph_edge *> (),
 					  NULL, NULL, "hsa");
 	  TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
+	  clone->externally_visible = node->externally_visible;
 	  clone->force_output = true;
 	  hsa_summaries->link_functions (clone, node, s->m_kind, false);
@@ -107,6 +108,7 @@ process_hsa_functions (void)
 	    = node->create_virtual_clone (vec <cgraph_edge *> (),
 					  NULL, NULL, "hsa");
 	  TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
+	  clone->externally_visible = node->externally_visible;
 	  if (!cgraph_local_p (node))
 	    clone->force_output = true;
@@ -131,7 +133,7 @@ process_hsa_functions (void)
 	      hsa_function_summary *dst = hsa_summaries->get (e->callee);
 	      if (dst->m_kind != HSA_NONE && !dst->m_gpu_implementation_p)
 		{
-		  e->redirect_callee (dst->m_binded_function);
+		  e->redirect_callee (dst->m_bound_function);
 		  if (dump_file)
 		    fprintf (dump_file,
 			     "Redirecting edge to HSA function: %s->%s\n",
@@ -193,10 +195,10 @@ ipa_hsa_write_summary (void)
 	  bp = bitpack_create (ob->main_stream);
 	  bp_pack_value (&bp, s->m_kind, 2);
 	  bp_pack_value (&bp, s->m_gpu_implementation_p, 1);
-	  bp_pack_value (&bp, s->m_binded_function != NULL, 1);
+	  bp_pack_value (&bp, s->m_bound_function != NULL, 1);
 	  streamer_write_bitpack (&bp);
-	  if (s->m_binded_function)
+	  if (s->m_bound_function)
-	    stream_write_tree (ob, s->m_binded_function->decl, true);
+	    stream_write_tree (ob, s->m_bound_function->decl, true);
 	}
    }
@@ -249,7 +251,7 @@ ipa_hsa_read_section (struct lto_file_decl_data *file_data, const char *data,
      if (has_tree)
 	{
 	  tree decl = stream_read_tree (&ib_main, data_in);
-	  s->m_binded_function = cgraph_node::get_create (decl);
+	  s->m_bound_function = cgraph_node::get_create (decl);
 	}
    }
  lto_free_section_data (file_data, LTO_section_ipa_hsa, NULL, data,

--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
+2016-11-23  Martin Jambor  <mjambor@suse.cz>
+	* c-c++-common/gomp/gridify-1.c: Update scan string.
+	* gfortran.dg/gomp/gridify-1.f90: Likewise.
+	* c-c++-common/gomp/gridify-2.c: New test.
+	* c-c++-common/gomp/gridify-3.c: Likewise.
 2016-11-23  Richard Biener  <rguenther@suse.de>
 	PR tree-optimization/78396

--- a/gcc/testsuite/c-c++-common/gomp/gridify-1.c
+++ b/gcc/testsuite/c-c++-common/gomp/gridify-1.c
@@ -51,4 +51,4 @@ foo4 (int j, int n, int *a)
 }
-/* { dg-final { scan-tree-dump-times "Target construct will be turned into a gridified GPGPU kernel" 4 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "Target construct will be turned into a gridified HSA kernel" 4 "omplower" } } */
--- a/gcc/testsuite/c-c++-common/gomp/gridify-2.c
+++ b/gcc/testsuite/c-c++-common/gomp/gridify-2.c
+/* { dg-do compile } */
+/* { dg-require-effective-target offload_hsa } */
+/* { dg-options "-fopenmp -fdump-tree-omplower-details" } */
+#define BLOCK_SIZE 16
+void tiled_sgemm_tt(const int M, const int N, const int K, const float alpha, const float*A, const int LDA,
+   const float*B, const int LDB, const float beta, float*C, const int LDC){
+#pragma omp target teams map(to:A[M*K],B[K*N]) map(from:C[M*N])
+#pragma omp distribute collapse(2)
+   for (int C_row_start=0 ; C_row_start < M ; C_row_start+=BLOCK_SIZE)
+      for (int C_col_start=0 ; C_col_start < N ; C_col_start+=BLOCK_SIZE)
+	{
+//       Each team has a local copy of these mini matrices
+         float As[BLOCK_SIZE][BLOCK_SIZE];
+         float Bs[BLOCK_SIZE][BLOCK_SIZE];
+#pragma omp parallel
+	 {
+         int C_row, C_col;
+         float Cval = 0.0;
+         for (int kblock = 0; kblock  < K ; kblock += BLOCK_SIZE )
+	   {
+#pragma omp for collapse(2)
+	     for (int row=0 ; row < BLOCK_SIZE ; row++)
+               for (int col=0 ; col < BLOCK_SIZE ; col++)
+		 {
+		   C_row = C_row_start + row;
+		   C_col = C_col_start + col;
+		   if ((C_row < M) && (kblock + col < K))
+		     As[row][col] = A[(C_row*LDA)+ kblock + col];
+		   else
+		     As[row][col] = 0;
+		   if ((kblock + row < K) && C_col < N)
+		     Bs[row][col] = B[((kblock+row)*LDB)+ C_col];
+		   else
+		     Bs[row][col] = 0;
+		 }
+#pragma omp for collapse(2)
+	     for (int row=0 ; row < BLOCK_SIZE ; row++)
+	       for (int col=0 ; col < BLOCK_SIZE ; col++)
+		 {
+		   for (int e = 0; e < BLOCK_SIZE; ++e)
+                     Cval += As[row][e] * Bs[e][col];
+		 }
+	   }  /* End for kblock .. */
+#pragma omp for collapse(2)
+         for (int row=0 ; row < BLOCK_SIZE ; row++)
+	   for (int col=0 ; col < BLOCK_SIZE ; col++)
+	     {
+               C_row = C_row_start + row;
+               C_col = C_col_start + col;
+	       if ((C_row < M) && (C_col < N))
+		 C[(C_row*LDC)+C_col] = alpha*Cval + beta*C[(C_row*LDC)+C_col];
+	     }
+         } /* end parallel */
+      }	   /* end target teams distribute */
+}
+/* { dg-final { scan-tree-dump "Target construct will be turned into a gridified HSA kernel" "omplower" } } */
--- a/gcc/testsuite/c-c++-common/gomp/gridify-3.c
+++ b/gcc/testsuite/c-c++-common/gomp/gridify-3.c
+/* { dg-do compile } */
+/* { dg-require-effective-target offload_hsa } */
+/* { dg-options "-fopenmp -fdump-tree-omplower-details" } */
+#define BLOCK_SIZE 16
+void tiled_sgemm_tt(const int M, const int N, const int K, const float alpha, const float*A, const int LDA,
+   const float*B, const int LDB, const float beta, float*C, const int LDC)
+{
+#pragma omp target teams map(to:A[M*K],B[K*N]) map(from:C[M*N])
+#pragma omp distribute collapse(2)
+   for (int C_row_start=0 ; C_row_start < M ; C_row_start+=BLOCK_SIZE)
+      for (int C_col_start=0 ; C_col_start < N ; C_col_start+=BLOCK_SIZE)
+	{
+	  float As[BLOCK_SIZE][BLOCK_SIZE];
+	  float Bs[BLOCK_SIZE][BLOCK_SIZE];
+	  float Cs[BLOCK_SIZE][BLOCK_SIZE];
+	  int C_row, C_col;
+#pragma omp parallel for collapse(2)
+         for (int row=0 ; row < BLOCK_SIZE ; row++)
+	   for (int col=0 ; col < BLOCK_SIZE ; col++)
+	     {
+               Cs[row][col] = 0.0;
+	     }
+         for (int kblock = 0; kblock  < K ; kblock += BLOCK_SIZE )
+	   {
+#pragma omp parallel for collapse(2)
+	     for (int row=0 ; row < BLOCK_SIZE ; row++)
+               for (int col=0 ; col < BLOCK_SIZE ; col++)
+		 {
+		   C_row = C_row_start + row;
+		   C_col = C_col_start + col;
+		   if ((C_row < M) && (kblock + col < K))
+		     As[row][col] = A[(C_row*LDA)+ kblock + col];
+		   else
+		     As[row][col] = 0;
+		   if ((kblock + row < K) && C_col < N)
+		     Bs[row][col] = B[((kblock+row)*LDB)+ C_col];
+		   else
+		     Bs[row][col] = 0;
+		 }
+#pragma omp parallel for collapse(2)
+	     for (int row=0 ; row < BLOCK_SIZE ; row++)
+               for (int col=0 ; col < BLOCK_SIZE ; col++)
+		 {
+		   for (int e = 0; e < BLOCK_SIZE; ++e)
+                     Cs[row][col] += As[row][e] * Bs[e][col];
+		 }
+         }  /* End for kblock .. */
+#pragma omp parallel for collapse(2)
+         for (int row=0 ; row < BLOCK_SIZE ; row++)
+	   for (int col=0 ; col < BLOCK_SIZE ; col++)
+	     {
+               C_row = C_row_start + row;
+               C_col = C_col_start + col;
+	       if ((C_row < M) && (C_col < N))
+		 C[(C_row*LDC)+C_col] = alpha*Cs[row][col] + beta*C[(C_row*LDC)+C_col];
+	     }
+      }	/* End distribute */
+}
+/* { dg-final { scan-tree-dump "Target construct will be turned into a gridified HSA kernel" "omplower" } } */
--- a/gcc/testsuite/gfortran.dg/gomp/gridify-1.f90
+++ b/gcc/testsuite/gfortran.dg/gomp/gridify-1.f90
@@ -13,4 +13,4 @@ subroutine vector_square(n, a, b)
 !$omp end target teams
 end subroutine vector_square
-! { dg-final { scan-tree-dump "Target construct will be turned into a gridified GPGPU kernel" "omplower" } }
+! { dg-final { scan-tree-dump "Target construct will be turned into a gridified HSA kernel" "omplower" } }
--- a/libgomp/ChangeLog
+++ b/libgomp/ChangeLog
+2016-11-23  Martin Jambor  <mjambor@suse.cz>
+	* testsuite/libgomp.hsa.c/bits-insns.c: New test.
+	* testsuite/libgomp.hsa.c/tiling-1.c: Likewise.
+	* testsuite/libgomp.hsa.c/tiling-2.c: Likewise.
 2016-11-23  Martin Liska  <mliska@suse.cz>
            Martin Jambor  <mjambor@suse.cz>

--- a/libgomp/testsuite/libgomp.hsa.c/bits-insns.c
+++ b/libgomp/testsuite/libgomp.hsa.c/bits-insns.c
+#include <math.h>
+#define N 12
+int main()
+{
+  unsigned int arguments[N] = {0u, 1u, 2u, 3u, 111u, 333u, 444u, 0x80000000u, 0x0000ffffu, 0xf0000000u, 0xff000000u, 0xffffffffu};
+  int clrsb[N] = {};
+  int clz[N] = {};
+  int ctz[N] = {};
+  int ffs[N] = {};
+  int parity[N] = {};
+  int popcount[N] = {};
+  int ref_clrsb[N] = {};
+  int ref_clz[N] = {};
+  int ref_ctz[N] = {};
+  int ref_ffs[N] = {};
+  int ref_parity[N] = {};
+  int ref_popcount[N] = {};
+  for (unsigned i = 0; i < N; i++)
+    {
+      ref_clrsb[i] = __builtin_clrsb (arguments[i]);
+      ref_clz[i] = __builtin_clz (arguments[i]);
+      ref_ctz[i] = __builtin_ctz (arguments[i]);
+      ref_ffs[i] = __builtin_ffs (arguments[i]);
+      ref_parity[i] = __builtin_parity (arguments[i]);
+      ref_popcount[i] = __builtin_popcount (arguments[i]);
+    }
+  #pragma omp target map(from:clz, ctz, ffs, parity, popcount)
+  {
+    for (unsigned i = 0; i < N; i++)
+    {
+      clrsb[i] = __builtin_clrsb (arguments[i]);
+      clz[i] = __builtin_clz (arguments[i]);
+      ctz[i] = __builtin_ctz (arguments[i]);
+      ffs[i] = __builtin_ffs (arguments[i]);
+      parity[i] = __builtin_parity (arguments[i]);
+      popcount[i] = __builtin_popcount (arguments[i]);
+    }
+  }
+  for (unsigned i = 0; i < N; i++)
+    if (ref_clrsb[i] != clrsb[i])
+      __builtin_abort ();
+  /* CLZ of zero is undefined for zero.  */
+  for (unsigned i = 1; i < N; i++)
+    if (ref_clz[i] != clz[i])
+      __builtin_abort ();
+  /* Likewise for ctz */
+  for (unsigned i = 1; i < N; i++)
+    if (ref_ctz[i] != ctz[i])
+      __builtin_abort ();
+  for (unsigned i = 0; i < N; i++)
+    if (ref_ffs[i] != ffs[i])
+      __builtin_abort ();
+  for (unsigned i = 0; i < N; i++)
+    if (ref_parity[i] != parity[i])
+      __builtin_abort ();
+  for (unsigned i = 0; i < N; i++)
+    if (ref_popcount[i] != popcount[i])
+      __builtin_abort ();
+  return 0;
+}
--- a/libgomp/testsuite/libgomp.hsa.c/tiling-1.c
+++ b/libgomp/testsuite/libgomp.hsa.c/tiling-1.c
+/*
+   matmul.c : Matrix Multiplication with tiling for openmp4 example
+*/
+#include <stdlib.h>
+#include <math.h>
+#define BLOCK_SIZE 16
+/*
+  #define BLOCK_SIZE 32
+*/
+#define NSECPERSEC 1000000000L
+typedef struct {
+   int width;
+   int height;
+   int stride;
+   int hpad;
+   float* elements;
+} Matrix;
+/* Correctly extract the number of nanoseconds from the two time structures */
+long int get_nanosecs( struct timespec start_time, struct timespec end_time) {
+   long int nanosecs;
+   if ((end_time.tv_nsec-start_time.tv_nsec)<0) nanosecs =
+      ((((long int) end_time.tv_sec- (long int) start_time.tv_sec )-1)*NSECPERSEC ) +
+      ( NSECPERSEC + (long int) end_time.tv_nsec - (long int) start_time.tv_nsec) ;
+   else nanosecs =
+      (((long int) end_time.tv_sec- (long int) start_time.tv_sec )*NSECPERSEC ) +
+      ( (long int) end_time.tv_nsec - (long int) start_time.tv_nsec );
+   return nanosecs;
+}
+void simple_sgemm_tt(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
+     const float* B,const int LDB, const float beta,float* C, const int LDC) ;
+void simple_sgemm_tn(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
+     const float* B,const int LDB, const float beta,float* C, const int LDC) ;
+void  tiled_sgemm_tt(const int M,const int N,const int K,const float alpha, const float*A, const int LDA,
+     const float* B,const int LDB, const float beta,float* C, const int LDC) ;
+int verify(float* v_res, float* v_ref, int len) {
+    int passed = 1;
+    int i;
+    for (i = 0; i < len; ++i) {
+        if (fabs(v_res[i] - v_ref[i]) > 0.001*v_ref[i]) {
+	  __builtin_abort ();
+        }
+    }
+    return passed;
+}
+int main(int argc, char* argv[]){
+   Matrix A,B,Bt,C,Cref;
+   int a1,a2,a3,i,j;
+   struct timespec start_time1, end_time1;
+   struct timespec start_time2, end_time2;
+   long int nanosecs,total_ops;
+   float gflopsTiled,gflopsCPU;
+   a1 = 35;
+   a2 = 28;
+   a3 = 47;
+   A.height = a1;
+   A.width = a2;
+   A.stride = (((A.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   A.hpad = (((A.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   A.elements = (float*)malloc(A.stride * A.hpad* sizeof(float));
+   B.height = a2;
+   B.width = a3;
+   B.stride = (((B.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   B.hpad = (((B.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   B.elements = (float*)malloc(B.stride * B.hpad * sizeof(float));
+   /* Bt is same as B but stored in column-major order */
+   Bt.height = B.height;
+   Bt.width = B.width;
+   Bt.stride = B.stride;
+   Bt.hpad = B.hpad;
+   Bt.elements = (float*)malloc(Bt.stride * Bt.hpad * sizeof(float));
+   C.height = a1;
+   C.width = a3;
+   C.stride = (((C.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   C.hpad = (((C.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   C.elements = (float*)malloc(C.stride * C.hpad * sizeof(float));
+   Cref.height = a1;
+   Cref.width = a3;
+   Cref.stride = (((Cref.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   Cref.hpad = (((Cref.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   Cref.elements = (float*)malloc(Cref.stride * Cref.hpad * sizeof(float));
+   for(i = 0; i < A.hpad ; i++)
+      for(j = 0; j < A.stride; j++) {
+         if (( j<A.width ) && (i<A.height)) {
+            A.elements[i*A.stride + j] = (i % 3);
+         } else {
+            A.elements[i*A.stride + j] = 0.0;
+         }
+      }
+   /*  Initialize B and Bt */
+   for(i = 0; i < B.hpad ; i++)
+      for(j = 0; j < B.stride; j++) {
+         if (( j<B.width ) && (i<B.height)) {
+            B.elements[i*B.stride+j] = (j % 2);
+            Bt.elements[j*Bt.stride+i] = B.elements[i*B.stride+j] ;
+         } else {
+            B.elements[i*B.stride+j] = 0.0;
+            Bt.elements[j*Bt.stride+i] = 0.0;
+         }
+      }
+   /* zero C, and Cref */
+   for(i = 0; i < C.hpad; i++)
+      for(j = 0; j < C.stride; j++) {
+         C.elements[i*C.stride+j] = 0.0;
+         Cref.elements[i*Cref.stride+j] = 0.0;
+      }
+   simple_sgemm_tt(A.height,B.width,B.height,1.0,A.elements,A.stride,B.elements,B.stride,1.0,Cref.elements,Cref.stride);
+   tiled_sgemm_tt(A.height,B.width,B.height,1.0,A.elements,A.stride,B.elements,B.stride,1.0,C.elements,C.stride);
+   verify(C.elements, Cref.elements, C.height * C.stride);
+   return 0;
+}
+void simple_sgemm_tt(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
+const float* B,const int LDB, const float beta,float* C, const int LDC) {
+   /*  A,B, and C  are in row-major order */
+   int c_row,c_col,inner;
+   float sum;
+   for (c_col  = 0 ;  c_col<N; c_col++ ) {
+      for (c_row = 0 ; c_row<M; c_row++ ) {
+         sum = 0.0 ;
+         for (inner = 0 ; inner<K; inner++ ) {
+            sum += A[c_row*LDA + inner] * B[inner*LDB + c_col] ;
+         }
+         C[c_row*LDC + c_col] = alpha*sum + beta*C[ c_row*LDC + c_col] ;
+      }
+   }
+}
+/***************************
+   tiled_sgemm_tt:  Tiled matrix multiplication:
+***************************/
+void tiled_sgemm_tt(const int M, const int N, const int K, const float alpha, const float*A, const int LDA,
+   const float*B, const int LDB, const float beta, float*C, const int LDC){
+#pragma omp target teams map(to:A[M*K],B[K*N]) map(from:C[M*N])
+#pragma omp distribute collapse(2)
+   for (int C_row_start=0 ; C_row_start < M ; C_row_start+=BLOCK_SIZE)
+      for (int C_col_start=0 ; C_col_start < N ; C_col_start+=BLOCK_SIZE)
+	{
+//       Each team has a local copy of these mini matrices
+         float As[BLOCK_SIZE][BLOCK_SIZE];
+         float Bs[BLOCK_SIZE][BLOCK_SIZE];
+#pragma omp parallel
+	 {
+         int C_row, C_col;
+         float Cval = 0.0;
+         for (int kblock = 0; kblock  < K ; kblock += BLOCK_SIZE )
+	   {
+#pragma omp for collapse(2)
+	     for (int row=0 ; row < BLOCK_SIZE ; row++)
+               for (int col=0 ; col < BLOCK_SIZE ; col++)
+		 {
+		   C_row = C_row_start + row;
+		   C_col = C_col_start + col;
+		   if ((C_row < M) && (kblock + col < K))
+		     As[row][col] = A[(C_row*LDA)+ kblock + col];
+		   else
+		     As[row][col] = 0;
+		   if ((kblock + row < K) && C_col < N)
+		     Bs[row][col] = B[((kblock+row)*LDB)+ C_col];
+		   else
+		     Bs[row][col] = 0;
+		 }
+#pragma omp for collapse(2)
+	     for (int row=0 ; row < BLOCK_SIZE ; row++)
+	       for (int col=0 ; col < BLOCK_SIZE ; col++)
+		 {
+		   for (int e = 0; e < BLOCK_SIZE; ++e)
+                     Cval += As[row][e] * Bs[e][col];
+		 }
+	   }  /* End for kblock .. */
+#pragma omp for collapse(2)
+         for (int row=0 ; row < BLOCK_SIZE ; row++)
+	   for (int col=0 ; col < BLOCK_SIZE ; col++)
+	     {
+               C_row = C_row_start + row;
+               C_col = C_col_start + col;
+	       if ((C_row < M) && (C_col < N))
+		 C[(C_row*LDC)+C_col] = alpha*Cval + beta*C[(C_row*LDC)+C_col];
+	     }
+         } /* end parallel */
+      }	   /* end target teams distribute */
+}
--- a/libgomp/testsuite/libgomp.hsa.c/tiling-2.c
+++ b/libgomp/testsuite/libgomp.hsa.c/tiling-2.c
+/*
+   matmul.c : Matrix Multiplication with tiling for openmp4 example
+*/
+#include <stdlib.h>
+#include <math.h>
+#define BLOCK_SIZE 16
+/*
+  #define BLOCK_SIZE 32
+*/
+#define NSECPERSEC 1000000000L
+typedef struct {
+   int width;
+   int height;
+   int stride;
+   int hpad;
+   float* elements;
+} Matrix;
+/* Correctly extract the number of nanoseconds from the two time structures */
+long int get_nanosecs( struct timespec start_time, struct timespec end_time) {
+   long int nanosecs;
+   if ((end_time.tv_nsec-start_time.tv_nsec)<0) nanosecs =
+      ((((long int) end_time.tv_sec- (long int) start_time.tv_sec )-1)*NSECPERSEC ) +
+      ( NSECPERSEC + (long int) end_time.tv_nsec - (long int) start_time.tv_nsec) ;
+   else nanosecs =
+      (((long int) end_time.tv_sec- (long int) start_time.tv_sec )*NSECPERSEC ) +
+      ( (long int) end_time.tv_nsec - (long int) start_time.tv_nsec );
+   return nanosecs;
+}
+void simple_sgemm_tt(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
+     const float* B,const int LDB, const float beta,float* C, const int LDC) ;
+void simple_sgemm_tn(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
+     const float* B,const int LDB, const float beta,float* C, const int LDC) ;
+void  tiled_sgemm_tt(const int M,const int N,const int K,const float alpha, const float*A, const int LDA,
+     const float* B,const int LDB, const float beta,float* C, const int LDC) ;
+int verify(float* v_res, float* v_ref, int len) {
+    int passed = 1;
+    int i;
+    for (i = 0; i < len; ++i) {
+        if (fabs(v_res[i] - v_ref[i]) > 0.001*v_ref[i]) {
+	  __builtin_abort ();
+        }
+    }
+    return passed;
+}
+int main(int argc, char* argv[]){
+   Matrix A,B,Bt,C,Cref;
+   int a1,a2,a3,i,j;
+   struct timespec start_time1, end_time1;
+   struct timespec start_time2, end_time2;
+   long int nanosecs,total_ops;
+   float gflopsTiled,gflopsCPU;
+   a1 = 35;
+   a2 = 28;
+   a3 = 47;
+   A.height = a1;
+   A.width = a2;
+   A.stride = (((A.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   A.hpad = (((A.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   A.elements = (float*)malloc(A.stride * A.hpad* sizeof(float));
+   B.height = a2;
+   B.width = a3;
+   B.stride = (((B.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   B.hpad = (((B.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   B.elements = (float*)malloc(B.stride * B.hpad * sizeof(float));
+   /* Bt is same as B but stored in column-major order */
+   Bt.height = B.height;
+   Bt.width = B.width;
+   Bt.stride = B.stride;
+   Bt.hpad = B.hpad;
+   Bt.elements = (float*)malloc(Bt.stride * Bt.hpad * sizeof(float));
+   C.height = a1;
+   C.width = a3;
+   C.stride = (((C.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   C.hpad = (((C.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   C.elements = (float*)malloc(C.stride * C.hpad * sizeof(float));
+   Cref.height = a1;
+   Cref.width = a3;
+   Cref.stride = (((Cref.width-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   Cref.hpad = (((Cref.height-1)/BLOCK_SIZE)+1) * BLOCK_SIZE;
+   Cref.elements = (float*)malloc(Cref.stride * Cref.hpad * sizeof(float));
+   for(i = 0; i < A.hpad ; i++)
+      for(j = 0; j < A.stride; j++) {
+         if (( j<A.width ) && (i<A.height)) {
+            A.elements[i*A.stride + j] = (i % 3);
+         } else {
+            A.elements[i*A.stride + j] = 0.0;
+         }
+      }
+   /*  Initialize B and Bt */
+   for(i = 0; i < B.hpad ; i++)
+      for(j = 0; j < B.stride; j++) {
+         if (( j<B.width ) && (i<B.height)) {
+            B.elements[i*B.stride+j] = (j % 2);
+            Bt.elements[j*Bt.stride+i] = B.elements[i*B.stride+j] ;
+         } else {
+            B.elements[i*B.stride+j] = 0.0;
+            Bt.elements[j*Bt.stride+i] = 0.0;
+         }
+      }
+   /* zero C, and Cref */
+   for(i = 0; i < C.hpad; i++)
+      for(j = 0; j < C.stride; j++) {
+         C.elements[i*C.stride+j] = 0.0;
+         Cref.elements[i*Cref.stride+j] = 0.0;
+      }
+   simple_sgemm_tt(A.height,B.width,B.height,1.0,A.elements,A.stride,B.elements,B.stride,1.0,Cref.elements,Cref.stride);
+   tiled_sgemm_tt(A.height,B.width,B.height,1.0,A.elements,A.stride,B.elements,B.stride,1.0,C.elements,C.stride);
+   verify(C.elements, Cref.elements, C.height * C.stride);
+   return 0;
+}
+void simple_sgemm_tt(const int M,const int N,const int K,const float alpha, const float* A,const int LDA,
+const float* B,const int LDB, const float beta,float* C, const int LDC) {
+   /*  A,B, and C  are in row-major order */
+   int c_row,c_col,inner;
+   float sum;
+   for (c_col  = 0 ;  c_col<N; c_col++ ) {
+      for (c_row = 0 ; c_row<M; c_row++ ) {
+         sum = 0.0 ;
+         for (inner = 0 ; inner<K; inner++ ) {
+            sum += A[c_row*LDA + inner] * B[inner*LDB + c_col] ;
+         }
+         C[c_row*LDC + c_col] = alpha*sum + beta*C[ c_row*LDC + c_col] ;
+      }
+   }
+}
+/***************************
+   tiled_sgemm_tt:  Tiled matrix multiplication:
+***************************/
+void tiled_sgemm_tt(const int M, const int N, const int K, const float alpha, const float*A, const int LDA,
+   const float*B, const int LDB, const float beta, float*C, const int LDC){
+#pragma omp target teams map(to:A[M*K],B[K*N]) map(from:C[M*N])
+#pragma omp distribute collapse(2)
+   for (int C_row_start=0 ; C_row_start < M ; C_row_start+=BLOCK_SIZE) {
+      for (int C_col_start=0 ; C_col_start < N ; C_col_start+=BLOCK_SIZE) {
+// We now have M/BLOCK_SIZE * N/BLOCK_SIZE teams = (M*N)/(BLOCK_SIZE*BLOCK_SIZE)
+// The grid global dimensions are M,N,1
+// The grid local dimensions are BLOCK_SIZE,BLOCK_SIZE,1
+// -------------------------------------------------------------------
+//      The rest of this code forms the HSAIL kernel with the
+//      pairs of "paralell for collapse(2)" loops repalced with a barrier.
+//      The kernel initializes these values
+//      C_row_start = get_group_id(0) * BLOCK_SIZE
+//      C_col_start = get_group_id(1) * BLOCK_SIZE
+//      row=get_local_id(0)
+//      col=get_local_id(1)
+// -------------------------------------------------------------------
+//       Each team has a local copy of these mini matrices
+         float As[BLOCK_SIZE][BLOCK_SIZE];
+         float Bs[BLOCK_SIZE][BLOCK_SIZE];
+         float Cs[BLOCK_SIZE][BLOCK_SIZE];
+         int C_row, C_col;
+         /* Zero Cs for this BLOCK */
+// - - - - - - - - - - - - - - - - - - - -
+// REPLACE NEXT THREE LINES WITH A BARRIER
+#pragma omp parallel for collapse(2)
+         for (int row=0 ; row < BLOCK_SIZE ; row++) {
+            for (int col=0 ; col < BLOCK_SIZE ; col++) {
+// END BARRIER
+// - - - - - - - - - - - - - - - - - - - -
+               Cs[row][col] = 0.0;
+            }
+         }
+         // This kblock loop is run on the master thread of each team
+         for (int kblock = 0; kblock  < K ; kblock += BLOCK_SIZE ) {
+            // Copy global memory values to local memory
+// - - - - - - - - - - - - - - - - - - - -
+// REPLACE NEXT THREE LINES WITH A BARRIER
+#pragma omp parallel for collapse(2)
+            for (int row=0 ; row < BLOCK_SIZE ; row++) {
+               for (int col=0 ; col < BLOCK_SIZE ; col++) {
+// END BARRIER
+// - - - - - - - - - - - - - - - - - - - -
+                  C_row = C_row_start + row;
+                  C_col = C_col_start + col;
+		  if ((C_row < M) && (kblock + col < K))
+		    As[row][col] = A[(C_row*LDA)+ kblock + col];
+		  else
+		    As[row][col] = 0;
+		  if ((kblock + row < K) && C_col < N)
+		    Bs[row][col] = B[((kblock+row)*LDB)+ C_col];
+		  else
+		    Bs[row][col] = 0;
+               }
+            }
+            // Calculate Cs <- Sum(As X Bs) across all kblocks
+// - - - - - - - - - - - - - - - - - - - -
+// REPLACE NEXT THREE LINES WITH A BARRIER
+#pragma omp parallel for collapse(2)
+            for (int row=0 ; row < BLOCK_SIZE ; row++) {
+               for (int col=0 ; col < BLOCK_SIZE ; col++) {
+// END BARRIER
+// - - - - - - - - - - - - - - - - - - - -
+                  for (int e = 0; e < BLOCK_SIZE; ++e)
+                     Cs[row][col] += As[row][e] * Bs[e][col];
+                }
+            }
+         }  /* End for kblock .. */
+         // Scale Update actual C from Cs
+// - - - - - - - - - - - - - - - - - - - -
+// REPLACE NEXT THREE LINES WITH A BARRIER
+#pragma omp parallel for collapse(2)
+         for (int row=0 ; row < BLOCK_SIZE ; row++) {
+            for (int col=0 ; col < BLOCK_SIZE ; col++) {
+// END BARRIER
+// - - - - - - - - - - - - - - - - - - - -
+               C_row = C_row_start + row;
+               C_col = C_col_start + col;
+	       if ((C_row < M) && (C_col < N)) {
+		 C[(C_row*LDC)+C_col] = alpha*Cs[row][col] + beta*C[(C_row*LDC)+C_col];
+	       }
+            }
+         }
+// -------------------------------------------------------------------
+// This is the end of the kernel
+      }
+   }
+}