* Fix high-low bit bug in __pack_half2 * Fix vector load * Add unit8 support for PrintVecElemLoadExpr and BroadcastNode