* Fix int8x4 vectorize * Fix gpu shared/local memory accumulate * Add test_shared_memory for int8x4 * Adjust test format * Fix cpplint