* add half and mix precision support to cublas backend * add TensorCore support in CuDNN * enhance CuDNN support * address comments and fix lint * fix * add fp16 test