test_topi_tensor.py
4.76 KB
-
[TOPI][CUDA] Enable vectorization on fp16 type (#4867) · 7013fc9a
- This allows to better utilize the memory bandwidth - Note that not all cases are vectorized for fp16 datatype. For instance, when the size is not a multiple of 1024, the inner loop may be an expression that cannot be vectorized. In this case, a small inner loop is still benefical for latency hidding. Signed-off-by: Wei Pan <weip@nvidia.com>
wpan11nv committed