-
[CodeGen][CUDA] Improve CUDA vectorizer (#4736) · 2630ffcb
- Fixes issues to enable fp16 vectorizer. Now correct packing and unpacking CUDA code will be emitted. Enabled more unit tests. - Do not emit code to read the first lane from an undef variable int _3; _3 = _3 & ~(0x000000ff << 0) | ... and emit the following code instead: _3 = (((0x000000ff & (_1 >> 0))+(0x000000ff & (_2 >> 0))) << 0); Note that nvcc 10.2 is forgiving and emits the same code for both cases. A warning appears in test_codegen_cuda.py. Signed-off-by: Wei Pan <weip@nvidia.com>
wpan11nv committed
Name |
Last commit
|
Last update |
---|---|---|
.. | ||
cuda_half_t.h | Loading commit data... |