@@ -26,6 +26,899 @@ Refer to the Roadmap issue for complete list on on-going version features.
...
@@ -26,6 +26,899 @@ Refer to the Roadmap issue for complete list on on-going version features.
If you check in something that is not reflected in Roadmap issue, please reply
If you check in something that is not reflected in Roadmap issue, please reply
to that issue so it can get added.
to that issue so it can get added.
## 0.6
### Relay in Production
Relay is a functional, differentiable programming language designed to be an expressive intermediate representation for machine learning systems. Relay supports algebraic data types, closures, control flow, and recursion, allowing it to directly represent more complex models than computation graph-based IRs (e.g., NNVM) can. In TVM v0.6, Relay is in stable phase and is ready for production.
* Algebraic Data Types (ADT) support (#2442, #2575). ADT provides an expressive, efficient, and safe way to realize recursive computation (e.g., RNN). Refer to https://docs.tvm.ai/langref/relay_adt.html for more information.
* Pass manager for Relay (#2546, #3226, #3234, #3191)
* Most frameworks have been supported in Relay, including ONNX, Keras, Tensorflow, Caffe2, CoreML, NNVMv1, MXNet (#2246).
* Explicitly manifest memory and tensor allocations in Relay. (#3560)
### Relay Virtual Machine
The Relay Virtual Machine (Relay VM) is the new generation of runtime to strike a balance between performance and flexibility when deploying and executing Relay programs. Previously, the graph runtime is able to utilize the fully static nature of the input graphs to perform aggressive optimization such as fully static allocation, and optimal memory reuse. When we introduce models which make use of control-flow, recursion, dynamic shapes, dynamic allocation we must change how execution works.
Relay VM is now usable and is able to achieve decent performance for a various of models and targets.
* Design (#2810 #2915) and a first version of implementation (#2889),
* Add VM runtime for Relay and compiler support (#3120, #3121, #2889, #3139)
* Relay VM (pattern matching #3470, port to python #3391, serialization #3647)
* Relay VM Profiler (#3727)
* Support execution on devices for Relay VM (#3678)
*[Relay][VM] Add more passes to VMCompiler (#4058)
*[relay][vm] Separate VM runtime with executable (#4100)
* Port VM, VM compiler, and Object into Python (#3391)
* VM: Add AllocTensor instruction and better instruction printer (#3306)
*[Relay][VM][Interpreter] Enable first-class constructors in VM and interpreter via eta expansion. (#4218)
*[Relay][VM] Clean up the VM and VM profiler code (#4391)
### Training
Relay is designed to natively support first-order and higher-order differentiation. The automatic differentiation infrastructure is now usable and a count of operators with gradient support are available in v0.6 release.
* Higher order reverse mode automatic differentiation that work with control flow (#2496)
* Higher order continuation passing style (#3456, #3485 )
* Relay gradient registration (clip #3509, `max_pool2d` and `avg_pool2d` #3601)
* Relay AD algorithm (#3585)
* Relay Training - allow gradient to return a tuple (#3600), numerical gradient check (#3630)
* Improve AD for concatenate (#3729)
*[Relay][Training] Add missing gradient check to gradient pass (#4169)
* As a part of Relay's automatic differentiation system, we are adding primal gradients for Relay operators. Please refer to #2562 for tracking the progress.
*[Relay][Training] Add gradient for Crossentropy (#3925)
*[Relay][Training] Add and fix gradients (#4126)
### Quantization
Low-bit inference is getting more and more popular as it benefits both the performance and storage usage. TVM now supports two types of quantization. 1. Automatic quantizaion takes floating-point precision model, does per-layer calibration and generates low-bit model. 2. TVM also imports pre-quantized model from Tensorflow and MXNet, a new dialect QNN is introduced to handle further lowering to normal operators.
* Automatic Quantization
- Low-bit automatic quantization supported. (#2116). The workflow includes annotation, calibration and transformation.
- Refactor quantization codebase and fix model accuracy. (#3543)
- Added tflite frontend support for quantized mean. (#4339)
-[Relay][Legalize] Legalize `conv2d_transpose` for NHWC (#4399)
### Accelerator and Microcontroller Support
TSIM is introduced to improve software and hardware integration and simulation accuracy. It integrates the hardware development process into the software stack. TSIM enables VTA to provide a more accurate performance feedback, i.e. clock cycles, compared to the traditional functional model of a hardware accelerator. Moreover, Chisel implementation for VTA is availale and it runs on top of TSIM.
There has been a proliferation of resource-constrained and embedded devices that do not have operating systems or a mature software stack. MicroTVM is intended to support TVM on such bare-metal devices.
*[TSIM] Enabling Cycle-Accurate Hardware Simulation for VTA (#3010, #3206, #3242)
* Chisel implementation for VTA and runs on top of TSIM (#3258, #3347)
* ChangeBatch pass for batched VTA compilation (#3656, #3660)
* VTA fast simulator statistics (#3481)
* TSIM improvements and fixes (#3505)
* Chisel VTA enhancements and fixes (32bit support #3558, alu instruction generation #3592, coherence support #3593, separate types #3605, tensor issue/commit #3637, uop load request #3643, uop dma requests #3654)
* VTA Runtime refactor for non-shared memory FPGAs (#3590)
* VTA HLS codebase refactor for Ultra96 (#3496)
* VTA support for batched inference (#3661)
* VTA bitstream compilation for Intel FPGA (#3494)
* TSIM: Introduce Virtual Memory for TSIM Driver (#3686)
* Parallel TSIM hardware compilation with macOS and debug support (#3797)
* Chisel: scale dram base address in hardware instead of runtime (#3772)
* Chisel: run all unittests by default (#3766)
* Chisel: improved Data Gen, Added ALU Test (#3743)
* Chisel dependencies for TSIM CI (#3721)
* Chisel: Added Module Unit Test Infrastructure (#3698)
* Add ISA BitPat generation (#3891)
* de10-nano driver (#3394)
* Extending Vision model coverage compilation for VTA (#3740)
* Conv2d transpose (deconvolution) operator support (#3777)
* Support TLPP in function simulator. (#3555)
*[VTA][Chisel] TSIM VTA Source Refactor (#4163)
*[VTA][TSIM] Serial GEMM Application Added (#4082)
### Rust Support
Rust language support in TVM includes two parts. 1. The frontend wraps the current C API and exposes a Rust programming model. 2. The backend serves as an alternative to C++ runtime. It privdes a standalone WASM module and security support, e.g., SGX.
* Rust frontend (#2292).
* Unify types between bindings and pure Rust impl (#2616)
* Rust: load syslib modules at compile time (#3274)
* Rustify PackedFunc & Friends (#2969)
* Rust DSO module (#2976)
### Operator Support
* A special operator `annotation.stop_fusion` to prevent it being fused with previous expressions (#2624).
*`batch_matmul` supported (#2561).
*`reverse_reshape` supported (#2503).
* Faster-RCNN proposal operator for CUDA (#2420).
* Vision operator for YOLO `yolo_reorg` (#1941).
*`slice` operator for MXNet (#2662).
*`arange` supported (#2621).
* Vision operator `roi_align` (#2618).
*`where` operator for MXNet (#2647).
* Deformable conv2d (#2908)
* Faster-RCNN Proposal OP (#2725)
* ROI Pool operator (#2811)
* Gluoncv SSD support on CPU (#2353)
* shape, reverse, and sign op (#2749, #2800, #2775)
* tile and repeat op (#2720)
* logical operators (#2743, #2453)
* stack op (#2729)
* NCHWc upsampling (#2806)
* clip and wrap mode support in take (#2858)
* AlterLayout support for `intel_graphics` conv2d , depthwise conv2d (#2729, #2806)
* Add foldr1 operator (#2928)
* Add rsqrt operator (#2949)
* Add clip and wrap mode support in take (#2858)
*`Gather_nd` exposed to relay (#2945)
*`bitserial_conv2d` move to autotvm template and updates (#2819)
* Port x86 NCHWc to AutoTVM for Task Extraction (#2664)
* Implement relay `nn.bias_add` compute in C++ (#3027)
* Rename output tensors for better readability (#3006)
* int8 dense on CUDA & Dense op quantization (#2877)
* Bitserial dense operators for CPU (#3051)
* Enhance upsample operator to adapt onnx opset v9 (#2968)
* Add adaptive pooling operator (#3085)
* Add all operator (#3124)
* Add cblas `batch_matmul` (#3210)
* Add packing for int8 1x1 convolution and support the int8 group convolution on X86 (#2991)
* Add op size (#3094)
* x86 TOPI (`roi_align` #3475, `conv2d_transpose` #3491)
* Intel INT8 (dilation in conv2d #3510, type checking #3516)
* Reinterpretation of tensor elements (#3599)
* Spase-Dense for block-sparse multiplication (#3566)
* Winograd matrix computation (#3553)
* CUDA schedule for `pool_grad` (#3622), `group_conv2d` (#3663)
* Bitserial operations conv2d, dense and bitpack (#3844)
* Improve numeric gradient check (#3856)
* Resize rework ([3788](#3788))
* Improve `conv2d_transpose` CUDA schedule template (#3796)
* SpaceToDepth and MirrorPad Operators (#3718)
* Add variance and layer norm op (#3700)
* Add `sparse_transpose` for Square CSR matrices (#3707)
* Define more standard global functions in the prelude of relay program, includes foldr1, hd, tl, nth, list update (#2928, #2917, #2771, #2866)
* Add SkipVectorize pass (#3222, #3228)
*[Relay][Pass] Add pass to remove unused functions in relay module (#4334)
### Symbolic shape enhancement
* Add shape function for symbolic shape. It enables certain cases for broadcast with symbolic shapes. (#3606)
*[tvm][any] broadcast with values other than one (#3967)
* Symbolic shape support (broadcast op #3389)
* Support reshape for dynamic shape in tf converter (#4185)
* Runtime Shape Functions (#4179)
### Language and Architecture
* An optimization pass to eliminate expressions which have the same functionality and same inputs (#2639).
* Refactor text printer to add stream-like API and FunctionType support (#2605, #2882)
* Build a scaffold for structured error handling (#2838). The new mechanism detects and rewrites error messages so that c++ and python stack trace are unified and not redundant. Guideslines and conventions for error handling is also discussed.
* Higher order reverse mode automatic differentiation that work with control flow (#2496)
* Integer arithmetic analyzers, includes modular set analysis, const integer bound analysis and rewrite simplifier (#2904, #2851, #2768, #2722, #2668, #2860)
* Improve operator fusion for TupleGetItem in relay (#2914, #2929
* Compute FLOP of autotvm template for int8 models (#2776)
* Common subexpression elimination pass in Relay (#2639)
* Improve quantization in Relay (#2723)
* Refactor `build_func` in measure module of autotvm to better support cross compiler (#2927)
* Fix typing.Deque import error for Python 3.5 (#4254)
*[VTA] Hotfix for padded load test in Chisel VTA (#4264)
*[Contrib] Fix error message at `callback_get_section_size()` (#4221)
*[TOPI] Fix bug in Winograd on CUDA (#4260)
* AutoTVM: Fix hang/crash issues on feature extraction (#3689)
*[TOPI][CUDA] Fix Winograd Kernel Size Support (#4276)
*[Relay][Frontend][Tensorflow] Fix type assignment for 'tf.range' operator (#4294)
* Fix incorrect call to Unicode Win32 InetPton (#4306)
*[Relay][Frontend][Keras] handle `batch_norm` op params well (#4310)
*[VTA] fix error when `memory_id` is `VTA_MEM_ID_OUT` (#4330)
*[Doc][fix] fix sphinx parsing for pass infra tutorial (#4337)
*[Codegen] remove fp16 function override for cuda (#4331)
*[TFLite] Fix Prelu unified shape error (#4326)
*[Relay][Frontend][TF] Fix transpose when axes is not a param (#4327)
*[VTA] Bug fix for padded load with large inputs (#4293)
* Fix inconsistent operator tag name (#4134)
* Fix for a specific case when loop partitioning with indivisble. (#4243)
* Send list as argument to `schedule_conv2d` (#4358)
*[Docker] Fix TVM folder name for installing on Android and OpenCL. (#4363)
* Fix TFLite Reshape assert (#4320)
*[Relay][Frontend][TF] Fix slice when begin or size is not Const (#4372)
* Fix compilaton of bfloat16 on Windows (#4415)
### Known Issues
* The performance of Relay VM is not good enough on GPU, due to memeory allocation overhead which will be resolved later.
* TFlite rounding vs tvm rounding causing differences in accuracy and potentially off by 1 errors. For reference #3900
* TFlite pre-quantized network support is still a work in progress and the project would welcome further contributions.
* TSIM build requires `python` command exist on the host. See [forum discussion](https://discuss.tvm.ai/t/vta-build-failure/4790) for details.
* Tensorflow control flow has not been fully supported in the frontend converter.
*`topi.floor_div` is inconsistent with floor division semantic when result number is close to an integer.
### Depreciations
* Deprecating python2 support in the master branch and following release (v0.6). (#2994, #2986)
* NNVM is deprecated and will be removed in a future version. (#4333, #4368)
## 0.5
## 0.5
This release features several major improvements. Some of the highlights are: Arbitrary bits quantization algorithm; High-level auto-differentiable programming IR -- Relay.
This release features several major improvements. Some of the highlights are: Arbitrary bits quantization algorithm; High-level auto-differentiable programming IR -- Relay.
...
@@ -279,3 +1172,5 @@ We also make major improvements in supporting new backends: ROCm for AMDGPUs and
...
@@ -279,3 +1172,5 @@ We also make major improvements in supporting new backends: ROCm for AMDGPUs and