Commits · f55609b4a19ed7166d9b4dbbee4acd48af3697ac · wenyuanbo / tic

29 Jul, 2019 1 commit

[VTA] Refactor to increase platform coverage (Ultra96 etc.) (#3496) · f55609b4

* hardware refactor for increased FPGA coverage, small optimizations

* fix header

* cleaning up parameters that won't be needed for now

* streamlining makefile, and simplifying tcl scripts

* moving parameter derivation into pkg_config.py, keeping tcl scripts lightweight

* refactoring tcl script to avoid global variables

* deriving AXI signals in pkg_config.py

* unifying address map definition for hardware and software drivers

* single channel design for ultra96 to simplify build

* enable alu by default, no mul opcode for now

* hardware fix

* new bitstream; vta version

* avoid error when env variable is not set

* ultra96 cleanup

* further cleaning up tcl script for bitstream generation

* preliminary rpc server support on ultra96

* rpc server tracker scripts

* ultra96 ldflag

* ultra96 support

* ultra96 support

* cleanup line

* cmake support for ultra96

* simplify memory instantiation

* cleaning up IP parameter initialization

* fix queue instantiation

* 2019.1 transition

* fix macro def

* removing bus width from config

* cleanup

* fix

* turning off testing for now

* cleanup ultra96 ps insantiation

* minor refactor

* adding comments

* upgrading to tophub v0.6

* model used in TVM target now refers to a specific version of VTA for better autoTVM scheduling

* revert change due to bug

* rename driver files to be for zynq-type devices

* streamlining address mapping

* unifying register map offset values between driver and hardware generator

* rely on cma library for cache flush/invalidation

* coherence management

* not make buffer packing depend on data types that can be wider than 64bits

* refactor config derivation to minimize free parameters

* fix environment/pkg config interaction

* adding cfg dump property to pkgconfig:

* fix rpc reconfig

* fix spacing

* cleanup

* fix spacing

* long line fix

* fix spacing and lint

* fix line length

* cmake fix

* environment fix

* renaming after pynq since the driver stack relies on the pynq library - see pynq.io

* update doc

* adding parameterization to  name

* space

* removing reg width

* vta RPC

* update doc on how to edit vta_config.json

* fix path

* fix path

committed Jul 28, 2019

f55609b4 Browse Files

28 Jul, 2019 3 commits
- fix comment/doc in TensorLoad (#3646) · bca8ac17
  Luis Vega committed Jul 28, 2019
  
  bca8ac17 Browse Files
- Hotfix for issue #3641. (#3644) · 026162ad
  Balint Cristian committed Jul 28, 2019
  
  026162ad Browse Files
- fix case when offset is odd and size is even (#3643) · 9a542e37
  Luis Vega committed Jul 28, 2019
  
  9a542e37 Browse Files
27 Jul, 2019 3 commits
- [VTA] [Chisel] fix tensor issue/commit in gemm (#3637) · da40645f
```
* fix tensor issue/commit in gemm

* remove trailing spaces
```
  Luis Vega committed Jul 27, 2019
  da40645f Browse Files
- [Relay][TF] add BatchMatMul (#3634) · 786c49f3
  Yong Wu committed Jul 27, 2019
  
  786c49f3 Browse Files
- Improve the x86 auto-tune tutorial (#3609) · 18d0ad31
  peterjc123 committed Jul 27, 2019
  
  18d0ad31 Browse Files
26 Jul, 2019 6 commits

Update tensorflow.py (#3632) · fbe42c26
YPBlib committed Jul 26, 2019

fbe42c26 Browse Files
Make Google Test usage configurable in CMake files (#3628) · f5464ce2
```
* Add USE_GTEST as a CMake variable

* Add GTest section in installation docs

* Incorporate feedback
```
Logan Weber committed Jul 26, 2019
f5464ce2 Browse Files
[TensorFlow] Fix a bug output index is ignored (#3631) · c1376a40
```
Enhance test to cover this case
```
lixiaoquan committed Jul 26, 2019
c1376a40 Browse Files
[TOPI][CUDA] Schedule for pool_grad (#3622) · f1ede9a9
```
* [TOPI][CUDA] Schedule for pool_grad

* Relay test

* Fix fused op

* doc

* Remove set scope local
```
Wuwei Lin committed Jul 26, 2019
f1ede9a9 Browse Files
[Relay] [Training] Add numerical gradient check. (#3630) · 8e0aaa29
```
* add check_grad

* finish

* what does the fox say?

* lint lint lint lint lint lint lint lint lint
```
雾雨魔理沙 committed Jul 26, 2019
8e0aaa29 Browse Files

[VTA] [Chisel] support for different inp/wgt bits, rewrote DotProduct for clarity (#3605) · 87e18a44

* support for different inp/wgt bits, rewrote dot for clarity

* [VTA] [Chisel] support for different inp/wgt bits, rewrote DotProduct for clarity

* [VTA] [Chisel] support for different inp/wgt bits, rewrote DotProduct for clarity

* change back to sim

* fix index

* fix index

* fix indent

* fix indent

* fix indent

* fix trailing spaces

* fix trailing spaces

* change to more descriptive name

* matric->matrix

* fix spacing

* fix spacing & added generic name for dot

* better parameter flow

* spacing

* spacing

* spacing

* update requirement (tested) for dot, spacing

* function call convention

* small edit

committed Jul 25, 2019

87e18a44 Browse Files

25 Jul, 2019 7 commits

[IR] Make iterators compatible with constructors of STL containers (#3624) · 0858c5ad
Lianmin Zheng committed Jul 25, 2019

0858c5ad Browse Files
Add Winograd matrices computation. (#3553) · 97e333ca
Balint Cristian committed Jul 26, 2019

97e333ca Browse Files

Implementation of uTVM (#3227) · ef909df1

* uTVM interfaces (#14)

* some minor interface changes

* implemented HostLowLevelDevice

* added MicroDeviceAPI

* implemented micro_common and added Python interfaces

* current status, semi implemented micro session

* added micro_common implementation and python interfaces (#18)

* added micro_common implementation and python interfaces (#18)

* current status, semi implemented

* host test working

* updated interfaces for MicroSession arguments allocation

* make somewhat lint compatible

* fix based on comments

* added rounding macro

* fix minor bug

* improvements based on comments

* Clean up `binutil.py` and make Python-3-compatible

* Change argument allocation design

* Address feedback and lint errors

* Improve binutil tests

* Simplify allocator (per @tqchen's suggestions)

* Doc/style fixes

* farts

* mcgee

* rodata section werks

(and so does `test_runtime_micro_workspace.py`)

* simple graph runtime werk

* TEMP

* ResNet works, yo

* First round of cleanup

* More cleanup

* runs a dyson over the code

* Another pass

* Fix `make lint` issues

* ready to pr... probably

* final

* Undo change

* Fix rebase resolution

* Minor fixes

* Undo changes to C codegen tests

* Add `obj_path` in `create_micro_lib`

* TEMP

* Address feedback

* Add missing TODO

* Partially address feedback

* Fix headers

* Switch to enum class for `SectionKind`

* Add missing ASF header

* Fix lint

* Fix lint again

* Fix lint

* Kill lint warnings

* Address feedback

* Change Python interface to MicroTVM

All interaction with the device is now through `Session` objects, which
are used through Python's `with` blocks.

* Reorder LowLevelDevice interface

* Store shared ptr to session in all alloced objects

* Move helper functions out of `tvm.micro`

* Switch static char arr to vector

* Improve general infra and code quality

Does not yet address all of tqchen's feedback

* Forgot a rename

* Fix lint

* Add ASF header

* Fix lint

* Partially address MarisaKirisame's feedback

* Lint

* Expose `MicroSession` as a node to Python

* Revert to using `Session` constructor

* Fix compiler error

* (Maybe) fix CI error

* Debugging

* Remove

* Quell lint

* Switch to stack-based session contexts

* Make uTVM less intrusive to host codegen

And use SSA for operands of generated ternary operators

* Inline UTVMArgs into UTVMTask struct

* Remove `HostLowLevelDevice` header

* Remove `BaseAddr` class

* Address feedback

* Add "utvm" prefix to global vars in runtime

* Fix lint

* Fix CI

* Fix `test_binutil.py`

* Fix submodules

* Remove ResNet tests

* Make `test_binutil.py` work with nose

* Fix CI

* I swear this actually fixes the binutil tests

* lint

* lint

* Add fcompile-compatible cross-compile func

* Add docs for uTVM runtime files

* Move pointer patching into `MicroSession`

* Fix lint

* First attempt at unifying cross-compile APIs

* Fix lint

* Rename `cross_compile` back to `cc`

* Address feedback

* Remove commented code

* Lint

* Figure out failing function

* Remove debugging code

* Change "micro_dev" target to "micro"

* Add checks in tests for whether uTVM is enabled

* Add TODO for 32-bit support

* Rename more "micro_dev" to "micro"

* Undo rename

We already have `tvm.micro` as a namespace.  Can't have it as a method
as well.

* Fix failing CI

Thanks to @tqchen for finding this bug.  Emitting ternary operators for
`min` and `max` causes concurrency bugs in CUDA, so we're moving the
ternary op emissions from `CodeGenC` to `CodeGenCHost`.

* Address feedback

* Fix lint

committed Jul 25, 2019

ef909df1 Browse Files

Add a missing header in cuda_device_api.cc (#3621) · 443d023b
Philip Hyunsu Cho committed Jul 24, 2019

443d023b Browse Files
[Relay][Keras] Permute, Softmax support (#3618) · dedcf82f
Yong Wu committed Jul 24, 2019

dedcf82f Browse Files
fix typo (#3611) · e7fb2d4d
Jian Weng committed Jul 24, 2019

e7fb2d4d Browse Files
[TOPI] Average Pool2D Bug. (#3607) · 3aa2eaed
```
* [TOPI] Average Pool2D Bug.

Issue - https://github.com/dmlc/tvm/issues/3581

* Add uint16 test.
```
Animesh Jain committed Jul 24, 2019
3aa2eaed Browse Files

24 Jul, 2019 6 commits
- Remove prints in `generic_op_impl.py` (#3616) · e0df6e12
  Logan Weber committed Jul 24, 2019
  
  e0df6e12 Browse Files
- Hotfix pylint (#3615) · 023fc6b4
  Tianqi Chen committed Jul 24, 2019
  
  023fc6b4 Browse Files
- [TEST] Fix testcase to make them more compatible to zero-rank (#3612) · 90eee087
  Tianqi Chen committed Jul 24, 2019
  
  90eee087 Browse Files
- init (#3571) · 814554e0
```
quickfix
```
  雾雨魔理沙 committed Jul 24, 2019
  814554e0 Browse Files
- [TOPI][Relay] max_pool2d & avg_pool2d gradient (#3601) · 5c410037
  Wuwei Lin committed Jul 24, 2019
  
  5c410037 Browse Files
- [Relay][vm] Small bug fix for DataTypeObject (#3604) · 440df0aa
```
* small bug fix for DataTypeObject

* retrigger ci
```
  Zhi committed Jul 24, 2019
  440df0aa Browse Files
23 Jul, 2019 7 commits

We observe multiple groups across a range of domains (ASR, NMT, LM, etc), (#3566) · d6dcd6c5

internally and externally, interested in replacing standard dense layers with
block-sparse matrix multiplication layers. The motivations are generally: higher
performance (due to reduction in FLOPs, memory bandwidth/cache footprint),
enabling larger models (e.g. fitting more layers in a given memory budget).

Some public work along these lines:

* https://openai.com/blog/block-sparse-gpu-kernels/
* https://openai.com/blog/sparse-transformer/
* https://arxiv.org/abs/1802.08435
* https://arxiv.org/abs/1711.02782

Various groups have been able to successfully train models with reasonable
levels of sparsity (90%+) with marginal accuracy changes, which suggests
substantial speedups are possible (as this implies a >10x reduction in FLOPs).

It is fairly straightforward to realize these theoretical speedups, see e.g. TVM
benchmarks for Intel CPUs in
https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA
results in https://github.com/openai/blocksparse, etc.

* https://github.com/openai/blocksparse (CUDA)
* https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRM)
* https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation)

This is extracted from an internal patch we've been using internally. There are
various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but
this is a reasonable starting point. This needs more thorough unit test coverage
however.

We follow the conventions established by scipy.sparse.bsr_matrix and other
libraries, see the unit tests for details.

For folks interested in experimenting with scheduling/AutoTVM etc,
https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful
starting point.

committed Jul 23, 2019

d6dcd6c5 Browse Files

{relay,topi}.reinterpret support (#3599) · 2ed31b24

= Motivation

It's useful to expose the tvm::reinterpret functionality to Relay/TOPI users, as
this allows them to build (fused) operators leveraging the bitwise
reinterpretation of an operator. An example is approximate transcendental
functions, which can be implemented similar to:

```.py
    def C(x):
        return relay.expr.const(x, "float32")

    def approx_exp(x):
        x = relay.minimum(relay.maximum(x, C(-88.0)), C(88.0))
        x = C(127.0) + x * C(1.44269504)
        xf = relay.floor(x)
        i = relay.cast(xf, "int32")
        x = x - xf
        Y = C(0.99992522) + x * (C(0.69583354) + x * (C(0.22606716) + x * C(0.078024523)))
        exponent = relay.left_shift(i, relay.expr.const(23, "int32"))
        exponent = relay.reinterpret(exponent, "float32")
        return exponent * Y

    def approx_sigmoid(x):
        # <2.0e-5 absolute error over [-5, 5]
        y = approx_exp(x)
        return y / (y + C(1.0))

    def approx_tanh(x):
        # <4.0e-5 absolute error over [-5, 5]
        x = x * C(2.0)
        y = approx_exp(x)
        return (y - C(1.0)) / (y + C(1.0))
```

See unit tests for implementations of these approximate transendentals.

committed Jul 23, 2019

2ed31b24 Browse Files

remove tabs (#3603) · 66f3bf83
Luis Vega committed Jul 23, 2019

66f3bf83 Browse Files

[Relay][Pass][Docs] Update the doc for adding a Relay pass to mention the pass infra (#3583) · 9911044b

* Update the Relay adding pass doc to reference the new pass infrastructure

* Correct pass name

Co-Authored-By: Zhi <5145158+zhiics@users.noreply.github.com>

* Align header equals signs

committed Jul 23, 2019

9911044b Browse Files

Checking the correct dtypes for choosing the Intel int8 instructions. (#3516) · 3ada7c0e
Animesh Jain committed Jul 23, 2019

3ada7c0e Browse Files
[Relay] [Training] Allow gradient to return a tuple (#3600) · 9e6a8c0d
雾雨魔理沙 committed Jul 23, 2019

9e6a8c0d Browse Files

[Runtime] [ThreadPool] Make SpscTaskQueue::Pop(..) spin_count configurable (#3577) · 9b1c2e08

In cases where we have multiple models or threadpools active, spinning around
`sched_yield()` may not be desirable, as it prevents the OS from effectively
scheduling other threads.

Thus, allow users to conditionally disable this behaviour (via an environment
variable `TVM_THREAD_POOL_SPIN_COUNT`, similar to existing environment flags for
the thread pool such as `TVM_BIND_THREADS`, etc).

This substantially improves tail latencies in some of our multi-tenant
workloads in practice.

Unit tests have been added - on my laptop, running:

```
TVM_THREAD_POOL_SPIN_COUNT=0 ./build/threading_backend_test;
TVM_THREAD_POOL_SPIN_COUNT=1 ./build/threading_backend_test;
./build/threading_backend_test;
```

gives https://gist.github.com/ajtulloch/1805ca6cbaa27f5d442d23f9d0021ce6 (i.e.
97ms -> <1ms after this change)

committed Jul 22, 2019

9b1c2e08 Browse Files

22 Jul, 2019 3 commits

Add support for Tflite operator SPLIT (#3520) · 19eb829e

* [RFC] Initial support for Tflite operator SPLIT

This patch adds initial support for the tflite operator split. However
I am not yet sure how to handle the axis parameter for the split
operator and support it in the test infrastructure. Putting this up for
an initial review and comment.

The split operator in tflite according to
https://www.tensorflow.org/lite/guide/ops_compatibility

appears to take num_or_size_split as a 0D tensor.

I also note that tflite.split is one of the few operators that returns
multiple outputs and thus the helper routines in the tests needed some
massaging to make this work.

@apivarov , could you please review this ?

Thanks,
Ramana

* Fix the axis parameter

Add more tests

* Address review comments

* Try out frozen_gene's suggestion

* Handle split of 1 element

* int32 is only supported in tflite 1.14, let's check that version here.

* Keep this at python3.5

* Add packaging as a python package to be installed

committed Jul 22, 2019

19eb829e Browse Files

Update Jenkinsfile · 443b5b46
Tianqi Chen committed Jul 22, 2019

443b5b46 Browse Files

[VTA] Runtime refactor to allow for non-shared memory FPGAs (e.g. F1) (#3554) · 9d64d321

* updated runtime to support non-shared memory FPGAs for instruction and micro-op kernels

* adding driver-defined memcpy function to handle F1 cases

* refactor to include flush/invalidate in memcpy driver function

* update tsim driver

* bug fixes

* cleanup

* pre-allocate fpga readable buffers to improve perf

* fix

* remove instruction stream address rewrite pass for micro op kernels

* fix:

* white spaces

* fix lint

* avoid signed/unsigned compilation warning

* avoid signed/unsigned compilation warning

* fix

* fix

* addressing comments

* whitespace

* moving flush/invalidate out of memmove

* clearnup

* fix

* cosmetic

* rename API

* comment fix

committed Jul 22, 2019

9d64d321 Browse Files

21 Jul, 2019 2 commits
- [CI] Upgrade LLVM envs (#3590) · 4d314833
  Tianqi Chen committed Jul 21, 2019
  
  4d314833 Browse Files
- add coherent, length, and user bits option to Shell Config (#3593) · 5b5ae980
  Luis Vega committed Jul 21, 2019
  
  5b5ae980 Browse Files
20 Jul, 2019 1 commit
- bugfix function args order in alu instruction generation (#3592) · 3116eeec
  Luis Vega committed Jul 19, 2019
  
  3116eeec Browse Files
19 Jul, 2019 1 commit
- [Relay] add some check for the ad algorithm (#3585) · 1a00cab9
```
* do

* fix test
```
  雾雨魔理沙 committed Jul 20, 2019
  1a00cab9 Browse Files