[PYTHON, TVM] Python TVM library, unit tests and end to end example

* VTA python library * Python unit tests * End to end example with Resnet18 * README instructions * Bug fixes

[PYTHON, TVM] Python TVM library, unit tests and end to end example
* VTA python library * Python unit tests * End to end example with Resnet18 * README instructions * Bug fixes
96488c11 · Thierry Moreau · Tianqi Chen · 56a0dea8 · 96488c11 · 96488c11
Commit 96488c11 authored Mar 22, 2018 by Thierry Moreau Committed by Tianqi Chen Jul 11, 2018
35 changed files
--- a/vta/Makefile
+++ b/vta/Makefile
@@ -55,10 +55,10 @@ endif
 all: lib/libvta.$(SHARED_LIBRARY_SUFFIX)
 VTA_LIB_SRC = $(wildcard src/*.cc src/tvm/*.cc)
-ifeq ($(TARGET), PYNQ_TARGET)
+ifeq ($(TARGET), VTA_PYNQ_TARGET)
 	VTA_LIB_SRC += $(wildcard src/pynq/*.cc)
 	LDFLAGS += -L/usr/lib -lsds_lib
-	LDFLAGS += -L/opt/python3.6/lib/python3.6/site-packages/pynq/drivers/ -l:libdma.so
+	LDFLAGS += -L/opt/python3.6/lib/python3.6/site-packages/pynq/lib/ -l:libdma.so
 endif
 VTA_LIB_OBJ = $(patsubst %.cc, build/%.o, $(VTA_LIB_SRC))
@@ -79,7 +79,7 @@ cpplint:
 	python nnvm/dmlc-core/scripts/lint.py vta cpp include src hardware tests
 pylint:
-	pylint python/vta --rcfile=$(ROOTDIR)/tests/lint/pylintrc
+	pylint python/tvm_vta --rcfile=$(ROOTDIR)/tests/lint/pylintrc
 doc:
 	doxygen docs/Doxyfile

--- a/vta/apps/pynq_rpc/README.md
+++ b/vta/apps/pynq_rpc/README.md
+### PYNQ RPC Server for VTA
+This guide describes how to setup a Pynq-based RPC server to accelerate deep learning workloads with VTA.
+## Pynq Setup
+Follow the getting started tutorial for the [Pynq board](http://pynq.readthedocs.io/en/latest/getting_started.html).
+* For this RPC setup make sure to go with the *Connect to a Computer* Ethernet setup.
+Make sure that you can ssh into your Pynq board successfully:
+```bash
+ssh xilinx@192.168.2.99
+```
+When ssh-ing onto the board, the default password for the `xilinx` account is `xilinx`.
+For convenience let's go ahead and mount the Pynq board's file system to easily access it and maintain it:
+```bash
+sshfs xilinx@192.168.2.99:/home/xilinx <mountpoint>
+```
+## Pynq TVM & VTA installation
+On your **host PC**, go to the `<mountpoint>` directory of your Pynq board file system.
+```bash
+cd <mountpoint>
+```
+From there, clone the VTA repository:
+```bash
+git clone git@github.com:uwsaml/vta.git --recursive
+```
+Next, clone the TVM repository:
+```bash
+git clone git@github.com:dmlc/tvm.git --recursive
+```
+TVM is rapidly changing, and to ensure stability, we keep track of working TVM checkpoints.
+As of now, the TVM checkpoint `e4c2af9abdcb3c7aabafba8084414d7739c17c4c` is known to work with VTA.
+```bash
+git checkout e4c2af9abdcb3c7aabafba8084414d7739c17c4c
+```
+Now, ssh into your **Pynq board** to build the TVM runtime with the following commands:
+```bash
+ssh xilinx@192.168.2.99 # ssh if you haven't done so
+cd ~/tvm
+cp make/config.mk .
+echo USE_RPC=1 >> config.mk
+make runtime -j2
+```
+## Pynq RPC server setup
+We're now ready to build the Pynq RPC server on the Pynq board.
+```bash
+ssh xilinx@192.168.2.99 # ssh if you haven't done so
+cd ~/vta
+export TVM_PATH = /home/xilinx/tvm
+make
+```
+The last stage will build the `192.168.2.99:home/xilinx/vta/lib/libvta.so` library file. We are now ready to launch the RPC server on the Pynq. In order to enable the FPGA drivers, we need to run the RPC server with administrator privileges (using `su`, account: `xilinx`, pwd: `xilinx`).
+```bash
+ssh xilinx@192.168.2.99 # ssh if you haven't done so
+cd ~/vta
+su
+./apps/pynq_rpc/start_rpc_server.sh
+```
+You should see the following being displayed when starting the RPC server:
+```
+INFO:root:Load additional library /home/xilinx/vta/lib/libvta.so
+INFO:root:RPCServer: bind to 0.0.0.0:9091
+```
+Note that it should be listening on port `9091`.
+To kill the RPC server, just enter the `Ctrl + c` command.
\ No newline at end of file
--- a/vta/apps/pynq_rpc/start_rpc_server.sh
+++ b/vta/apps/pynq_rpc/start_rpc_server.sh
 #!/bin/bash
 export PYTHONPATH=${PYTHONPATH}:/home/xilinx/tvm/python
-export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/python3.6/lib/python3.6/site-packages/pynq/drivers/
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/python3.6/lib/python3.6/site-packages/pynq/lib/
 python -m  tvm.exec.rpc_server --load-library /home/xilinx/vta/lib/libvta.so
--- a/vta/examples/resnet18/pynq/.gitignore
+++ b/vta/examples/resnet18/pynq/.gitignore
+quantize_graph.json
+quantize_params.pkl
+synset.txt
+*.jpg
+vta.bit
\ No newline at end of file
--- a/vta/examples/resnet18/pynq/README.md
+++ b/vta/examples/resnet18/pynq/README.md
+# Resnet-18 Example on Pynq-based VTA Design
+In order to run this example you'll need to have:
+* VTA installed
+* TVM installed
+* NNVM installed
+* A Pynq-based RPC server running
+## VTA installation
+Clone the VTA repository in the directory of your choosing:
+```bash
+git clone git@github.com:uwsaml/vta.git --recursive
+```
+Update your `~/.bashrc` file to include the VTA python libraries in your `PYTHONPATH` (don't forget to source the newly modified `.bashrc` file!):
+```bash
+export PYTHONPATH=<vta root>/python:${PYTHONPATH}
+```
+## TVM installation
+Clone the TVM repository in the directory of your choosing:
+```bash
+git clone git@github.com:dmlc/tvm.git --recursive
+```
+TVM is rapidly changing, and to ensure stability, we keep track of working TVM checkpoints.
+As of now, the TVM checkpoint `e4c2af9abdcb3c7aabafba8084414d7739c17c4c` is known to work with VTA.
+```bash
+git checkout e4c2af9abdcb3c7aabafba8084414d7739c17c4c
+```
+Before building TVM, copy the `make/config.mk` file into the root TVM directory:
+```bash
+cd <tvm root>
+cp make/config.mk .
+```
+In the 'config.mk' file sure that:
+* `LLVM_CONFIG` points to the llvm-config executable (e.g. `LLVM_CONFIG = /usr/bin/llvm-config-4.0`). You'll need to have llvm4.0 installed or later.
+* `USE_RPC` should be set to 1
+Launch the compilation, this takes about 5 minutes.
+```bash
+cd <tvm root>
+make -j4
+```
+Finally update your `~/.bashrc` file to include the TVM python libraries in your `PYTHONPATH` (don't forget to source the newly modified `.bashrc` file!):
+```bash
+export PYTHONPATH=<tvm root>/python:<tvm root>/topi/python:${PYTHONPATH}
+```
+## NNVM installation
+Clone the NNVM repository from `tqchen` in the directory of your choosing:
+```bash
+git clone git@github.com:tqchen/nnvm.git --recursive
+```
+To run this example, we rely on a special branch of NNVM: `qt`:
+```bash
+cd <nnvm root>
+git checkout qt
+```
+Launch the compilation, this takes less a minute.
+```bash
+cd <nnvm root>
+make -j4
+```
+Finally update your `~/.bashrc` file to include the NNVM python libraries in your `PYTHONPATH` (don't forget to source the newly modified `.bashrc` file!):
+```bash
+export PYTHONPATH=<nnvm root>/python:${PYTHONPATH}
+```
+## Pynq RPC Server Setup
+Follow the [Pynq RPC Server Guide](https://github.com/saml/vta/tree/master/apps/pynq_rpc/README.md)
+## Running the example
+Simply run the following python script:
+```bash
+python imagenet_predict.py
+```
+This will run imagenet classification using the ResNet18 architecture on a VTA design that performs 8-bit integer inference, to perform classification on a cat image `cat.jpg`.
+The script reports runtime measured on the Pynq board, and the top-1 result category:
+```
+('x', (1, 3, 224, 224))
+Build complete...
+('TVM prediction top-1:', 281, 'tabby, tabby cat')
+t-cost=0.41906
+```
\ No newline at end of file
--- a/vta/examples/resnet18/pynq/imagenet_predict.py
+++ b/vta/examples/resnet18/pynq/imagenet_predict.py
+# some standard imports
+import nnvm
+import tvm
+from nnvm.compiler import graph_attr
+import vta
+import os
+import numpy as np
+from PIL import Image
+import pickle
+import json
+import logging
+import wget
+from tvm.contrib import graph_runtime, rpc, util
+factor = 16
+host = "pynq"
+port = 9091
+verbose = False
+# only run fpga component, mark non-conv ops as nop
+debug_fpga_only = False
+# Obtain model and hardware files (they're too large to check-in)
+url = "https://homes.cs.washington.edu/~moreau/media/vta/"
+TEST_FILE = 'cat.jpg'
+CATEG_FILE = 'synset.txt'
+RESNET_GRAPH_FILE = 'quantize_graph.json'
+RESNET_PARAMS_FILE = 'quantize_params.pkl'
+BITSTREAM_FILE = 'vta.bit'
+for file in [TEST_FILE, CATEG_FILE, RESNET_GRAPH_FILE, RESNET_PARAMS_FILE, BITSTREAM_FILE]:
+    if not os.path.isfile(file):
+        print "Downloading {}".format(file)
+        wget.download(url+file) 
+# Program the FPGA remotely
+assert tvm.module.enabled("rpc")
+remote = rpc.connect(host, port)
+remote.upload(BITSTREAM_FILE, BITSTREAM_FILE)
+fprogram = remote.get_function("tvm.contrib.vta.init")
+fprogram(BITSTREAM_FILE)
+if verbose:
+    logging.basicConfig(level=logging.INFO)
+# Change to -device=tcpu to run cpu only inference.
+target = "llvm -device=vta"
+synset = eval(open(os.path.join(CATEG_FILE)).read())
+image = Image.open(os.path.join(TEST_FILE)).resize((224, 224))
+def transform_image(image):
+    image = np.array(image) - np.array([123., 117., 104.])
+    image /= np.array([58.395, 57.12, 57.375])
+    image = image.transpose((2, 0, 1))
+    image = image[np.newaxis, :]
+    return image
+def mark_nop(graph, conv_layer=-1, skip_conv_layer=()):
+    """Helper function to mark certain op as nop
+    Useful to debug performance issues.
+    """
+    jgraph = json.loads(graph.json())
+    counter = 0
+    for nid, node in enumerate(jgraph["nodes"]):
+        op_name = node["op"]
+        if op_name != "tvm_op":
+            continue
+        attrs = node["attrs"]
+        node_name = node["name"]
+        func_name = attrs["func_name"]
+        if func_name.find("quantized_conv2d") != -1:
+            if conv_layer >= 0:
+                if counter != conv_layer:
+                    attrs["func_name"] = "__nop"
+            if counter in skip_conv_layer:
+                attrs["func_name"] = "__nop"
+            counter += 1
+        else:
+            if conv_layer >= 0:
+                attrs["func_name"] = "__nop"
+            attrs["func_name"] = "__nop"
+        if attrs["func_name"] != "__nop":
+            print("Run function %s"% func_name)
+    graph = nnvm.graph.load_json(json.dumps(jgraph))
+    return graph
+x = transform_image(image)
+print('x', x.shape)
+######################################################################
+# now compile the graph
+import nnvm.compiler
+np.random.seed(0)
+sym = nnvm.graph.load_json(
+    open(os.path.join(RESNET_GRAPH_FILE)).read())
+params = pickle.load(
+    open(os.path.join(RESNET_PARAMS_FILE)))
+shape_dict = {"data": x.shape}
+dtype_dict = {"data": 'float32'}
+shape_dict.update({k: v.shape for k, v in params.items()})
+dtype_dict.update({k: str(v.dtype) for k, v in params.items()})
+graph = nnvm.graph.create(sym)
+graph_attr.set_shape_inputs(sym, shape_dict)
+graph_attr.set_dtype_inputs(sym, dtype_dict)
+graph = graph.apply("InferShape").apply("InferType")
+dtype = "float32"
+sym = vta.graph.remove_stochastic(sym)
+sym = vta.graph.clean_cast(sym)
+sym = vta.graph.clean_conv_fuse(sym)
+if "vta" in target:
+    sym = vta.graph.pack(sym, shape_dict, factor)
+graph_attr.set_shape_inputs(sym, shape_dict)
+sym = sym.apply("InferShape")
+graph_attr.set_dtype_inputs(sym, dtype_dict)
+sym = sym.apply("InferType")
+with nnvm.compiler.build_config(opt_level=3):
+    bdict = {}
+    if "vta" not in target:
+        bdict = {"add_lower_pass": []}
+    else:
+        bdict = {"add_lower_pass": vta.debug_mode(0)}
+    with tvm.build_config(**bdict):
+        graph, lib, params = nnvm.compiler.build(
+            sym, target, shape_dict, dtype_dict,
+            params=params)
+remote = rpc.connect(host, port)
+temp = util.tempdir()
+lib.save(temp.relpath("graphlib.o"))
+remote.upload(temp.relpath("graphlib.o"))
+lib = remote.load_module("graphlib.o")
+ctx = remote.ext_dev(0) if "vta" in target else remote.cpu(0)
+print("Build complete...")
+def run_e2e(graph):
+    """Running end to end example
+    """
+    if debug_fpga_only:
+        graph = mark_nop(graph, skip_conv_layer=(0,))
+    m = graph_runtime.create(graph, lib, ctx)
+    # set inputs
+    m.set_input('data', tvm.nd.array(x.astype("float32")))
+    m.set_input(**params)
+    # execute
+    timer = m.module.time_evaluator("run", ctx, number=10)
+    tcost = timer()
+    # get outputs
+    tvm_output = m.get_output(
+        0,tvm.nd.empty((1000,), dtype, remote.cpu(0)))
+    top1 = np.argmax(tvm_output.asnumpy())
+    print('TVM prediction top-1:', top1, synset[top1])
+    print("t-cost=%g" % tcost.mean)
+def run_layer(old_graph):
+    """Run a certain layer."""
+    for layer_id in range(1, 2):
+        graph = mark_nop(old_graph, layer_id)
+        m = graph_runtime.create(graph, lib, ctx)
+        # set inputs
+        m.set_input('data', tvm.nd.array(x.astype("float32")))
+        m.set_input(**params)
+        # execute
+        timer = m.module.time_evaluator("run", ctx, number=10)
+        tcost = timer()
+        print("resnet[%d]: %g\n"% (layer_id, tcost.mean))
+run_e2e(graph)
--- a/vta/hardware/vivado/.gitignore
+++ b/vta/hardware/vivado/.gitignore
--- a/vta/hardware/vivado/Makefile
+++ b/vta/hardware/vivado/Makefile
 # Directories
 ROOTDIR = $(CURDIR)
-BUILD_DIR = $(ROOTDIR)/../../build/hardware/vivado
+BUILD_DIR = $(ROOTDIR)/../../build/hardware/xilinx
 SCRIPT_DIR = $(ROOTDIR)/scripts
 SRC_DIR = $(ROOTDIR)/src
 SIM_DIR = $(ROOTDIR)/sim
@@ -64,7 +64,7 @@ bit: ip
 	cd $(HW_BUILD_PATH) && \
 		$(VIVADO) -mode tcl -source $(SCRIPT_DIR)/vivado.tcl \
 		-tclargs $(IP_BUILD_PATH) $(VTA_HW_COMP_THREADS) $(VTA_HW_COMP_CLOCK_FREQ) \
-		$(VTA_INP_WIDTH) $(VTA_WGT_WIDTH) $(OUT_WIDTH) \
+		$(VTA_INP_WIDTH) $(VTA_WGT_WIDTH) $(VTA_OUT_WIDTH) \
 		$(VTA_BATCH) $(VTA_IN_BLOCK) $(VTA_OUT_BLOCK) \
 		$(VTA_INP_BUFF_SIZE) $(VTA_WGT_BUFF_SIZE) $(VTA_OUT_BUFF_SIZE)

--- a/vta/hardware/xilinx/README.md
+++ b/vta/hardware/xilinx/README.md
+# Hardware Compilation Guide
+**This hardware compilation guide aims to provide guidance on generating VTA bitstreams with the Xilinx Vivado toolchains.**
+As of writing this guide, we recommend using `Vivado 2017.1` since our scripts have been tested to work on this version of the Xilinx toolchains.
+# Vivado Toolchains Installation for Pynq Board
+## Ubuntu instructions
+You’ll need to install Xilinx’ FPGA compilation toolchain, [Vivado HL WebPACK 2017.1](https://www.xilinx.com/products/design-tools/vivado.html), which a license-free version of the Vivado HLx toolchain.
+### Obtaining and launching the installation binary
+1. Go to the [download webpage](https://www.xilinx.com/support/download.html), and download the Linux Self Extracting Web Installer for Vivado HL 2017.1 WebPACK and Editions.
+2. You’ll have to sign in with a Xilinx account. This requires a Xilinx account creation that will take 2 minutes.
+3. Complete the Name and Address Verification by clicking “Next”, and you will get the opportunity to download a binary file, called `Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin`.
+4. Now that the file is downloaded, go to your `Downloads` directory, and change the file permissions so it can be executed:
+```bash
+chmod u+x Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
+```
+5. Now you can execute the binary: 
+```bash
+./Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
+```
+### Installation Steps
+At this point you've launched the Vivado 2017.1 Installer GUI program.
+1. Click “Next” on the **Welcome** screen.
+2. Enter your Xilinx User Credentials under “User Authentication” and select the “Download and Install Now” before clicking “Next” on the **Select Install Type** screen.
+3. Accept all terms before clicking on “Next” on the **Accept License Agreements** screen.
+4. Select the “Vivado HL WebPACK” before clicking on “Next” on the **Select Edition to Install** screen.
+5. Under the **Vivado HL WebPACK** screen, before hitting “Next", check the following options (the rest should be unchecked):
+   * Design Tools -> Vivado Design Suite -> Vivado
+   * Design Tools -> Vivado Design Suite -> Vivado High Level Synthesis
+   * Devices -> Production Services -> SoCs -> Zynq-7000 Series
+6. Your total download size should be about 3GB and the amount of Disk Space Required 13GB.
+7. Set the installation directory before clicking “Next” on the **Select Destination Directory** screen. It might highlight some paths as red - that’s because the installer doesn’t have the permission to write to that directory. In that case select a path that doesn’t require special write permissions (e.g. in your home directory).
+8. Hit “Install” under the **Installation Summary** screen.
+9. An **Installation Progress Window** will pop-up to track progress of the download and the installation.
+10. This process will take about 20-30 minutes depending on your connection speed.
+11. A pop-up window will inform you that the installation completed successfully. Click "OK".
+12. Finally the **Vivado License Manager** will launch. Select "Get Free ISE WebPACK, ISE/Vivado IP or PetaLinux License" and click "Connect Now" to complete the license registration process. 
+### Environment Setup
+The last step is to update your `~/.bashrc` with the following line:
+```bash
+# Xilinx Vivado 2017.1 environment
+source <install_path>/Vivado/2017.1/settings64.sh
+```
+This will include all of the Xilinx binary paths so you can launch compilation scripts from the command line.
+Note that this will overwrite the paths to GCC required to build TVM, or NNVM. Therefore, when attempting to build TVM and NNVM, please comment this line from your `~/.bashrc` before re-sourcing it.
+# Bitstream compilation
+High-level parameters are listed under `<vta root>/make/config.mk` and can be customized by the user.
+Bitstream generation is driven by a makefile. All it takes is to enter the following command:
+```bash
+make
+```
+The local `Makefile` containts several variables that can be tweaked by the user:
+* `VTA_HW_COMP_THREADS`: determines the number of threads used for the Vivado compilation job (default 8 threads).
+* `VTA_HW_COMP_CLOCK_FREQ`: determines the target frequency of the VTA design (default 100MHz). It can only be set to 100, 142, 167 or 200MHz.
+* `VTA_HW_COMP_TIMING_COMP`: determines how much additional slack must be provided to close timing (default 0ns). Generally when utilization is high for an FPGA design, setting this paramter to 1, 2 or 3 can help close timing.
+Once the compilation completes, the generated bitstream can be found under `<vta root>/build/hardware/xilinx/vivado/<design name>/export/vta.bit`. 
\ No newline at end of file
--- a/vta/hardware/vivado/scripts/hls.tcl
+++ b/vta/hardware/vivado/scripts/hls.tcl
--- a/vta/hardware/vivado/scripts/hsi.tcl
+++ b/vta/hardware/vivado/scripts/hsi.tcl
--- a/vta/hardware/vivado/scripts/vivado.tcl
+++ b/vta/hardware/vivado/scripts/vivado.tcl
--- a/vta/hardware/vivado/sim/vta_test.cc
+++ b/vta/hardware/vivado/sim/vta_test.cc
@@ -40,6 +40,8 @@ int main(void) {
    status |= alu_test(VTA_ALU_OPCODE_ADD, true, 16, 128, false);
    status |= alu_test(VTA_ALU_OPCODE_SHR, true, 16, 128, true);
    status |= alu_test(VTA_ALU_OPCODE_SHR, true, 16, 128, false);
+    status |= alu_test(VTA_ALU_OPCODE_SHL, true, 16, 128, true);
+    status |= alu_test(VTA_ALU_OPCODE_SHL, true, 16, 128, false);
    // Run ALU test (vector-vector operators)
    status |= alu_test(VTA_ALU_OPCODE_MIN, false, 16, 128, true);

--- a/vta/hardware/vivado/src/vta.cc
+++ b/vta/hardware/vivado/src/vta.cc
--- a/vta/hardware/vivado/src/vta.h
+++ b/vta/hardware/vivado/src/vta.h
@@ -107,9 +107,9 @@ typedef ap_uint<VTA_LOG_ACC_WIDTH> aluop_sh_imm_T;
 void fetch(
  uint32_t insn_count,
  volatile insn_T *insns,
-  hls::stream<insn_T> *load_queue,
+  hls::stream<insn_T> &load_queue,
-  hls::stream<insn_T> *gemm_queue,
+  hls::stream<insn_T> &gemm_queue,
-  hls::stream<insn_T> *store_queue);
+  hls::stream<insn_T> &store_queue);
 /*!
 * \brief Load module.
@@ -129,9 +129,9 @@ void fetch(
 void load(
  volatile inp_vec_T *inputs,
  volatile wgt_vec_T *weights,
-  hls::stream<insn_T> *load_queue,
+  hls::stream<insn_T> &load_queue,
-  hls::stream<bool> *g2l_dep_queue,
+  hls::stream<bool> &g2l_dep_queue,
-  hls::stream<bool> *l2g_dep_queue,
+  hls::stream<bool> &l2g_dep_queue,
  inp_vec_T inp_mem[VTA_INP_BUFF_DEPTH][VTA_BATCH],
  wgt_vec_T wgt_mem[VTA_WGT_BUFF_DEPTH][VTA_BLOCK_OUT]);
@@ -159,14 +159,14 @@ void load(
 * \param out_mem Local output SRAM buffer. Write only single port BRAM.
 */
 void compute(
-  volatile uint32_t *done,
+  volatile uint32_t &done,
  volatile uop_T *uops,
  volatile acc_vec_T *biases,
-  hls::stream<insn_T> *gemm_queue,
+  hls::stream<insn_T> &gemm_queue,
-  hls::stream<bool> *l2g_dep_queue,
+  hls::stream<bool> &l2g_dep_queue,
-  hls::stream<bool> *s2g_dep_queue,
+  hls::stream<bool> &s2g_dep_queue,
-  hls::stream<bool> *g2l_dep_queue,
+  hls::stream<bool> &g2l_dep_queue,
-  hls::stream<bool> *g2s_dep_queue,
+  hls::stream<bool> &g2s_dep_queue,
  out_vec_T inp_mem[VTA_INP_BUFF_DEPTH][VTA_BATCH],
  wgt_vec_T wgt_mem[VTA_WGT_BUFF_DEPTH][VTA_BLOCK_OUT],
  out_vec_T out_mem[VTA_ACC_BUFF_DEPTH][VTA_BATCH]);
@@ -186,9 +186,9 @@ void compute(
 */
 void store(
  volatile out_vec_T *outputs,
-  hls::stream<insn_T> *store_queue,
+  hls::stream<insn_T> &store_queue,
-  hls::stream<bool> *g2s_dep_queue,
+  hls::stream<bool> &g2s_dep_queue,
-  hls::stream<bool> *s2g_dep_queue,
+  hls::stream<bool> &s2g_dep_queue,
  out_vec_T out_mem[VTA_ACC_BUFF_DEPTH][VTA_BATCH]);
 /*!

--- a/vta/make/config.mk
+++ b/vta/make/config.mk
@@ -84,7 +84,7 @@ VTA_ACC_BUFF_SIZE = $(shell echo "$$(( 1 << $(VTA_LOG_ACC_BUFF_SIZE) ))" )
 VTA_LOG_OUT_BUFF_SIZE = \
 $(shell echo "$$(( $(VTA_LOG_ACC_BUFF_SIZE) + $(VTA_LOG_OUT_WIDTH) - $(VTA_LOG_ACC_WIDTH) ))" )
 #  Out buffer size in Bytes
-VTA_OUT_BUFF_SIZE = $(shell echo "$$(( 1 << $(LOG_OUT_BUFF_SIZE) ))" )
+VTA_OUT_BUFF_SIZE = $(shell echo "$$(( 1 << $(VTA_LOG_OUT_BUFF_SIZE) ))" )
 # Update ADD_CFLAGS
 ADD_CFLAGS += \

--- a/vta/python/vta/__init__.py
+++ b/vta/python/vta/__init__.py
-"""VTA Python package backed by TVM"""
+"""TVM VTA runtime"""
+from __future__ import absolute_import as _abs
+from .hw_spec import *
-# version of this package
+from .runtime import SCOPE_INP, SCOPE_OUT, SCOPE_WGT, DMA_COPY, ALU
-__version__ = "0.1.0"
+from .intrin import GEVM, GEMM
+from .build import debug_mode
+from . import mock, ir_pass
+from . import arm_conv2d, vta_conv2d
+from . import graph
--- a/vta/python/vta/arm_conv2d.py
+++ b/vta/python/vta/arm_conv2d.py
--- a/vta/python/vta/build.py
+++ b/vta/python/vta/build.py
+"""Runtime function related hooks"""
+from __future__ import absolute_import as _abs
+import tvm
+from tvm import build_module
+from . runtime import CB_HANDLE
+from . import ir_pass
+def lift_coproc_scope(x):
+    x = ir_pass.lift_alloc_to_scope_begin(x)
+    x = tvm.ir_pass.LiftAttrScope(x, "coproc_scope", False)
+    return x
+def early_rewrite(stmt):
+    try:
+        return tvm.ir_pass.StorageRewrite(stmt)
+    except tvm.TVMError:
+        return stmt
+def debug_mode(debug_flag):
+    """Pass to enable vta debug mode.
+    Parameters
+    ----------
+    debug_flag : int
+        The dbeug flag to be passed.
+    Returns
+    -------
+    pass_list: list of function
+        The pass to set to build_config(add_lower_pass=vta.debug_mode(mode))
+    """
+    def add_debug(stmt):
+        debug = tvm.call_extern(
+            "int32", "VTASetDebugMode", CB_HANDLE, debug_flag)
+        return tvm.make.stmt_seq(debug, stmt)
+    pass_list = [(1, ir_pass.inject_dma_intrin),
+                 (1, ir_pass.inject_skip_copy),
+                 (1, ir_pass.annotate_alu_coproc_scope),
+                 (1, lambda x: tvm.ir_pass.LiftAttrScope(x, "coproc_uop_scope", True)),
+                 (1, lift_coproc_scope),
+                 (1, ir_pass.inject_coproc_sync),
+                 (1, early_rewrite)]
+    if debug_flag:
+        pass_list.append((1, add_debug))
+    pass_list.append((2, ir_pass.inject_alu_intrin))
+    pass_list.append((3, ir_pass.fold_uop_loop))
+    pass_list.append((3, ir_pass.cpu_access_rewrite))
+    return pass_list
+# Add a lower pass to sync uop
+build_module.BuildConfig.current.add_lower_pass = debug_mode(0)
--- a/vta/python/vta/graph.py
+++ b/vta/python/vta/graph.py
--- a/vta/python/vta/hw_spec.py
+++ b/vta/python/vta/hw_spec.py
+"""VTA configuration constants (should match hw_spec.h"""
+from __future__ import absolute_import as _abs
+# The Constants
+VTA_WGT_WIDTH = 8
+VTA_INP_WIDTH = VTA_WGT_WIDTH
+VTA_OUT_WIDTH = 32
+# Dimensions of the GEMM unit
+# (BATCH,BLOCK_IN) x (BLOCK_IN,BLOCK_OUT)
+VTA_BATCH = 1
+VTA_BLOCK_IN = 16
+VTA_BLOCK_OUT = 16
+# log-2 On-chip wgt buffer size in Bytes
+VTA_LOG_WGT_BUFF_SIZE = 15
+# log-2 On-chip input buffer size in Bytes
+VTA_LOG_INP_BUFF_SIZE = 15
+# log-2 On-chip output buffer size in Bytes
+VTA_LOG_OUT_BUFF_SIZE = 17
+# On-chip wgt buffer size in Bytes
+VTA_WGT_BUFF_SIZE = 1 << VTA_LOG_WGT_BUFF_SIZE
+# Input buffer size
+VTA_INP_BUFF_SIZE = 1 << VTA_LOG_INP_BUFF_SIZE
+# Output buffer size.
+VTA_OUT_BUFF_SIZE = 1 << VTA_LOG_OUT_BUFF_SIZE
+# Number of bytes per buffer
+VTA_INP_ELEM_BYTES = (VTA_BATCH*VTA_BLOCK_IN*VTA_INP_WIDTH//8)
+VTA_WGT_ELEM_BYTES = (VTA_BLOCK_OUT*VTA_BLOCK_IN*VTA_WGT_WIDTH//8)
+VTA_OUT_ELEM_BYTES = (VTA_BATCH*VTA_BLOCK_OUT*VTA_OUT_WIDTH//8)
+# Maximum external buffer size in bytes
+VTA_MAX_XFER = 1 << 22
+# Number of elements
+VTA_INP_BUFF_DEPTH = VTA_INP_BUFF_SIZE//VTA_INP_ELEM_BYTES
+VTA_WGT_BUFF_DEPTH = VTA_WGT_BUFF_SIZE//VTA_WGT_ELEM_BYTES
+VTA_OUT_BUFF_DEPTH = VTA_OUT_BUFF_SIZE//VTA_OUT_ELEM_BYTES
+# Memory id for DMA
+VTA_MEM_ID_UOP = 0
+VTA_MEM_ID_WGT = 1
+VTA_MEM_ID_INP = 2
+VTA_MEM_ID_ACC = 3
+VTA_MEM_ID_OUT = 4
+# VTA ALU Opcodes
+VTA_ALU_OPCODE_MIN = 0
+VTA_ALU_OPCODE_MAX = 1
+VTA_ALU_OPCODE_ADD = 2
+VTA_ALU_OPCODE_SUB = 3
+VTA_ALU_OPCODE_MUL = 4
+VTA_ALU_OPCODE_SHL = 5
+VTA_ALU_OPCODE_SHR = 6
+VTA_ALU_OPCODE_UNSET = 7
+# Task queue id (pipeline stage)
+VTA_QID_LOAD_INP = 1
+VTA_QID_LOAD_WGT = 1
+VTA_QID_LOAD_OUT = 2
+VTA_QID_STORE_OUT = 3
+VTA_QID_COMPUTE = 2
+VTA_QID_STORE_INP = 3
+# Debug flags
+DEBUG_DUMP_INSN = (1 << 1)
+DEBUG_DUMP_UOP = (1 << 2)
+DEBUG_SKIP_READ_BARRIER = (1 << 3)
+DEBUG_SKIP_WRITE_BARRIER = (1 << 4)
\ No newline at end of file
--- a/vta/python/vta/intrin.py
+++ b/vta/python/vta/intrin.py
+"""VTA related intrinsics"""
+from __future__ import absolute_import as _abs
+import tvm
+from . import hw_spec as spec
+from .runtime import VTA_AXIS, VTA_PUSH_UOP, get_task_qid
+from .runtime import SCOPE_OUT, SCOPE_INP, SCOPE_WGT
+# The memory information for the compiler
+@tvm.register_func("tvm.info.mem.%s" % SCOPE_INP)
+def mem_info_inp_buffer():
+    return tvm.make.node("MemoryInfo",
+                         unit_bits=spec.VTA_INP_ELEM_BYTES * 8,
+                         max_simd_bits=spec.VTA_INP_ELEM_BYTES * 8,
+                         max_num_bits=spec.VTA_INP_BUFF_SIZE * 8,
+                         head_address=None)
+@tvm.register_func("tvm.info.mem.%s" % SCOPE_WGT)
+def mem_info_wgt_buffer():
+    return tvm.make.node("MemoryInfo",
+                         unit_bits=spec.VTA_WGT_ELEM_BYTES * 8,
+                         max_simd_bits=spec.VTA_WGT_ELEM_BYTES * 8,
+                         max_num_bits=spec.VTA_WGT_BUFF_SIZE * 8,
+                         head_address=None)
+@tvm.register_func("tvm.info.mem.%s" % SCOPE_OUT)
+def mem_info_out_buffer():
+    return tvm.make.node("MemoryInfo",
+                         unit_bits=spec.VTA_OUT_ELEM_BYTES * 8,
+                         max_simd_bits=spec.VTA_OUT_ELEM_BYTES * 8,
+                         max_num_bits=spec.VTA_OUT_BUFF_SIZE * 8,
+                         head_address=None)
+def intrin_gevm(mock=False):
+    """Vector-matrix multiply intrinsic"""
+    wgt_lanes = spec.VTA_WGT_ELEM_BYTES * 8 // spec.VTA_WGT_WIDTH
+    assert wgt_lanes == spec.VTA_BLOCK_OUT * spec.VTA_BLOCK_IN
+    wgt_shape = (spec.VTA_BLOCK_OUT, spec.VTA_BLOCK_IN)
+    assert wgt_shape[0] * wgt_shape[1] == wgt_lanes
+    inp_lanes = spec.VTA_INP_ELEM_BYTES * 8 // spec.VTA_INP_WIDTH
+    out_lanes = spec.VTA_OUT_ELEM_BYTES * 8 // spec.VTA_OUT_WIDTH
+    wgt = tvm.placeholder((wgt_shape[0], wgt_shape[1]),
+                          dtype="int%d" % spec.VTA_WGT_WIDTH,
+                          name=SCOPE_WGT)
+    inp = tvm.placeholder((wgt_shape[1], ),
+                          dtype="int%d" % spec.VTA_INP_WIDTH,
+                          name=SCOPE_INP)
+    k = tvm.reduce_axis((0, wgt_shape[1]), name="k")
+    out_dtype = "int%d" % spec.VTA_OUT_WIDTH
+    out = tvm.compute((wgt_shape[0],),
+                      lambda i: tvm.sum(inp[k].astype(out_dtype) *
+                                        wgt[i, k].astype(out_dtype),
+                                        axis=[k]),
+                      name="out")
+    wgt_layout = tvm.decl_buffer(
+        wgt.shape, wgt.dtype, SCOPE_WGT,
+        scope=SCOPE_WGT, offset_factor=wgt_lanes, data_alignment=wgt_lanes)
+    inp_layout = tvm.decl_buffer(
+        inp.shape, inp.dtype, SCOPE_INP,
+        scope=SCOPE_INP, offset_factor=inp_lanes, data_alignment=inp_lanes)
+    out_layout = tvm.decl_buffer(
+        out.shape, out.dtype, SCOPE_OUT,
+        scope=SCOPE_OUT, offset_factor=out_lanes, data_alignment=out_lanes)
+    def intrin_func(ins, outs):
+        """Vector-matrix multiply intrinsic function"""
+        dinp, dwgt = ins
+        dout = outs[0]
+        def instr(index):
+            """Generate vector-matrix multiply VTA instruction"""
+            irb = tvm.ir_builder.create()
+            irb.scope_attr(VTA_AXIS, "coproc_scope", get_task_qid(spec.VTA_QID_COMPUTE))
+            irb.scope_attr(VTA_AXIS, "coproc_uop_scope", VTA_PUSH_UOP)
+            if index == 0 or index == 2:
+                irb.emit(tvm.call_extern(
+                    "int32", "VTAUopPush",
+                    0, 0,
+                    dout.access_ptr("rw", "int32"),
+                    dinp.access_ptr("r", "int32"),
+                    dwgt.access_ptr("r", "int32"),
+                    0, 0, 0))
+            else:
+                irb.emit(tvm.call_extern(
+                    "int32", "VTAUopPush",
+                    0, 1,
+                    dout.access_ptr("rw", "int32"),
+                    0,
+                    0,
+                    0, 0, 0))
+            return irb.get()
+        # return a triple of normal-set, reset, update
+        nop = tvm.make.Evaluate(0)
+        if mock:
+            return (nop, nop, nop)
+        return (instr(0), instr(1), instr(2))
+    return tvm.decl_tensor_intrin(out.op, intrin_func,
+                                  name="GEVM",
+                                  binds={inp: inp_layout,
+                                         wgt: wgt_layout,
+                                         out: out_layout})
+def intrin_gemm(mock=False):
+    """Matrix-matrix multiply intrinsic"""
+    wgt_lanes = spec.VTA_WGT_ELEM_BYTES * 8 // spec.VTA_WGT_WIDTH
+    assert wgt_lanes == spec.VTA_BLOCK_OUT * spec.VTA_BLOCK_IN
+    wgt_shape = (spec.VTA_BLOCK_OUT, spec.VTA_BLOCK_IN)
+    assert wgt_shape[0] * wgt_shape[1] == wgt_lanes
+    inp_lanes = spec.VTA_INP_ELEM_BYTES * 8 // spec.VTA_INP_WIDTH
+    assert inp_lanes == spec.VTA_BATCH * spec.VTA_BLOCK_IN
+    inp_shape = (spec.VTA_BATCH, spec.VTA_BLOCK_IN)
+    assert inp_shape[0] * inp_shape[1] == inp_lanes
+    out_lanes = spec.VTA_OUT_ELEM_BYTES * 8 // spec.VTA_OUT_WIDTH
+    assert out_lanes == spec.VTA_BATCH * spec.VTA_BLOCK_OUT
+    out_shape = (spec.VTA_BATCH, spec.VTA_BLOCK_OUT)
+    assert out_shape[0] * out_shape[1] == out_lanes
+    wgt = tvm.placeholder((wgt_shape[0], wgt_shape[1]),
+                          dtype="int%d" % spec.VTA_WGT_WIDTH,
+                          name=SCOPE_WGT)
+    inp = tvm.placeholder((inp_shape[0], inp_shape[1]),
+                          dtype="int%d" % spec.VTA_INP_WIDTH,
+                          name=SCOPE_INP)
+    k = tvm.reduce_axis((0, wgt_shape[1]), name="k")
+    out_dtype = "int%d" % spec.VTA_OUT_WIDTH
+    out = tvm.compute((out_shape[0], out_shape[1]),
+                      lambda i, j: tvm.sum(inp[i, k].astype(out_dtype) *
+                                           wgt[j, k].astype(out_dtype),
+                                           axis=[k]),
+                      name="out")
+    wgt_layout = tvm.decl_buffer(
+        wgt.shape, wgt.dtype, SCOPE_WGT,
+        scope=SCOPE_WGT, offset_factor=wgt_lanes, data_alignment=wgt_lanes)
+    inp_layout = tvm.decl_buffer(
+        inp.shape, inp.dtype, SCOPE_INP,
+        scope=SCOPE_INP, offset_factor=inp_lanes, data_alignment=inp_lanes)
+    out_layout = tvm.decl_buffer(
+        out.shape, out.dtype, SCOPE_OUT,
+        scope=SCOPE_OUT, offset_factor=out_lanes, data_alignment=out_lanes)
+    def intrin_func(ins, outs):
+        """Matrix-matrix multiply intrinsic function"""
+        dinp, dwgt = ins
+        dout = outs[0]
+        def instr(index):
+            """Generate matrix-matrix multiply VTA instruction"""
+            irb = tvm.ir_builder.create()
+            irb.scope_attr(VTA_AXIS, "coproc_scope", get_task_qid(spec.VTA_QID_COMPUTE))
+            irb.scope_attr(VTA_AXIS, "coproc_uop_scope", VTA_PUSH_UOP)
+            if index == 0 or index == 2:
+                irb.emit(tvm.call_extern(
+                    "int32", "VTAUopPush",
+                    0, 0,
+                    dout.access_ptr("rw", "int32"),
+                    dinp.access_ptr("r", "int32"),
+                    dwgt.access_ptr("r", "int32"),
+                    0, 0, 0))
+            else:
+                irb.emit(tvm.call_extern(
+                    "int32", "VTAUopPush",
+                    0, 1,
+                    dout.access_ptr("rw", "int32"),
+                    0,
+                    0,
+                    0, 0, 0))
+            return irb.get()
+        # return a triple of normal-set, reset, update
+        nop = tvm.make.Evaluate(0)
+        if mock:
+            return (nop, nop, nop)
+        return (instr(0), instr(1), instr(2))
+    return tvm.decl_tensor_intrin(out.op, intrin_func,
+                                  name="GEMM",
+                                  binds={inp: inp_layout,
+                                         wgt: wgt_layout,
+                                         out: out_layout})
+GEMM = intrin_gemm()
+GEVM = intrin_gevm()
--- a/vta/python/vta/ir_pass.py
+++ b/vta/python/vta/ir_pass.py
--- a/vta/python/vta/mock.py
+++ b/vta/python/vta/mock.py
+"""Mock interface for skip part of compute """
+from .intrin import intrin_gevm, intrin_gemm
+GEMM = intrin_gemm(True)
+GEVM = intrin_gevm(True)
+DMA_COPY = "skip_dma_copy"
+ALU = "skip_alu"
--- a/vta/python/vta/runtime.py
+++ b/vta/python/vta/runtime.py
+"""Runtime function related hooks"""
+from __future__ import absolute_import as _abs
+import tvm
+def thread_local_command_buffer():
+    """Get thread local command buffer"""
+    ctx = tvm.call_extern("handle", "VTATLSCommandHandle")
+    return tvm.make.Call(
+        "handle", "tvm_thread_context", [ctx], tvm.expr.Call.Intrinsic, None, 0)
+CB_HANDLE = thread_local_command_buffer()
+VTA_AXIS = tvm.thread_axis("vta")
+VTA_PUSH_UOP = tvm.make.StringImm("VTAPushGEMMOp")
+SCOPE_INP = "local.inp_buffer"
+SCOPE_OUT = "local.out_buffer"
+SCOPE_WGT = "local.wgt_buffer"
+DMA_COPY = "dma_copy"
+ALU = "alu"
+DEBUG_NO_SYNC = False
+def get_task_qid(qid):
+    """Get transformed queue index."""
+    return 1 if DEBUG_NO_SYNC else qid
+@tvm.register_func("tvm.intrin.rule.default.vta.coproc_sync")
+def coproc_sync(op):
+    return tvm.call_extern(
+        "int32", "VTASynchronize", CB_HANDLE, 1<<31)
+@tvm.register_func("tvm.intrin.rule.default.vta.coproc_dep_push")
+def coproc_dep_push(op):
+    return tvm.call_extern(
+        "int32", "VTADepPush", CB_HANDLE, op.args[0], op.args[1])
+@tvm.register_func("tvm.intrin.rule.default.vta.coproc_dep_pop")
+def coproc_dep_pop(op):
+    return tvm.call_extern(
+        "int32", "VTADepPop", CB_HANDLE, op.args[0], op.args[1])
--- a/vta/python/vta/vta_conv2d.py
+++ b/vta/python/vta/vta_conv2d.py
--- a/vta/src/pynq/pynq_driver.h
+++ b/vta/src/pynq/pynq_driver.h
@@ -80,4 +80,4 @@ void xlnkInvalidateCache(void* buf, int size);
 #ifdef __cplusplus
 }
 #endif
 #endif  // VTA_PYNQ_PYNQ_DRIVER_H_
\ No newline at end of file
--- a/vta/src/runtime.cc
+++ b/vta/src/runtime.cc
@@ -1043,9 +1043,9 @@ class CommandQueue {
    VTAWriteMappedReg(vta_load_handle_, 0x10, 0);
    // LOAD @ 0x18 : Data signal of weight_V
    VTAWriteMappedReg(vta_load_handle_, 0x18, 0);
-    // COMPUTE @ 0x10 : Data signal of uops_V
+    // COMPUTE @ 0x20 : Data signal of uops_V
    VTAWriteMappedReg(vta_compute_handle_, 0x20, 0);
-    // COMPUTE @ 0x18 : Data signal of biases_V
+    // COMPUTE @ 0x28 : Data signal of biases_V
    VTAWriteMappedReg(vta_compute_handle_, 0x28, 0);
    // STORE @ 0x10 : Data signal of outputs_V
    VTAWriteMappedReg(vta_store_handle_, 0x10, 0);

--- a/vta/tests/hardware/common/test_lib.h
+++ b/vta/tests/hardware/common/test_lib.h
@@ -39,7 +39,7 @@ uint64_t vta(
 #else  // NO_SIM
-#include "../../../hardware/vivado/src/vta.h"
+#include "../../../hardware/xilinx/src/vta.h"
 #endif  // NO_SIM

--- a/vta/tests/hardware/pynq/Makefile
+++ b/vta/tests/hardware/pynq/Makefile
 CC ?= g++
 CFLAGS = -Wall -O3 -std=c++11 -I/usr/include
-LDFLAGS = -L/usr/lib -L/home/xilinx/pynq/drivers
+LDFLAGS = -L/usr/lib -L/opt/python3.6/lib/python3.6/site-packages/pynq/lib/
 LIBS = -l:libsds_lib.so -l:libdma.so
 INCLUDE_DIR = ../../../include
 DRIVER_DIR = ../../../src/pynq

--- a/vta/tests/python/pynq/test_benchmark_conv2d.py
+++ b/vta/tests/python/pynq/test_benchmark_conv2d.py
--- a/vta/tests/python/pynq/test_benchmark_gemm.py
+++ b/vta/tests/python/pynq/test_benchmark_gemm.py
--- a/vta/tests/python/pynq/test_benchmark_topi.py
+++ b/vta/tests/python/pynq/test_benchmark_topi.py
+"""Testing if we can generate code in topi style"""
+import topi
+import tvm
+from tvm.contrib import util, rpc
+import vta
+from vta import vta_conv2d
+import numpy as np
+import mxnet as mx
+Workload = vta_conv2d.Workload
+@tvm.tag_scope(tag=topi.tag.ELEMWISE)
+def my_clip(x, a_min, a_max):
+    """Unlike topi's current clip, put min and max into two stages."""
+    const_min = tvm.const(a_min, x.dtype)
+    const_max = tvm.const(a_max, x.dtype)
+    x = tvm.compute(x.shape, lambda *i: tvm.min(x(*i), const_max), name="clipA")
+    x = tvm.compute(x.shape, lambda *i: tvm.max(x(*i), const_min), name="clipB")
+    return x
+host = "pynq"
+port = 9091
+out_dtype = "int%d" % vta.VTA_OUT_WIDTH
+wgt_dtype = "int%d" % vta.VTA_WGT_WIDTH
+inp_dtype = "int%d" % vta.VTA_INP_WIDTH
+target = "llvm -target=armv7-none-linux-gnueabihf -mattr=+neon"
+print_ir = False
+def test_vta_conv2d(key, batch_size, wl, profile=True):
+    data_shape = (batch_size, wl.in_filter//vta.VTA_BLOCK_IN,
+                  wl.height, wl.width, vta.VTA_BLOCK_IN)
+    kernel_shape = (wl.out_filter//vta.VTA_BLOCK_OUT, wl.in_filter//vta.VTA_BLOCK_IN,
+                    wl.hkernel, wl.wkernel, vta.VTA_BLOCK_OUT, vta.VTA_BLOCK_IN)
+    bias_shape = (wl.out_filter//vta.VTA_BLOCK_OUT, 1, 1, vta.VTA_BLOCK_OUT)
+    fout_height = (wl.height + 2 * wl.hpad - wl.hkernel) // wl.hstride + 1
+    fout_width = (wl.width + 2 * wl.wpad - wl.wkernel) // wl.wstride + 1
+    data = tvm.placeholder(data_shape, name="data", dtype=inp_dtype)
+    kernel = tvm.placeholder(kernel_shape, name="kernel", dtype=wgt_dtype)
+    bias = tvm.placeholder(bias_shape, name="kernel", dtype=out_dtype)
+    res_conv = vta_conv2d.packed_conv2d(
+        data, kernel, padding=(wl.hpad, wl.wpad), strides=(wl.hstride, wl.wstride))
+    res = topi.right_shift(res_conv, 8)
+    res = topi.broadcast_add(res, bias)
+    res = my_clip(res, 0, 127)
+    res = topi.cast(res, "int8")
+    num_ops = fout_height * fout_width * wl.hkernel * wl.wkernel * wl.out_filter * wl.in_filter
+    def verify(s, check_correctness):
+        mod = tvm.build(s, [data, kernel, bias, res], "ext_dev", target, name="conv2d")
+        temp = util.tempdir()
+        remote = rpc.connect(host, port)
+        mod.save(temp.relpath("conv2d.o"))
+        remote.upload(temp.relpath("conv2d.o"))
+        f = remote.load_module("conv2d.o")
+        # verify
+        ctx = remote.ext_dev(0)
+        # Data in original format
+        data_orig = (np.random.uniform(
+            size=(batch_size, wl.in_filter, wl.height, wl.width)) * 4).astype(data.dtype)
+        kernel_orig = (np.random.uniform(
+            size=(wl.out_filter, wl.in_filter, wl.hkernel, wl.wkernel)) * 4).astype(kernel.dtype)
+        bias_orig = (np.random.uniform(size=(wl.out_filter,)) * 4).astype("int32")
+        data_orig = np.abs(data_orig)
+        kernel_orig = np.abs(kernel_orig)
+        bias_orig = np.abs(bias_orig)
+        data_packed = data_orig.reshape(
+            batch_size, wl.in_filter//vta.VTA_BLOCK_IN, vta.VTA_BLOCK_IN,
+            wl.height, wl.width).transpose((0, 1, 3, 4, 2))
+        kernel_packed = kernel_orig.reshape(
+            wl.out_filter//vta.VTA_BLOCK_OUT, vta.VTA_BLOCK_OUT,
+            wl.in_filter//vta.VTA_BLOCK_IN, vta.VTA_BLOCK_IN,
+            wl.hkernel, wl.wkernel).transpose((0, 2, 4, 5, 1, 3))
+        bias_packed = bias_orig.reshape(
+            wl.out_filter//vta.VTA_BLOCK_OUT, 1, 1, vta.VTA_BLOCK_OUT)
+        res_shape = topi.util.get_const_tuple(res.shape)
+        res_np = np.zeros(res_shape).astype(res.dtype)
+        data_arr = tvm.nd.array(data_packed, ctx)
+        kernel_arr = tvm.nd.array(kernel_packed, ctx)
+        bias_arr = tvm.nd.array(bias_packed, ctx)
+        res_arr = tvm.nd.array(res_np, ctx)
+        time_f = f.time_evaluator("conv2d", ctx, number=10)
+        cost = time_f(data_arr, kernel_arr, bias_arr, res_arr)
+        res_unpack = res_arr.asnumpy().transpose(
+            (0, 1, 4, 2, 3)).reshape(batch_size, wl.out_filter, fout_height, fout_width)
+        if check_correctness:
+            res_ref = mx.nd.Convolution(
+                mx.nd.array(data_orig.astype(out_dtype), mx.cpu(0)),
+                mx.nd.array(kernel_orig.astype(out_dtype), mx.cpu(0)),
+                stride=(wl.hstride, wl.wstride),
+                kernel=(wl.hkernel, wl.wkernel),
+                num_filter=wl.out_filter,
+                no_bias=True,
+                pad=(wl.hpad, wl.wpad)).asnumpy().astype(out_dtype)
+            res_ref = res_ref >> 8
+            res_ref += bias_orig.reshape(wl.out_filter, 1, 1)
+            res_ref = np.clip(res_ref, 0, 127).astype("int8")
+            np.testing.assert_allclose(res_unpack, res_ref)
+            print("Correctness check pass...")
+        return cost
+    def conv_normal(print_ir):
+        print("----- CONV2D End-to-End Test-------")
+        with tvm.build_config(add_lower_pass=vta.debug_mode(0)):
+            s = vta_conv2d.schedule_packed_conv2d([res])
+            if print_ir:
+                print(tvm.lower(s, [data, kernel, bias, res], simple_mode=True))
+            cost = verify(s, True)
+        gops = (num_ops / cost.mean) / float(10 ** 9)
+        print("\tTime cost = %g sec/op, %g GFLOPS" % (cost.mean, gops))
+    conv_normal(print_ir)
+# ResNet18 workloads
+resnet = {
+    # Workloads of resnet18 on imagenet
+    0: Workload(224, 224, 16, 64, 7, 7, 3, 3, 2, 2),
+    1: Workload(56, 56, 64, 64, 3, 3, 1, 1, 1, 1),
+    2: Workload(56, 56, 64, 64, 1, 1, 0, 0, 1, 1),
+    3: Workload(56, 56, 64, 128, 3, 3, 1, 1, 2, 2),
+    4: Workload(56, 56, 64, 128, 1, 1, 0, 0, 2, 2),
+    5: Workload(28, 28, 128, 128, 3, 3, 1, 1, 1, 1),
+    6: Workload(28, 28, 128, 256, 3, 3, 1, 1, 2, 2),
+    7: Workload(28, 28, 128, 256, 1, 1, 0, 0, 2, 2),
+    8: Workload(14, 14, 256, 256, 3, 3, 1, 1, 1, 1),
+    9: Workload(14, 14, 256, 512, 3, 3, 1, 1, 2, 2),
+    10: Workload(14, 14, 256, 512, 1, 1, 0, 0, 2, 2),
+    11: Workload(7, 7, 512, 512, 3, 3, 1, 1, 1, 1),
+}
+batch_size = 1
+for i in range(0, len(resnet)):
+    wl = resnet[i]
+    key = "resnet-cfg[%d]" % i
+    print "key=%s" % key
+    print wl
+    test_vta_conv2d(key, batch_size, wl)
--- a/vta/tests/python/pynq/test_program_rpc.py
+++ b/vta/tests/python/pynq/test_program_rpc.py
+import tvm
+import vta
+import os
+from tvm.contrib import rpc, util
+host = "pynq"
+port = 9091
+target = "llvm -target=armv7-none-linux-gnueabihf"
+bit = "vta.bit"
+curr_path = os.path.dirname(os.path.abspath(os.path.expanduser(__file__)))
+bitstream = os.path.join(curr_path, "./", bit)
+def test_program_rpc():
+    assert tvm.module.enabled("rpc")
+    remote = rpc.connect(host, port)
+    remote.upload(bitstream, bit)
+    fprogram = remote.get_function("tvm.contrib.vta.init")
+    fprogram(bit)
+test_program_rpc()
--- a/vta/tests/python/pynq/test_vta_insn.py
+++ b/vta/tests/python/pynq/test_vta_insn.py