[DOCS] VTA installation guide (#1428)

6bda4e33 · Thierry Moreau · Tianqi Chen · ee3c1b09 · 6bda4e33 · ee3c1b09
Commit 6bda4e33 authored Jul 12, 2018 by Thierry Moreau Committed by Tianqi Chen Jul 12, 2018
Hide whitespace changes
Inline Side-by-side

Showing with 24 additions and 572 deletions

docs/vta/install.md
+24 -28

vta/tests/python/pynq/test_benchmark_conv2d.py
+0 -544

No files found.
--- a/docs/vta/install.md
+++ b/docs/vta/install.md
@@ -8,7 +8,7 @@ We present three installation guides, each extending on the previous one:
 ## VTA Simulation-Only Installation
-You need [TVM installed](https://docs.tvm.ai/install/index.html) on your machine.
+You need [TVM installed](https://docs.tvm.ai/install/index.html) on your machine. For a quick and easy start, use the pre-built Docker image.
 VTA simulator is library will be built by default along with TVM.
 All you need to run the simulator is to add the vta library to your python path.
@@ -23,10 +23,12 @@ Finally to ensure that you've properly installed the VTA package, we can run sim
 Let's first run the 2D convolution test bench that will only run the ResNet-18 convolution layers.
 ```bash
-python vta/tests/python/integration/test_benchmark_topi_conv2d.py
+python <tvm root>/vta/tests/python/integration/test_benchmark_topi_conv2d.py
 ```
-> Note: You'll notice that for every convolution layer, the throughput gets reported in GOPS. These numbers are actually the computational throughput that the simulator achieves, by evaluating the convolution in software. You can also try out other tutorials.
+> Note: You'll notice that for every convolution layer, the throughput gets reported in GOPS. These numbers are actually the computational throughput that the simulator achieves, by evaluating the convolution in software.
+You can also try out our [VTA programming tutorials](https://docs.tvm.ai/vta/tutorials/index.html) on the VTA simulator.
 ### Advanced Configuration
@@ -39,7 +41,7 @@ You can modify the content to reconfigure VTA to a different mode. To do so,
 ```bash
 cd <tvm root>
 cp vta/config/vta_config.json vta_config.json
-edit vta_config.json
+# edit vta_config.json
 make vta
 ```
@@ -103,9 +105,6 @@ cd ..
 sudo ./apps/pynq_rpc/start_rpc_server.sh # pw is 'xilinx'
 ```
-Note that one key difference between the simulator build is that we changed the VTA configuration
-to be `vta/config/pynq_sample.json`, which specifies PYNQ as target.
 You should see the following being displayed when starting the RPC server. In order to run the next examples, you'll need to leave the RPC server running in an `ssh` session.
 ```
 INFO:root:RPCServer: bind to 0.0.0.0:9091
@@ -118,49 +117,46 @@ Tips regarding the Pynq RPC Server:
 ### Testing your VTA Pynq-based Hardware Setup
-Before running the examples you'll need to configure your environment as follows:
+Before running the examples you'll need to configure your host environment as follows:
 ```bash
 export VTA_PYNQ_RPC_HOST=192.168.2.99
 export VTA_PYNQ_RPC_PORT=9091
 ```
-In addition, you'll need to edit the `vta_config.json` file to indicate that we are targeting the Pynq platform, by setting the `TARGET` field to the `"pynq"` value. Alternatively, you can copy the default `make/config.json` into the VTA root.
+In addition, you'll need to edit the `vta_config.json` file on the host to indicate that we are targeting the Pynq platform, by setting the `TARGET` field to `"pynq"`.
+Alternatively, you can copy the default `vta/config/pynq_sample.json` into the TVM root as `vta_config.json`.
 > Note: in contrast to our simulation setup, there are no libraries to compile on the host side since the host offloads all of the computation to the Pynq board.
 ```bash
 cd <tvm root>
-cp vta/config/pynq_sample.json .
+cp vta/config/pynq_sample.json vta_config.json
 ```
 This time again, we will run the 2D convolution testbench. But beforehand, we'll need to program the Pynq's own FPGA with a VTA bitstream, and build the VTA runtime on the Pynq via RPC. The following `test_program_rpc.py` script will perform two operations:
-* FPGA programming, by downloading a pre-compiled bitstream from a [VTA bitstream repository](https://github.com/uwsaml/vta-distro) that matches the default `config.json` configuration set by the host, and sending it over to the Pynq via RPC to program the Pynq's FPGA.
+* FPGA programming, by downloading a pre-compiled bitstream from a [VTA bitstream repository](https://github.com/uwsaml/vta-distro) that matches the default `vta_config.json` configuration set by the host, and sending it over to the Pynq via RPC to program the Pynq's FPGA.
-* Runtime building on the Pynq, which needs to be run everytime the `config.json` configuration is modified. This ensures that the VTA software runtime that generates the accelerator's executable via just-in-time (JIT) compilation matches the specifications of the VTA design that is programmed on the FPGA. The build process takes about 30 seconds to complete.
+* Runtime building on the Pynq, which needs to be run everytime the `vta_config.json` configuration is modified. This ensures that the VTA software runtime that generates the accelerator's executable via just-in-time (JIT) compilation matches the specifications of the VTA design that is programmed on the FPGA. The build process takes about 30 seconds to complete.
 ```bash
-python tests/python/pynq/test_program_rpc.py
+python <tvm root>/vta/tests/python/pynq/test_program_rpc.py
 ```
 > Tip: You can track progress of the FPGA programming and the runtime rebuilding steps by looking at the RPC server's logging messages in your Pynq `ssh` session.
-We are now ready to run the 2D convolution testbench for the ResNet-15 workload in hardware.
+We are now ready to run the 2D convolution testbench for the ResNet-18 workload in hardware.
 ```bash
-python tests/python/pynq/test_benchmark_conv2d.py
+python <tvm root>/vta/tests/python/integration/test_benchmark_topi_conv2d.py
 ```
 The performance metrics measured on the Pynq board will be reported for each convolutional layer.
-You can also try out other tutorials.
+You can also try out our [VTA programming tutorials](https://docs.tvm.ai/vta/tutorials/index.html).
 ## VTA Hardware Toolchain Installation
 This third and last guide allows users to generate custom VTA bitstreams using free-to-use Xilinx compilation toolchains.
-This guide includes:
-1. Xilinx toolchain installation (for Linux)
-2. Custom VTA bitstream compilation
-3. Running the end to end ResNet-18 test with the new bitstream
 ### Xilinx Toolchain Installation
 We recommend using `Vivado 2017.1` since our scripts have been tested to work on this version of the Xilinx toolchains. Our guide is written for Linux installation.
@@ -216,15 +212,15 @@ export PATH=${XILINX_SDK}/bin:${PATH}
 ### Custom VTA Bitstream Compilation
-High-level parameters are listed under `tvm/vta/config/vta_config.json` and can be customized by the user. For this custom VTA Bitstream Compilation exercise, we'll change the frequency of our design, so it can be clocked a little faster.
+High-level parameters are listed under `<tvm root>/vta/config/vta_config.json` and can be customized by the user. For this custom VTA Bitstream Compilation exercise, we'll change the frequency of our design, so it can be clocked a little faster.
 * Set the `HW_FREQ` field to `142`. The Pynq board supports 100, 142, 167 and 200MHz clocks. Note that the higher the frequency, the harder it will be to close timing. Increasing the frequency can lead to timing violation and thus faulty hardware.
 * Set the `HW_CLK_TARGET` to `6`. This parameters refers to the target clock period in ns passed to HLS - a lower clock period leads to more aggressive pipelining to achieve timing closure at higher frequencies. Technically a 142MHz clock would require a 7ns target, but we intentionally lower the clock target to 6ns to more aggressively pipeline our design.
-Bitstream generation is driven by a top-level `Makefile` under `<vta root>/hardware/xilinx/`.
+Bitstream generation is driven by a top-level `Makefile` under `<tvm root>/vta/hardware/xilinx/`.
 If you just want to simulate the VTA design in software emulation to make sure that it is functional, enter:
 ```bash
-cd <vta root>/hardware/xilinx
+cd <tvm root>/vta/hardware/xilinx
 make ip MODE=sim
 ```
@@ -232,8 +228,8 @@ If you just want to generate the HLS-based VTA IP cores without launching the en
 ```bash
 make ip
 ```
-You'll be able to view the HLS synthesis reports under `<vta root>/build/hardware/xilinx/hls/<configuration>/<block>/solution0/syn/report/<block>_csynth.rpt`
+You'll be able to view the HLS synthesis reports under `<tvm root>/vta/build/hardware/xilinx/hls/` `<configuration>/<block>/solution0/syn/report/<block>_csynth.rpt`
-> Note: The `<configuration>` name is a string that summarizes the VTA configuration parameters specified in the `config.json`. The `<block>` name refers to the specific module in the VTA pipeline.
+> Note: The `<configuration>` name is a string that summarizes the VTA configuration parameters specified in the `vta_config.json`. The `<block>` name refers to the specific module in the VTA pipeline.
 Finally to run the full hardware compilation and generate the bitstream, run:
@@ -243,14 +239,14 @@ make
 This process is lenghty, and can take around up to an hour to complete depending on your machine's specs. We recommend setting the `VTA_HW_COMP_THREADS` variable in the Makefile to take full advantage of all the cores on your development machine.
-Once the compilation completes, the generated bitstream can be found under `<vta root>/build/hardware/xilinx/vivado/<configuration>/export/vta.bit`.
+Once the compilation completes, the generated bitstream can be found under `<tvm root>/vta/build/hardware/xilinx/vivado/<configuration>/export/vta.bit`.
 ### Use the Custom Bitstream
 We can change the FPGA bitstream by simply change the bistream path to the configuring API.
 ```python
-vta.program_fpga(remote, bitstream="<vta root>/build/hardware/xilinx/vivado/<configuration>/export/vta.bit")
+vta.program_fpga(remote, bitstream="<tvm root>/vta/build/hardware/xilinx/vivado/<configuration>/export/vta.bit")
 ```
 Instead of downloading the bitstream from the bitstream repository, the programmer will instead use the custom bitstream you just generated, which is a VTA design clocked at a higher frequency.

--- a/vta/tests/python/pynq/test_benchmark_conv2d.py
+++ b/vta/tests/python/pynq/test_benchmark_conv2d.py
-import os
-import tvm
-import mxnet as mx
-import vta
-import numpy as np
-import topi
-from collections import namedtuple
-from tvm import rpc
-from tvm.contrib import util
-import pandas as pd
-host = os.environ.get("VTA_PYNQ_RPC_HOST", "pynq")
-port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))
-target = "llvm -target=armv7-none-linux-gnueabihf -mattr=+neon"
-Workload = namedtuple("Conv2DWorkload",
-                      ['batch', 'height', 'width', 'in_filter', 'out_filter',
-                       'hkernel', 'wkernel', 'hpad', 'wpad', 'hstride', 'wstride'])
-class Conv2DSchedule(object):
-    def __init__(self,
-                 b_factor=1,
-                 oc_factor=1,
-                 ko_factor=1,
-                 h_factor=1,
-                 w_factor=0,
-                 oc_nthread=0,
-                 h_nthread=0,
-                 debug_sync=False):
-        self.b_factor = b_factor
-        self.oc_factor = oc_factor
-        self.ko_factor = ko_factor
-        self.h_factor = h_factor
-        self.w_factor = w_factor
-        self.oc_nthread = oc_nthread
-        self.h_nthread = h_nthread
-        self.debug_sync = debug_sync
-    def __str__(self):
-        return "{}.{}.{}.{}.{}.{}.{}".format(
-            self.b_factor, self.oc_factor, self.ko_factor,
-            self.h_factor, self.w_factor,
-            self.oc_nthread, self.h_nthread)
-Schedule = Conv2DSchedule
-def get_insn_count(layer, sched):
-    env = vta.get_env()
-    b, h, w, ci, co = sched
-    b_factor = b
-    h_factor = layer.height // h
-    w_factor = layer.width // w
-    ci_factor = layer.in_filter // (ci * env.BLOCK_IN)
-    co_factor = layer.out_filter // (co * env.BLOCK_OUT)
-    input_xfers = b_factor * h_factor * w_factor * co_factor * ci_factor
-    weight_xfers = b_factor * h_factor * w_factor * co_factor * ci_factor
-    output_xfers = b_factor * h_factor * w_factor * co_factor
-    # compute instruction count
-    # output transfer factor: 4 (INIT, GEMM, ALU, STORE)
-    # offset: 5 (3 uop kernels, 1 initial dep push, 1 finish, co_factor)
-    insn_count = input_xfers + weight_xfers + (output_xfers * 4) + 5 + co_factor
-    return insn_count
-def find_schedules(layer, mtOnly=False, bestOnly=False):
-    env = vta.get_env()
-    # Helper function to get factors
-    def find_factors(n):
-        factors = []
-        for i in range(1, n+1):
-            if n % i == 0:
-                factors.append(i)
-        return factors
-    # Scheduling exploration
-    batch_factors = find_factors(layer.batch // env.BATCH)
-    height_factors = find_factors(layer.height // layer.hstride)
-    width_factors = find_factors(layer.width // layer.wstride)
-    cin_factors = find_factors(layer.in_filter // env.BLOCK_IN)
-    cout_factors = find_factors(layer.out_filter // env.BLOCK_OUT)
-    ht_factors = [1, 2]
-    cot_factors = [1, 2]
-    # Explore schedules
-    schedules = []
-    for b in batch_factors:
-        for h in height_factors:
-            for w in width_factors:
-                for ci in cin_factors:
-                    for co in cout_factors:
-                        # FIXME: filter because of 2D load
-                        if w == layer.width/layer.wstride or (w != layer.width/layer.wstride and co == 1):
-                            # FIXME: filter because of 2D load
-                            if ci == 1:
-                                schedules.append([b, h, w, ci, co])
-    # Filter the schedules that wouldn't work in the available BRAM sizes
-    input_elem_size_b = env.BATCH * env.BLOCK_IN * env.INP_WIDTH
-    weight_elem_size_b = env.BLOCK_IN * env.BLOCK_OUT * env.WGT_WIDTH
-    output_elem_size_b = env.BATCH * env.BLOCK_OUT * env.OUT_WIDTH
-    input_brams_capacity_b = env.INP_BUFF_SIZE * 8
-    weight_brams_capacity_b = env.WGT_BUFF_SIZE * 8
-    output_brams_capacity_b = env.OUT_BUFF_SIZE * 8
-    fil_sched = []
-    xfer_size = []
-    for sched in schedules:
-        b, h, w, ci, co = sched
-        for ht in [1, 2]:
-            for cot in [1, 2]:
-                # Make sure to filter cases where we apply threading on two axes
-                # or cases where the threading factors for h and co are not
-                # factors of h and co
-                if not (ht == 2 and cot == 2) and h % ht == 0 and co % cot == 0:
-                    # If in multi-threaded mode, only allow for mt configs:
-                    if (mtOnly and (ht == 2 or cot == 2)) or not mtOnly:
-                        h /= ht
-                        co /= cot
-                        input_tile_elems = b * \
-                                ((h - 1) * layer.hstride + layer.hkernel) * \
-                                ((w - 1) * layer.wstride + layer.wkernel) * ci
-                        weight_tile_elems = layer.hkernel * layer.wkernel * ci * co
-                        output_tile_elems = b * h * w * co
-                        insn_count = get_insn_count(layer, sched)
-                        # 1. Check input capacity
-                        # 2. Check weight capacity
-                        # 3. Check output capacity
-                        # 4. Check instruction capacity
-                        # 5. Make sure that we don't write to the same acc location
-                        #    within 2 consecutive cycles
-                        if input_tile_elems*input_elem_size_b <= input_brams_capacity_b/(cot*ht) and \
-                           weight_tile_elems*weight_elem_size_b <= weight_brams_capacity_b and \
-                           output_tile_elems*output_elem_size_b <= output_brams_capacity_b/(cot*ht) and \
-                           insn_count <= env.MAX_XFER // 16 and \
-                           h > 2 and w > 2:
-                            schedule = Schedule(oc_factor=co, ko_factor=ci, h_factor=h,
-                                                w_factor=w, oc_nthread=cot, h_nthread=ht)
-                            fil_sched.append(schedule)
-                            xfer_size.append(get_data_movementB(schedule, layer))
-    if bestOnly:
-        return [fil_sched[xfer_size.index(min(xfer_size))]]
-    else:
-        return fil_sched
-def get_data_movementB(sched, layer):
-    env = vta.get_env()
-    # Derive data movement
-    input_elem_size_b = env.BATCH * env.BLOCK_IN * env.INP_WIDTH
-    weight_elem_size_b = env.BLOCK_IN * env.BLOCK_OUT * env.WGT_WIDTH
-    output_elem_size_b = env.BATCH * env.BLOCK_OUT * env.OUT_WIDTH
-    b = sched.b_factor
-    h = sched.h_factor
-    w = sched.w_factor
-    ci = sched.ko_factor
-    co = sched.oc_factor
-    ht = sched.h_nthread
-    cot = sched.oc_nthread
-    input_tile_elems = b * \
-            ((h - 1) * layer.hstride + layer.hkernel) * \
-            ((w - 1) * layer.wstride + layer.wkernel) * ci
-    weight_tile_elems = layer.hkernel * layer.wkernel * ci
-    output_tile_elems = b * h * w * co
-    # Derive factors
-    b_factor = layer.batch // (b * env.BATCH)
-    h_factor = (layer.height // layer.hstride) // h
-    w_factor = (layer.width // layer.wstride) // w
-    ci_factor = int(np.ceil(float(layer.in_filter) // (ci * env.BLOCK_IN)))
-    co_factor = int(np.ceil(float(layer.out_filter) // (co * env.BLOCK_OUT)))
-    # Derive transfers
-    input_xfers = b_factor * h_factor * w_factor * co_factor * ci_factor
-    weight_xfers = b_factor * h_factor * w_factor * co_factor * ci_factor
-    output_xfers = b_factor * h_factor * w_factor * co_factor
-    # Compute total transfer sizes
-    input_xfer_B = input_tile_elems * input_xfers * input_elem_size_b // 8
-    weight_xfer_B = weight_tile_elems * weight_xfers * weight_elem_size_b // 8
-    output_xfer_B = output_tile_elems * output_xfers * output_elem_size_b // 8
-    total_xfer_B = input_xfer_B + weight_xfer_B + output_xfer_B
-    return total_xfer_B
-def test_conv2d_chwv(layer, key, batch_size, wl, sched, log_frame, profile=True):
-    env = vta.get_env()
-    assert batch_size % env.BATCH == 0
-    assert wl.in_filter % env.BLOCK_IN == 0
-    assert wl.out_filter % env.BLOCK_OUT == 0
-    data_shape = (batch_size // env.BATCH, wl.in_filter // env.BLOCK_IN,
-                  wl.height, wl.width, env.BATCH, env.BLOCK_IN)
-    kernel_shape = (wl.out_filter // env.BLOCK_OUT, wl.in_filter // env.BLOCK_IN,
-                    wl.hkernel, wl.wkernel, env.BLOCK_OUT, env.BLOCK_IN)
-    fout_height = (wl.height + 2 * wl.hpad - wl.hkernel) // wl.hstride + 1
-    fout_width = (wl.width + 2 * wl.wpad - wl.wkernel) // wl.wstride + 1
-    res_shape = (batch_size // env.BATCH, wl.out_filter // env.BLOCK_OUT,
-                 fout_height, fout_width, env.BATCH, env.BLOCK_OUT)
-    data = tvm.placeholder(data_shape, name="data", dtype=env.inp_dtype)
-    kernel = tvm.placeholder(kernel_shape, name="kernel", dtype=env.wgt_dtype)
-    if wl.hpad or wl.wpad:
-        data_buf = topi.nn.pad(data, [0, 0, wl.hpad, wl.wpad, 0, 0], name="data_buf")
-    else:
-        data_buf = tvm.compute(data_shape, lambda *i: data(*i), "data_buf")
-    kernel_buf = tvm.compute(kernel_shape, lambda *i: kernel(*i), "kernel_buf")
-    di = tvm.reduce_axis((0, wl.hkernel), name='di')
-    dj = tvm.reduce_axis((0, wl.wkernel), name='dj')
-    ko = tvm.reduce_axis((0, wl.in_filter//env.BLOCK_IN), name='ko')
-    ki = tvm.reduce_axis((0, env.BLOCK_IN), name='ki')
-    res_cnv = tvm.compute(
-        res_shape,
-        lambda bo, co, i, j, bi, ci: tvm.sum(
-            data_buf[bo, ko, i*wl.hstride+di, j*wl.wstride+dj, bi, ki].astype(env.acc_dtype) *
-            kernel_buf[co, ko, di, dj, ci, ki].astype(env.acc_dtype),
-        axis=[ko, di, dj, ki]),
-        name="res_cnv")
-    # res_shf = tvm.compute(res_shape, lambda *i: res_cnv(*i) >> 8, name="res_shf")
-    res_shf = topi.right_shift(res_cnv, 8)
-    res = tvm.compute(res_shape, lambda *i: res_shf(*i).astype(env.inp_dtype), name="res")
-    num_ops = batch_size * fout_height * fout_width * wl.hkernel * wl.wkernel * wl.out_filter * wl.in_filter
-    total_xfer_B = get_data_movementB(sched, wl)
-    def verify(s, check_correctness):
-        mod = vta.build(s, [data, kernel, res], "ext_dev", target, name="conv2d")
-        temp = util.tempdir()
-        remote = rpc.connect(host, port)
-        mod.save(temp.relpath("conv2d.o"))
-        remote.upload(temp.relpath("conv2d.o"))
-        f = remote.load_module("conv2d.o")
-        # verify
-        ctx = remote.ext_dev(0)
-        # Data in original format
-        data_orig = np.random.randint(
-            -128, 128, size=(batch_size, wl.in_filter, wl.height, wl.width)).astype(data.dtype)
-        kernel_orig = np.random.randint(
-            -128, 128, size=(wl.out_filter, wl.in_filter, wl.hkernel, wl.wkernel)).astype(kernel.dtype)
-        data_packed = data_orig.reshape(
-            batch_size//env.BATCH, env.BATCH,
-            wl.in_filter//env.BLOCK_IN, env.BLOCK_IN,
-            wl.height, wl.width).transpose((0, 2, 4, 5, 1, 3))
-        kernel_packed = kernel_orig.reshape(
-            wl.out_filter//env.BLOCK_OUT, env.BLOCK_OUT,
-            wl.in_filter//env.BLOCK_IN, env.BLOCK_IN,
-            wl.hkernel, wl.wkernel).transpose((0, 2, 4, 5, 1, 3))
-        res_np = np.zeros(res_shape).astype(res.dtype)
-        data_arr = tvm.nd.array(data_packed, ctx)
-        kernel_arr = tvm.nd.array(kernel_packed, ctx)
-        res_arr = tvm.nd.array(res_np, ctx)
-        time_f = f.time_evaluator("conv2d", ctx, number=1)
-        cost = time_f(data_arr, kernel_arr, res_arr)
-        res_unpack = res_arr.asnumpy().transpose(
-            (0, 4, 1, 5, 2, 3)).reshape(batch_size, wl.out_filter, fout_height, fout_width)
-        if check_correctness:
-            res_ref = mx.nd.Convolution(
-                mx.nd.array(data_orig.astype(env.acc_dtype), mx.cpu(0)),
-                mx.nd.array(kernel_orig.astype(env.acc_dtype), mx.cpu(0)),
-                stride=(wl.hstride, wl.wstride),
-                kernel=(wl.hkernel, wl.wkernel),
-                num_filter=wl.out_filter,
-                no_bias=True,
-                pad=(wl.hpad, wl.wpad)).asnumpy().astype(env.acc_dtype)
-            res_ref = np.right_shift(res_ref, 8).astype(res.dtype)
-            np.testing.assert_allclose(res_unpack, res_ref)
-            print("Correctness check pass...")
-        return cost
-    def run_schedule(load_inp, load_wgt, gemm, alu, store_out,
-                     print_ir, check_correctness):
-        # schedule1
-        s = tvm.create_schedule(res.op)
-        s[data_buf].set_scope(env.inp_scope)
-        s[kernel_buf].set_scope(env.wgt_scope)
-        s[res_cnv].set_scope(env.acc_scope)
-        s[res_shf].set_scope(env.acc_scope)
-        # tile
-        oc_factor = (sched.oc_factor if sched.oc_factor
-                     else wl.out_filter // env.BLOCK_OUT)
-        h_factor = (sched.h_factor if sched.h_factor else fout_height)
-        w_factor = (sched.w_factor if sched.w_factor else fout_width)
-        xbo, xco, xi, xj, xbi, xci = s[res].op.axis
-        xco0, xco1 = s[res].split(xco, factor=oc_factor)
-        xi0, xi1 = s[res].split(xi, factor=h_factor)
-        xj0, xj1 = s[res].split(xj, factor=w_factor)
-        s[res].reorder(xbo, xi0, xco0, xj0, xco1, xi1, xj1, xbi, xci)
-        s[res_cnv].compute_at(s[res], xj0)
-        s[res_shf].compute_at(s[res], xj0)
-        if sched.oc_nthread > 1:
-            _, tx = s[res].split(xco0, factor=sched.oc_nthread)
-            s[res].reorder(tx, xbo)
-            s[res].bind(tx, tvm.thread_axis("cthread"))
-        if sched.h_nthread > 1:
-            xo, tx = s[res].split(xi0, factor=sched.h_nthread)
-            s[res].reorder(tx, xbo)
-            s[res].bind(tx, tvm.thread_axis("cthread"))
-        xbo, xco, xi, xj, xbi, xci = s[res_cnv].op.axis
-        s[res_cnv].reorder(xbo, ko, xj, dj, di, xco, xi, xbi, xci, ki)
-        if sched.ko_factor:
-            ko0, ko1 = s[res_cnv].split(ko, factor=sched.ko_factor)
-            s[data_buf].compute_at(s[res_cnv], ko0)
-            s[kernel_buf].compute_at(s[res_cnv], ko0)
-        # Use VTA instructions
-        s[data_buf].pragma(s[data_buf].op.axis[0], load_inp)
-        s[kernel_buf].pragma(s[kernel_buf].op.axis[0], load_wgt)
-        s[res_cnv].tensorize(xbi, gemm)
-        s[res_shf].pragma(s[res_shf].op.axis[0], alu)
-        s[res].pragma(xco1, store_out)
-        if sched.debug_sync:
-            s[res].pragma(xco0, "coproc_sync")
-        if print_ir:
-            print(tvm.lower(s, [data, kernel, res], simple_mode=True))
-        return verify(s, check_correctness)
-    def conv_normal(print_ir):
-        print("----- CONV2D End-to-End Test-------")
-        def run_test(header, print_ir, check_correctness):
-            s = [1, sched.oc_factor, sched.ko_factor, sched.h_factor, sched.w_factor]
-            cost = run_schedule(
-                env.dma_copy, env.dma_copy,
-                env.gemm, env.alu, env.dma_copy,
-                print_ir, check_correctness)
-            gops = (num_ops / cost.mean) / float(10 ** 9)
-            print(header)
-            print("\tTime cost = %g sec/op, %g GOPS" % (cost.mean, gops))
-            log_frame["key"].append(key)
-            log_frame["layer"].append(layer)
-            log_frame["total-data"].append(total_xfer_B)
-            log_frame["total-gops"].append(gops)
-            log_frame["total-cost"].append(cost.mean)
-            log_frame["total-insn"].append(get_insn_count(wl, s))
-            log_frame["block-batch"].append(env.BATCH)
-            log_frame["block-in"].append(env.BLOCK_IN)
-            log_frame["block-out"].append(env.BLOCK_OUT)
-            log_frame["inp-width"].append(env.INP_WIDTH)
-            log_frame["wgt-width"].append(env.WGT_WIDTH)
-            log_frame["uop-size"].append(env.UOP_BUFF_SIZE)
-            log_frame["inp-size"].append(env.INP_BUFF_SIZE)
-            log_frame["wgt-size"].append(env.WGT_BUFF_SIZE)
-            log_frame["out-size"].append(env.OUT_BUFF_SIZE)
-            log_frame["oc-factor"].append(sched.oc_factor)
-            log_frame["ic-factor"].append(sched.ko_factor)
-            log_frame["h-factor"].append(sched.h_factor)
-            log_frame["w-factor"].append(sched.w_factor)
-            log_frame["oc-threads"].append(sched.oc_nthread)
-            log_frame["h-threads"].append(sched.h_nthread)
-            log_frame["threaded"].append(True if sched.oc_nthread > 1 or sched.h_nthread > 1 else False)
-        with vta.build_config():
-            run_test("NORMAL", print_ir, True)
-    def skip_alu_unittest(print_ir):
-        mock = env.mock
-        print("----- Skip ALU Unit Test-------")
-        def run_test(header, print_ir):
-            cost = run_schedule(
-                env.dma_copy, env.dma_copy,
-                env.gemm, mock.alu, env.dma_copy,
-                print_ir, False)
-            gops = (num_ops / cost.mean) / float(10 ** 9)
-            print(header)
-            print("\tTime cost = %g sec/op, %g GOPS" % (cost.mean, gops))
-            log_frame["skip-alu-gops"].append(gops)
-            log_frame["skip-alu-cost"].append(cost.mean)
-        with vta.build_config():
-            run_test("NORMAL", print_ir)
-        print("")
-    def gemm_unittest(print_ir):
-        mock = env.mock
-        print("----- GEMM Unit Test-------")
-        def run_test(header, print_ir):
-            cost = run_schedule(
-                mock.dma_copy, mock.dma_copy,
-                env.gemm, mock.alu, mock.dma_copy,
-                print_ir, False)
-            gops = (num_ops / cost.mean) / float(10 ** 9)
-            print(header)
-            print("\tTime cost = %g sec/op, %g GOPS" % (cost.mean, gops))
-            log_frame["gemm-gops"].append(gops)
-            log_frame["gemm-cost"].append(cost.mean)
-        with vta.build_config():
-            run_test("NORMAL", print_ir)
-        print("")
-    def alu_unittest(print_ir):
-        mock = env.mock
-        print("----- ALU Unit Test-------")
-        def run_test(header, print_ir):
-            cost = run_schedule(
-                mock.dma_copy, mock.dma_copy,
-                env.gemm, env.alu, mock.dma_copy,
-                print_ir, False)
-            gops = (num_ops / cost.mean) / float(10 ** 9)
-            print(header)
-            print("\tTime cost = %g sec/op, %g GOPS" % (cost.mean, gops))
-            log_frame["alu-gops"].append(gops)
-            log_frame["alu-cost"].append(cost.mean)
-        with vta.build_config():
-            run_test("NORMAL", print_ir)
-        print("")
-    def load_inp_unittest(print_ir):
-        mock = env.mock
-        print("----- LoadInp Unit Test-------")
-        def run_test(header, print_ir):
-            cost = run_schedule(
-                env.dma_copy, mock.dma_copy,
-                env.gemm, mock.alu, mock.dma_copy,
-                print_ir, False)
-            gops = (num_ops / cost.mean) / float(10 ** 9)
-            bandwith = (batch_size * wl.in_filter * wl.height *
-                        wl.width * env.INP_WIDTH / cost.mean) / float(10 ** 9)
-            print(header)
-            print("\tTime cost = %g sec/op, %g GOPS, bandwith=%g gbits" % (
-                cost.mean, gops, bandwith))
-            log_frame["ld-inp-gbits"].append(bandwith)
-            log_frame["ld-inp-cost"].append(cost.mean)
-        with vta.build_config():
-            run_test("NORMAL", print_ir)
-        print("")
-    def load_wgt_unittest(print_ir):
-        mock = env.mock
-        print("----- LoadWgt Unit Test-------")
-        def run_test(header, print_ir):
-            cost = run_schedule(
-                mock.dma_copy, env.dma_copy,
-                env.gemm, mock.alu, mock.dma_copy, print_ir,
-                False)
-            gops = (num_ops / cost.mean) / float(10 ** 9)
-            bandwith = (wl.out_filter * wl.in_filter * wl.hkernel *
-                        wl.wkernel * env.WGT_WIDTH / cost.mean) / float(10 ** 9)
-            print(header)
-            print("\tTime cost = %g sec/op, %g GOPS, bandwith=%g gbits" % (
-                cost.mean, gops, bandwith))
-            log_frame["ld-wgt-gbits"].append(bandwith)
-            log_frame["ld-wgt-cost"].append(cost.mean)
-        with vta.build_config():
-            run_test("NORMAL", print_ir)
-        print("")
-    def store_out_unittest(print_ir):
-        mock = env.mock
-        print("----- StoreOut Unit Test-------")
-        def run_test(header, print_ir):
-            cost = run_schedule(
-                mock.dma_copy, mock.dma_copy,
-                env.gemm, mock.alu, env.dma_copy, print_ir,
-                False)
-            gops = (num_ops / cost.mean) / float(10 ** 9)
-            bandwith = (batch_size * wl.out_filter * fout_height *
-                        fout_width * env.OUT_WIDTH / cost.mean) / float(10 ** 9)
-            print(header)
-            print("\tTime cost = %g sec/op, %g GOPS, bandwith=%g gbits" % (
-                cost.mean, gops, bandwith))
-            log_frame["st-out-gbits"].append(bandwith)
-            log_frame["st-out-cost"].append(cost.mean)
-        with vta.build_config():
-            run_test("NORMAL", print_ir)
-        print("")
-    def manual_unittest(print_ir):
-        # Manual section used to teak the components
-        mock = env.mock
-        print("----- Manual Unit Test-------")
-        def run_test(header, print_ir):
-            cost = run_schedule(
-                env.dma_copy, env.dma_copy,
-                env.gemm, env.alu, mock.dma_copy, print_ir,
-                False)
-            gops = (num_ops / cost.mean) / float(10 ** 9)
-            print(header)
-            print("\tTime cost = %g sec/op, %g GOPS" % (
-                cost.mean, gops))
-        with vta.build_config():
-            run_test("NORMAL", print_ir)
-        print("")
-    print("=================================")
-    print("key=%s" % key)
-    print(wl)
-    conv_normal(False)
-    if not profile:
-        return
-    skip_alu_unittest(False)
-    gemm_unittest(False)
-    alu_unittest(False)
-    load_inp_unittest(False)
-    load_wgt_unittest(False)
-    store_out_unittest(False)
-# Perform profiling
-profile = False
-# Data set batch size
-batch_size = 1
-# Use multi-threading for latency hiding
-multi_threaded = False
-# ResNet18 workloads
-resnet = {
-    # Workloads of resnet18 on imagenet
-    0: Workload(1, 224, 224, 16, 64, 7, 7, 3, 3, 2, 2),
-    1: Workload(1, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1),
-    2: Workload(1, 56, 56, 64, 64, 1, 1, 0, 0, 1, 1),
-    3: Workload(1, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2),
-    4: Workload(1, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2),
-    5: Workload(1, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1),
-    6: Workload(1, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2),
-    7: Workload(1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2),
-    8: Workload(1, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1),
-    9: Workload(1, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2),
-    10: Workload(1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2),
-    11: Workload(1, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1),
-}
-begin = 0
-end = len(resnet)
-resnet_schedules = []
-for i in range(begin, end):
-    scheds = find_schedules(resnet[i], mtOnly=multi_threaded, bestOnly=True)
-    resnet_schedules.append([i, scheds])
-keys = ["key", "layer", "total-data", "total-gops", "total-cost", "total-insn",
-        "block-batch", "block-in", "block-out", "wgt-width", "inp-width",
-        "uop-size", "inp-size", "wgt-size", "out-size",
-        "oc-factor", "ic-factor", "h-factor", "w-factor",
-        "oc-threads", "h-threads", "threaded"]
-if profile:
-    keys += ["skip-alu-gops", "skip-alu-cost",
-             "gemm-gops", "gemm-cost", "alu-gops", "alu-cost",
-             "ld-inp-cost", "ld-wgt-cost", "st-out-cost",
-             "ld-inp-gbits", "ld-wgt-gbits", "st-out-gbits"]
-log_frame = {
-    k : [] for k in keys
-}
-for x in resnet_schedules:
-    l, plans = x
-    for plan in plans:
-        key = "resnet-cfg[{}-{}]".format(l, plan)
-        test_conv2d_chwv(l, key, batch_size, resnet[l], plan, log_frame, profile)
-if profile:
-    pd.set_option('expand_frame_repr', False)
-    log_df = pd.DataFrame()
-    for k  in keys:
-        log_df[k] = log_frame[k]
-    print(log_df)
-    log_df.to_csv("conv2d.csv")