Commit 6bda4e33 by Thierry Moreau Committed by Tianqi Chen

[DOCS] VTA installation guide (#1428)

parent ee3c1b09
...@@ -8,7 +8,7 @@ We present three installation guides, each extending on the previous one: ...@@ -8,7 +8,7 @@ We present three installation guides, each extending on the previous one:
## VTA Simulation-Only Installation ## VTA Simulation-Only Installation
You need [TVM installed](https://docs.tvm.ai/install/index.html) on your machine. You need [TVM installed](https://docs.tvm.ai/install/index.html) on your machine. For a quick and easy start, use the pre-built Docker image.
VTA simulator is library will be built by default along with TVM. VTA simulator is library will be built by default along with TVM.
All you need to run the simulator is to add the vta library to your python path. All you need to run the simulator is to add the vta library to your python path.
...@@ -23,10 +23,12 @@ Finally to ensure that you've properly installed the VTA package, we can run sim ...@@ -23,10 +23,12 @@ Finally to ensure that you've properly installed the VTA package, we can run sim
Let's first run the 2D convolution test bench that will only run the ResNet-18 convolution layers. Let's first run the 2D convolution test bench that will only run the ResNet-18 convolution layers.
```bash ```bash
python vta/tests/python/integration/test_benchmark_topi_conv2d.py python <tvm root>/vta/tests/python/integration/test_benchmark_topi_conv2d.py
``` ```
> Note: You'll notice that for every convolution layer, the throughput gets reported in GOPS. These numbers are actually the computational throughput that the simulator achieves, by evaluating the convolution in software. You can also try out other tutorials. > Note: You'll notice that for every convolution layer, the throughput gets reported in GOPS. These numbers are actually the computational throughput that the simulator achieves, by evaluating the convolution in software.
You can also try out our [VTA programming tutorials](https://docs.tvm.ai/vta/tutorials/index.html) on the VTA simulator.
### Advanced Configuration ### Advanced Configuration
...@@ -39,7 +41,7 @@ You can modify the content to reconfigure VTA to a different mode. To do so, ...@@ -39,7 +41,7 @@ You can modify the content to reconfigure VTA to a different mode. To do so,
```bash ```bash
cd <tvm root> cd <tvm root>
cp vta/config/vta_config.json vta_config.json cp vta/config/vta_config.json vta_config.json
edit vta_config.json # edit vta_config.json
make vta make vta
``` ```
...@@ -103,9 +105,6 @@ cd .. ...@@ -103,9 +105,6 @@ cd ..
sudo ./apps/pynq_rpc/start_rpc_server.sh # pw is 'xilinx' sudo ./apps/pynq_rpc/start_rpc_server.sh # pw is 'xilinx'
``` ```
Note that one key difference between the simulator build is that we changed the VTA configuration
to be `vta/config/pynq_sample.json`, which specifies PYNQ as target.
You should see the following being displayed when starting the RPC server. In order to run the next examples, you'll need to leave the RPC server running in an `ssh` session. You should see the following being displayed when starting the RPC server. In order to run the next examples, you'll need to leave the RPC server running in an `ssh` session.
``` ```
INFO:root:RPCServer: bind to 0.0.0.0:9091 INFO:root:RPCServer: bind to 0.0.0.0:9091
...@@ -118,49 +117,46 @@ Tips regarding the Pynq RPC Server: ...@@ -118,49 +117,46 @@ Tips regarding the Pynq RPC Server:
### Testing your VTA Pynq-based Hardware Setup ### Testing your VTA Pynq-based Hardware Setup
Before running the examples you'll need to configure your environment as follows: Before running the examples you'll need to configure your host environment as follows:
```bash ```bash
export VTA_PYNQ_RPC_HOST=192.168.2.99 export VTA_PYNQ_RPC_HOST=192.168.2.99
export VTA_PYNQ_RPC_PORT=9091 export VTA_PYNQ_RPC_PORT=9091
``` ```
In addition, you'll need to edit the `vta_config.json` file to indicate that we are targeting the Pynq platform, by setting the `TARGET` field to the `"pynq"` value. Alternatively, you can copy the default `make/config.json` into the VTA root. In addition, you'll need to edit the `vta_config.json` file on the host to indicate that we are targeting the Pynq platform, by setting the `TARGET` field to `"pynq"`.
Alternatively, you can copy the default `vta/config/pynq_sample.json` into the TVM root as `vta_config.json`.
> Note: in contrast to our simulation setup, there are no libraries to compile on the host side since the host offloads all of the computation to the Pynq board. > Note: in contrast to our simulation setup, there are no libraries to compile on the host side since the host offloads all of the computation to the Pynq board.
```bash ```bash
cd <tvm root> cd <tvm root>
cp vta/config/pynq_sample.json . cp vta/config/pynq_sample.json vta_config.json
``` ```
This time again, we will run the 2D convolution testbench. But beforehand, we'll need to program the Pynq's own FPGA with a VTA bitstream, and build the VTA runtime on the Pynq via RPC. The following `test_program_rpc.py` script will perform two operations: This time again, we will run the 2D convolution testbench. But beforehand, we'll need to program the Pynq's own FPGA with a VTA bitstream, and build the VTA runtime on the Pynq via RPC. The following `test_program_rpc.py` script will perform two operations:
* FPGA programming, by downloading a pre-compiled bitstream from a [VTA bitstream repository](https://github.com/uwsaml/vta-distro) that matches the default `config.json` configuration set by the host, and sending it over to the Pynq via RPC to program the Pynq's FPGA. * FPGA programming, by downloading a pre-compiled bitstream from a [VTA bitstream repository](https://github.com/uwsaml/vta-distro) that matches the default `vta_config.json` configuration set by the host, and sending it over to the Pynq via RPC to program the Pynq's FPGA.
* Runtime building on the Pynq, which needs to be run everytime the `config.json` configuration is modified. This ensures that the VTA software runtime that generates the accelerator's executable via just-in-time (JIT) compilation matches the specifications of the VTA design that is programmed on the FPGA. The build process takes about 30 seconds to complete. * Runtime building on the Pynq, which needs to be run everytime the `vta_config.json` configuration is modified. This ensures that the VTA software runtime that generates the accelerator's executable via just-in-time (JIT) compilation matches the specifications of the VTA design that is programmed on the FPGA. The build process takes about 30 seconds to complete.
```bash ```bash
python tests/python/pynq/test_program_rpc.py python <tvm root>/vta/tests/python/pynq/test_program_rpc.py
``` ```
> Tip: You can track progress of the FPGA programming and the runtime rebuilding steps by looking at the RPC server's logging messages in your Pynq `ssh` session. > Tip: You can track progress of the FPGA programming and the runtime rebuilding steps by looking at the RPC server's logging messages in your Pynq `ssh` session.
We are now ready to run the 2D convolution testbench for the ResNet-15 workload in hardware. We are now ready to run the 2D convolution testbench for the ResNet-18 workload in hardware.
```bash ```bash
python tests/python/pynq/test_benchmark_conv2d.py python <tvm root>/vta/tests/python/integration/test_benchmark_topi_conv2d.py
``` ```
The performance metrics measured on the Pynq board will be reported for each convolutional layer. The performance metrics measured on the Pynq board will be reported for each convolutional layer.
You can also try out other tutorials.
You can also try out our [VTA programming tutorials](https://docs.tvm.ai/vta/tutorials/index.html).
## VTA Hardware Toolchain Installation ## VTA Hardware Toolchain Installation
This third and last guide allows users to generate custom VTA bitstreams using free-to-use Xilinx compilation toolchains. This third and last guide allows users to generate custom VTA bitstreams using free-to-use Xilinx compilation toolchains.
This guide includes:
1. Xilinx toolchain installation (for Linux)
2. Custom VTA bitstream compilation
3. Running the end to end ResNet-18 test with the new bitstream
### Xilinx Toolchain Installation ### Xilinx Toolchain Installation
We recommend using `Vivado 2017.1` since our scripts have been tested to work on this version of the Xilinx toolchains. Our guide is written for Linux installation. We recommend using `Vivado 2017.1` since our scripts have been tested to work on this version of the Xilinx toolchains. Our guide is written for Linux installation.
...@@ -216,15 +212,15 @@ export PATH=${XILINX_SDK}/bin:${PATH} ...@@ -216,15 +212,15 @@ export PATH=${XILINX_SDK}/bin:${PATH}
### Custom VTA Bitstream Compilation ### Custom VTA Bitstream Compilation
High-level parameters are listed under `tvm/vta/config/vta_config.json` and can be customized by the user. For this custom VTA Bitstream Compilation exercise, we'll change the frequency of our design, so it can be clocked a little faster. High-level parameters are listed under `<tvm root>/vta/config/vta_config.json` and can be customized by the user. For this custom VTA Bitstream Compilation exercise, we'll change the frequency of our design, so it can be clocked a little faster.
* Set the `HW_FREQ` field to `142`. The Pynq board supports 100, 142, 167 and 200MHz clocks. Note that the higher the frequency, the harder it will be to close timing. Increasing the frequency can lead to timing violation and thus faulty hardware. * Set the `HW_FREQ` field to `142`. The Pynq board supports 100, 142, 167 and 200MHz clocks. Note that the higher the frequency, the harder it will be to close timing. Increasing the frequency can lead to timing violation and thus faulty hardware.
* Set the `HW_CLK_TARGET` to `6`. This parameters refers to the target clock period in ns passed to HLS - a lower clock period leads to more aggressive pipelining to achieve timing closure at higher frequencies. Technically a 142MHz clock would require a 7ns target, but we intentionally lower the clock target to 6ns to more aggressively pipeline our design. * Set the `HW_CLK_TARGET` to `6`. This parameters refers to the target clock period in ns passed to HLS - a lower clock period leads to more aggressive pipelining to achieve timing closure at higher frequencies. Technically a 142MHz clock would require a 7ns target, but we intentionally lower the clock target to 6ns to more aggressively pipeline our design.
Bitstream generation is driven by a top-level `Makefile` under `<vta root>/hardware/xilinx/`. Bitstream generation is driven by a top-level `Makefile` under `<tvm root>/vta/hardware/xilinx/`.
If you just want to simulate the VTA design in software emulation to make sure that it is functional, enter: If you just want to simulate the VTA design in software emulation to make sure that it is functional, enter:
```bash ```bash
cd <vta root>/hardware/xilinx cd <tvm root>/vta/hardware/xilinx
make ip MODE=sim make ip MODE=sim
``` ```
...@@ -232,8 +228,8 @@ If you just want to generate the HLS-based VTA IP cores without launching the en ...@@ -232,8 +228,8 @@ If you just want to generate the HLS-based VTA IP cores without launching the en
```bash ```bash
make ip make ip
``` ```
You'll be able to view the HLS synthesis reports under `<vta root>/build/hardware/xilinx/hls/<configuration>/<block>/solution0/syn/report/<block>_csynth.rpt` You'll be able to view the HLS synthesis reports under `<tvm root>/vta/build/hardware/xilinx/hls/` `<configuration>/<block>/solution0/syn/report/<block>_csynth.rpt`
> Note: The `<configuration>` name is a string that summarizes the VTA configuration parameters specified in the `config.json`. The `<block>` name refers to the specific module in the VTA pipeline. > Note: The `<configuration>` name is a string that summarizes the VTA configuration parameters specified in the `vta_config.json`. The `<block>` name refers to the specific module in the VTA pipeline.
Finally to run the full hardware compilation and generate the bitstream, run: Finally to run the full hardware compilation and generate the bitstream, run:
...@@ -243,14 +239,14 @@ make ...@@ -243,14 +239,14 @@ make
This process is lenghty, and can take around up to an hour to complete depending on your machine's specs. We recommend setting the `VTA_HW_COMP_THREADS` variable in the Makefile to take full advantage of all the cores on your development machine. This process is lenghty, and can take around up to an hour to complete depending on your machine's specs. We recommend setting the `VTA_HW_COMP_THREADS` variable in the Makefile to take full advantage of all the cores on your development machine.
Once the compilation completes, the generated bitstream can be found under `<vta root>/build/hardware/xilinx/vivado/<configuration>/export/vta.bit`. Once the compilation completes, the generated bitstream can be found under `<tvm root>/vta/build/hardware/xilinx/vivado/<configuration>/export/vta.bit`.
### Use the Custom Bitstream ### Use the Custom Bitstream
We can change the FPGA bitstream by simply change the bistream path to the configuring API. We can change the FPGA bitstream by simply change the bistream path to the configuring API.
```python ```python
vta.program_fpga(remote, bitstream="<vta root>/build/hardware/xilinx/vivado/<configuration>/export/vta.bit") vta.program_fpga(remote, bitstream="<tvm root>/vta/build/hardware/xilinx/vivado/<configuration>/export/vta.bit")
``` ```
Instead of downloading the bitstream from the bitstream repository, the programmer will instead use the custom bitstream you just generated, which is a VTA design clocked at a higher frequency. Instead of downloading the bitstream from the bitstream repository, the programmer will instead use the custom bitstream you just generated, which is a VTA design clocked at a higher frequency.
......
import os
import tvm
import mxnet as mx
import vta
import numpy as np
import topi
from collections import namedtuple
from tvm import rpc
from tvm.contrib import util
import pandas as pd
host = os.environ.get("VTA_PYNQ_RPC_HOST", "pynq")
port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))
target = "llvm -target=armv7-none-linux-gnueabihf -mattr=+neon"
Workload = namedtuple("Conv2DWorkload",
['batch', 'height', 'width', 'in_filter', 'out_filter',
'hkernel', 'wkernel', 'hpad', 'wpad', 'hstride', 'wstride'])
class Conv2DSchedule(object):
def __init__(self,
b_factor=1,
oc_factor=1,
ko_factor=1,
h_factor=1,
w_factor=0,
oc_nthread=0,
h_nthread=0,
debug_sync=False):
self.b_factor = b_factor
self.oc_factor = oc_factor
self.ko_factor = ko_factor
self.h_factor = h_factor
self.w_factor = w_factor
self.oc_nthread = oc_nthread
self.h_nthread = h_nthread
self.debug_sync = debug_sync
def __str__(self):
return "{}.{}.{}.{}.{}.{}.{}".format(
self.b_factor, self.oc_factor, self.ko_factor,
self.h_factor, self.w_factor,
self.oc_nthread, self.h_nthread)
Schedule = Conv2DSchedule
def get_insn_count(layer, sched):
env = vta.get_env()
b, h, w, ci, co = sched
b_factor = b
h_factor = layer.height // h
w_factor = layer.width // w
ci_factor = layer.in_filter // (ci * env.BLOCK_IN)
co_factor = layer.out_filter // (co * env.BLOCK_OUT)
input_xfers = b_factor * h_factor * w_factor * co_factor * ci_factor
weight_xfers = b_factor * h_factor * w_factor * co_factor * ci_factor
output_xfers = b_factor * h_factor * w_factor * co_factor
# compute instruction count
# output transfer factor: 4 (INIT, GEMM, ALU, STORE)
# offset: 5 (3 uop kernels, 1 initial dep push, 1 finish, co_factor)
insn_count = input_xfers + weight_xfers + (output_xfers * 4) + 5 + co_factor
return insn_count
def find_schedules(layer, mtOnly=False, bestOnly=False):
env = vta.get_env()
# Helper function to get factors
def find_factors(n):
factors = []
for i in range(1, n+1):
if n % i == 0:
factors.append(i)
return factors
# Scheduling exploration
batch_factors = find_factors(layer.batch // env.BATCH)
height_factors = find_factors(layer.height // layer.hstride)
width_factors = find_factors(layer.width // layer.wstride)
cin_factors = find_factors(layer.in_filter // env.BLOCK_IN)
cout_factors = find_factors(layer.out_filter // env.BLOCK_OUT)
ht_factors = [1, 2]
cot_factors = [1, 2]
# Explore schedules
schedules = []
for b in batch_factors:
for h in height_factors:
for w in width_factors:
for ci in cin_factors:
for co in cout_factors:
# FIXME: filter because of 2D load
if w == layer.width/layer.wstride or (w != layer.width/layer.wstride and co == 1):
# FIXME: filter because of 2D load
if ci == 1:
schedules.append([b, h, w, ci, co])
# Filter the schedules that wouldn't work in the available BRAM sizes
input_elem_size_b = env.BATCH * env.BLOCK_IN * env.INP_WIDTH
weight_elem_size_b = env.BLOCK_IN * env.BLOCK_OUT * env.WGT_WIDTH
output_elem_size_b = env.BATCH * env.BLOCK_OUT * env.OUT_WIDTH
input_brams_capacity_b = env.INP_BUFF_SIZE * 8
weight_brams_capacity_b = env.WGT_BUFF_SIZE * 8
output_brams_capacity_b = env.OUT_BUFF_SIZE * 8
fil_sched = []
xfer_size = []
for sched in schedules:
b, h, w, ci, co = sched
for ht in [1, 2]:
for cot in [1, 2]:
# Make sure to filter cases where we apply threading on two axes
# or cases where the threading factors for h and co are not
# factors of h and co
if not (ht == 2 and cot == 2) and h % ht == 0 and co % cot == 0:
# If in multi-threaded mode, only allow for mt configs:
if (mtOnly and (ht == 2 or cot == 2)) or not mtOnly:
h /= ht
co /= cot
input_tile_elems = b * \
((h - 1) * layer.hstride + layer.hkernel) * \
((w - 1) * layer.wstride + layer.wkernel) * ci
weight_tile_elems = layer.hkernel * layer.wkernel * ci * co
output_tile_elems = b * h * w * co
insn_count = get_insn_count(layer, sched)
# 1. Check input capacity
# 2. Check weight capacity
# 3. Check output capacity
# 4. Check instruction capacity
# 5. Make sure that we don't write to the same acc location
# within 2 consecutive cycles
if input_tile_elems*input_elem_size_b <= input_brams_capacity_b/(cot*ht) and \
weight_tile_elems*weight_elem_size_b <= weight_brams_capacity_b and \
output_tile_elems*output_elem_size_b <= output_brams_capacity_b/(cot*ht) and \
insn_count <= env.MAX_XFER // 16 and \
h > 2 and w > 2:
schedule = Schedule(oc_factor=co, ko_factor=ci, h_factor=h,
w_factor=w, oc_nthread=cot, h_nthread=ht)
fil_sched.append(schedule)
xfer_size.append(get_data_movementB(schedule, layer))
if bestOnly:
return [fil_sched[xfer_size.index(min(xfer_size))]]
else:
return fil_sched
def get_data_movementB(sched, layer):
env = vta.get_env()
# Derive data movement
input_elem_size_b = env.BATCH * env.BLOCK_IN * env.INP_WIDTH
weight_elem_size_b = env.BLOCK_IN * env.BLOCK_OUT * env.WGT_WIDTH
output_elem_size_b = env.BATCH * env.BLOCK_OUT * env.OUT_WIDTH
b = sched.b_factor
h = sched.h_factor
w = sched.w_factor
ci = sched.ko_factor
co = sched.oc_factor
ht = sched.h_nthread
cot = sched.oc_nthread
input_tile_elems = b * \
((h - 1) * layer.hstride + layer.hkernel) * \
((w - 1) * layer.wstride + layer.wkernel) * ci
weight_tile_elems = layer.hkernel * layer.wkernel * ci
output_tile_elems = b * h * w * co
# Derive factors
b_factor = layer.batch // (b * env.BATCH)
h_factor = (layer.height // layer.hstride) // h
w_factor = (layer.width // layer.wstride) // w
ci_factor = int(np.ceil(float(layer.in_filter) // (ci * env.BLOCK_IN)))
co_factor = int(np.ceil(float(layer.out_filter) // (co * env.BLOCK_OUT)))
# Derive transfers
input_xfers = b_factor * h_factor * w_factor * co_factor * ci_factor
weight_xfers = b_factor * h_factor * w_factor * co_factor * ci_factor
output_xfers = b_factor * h_factor * w_factor * co_factor
# Compute total transfer sizes
input_xfer_B = input_tile_elems * input_xfers * input_elem_size_b // 8
weight_xfer_B = weight_tile_elems * weight_xfers * weight_elem_size_b // 8
output_xfer_B = output_tile_elems * output_xfers * output_elem_size_b // 8
total_xfer_B = input_xfer_B + weight_xfer_B + output_xfer_B
return total_xfer_B
def test_conv2d_chwv(layer, key, batch_size, wl, sched, log_frame, profile=True):
env = vta.get_env()
assert batch_size % env.BATCH == 0
assert wl.in_filter % env.BLOCK_IN == 0
assert wl.out_filter % env.BLOCK_OUT == 0
data_shape = (batch_size // env.BATCH, wl.in_filter // env.BLOCK_IN,
wl.height, wl.width, env.BATCH, env.BLOCK_IN)
kernel_shape = (wl.out_filter // env.BLOCK_OUT, wl.in_filter // env.BLOCK_IN,
wl.hkernel, wl.wkernel, env.BLOCK_OUT, env.BLOCK_IN)
fout_height = (wl.height + 2 * wl.hpad - wl.hkernel) // wl.hstride + 1
fout_width = (wl.width + 2 * wl.wpad - wl.wkernel) // wl.wstride + 1
res_shape = (batch_size // env.BATCH, wl.out_filter // env.BLOCK_OUT,
fout_height, fout_width, env.BATCH, env.BLOCK_OUT)
data = tvm.placeholder(data_shape, name="data", dtype=env.inp_dtype)
kernel = tvm.placeholder(kernel_shape, name="kernel", dtype=env.wgt_dtype)
if wl.hpad or wl.wpad:
data_buf = topi.nn.pad(data, [0, 0, wl.hpad, wl.wpad, 0, 0], name="data_buf")
else:
data_buf = tvm.compute(data_shape, lambda *i: data(*i), "data_buf")
kernel_buf = tvm.compute(kernel_shape, lambda *i: kernel(*i), "kernel_buf")
di = tvm.reduce_axis((0, wl.hkernel), name='di')
dj = tvm.reduce_axis((0, wl.wkernel), name='dj')
ko = tvm.reduce_axis((0, wl.in_filter//env.BLOCK_IN), name='ko')
ki = tvm.reduce_axis((0, env.BLOCK_IN), name='ki')
res_cnv = tvm.compute(
res_shape,
lambda bo, co, i, j, bi, ci: tvm.sum(
data_buf[bo, ko, i*wl.hstride+di, j*wl.wstride+dj, bi, ki].astype(env.acc_dtype) *
kernel_buf[co, ko, di, dj, ci, ki].astype(env.acc_dtype),
axis=[ko, di, dj, ki]),
name="res_cnv")
# res_shf = tvm.compute(res_shape, lambda *i: res_cnv(*i) >> 8, name="res_shf")
res_shf = topi.right_shift(res_cnv, 8)
res = tvm.compute(res_shape, lambda *i: res_shf(*i).astype(env.inp_dtype), name="res")
num_ops = batch_size * fout_height * fout_width * wl.hkernel * wl.wkernel * wl.out_filter * wl.in_filter
total_xfer_B = get_data_movementB(sched, wl)
def verify(s, check_correctness):
mod = vta.build(s, [data, kernel, res], "ext_dev", target, name="conv2d")
temp = util.tempdir()
remote = rpc.connect(host, port)
mod.save(temp.relpath("conv2d.o"))
remote.upload(temp.relpath("conv2d.o"))
f = remote.load_module("conv2d.o")
# verify
ctx = remote.ext_dev(0)
# Data in original format
data_orig = np.random.randint(
-128, 128, size=(batch_size, wl.in_filter, wl.height, wl.width)).astype(data.dtype)
kernel_orig = np.random.randint(
-128, 128, size=(wl.out_filter, wl.in_filter, wl.hkernel, wl.wkernel)).astype(kernel.dtype)
data_packed = data_orig.reshape(
batch_size//env.BATCH, env.BATCH,
wl.in_filter//env.BLOCK_IN, env.BLOCK_IN,
wl.height, wl.width).transpose((0, 2, 4, 5, 1, 3))
kernel_packed = kernel_orig.reshape(
wl.out_filter//env.BLOCK_OUT, env.BLOCK_OUT,
wl.in_filter//env.BLOCK_IN, env.BLOCK_IN,
wl.hkernel, wl.wkernel).transpose((0, 2, 4, 5, 1, 3))
res_np = np.zeros(res_shape).astype(res.dtype)
data_arr = tvm.nd.array(data_packed, ctx)
kernel_arr = tvm.nd.array(kernel_packed, ctx)
res_arr = tvm.nd.array(res_np, ctx)
time_f = f.time_evaluator("conv2d", ctx, number=1)
cost = time_f(data_arr, kernel_arr, res_arr)
res_unpack = res_arr.asnumpy().transpose(
(0, 4, 1, 5, 2, 3)).reshape(batch_size, wl.out_filter, fout_height, fout_width)
if check_correctness:
res_ref = mx.nd.Convolution(
mx.nd.array(data_orig.astype(env.acc_dtype), mx.cpu(0)),
mx.nd.array(kernel_orig.astype(env.acc_dtype), mx.cpu(0)),
stride=(wl.hstride, wl.wstride),
kernel=(wl.hkernel, wl.wkernel),
num_filter=wl.out_filter,
no_bias=True,
pad=(wl.hpad, wl.wpad)).asnumpy().astype(env.acc_dtype)
res_ref = np.right_shift(res_ref, 8).astype(res.dtype)
np.testing.assert_allclose(res_unpack, res_ref)
print("Correctness check pass...")
return cost
def run_schedule(load_inp, load_wgt, gemm, alu, store_out,
print_ir, check_correctness):
# schedule1
s = tvm.create_schedule(res.op)
s[data_buf].set_scope(env.inp_scope)
s[kernel_buf].set_scope(env.wgt_scope)
s[res_cnv].set_scope(env.acc_scope)
s[res_shf].set_scope(env.acc_scope)
# tile
oc_factor = (sched.oc_factor if sched.oc_factor
else wl.out_filter // env.BLOCK_OUT)
h_factor = (sched.h_factor if sched.h_factor else fout_height)
w_factor = (sched.w_factor if sched.w_factor else fout_width)
xbo, xco, xi, xj, xbi, xci = s[res].op.axis
xco0, xco1 = s[res].split(xco, factor=oc_factor)
xi0, xi1 = s[res].split(xi, factor=h_factor)
xj0, xj1 = s[res].split(xj, factor=w_factor)
s[res].reorder(xbo, xi0, xco0, xj0, xco1, xi1, xj1, xbi, xci)
s[res_cnv].compute_at(s[res], xj0)
s[res_shf].compute_at(s[res], xj0)
if sched.oc_nthread > 1:
_, tx = s[res].split(xco0, factor=sched.oc_nthread)
s[res].reorder(tx, xbo)
s[res].bind(tx, tvm.thread_axis("cthread"))
if sched.h_nthread > 1:
xo, tx = s[res].split(xi0, factor=sched.h_nthread)
s[res].reorder(tx, xbo)
s[res].bind(tx, tvm.thread_axis("cthread"))
xbo, xco, xi, xj, xbi, xci = s[res_cnv].op.axis
s[res_cnv].reorder(xbo, ko, xj, dj, di, xco, xi, xbi, xci, ki)
if sched.ko_factor:
ko0, ko1 = s[res_cnv].split(ko, factor=sched.ko_factor)
s[data_buf].compute_at(s[res_cnv], ko0)
s[kernel_buf].compute_at(s[res_cnv], ko0)
# Use VTA instructions
s[data_buf].pragma(s[data_buf].op.axis[0], load_inp)
s[kernel_buf].pragma(s[kernel_buf].op.axis[0], load_wgt)
s[res_cnv].tensorize(xbi, gemm)
s[res_shf].pragma(s[res_shf].op.axis[0], alu)
s[res].pragma(xco1, store_out)
if sched.debug_sync:
s[res].pragma(xco0, "coproc_sync")
if print_ir:
print(tvm.lower(s, [data, kernel, res], simple_mode=True))
return verify(s, check_correctness)
def conv_normal(print_ir):
print("----- CONV2D End-to-End Test-------")
def run_test(header, print_ir, check_correctness):
s = [1, sched.oc_factor, sched.ko_factor, sched.h_factor, sched.w_factor]
cost = run_schedule(
env.dma_copy, env.dma_copy,
env.gemm, env.alu, env.dma_copy,
print_ir, check_correctness)
gops = (num_ops / cost.mean) / float(10 ** 9)
print(header)
print("\tTime cost = %g sec/op, %g GOPS" % (cost.mean, gops))
log_frame["key"].append(key)
log_frame["layer"].append(layer)
log_frame["total-data"].append(total_xfer_B)
log_frame["total-gops"].append(gops)
log_frame["total-cost"].append(cost.mean)
log_frame["total-insn"].append(get_insn_count(wl, s))
log_frame["block-batch"].append(env.BATCH)
log_frame["block-in"].append(env.BLOCK_IN)
log_frame["block-out"].append(env.BLOCK_OUT)
log_frame["inp-width"].append(env.INP_WIDTH)
log_frame["wgt-width"].append(env.WGT_WIDTH)
log_frame["uop-size"].append(env.UOP_BUFF_SIZE)
log_frame["inp-size"].append(env.INP_BUFF_SIZE)
log_frame["wgt-size"].append(env.WGT_BUFF_SIZE)
log_frame["out-size"].append(env.OUT_BUFF_SIZE)
log_frame["oc-factor"].append(sched.oc_factor)
log_frame["ic-factor"].append(sched.ko_factor)
log_frame["h-factor"].append(sched.h_factor)
log_frame["w-factor"].append(sched.w_factor)
log_frame["oc-threads"].append(sched.oc_nthread)
log_frame["h-threads"].append(sched.h_nthread)
log_frame["threaded"].append(True if sched.oc_nthread > 1 or sched.h_nthread > 1 else False)
with vta.build_config():
run_test("NORMAL", print_ir, True)
def skip_alu_unittest(print_ir):
mock = env.mock
print("----- Skip ALU Unit Test-------")
def run_test(header, print_ir):
cost = run_schedule(
env.dma_copy, env.dma_copy,
env.gemm, mock.alu, env.dma_copy,
print_ir, False)
gops = (num_ops / cost.mean) / float(10 ** 9)
print(header)
print("\tTime cost = %g sec/op, %g GOPS" % (cost.mean, gops))
log_frame["skip-alu-gops"].append(gops)
log_frame["skip-alu-cost"].append(cost.mean)
with vta.build_config():
run_test("NORMAL", print_ir)
print("")
def gemm_unittest(print_ir):
mock = env.mock
print("----- GEMM Unit Test-------")
def run_test(header, print_ir):
cost = run_schedule(
mock.dma_copy, mock.dma_copy,
env.gemm, mock.alu, mock.dma_copy,
print_ir, False)
gops = (num_ops / cost.mean) / float(10 ** 9)
print(header)
print("\tTime cost = %g sec/op, %g GOPS" % (cost.mean, gops))
log_frame["gemm-gops"].append(gops)
log_frame["gemm-cost"].append(cost.mean)
with vta.build_config():
run_test("NORMAL", print_ir)
print("")
def alu_unittest(print_ir):
mock = env.mock
print("----- ALU Unit Test-------")
def run_test(header, print_ir):
cost = run_schedule(
mock.dma_copy, mock.dma_copy,
env.gemm, env.alu, mock.dma_copy,
print_ir, False)
gops = (num_ops / cost.mean) / float(10 ** 9)
print(header)
print("\tTime cost = %g sec/op, %g GOPS" % (cost.mean, gops))
log_frame["alu-gops"].append(gops)
log_frame["alu-cost"].append(cost.mean)
with vta.build_config():
run_test("NORMAL", print_ir)
print("")
def load_inp_unittest(print_ir):
mock = env.mock
print("----- LoadInp Unit Test-------")
def run_test(header, print_ir):
cost = run_schedule(
env.dma_copy, mock.dma_copy,
env.gemm, mock.alu, mock.dma_copy,
print_ir, False)
gops = (num_ops / cost.mean) / float(10 ** 9)
bandwith = (batch_size * wl.in_filter * wl.height *
wl.width * env.INP_WIDTH / cost.mean) / float(10 ** 9)
print(header)
print("\tTime cost = %g sec/op, %g GOPS, bandwith=%g gbits" % (
cost.mean, gops, bandwith))
log_frame["ld-inp-gbits"].append(bandwith)
log_frame["ld-inp-cost"].append(cost.mean)
with vta.build_config():
run_test("NORMAL", print_ir)
print("")
def load_wgt_unittest(print_ir):
mock = env.mock
print("----- LoadWgt Unit Test-------")
def run_test(header, print_ir):
cost = run_schedule(
mock.dma_copy, env.dma_copy,
env.gemm, mock.alu, mock.dma_copy, print_ir,
False)
gops = (num_ops / cost.mean) / float(10 ** 9)
bandwith = (wl.out_filter * wl.in_filter * wl.hkernel *
wl.wkernel * env.WGT_WIDTH / cost.mean) / float(10 ** 9)
print(header)
print("\tTime cost = %g sec/op, %g GOPS, bandwith=%g gbits" % (
cost.mean, gops, bandwith))
log_frame["ld-wgt-gbits"].append(bandwith)
log_frame["ld-wgt-cost"].append(cost.mean)
with vta.build_config():
run_test("NORMAL", print_ir)
print("")
def store_out_unittest(print_ir):
mock = env.mock
print("----- StoreOut Unit Test-------")
def run_test(header, print_ir):
cost = run_schedule(
mock.dma_copy, mock.dma_copy,
env.gemm, mock.alu, env.dma_copy, print_ir,
False)
gops = (num_ops / cost.mean) / float(10 ** 9)
bandwith = (batch_size * wl.out_filter * fout_height *
fout_width * env.OUT_WIDTH / cost.mean) / float(10 ** 9)
print(header)
print("\tTime cost = %g sec/op, %g GOPS, bandwith=%g gbits" % (
cost.mean, gops, bandwith))
log_frame["st-out-gbits"].append(bandwith)
log_frame["st-out-cost"].append(cost.mean)
with vta.build_config():
run_test("NORMAL", print_ir)
print("")
def manual_unittest(print_ir):
# Manual section used to teak the components
mock = env.mock
print("----- Manual Unit Test-------")
def run_test(header, print_ir):
cost = run_schedule(
env.dma_copy, env.dma_copy,
env.gemm, env.alu, mock.dma_copy, print_ir,
False)
gops = (num_ops / cost.mean) / float(10 ** 9)
print(header)
print("\tTime cost = %g sec/op, %g GOPS" % (
cost.mean, gops))
with vta.build_config():
run_test("NORMAL", print_ir)
print("")
print("=================================")
print("key=%s" % key)
print(wl)
conv_normal(False)
if not profile:
return
skip_alu_unittest(False)
gemm_unittest(False)
alu_unittest(False)
load_inp_unittest(False)
load_wgt_unittest(False)
store_out_unittest(False)
# Perform profiling
profile = False
# Data set batch size
batch_size = 1
# Use multi-threading for latency hiding
multi_threaded = False
# ResNet18 workloads
resnet = {
# Workloads of resnet18 on imagenet
0: Workload(1, 224, 224, 16, 64, 7, 7, 3, 3, 2, 2),
1: Workload(1, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1),
2: Workload(1, 56, 56, 64, 64, 1, 1, 0, 0, 1, 1),
3: Workload(1, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2),
4: Workload(1, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2),
5: Workload(1, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1),
6: Workload(1, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2),
7: Workload(1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2),
8: Workload(1, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1),
9: Workload(1, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2),
10: Workload(1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2),
11: Workload(1, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1),
}
begin = 0
end = len(resnet)
resnet_schedules = []
for i in range(begin, end):
scheds = find_schedules(resnet[i], mtOnly=multi_threaded, bestOnly=True)
resnet_schedules.append([i, scheds])
keys = ["key", "layer", "total-data", "total-gops", "total-cost", "total-insn",
"block-batch", "block-in", "block-out", "wgt-width", "inp-width",
"uop-size", "inp-size", "wgt-size", "out-size",
"oc-factor", "ic-factor", "h-factor", "w-factor",
"oc-threads", "h-threads", "threaded"]
if profile:
keys += ["skip-alu-gops", "skip-alu-cost",
"gemm-gops", "gemm-cost", "alu-gops", "alu-cost",
"ld-inp-cost", "ld-wgt-cost", "st-out-cost",
"ld-inp-gbits", "ld-wgt-gbits", "st-out-gbits"]
log_frame = {
k : [] for k in keys
}
for x in resnet_schedules:
l, plans = x
for plan in plans:
key = "resnet-cfg[{}-{}]".format(l, plan)
test_conv2d_chwv(l, key, batch_size, resnet[l], plan, log_frame, profile)
if profile:
pd.set_option('expand_frame_repr', False)
log_df = pd.DataFrame()
for k in keys:
log_df[k] = log_frame[k]
print(log_df)
log_df.to_csv("conv2d.csv")
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment