Commit 96488c11 by Thierry Moreau Committed by Tianqi Chen

[PYTHON, TVM] Python TVM library, unit tests and end to end example

* VTA python library
* Python unit tests
* End to end example with Resnet18
* README instructions
* Bug fixes
parent 56a0dea8
......@@ -55,10 +55,10 @@ endif
all: lib/libvta.$(SHARED_LIBRARY_SUFFIX)
VTA_LIB_SRC = $(wildcard src/*.cc src/tvm/*.cc)
ifeq ($(TARGET), PYNQ_TARGET)
ifeq ($(TARGET), VTA_PYNQ_TARGET)
VTA_LIB_SRC += $(wildcard src/pynq/*.cc)
LDFLAGS += -L/usr/lib -lsds_lib
LDFLAGS += -L/opt/python3.6/lib/python3.6/site-packages/pynq/drivers/ -l:libdma.so
LDFLAGS += -L/opt/python3.6/lib/python3.6/site-packages/pynq/lib/ -l:libdma.so
endif
VTA_LIB_OBJ = $(patsubst %.cc, build/%.o, $(VTA_LIB_SRC))
......@@ -79,7 +79,7 @@ cpplint:
python nnvm/dmlc-core/scripts/lint.py vta cpp include src hardware tests
pylint:
pylint python/vta --rcfile=$(ROOTDIR)/tests/lint/pylintrc
pylint python/tvm_vta --rcfile=$(ROOTDIR)/tests/lint/pylintrc
doc:
doxygen docs/Doxyfile
......
### PYNQ RPC Server for VTA
This guide describes how to setup a Pynq-based RPC server to accelerate deep learning workloads with VTA.
## Pynq Setup
Follow the getting started tutorial for the [Pynq board](http://pynq.readthedocs.io/en/latest/getting_started.html).
* For this RPC setup make sure to go with the *Connect to a Computer* Ethernet setup.
Make sure that you can ssh into your Pynq board successfully:
```bash
ssh xilinx@192.168.2.99
```
When ssh-ing onto the board, the default password for the `xilinx` account is `xilinx`.
For convenience let's go ahead and mount the Pynq board's file system to easily access it and maintain it:
```bash
sshfs xilinx@192.168.2.99:/home/xilinx <mountpoint>
```
## Pynq TVM & VTA installation
On your **host PC**, go to the `<mountpoint>` directory of your Pynq board file system.
```bash
cd <mountpoint>
```
From there, clone the VTA repository:
```bash
git clone git@github.com:uwsaml/vta.git --recursive
```
Next, clone the TVM repository:
```bash
git clone git@github.com:dmlc/tvm.git --recursive
```
TVM is rapidly changing, and to ensure stability, we keep track of working TVM checkpoints.
As of now, the TVM checkpoint `e4c2af9abdcb3c7aabafba8084414d7739c17c4c` is known to work with VTA.
```bash
git checkout e4c2af9abdcb3c7aabafba8084414d7739c17c4c
```
Now, ssh into your **Pynq board** to build the TVM runtime with the following commands:
```bash
ssh xilinx@192.168.2.99 # ssh if you haven't done so
cd ~/tvm
cp make/config.mk .
echo USE_RPC=1 >> config.mk
make runtime -j2
```
## Pynq RPC server setup
We're now ready to build the Pynq RPC server on the Pynq board.
```bash
ssh xilinx@192.168.2.99 # ssh if you haven't done so
cd ~/vta
export TVM_PATH = /home/xilinx/tvm
make
```
The last stage will build the `192.168.2.99:home/xilinx/vta/lib/libvta.so` library file. We are now ready to launch the RPC server on the Pynq. In order to enable the FPGA drivers, we need to run the RPC server with administrator privileges (using `su`, account: `xilinx`, pwd: `xilinx`).
```bash
ssh xilinx@192.168.2.99 # ssh if you haven't done so
cd ~/vta
su
./apps/pynq_rpc/start_rpc_server.sh
```
You should see the following being displayed when starting the RPC server:
```
INFO:root:Load additional library /home/xilinx/vta/lib/libvta.so
INFO:root:RPCServer: bind to 0.0.0.0:9091
```
Note that it should be listening on port `9091`.
To kill the RPC server, just enter the `Ctrl + c` command.
\ No newline at end of file
#!/bin/bash
export PYTHONPATH=${PYTHONPATH}:/home/xilinx/tvm/python
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/python3.6/lib/python3.6/site-packages/pynq/drivers/
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/python3.6/lib/python3.6/site-packages/pynq/lib/
python -m tvm.exec.rpc_server --load-library /home/xilinx/vta/lib/libvta.so
quantize_graph.json
quantize_params.pkl
synset.txt
*.jpg
vta.bit
\ No newline at end of file
# Resnet-18 Example on Pynq-based VTA Design
In order to run this example you'll need to have:
* VTA installed
* TVM installed
* NNVM installed
* A Pynq-based RPC server running
## VTA installation
Clone the VTA repository in the directory of your choosing:
```bash
git clone git@github.com:uwsaml/vta.git --recursive
```
Update your `~/.bashrc` file to include the VTA python libraries in your `PYTHONPATH` (don't forget to source the newly modified `.bashrc` file!):
```bash
export PYTHONPATH=<vta root>/python:${PYTHONPATH}
```
## TVM installation
Clone the TVM repository in the directory of your choosing:
```bash
git clone git@github.com:dmlc/tvm.git --recursive
```
TVM is rapidly changing, and to ensure stability, we keep track of working TVM checkpoints.
As of now, the TVM checkpoint `e4c2af9abdcb3c7aabafba8084414d7739c17c4c` is known to work with VTA.
```bash
git checkout e4c2af9abdcb3c7aabafba8084414d7739c17c4c
```
Before building TVM, copy the `make/config.mk` file into the root TVM directory:
```bash
cd <tvm root>
cp make/config.mk .
```
In the 'config.mk' file sure that:
* `LLVM_CONFIG` points to the llvm-config executable (e.g. `LLVM_CONFIG = /usr/bin/llvm-config-4.0`). You'll need to have llvm4.0 installed or later.
* `USE_RPC` should be set to 1
Launch the compilation, this takes about 5 minutes.
```bash
cd <tvm root>
make -j4
```
Finally update your `~/.bashrc` file to include the TVM python libraries in your `PYTHONPATH` (don't forget to source the newly modified `.bashrc` file!):
```bash
export PYTHONPATH=<tvm root>/python:<tvm root>/topi/python:${PYTHONPATH}
```
## NNVM installation
Clone the NNVM repository from `tqchen` in the directory of your choosing:
```bash
git clone git@github.com:tqchen/nnvm.git --recursive
```
To run this example, we rely on a special branch of NNVM: `qt`:
```bash
cd <nnvm root>
git checkout qt
```
Launch the compilation, this takes less a minute.
```bash
cd <nnvm root>
make -j4
```
Finally update your `~/.bashrc` file to include the NNVM python libraries in your `PYTHONPATH` (don't forget to source the newly modified `.bashrc` file!):
```bash
export PYTHONPATH=<nnvm root>/python:${PYTHONPATH}
```
## Pynq RPC Server Setup
Follow the [Pynq RPC Server Guide](https://github.com/saml/vta/tree/master/apps/pynq_rpc/README.md)
## Running the example
Simply run the following python script:
```bash
python imagenet_predict.py
```
This will run imagenet classification using the ResNet18 architecture on a VTA design that performs 8-bit integer inference, to perform classification on a cat image `cat.jpg`.
The script reports runtime measured on the Pynq board, and the top-1 result category:
```
('x', (1, 3, 224, 224))
Build complete...
('TVM prediction top-1:', 281, 'tabby, tabby cat')
t-cost=0.41906
```
\ No newline at end of file
# some standard imports
import nnvm
import tvm
from nnvm.compiler import graph_attr
import vta
import os
import numpy as np
from PIL import Image
import pickle
import json
import logging
import wget
from tvm.contrib import graph_runtime, rpc, util
factor = 16
host = "pynq"
port = 9091
verbose = False
# only run fpga component, mark non-conv ops as nop
debug_fpga_only = False
# Obtain model and hardware files (they're too large to check-in)
url = "https://homes.cs.washington.edu/~moreau/media/vta/"
TEST_FILE = 'cat.jpg'
CATEG_FILE = 'synset.txt'
RESNET_GRAPH_FILE = 'quantize_graph.json'
RESNET_PARAMS_FILE = 'quantize_params.pkl'
BITSTREAM_FILE = 'vta.bit'
for file in [TEST_FILE, CATEG_FILE, RESNET_GRAPH_FILE, RESNET_PARAMS_FILE, BITSTREAM_FILE]:
if not os.path.isfile(file):
print "Downloading {}".format(file)
wget.download(url+file)
# Program the FPGA remotely
assert tvm.module.enabled("rpc")
remote = rpc.connect(host, port)
remote.upload(BITSTREAM_FILE, BITSTREAM_FILE)
fprogram = remote.get_function("tvm.contrib.vta.init")
fprogram(BITSTREAM_FILE)
if verbose:
logging.basicConfig(level=logging.INFO)
# Change to -device=tcpu to run cpu only inference.
target = "llvm -device=vta"
synset = eval(open(os.path.join(CATEG_FILE)).read())
image = Image.open(os.path.join(TEST_FILE)).resize((224, 224))
def transform_image(image):
image = np.array(image) - np.array([123., 117., 104.])
image /= np.array([58.395, 57.12, 57.375])
image = image.transpose((2, 0, 1))
image = image[np.newaxis, :]
return image
def mark_nop(graph, conv_layer=-1, skip_conv_layer=()):
"""Helper function to mark certain op as nop
Useful to debug performance issues.
"""
jgraph = json.loads(graph.json())
counter = 0
for nid, node in enumerate(jgraph["nodes"]):
op_name = node["op"]
if op_name != "tvm_op":
continue
attrs = node["attrs"]
node_name = node["name"]
func_name = attrs["func_name"]
if func_name.find("quantized_conv2d") != -1:
if conv_layer >= 0:
if counter != conv_layer:
attrs["func_name"] = "__nop"
if counter in skip_conv_layer:
attrs["func_name"] = "__nop"
counter += 1
else:
if conv_layer >= 0:
attrs["func_name"] = "__nop"
attrs["func_name"] = "__nop"
if attrs["func_name"] != "__nop":
print("Run function %s"% func_name)
graph = nnvm.graph.load_json(json.dumps(jgraph))
return graph
x = transform_image(image)
print('x', x.shape)
######################################################################
# now compile the graph
import nnvm.compiler
np.random.seed(0)
sym = nnvm.graph.load_json(
open(os.path.join(RESNET_GRAPH_FILE)).read())
params = pickle.load(
open(os.path.join(RESNET_PARAMS_FILE)))
shape_dict = {"data": x.shape}
dtype_dict = {"data": 'float32'}
shape_dict.update({k: v.shape for k, v in params.items()})
dtype_dict.update({k: str(v.dtype) for k, v in params.items()})
graph = nnvm.graph.create(sym)
graph_attr.set_shape_inputs(sym, shape_dict)
graph_attr.set_dtype_inputs(sym, dtype_dict)
graph = graph.apply("InferShape").apply("InferType")
dtype = "float32"
sym = vta.graph.remove_stochastic(sym)
sym = vta.graph.clean_cast(sym)
sym = vta.graph.clean_conv_fuse(sym)
if "vta" in target:
sym = vta.graph.pack(sym, shape_dict, factor)
graph_attr.set_shape_inputs(sym, shape_dict)
sym = sym.apply("InferShape")
graph_attr.set_dtype_inputs(sym, dtype_dict)
sym = sym.apply("InferType")
with nnvm.compiler.build_config(opt_level=3):
bdict = {}
if "vta" not in target:
bdict = {"add_lower_pass": []}
else:
bdict = {"add_lower_pass": vta.debug_mode(0)}
with tvm.build_config(**bdict):
graph, lib, params = nnvm.compiler.build(
sym, target, shape_dict, dtype_dict,
params=params)
remote = rpc.connect(host, port)
temp = util.tempdir()
lib.save(temp.relpath("graphlib.o"))
remote.upload(temp.relpath("graphlib.o"))
lib = remote.load_module("graphlib.o")
ctx = remote.ext_dev(0) if "vta" in target else remote.cpu(0)
print("Build complete...")
def run_e2e(graph):
"""Running end to end example
"""
if debug_fpga_only:
graph = mark_nop(graph, skip_conv_layer=(0,))
m = graph_runtime.create(graph, lib, ctx)
# set inputs
m.set_input('data', tvm.nd.array(x.astype("float32")))
m.set_input(**params)
# execute
timer = m.module.time_evaluator("run", ctx, number=10)
tcost = timer()
# get outputs
tvm_output = m.get_output(
0,tvm.nd.empty((1000,), dtype, remote.cpu(0)))
top1 = np.argmax(tvm_output.asnumpy())
print('TVM prediction top-1:', top1, synset[top1])
print("t-cost=%g" % tcost.mean)
def run_layer(old_graph):
"""Run a certain layer."""
for layer_id in range(1, 2):
graph = mark_nop(old_graph, layer_id)
m = graph_runtime.create(graph, lib, ctx)
# set inputs
m.set_input('data', tvm.nd.array(x.astype("float32")))
m.set_input(**params)
# execute
timer = m.module.time_evaluator("run", ctx, number=10)
tcost = timer()
print("resnet[%d]: %g\n"% (layer_id, tcost.mean))
run_e2e(graph)
# Directories
ROOTDIR = $(CURDIR)
BUILD_DIR = $(ROOTDIR)/../../build/hardware/vivado
BUILD_DIR = $(ROOTDIR)/../../build/hardware/xilinx
SCRIPT_DIR = $(ROOTDIR)/scripts
SRC_DIR = $(ROOTDIR)/src
SIM_DIR = $(ROOTDIR)/sim
......@@ -64,7 +64,7 @@ bit: ip
cd $(HW_BUILD_PATH) && \
$(VIVADO) -mode tcl -source $(SCRIPT_DIR)/vivado.tcl \
-tclargs $(IP_BUILD_PATH) $(VTA_HW_COMP_THREADS) $(VTA_HW_COMP_CLOCK_FREQ) \
$(VTA_INP_WIDTH) $(VTA_WGT_WIDTH) $(OUT_WIDTH) \
$(VTA_INP_WIDTH) $(VTA_WGT_WIDTH) $(VTA_OUT_WIDTH) \
$(VTA_BATCH) $(VTA_IN_BLOCK) $(VTA_OUT_BLOCK) \
$(VTA_INP_BUFF_SIZE) $(VTA_WGT_BUFF_SIZE) $(VTA_OUT_BUFF_SIZE)
......
# Hardware Compilation Guide
**This hardware compilation guide aims to provide guidance on generating VTA bitstreams with the Xilinx Vivado toolchains.**
As of writing this guide, we recommend using `Vivado 2017.1` since our scripts have been tested to work on this version of the Xilinx toolchains.
# Vivado Toolchains Installation for Pynq Board
## Ubuntu instructions
You’ll need to install Xilinx’ FPGA compilation toolchain, [Vivado HL WebPACK 2017.1](https://www.xilinx.com/products/design-tools/vivado.html), which a license-free version of the Vivado HLx toolchain.
### Obtaining and launching the installation binary
1. Go to the [download webpage](https://www.xilinx.com/support/download.html), and download the Linux Self Extracting Web Installer for Vivado HL 2017.1 WebPACK and Editions.
2. You’ll have to sign in with a Xilinx account. This requires a Xilinx account creation that will take 2 minutes.
3. Complete the Name and Address Verification by clicking “Next”, and you will get the opportunity to download a binary file, called `Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin`.
4. Now that the file is downloaded, go to your `Downloads` directory, and change the file permissions so it can be executed:
```bash
chmod u+x Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
```
5. Now you can execute the binary:
```bash
./Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
```
### Installation Steps
At this point you've launched the Vivado 2017.1 Installer GUI program.
1. Click “Next” on the **Welcome** screen.
2. Enter your Xilinx User Credentials under “User Authentication” and select the “Download and Install Now” before clicking “Next” on the **Select Install Type** screen.
3. Accept all terms before clicking on “Next” on the **Accept License Agreements** screen.
4. Select the “Vivado HL WebPACK” before clicking on “Next” on the **Select Edition to Install** screen.
5. Under the **Vivado HL WebPACK** screen, before hitting “Next", check the following options (the rest should be unchecked):
* Design Tools -> Vivado Design Suite -> Vivado
* Design Tools -> Vivado Design Suite -> Vivado High Level Synthesis
* Devices -> Production Services -> SoCs -> Zynq-7000 Series
6. Your total download size should be about 3GB and the amount of Disk Space Required 13GB.
7. Set the installation directory before clicking “Next” on the **Select Destination Directory** screen. It might highlight some paths as red - that’s because the installer doesn’t have the permission to write to that directory. In that case select a path that doesn’t require special write permissions (e.g. in your home directory).
8. Hit “Install” under the **Installation Summary** screen.
9. An **Installation Progress Window** will pop-up to track progress of the download and the installation.
10. This process will take about 20-30 minutes depending on your connection speed.
11. A pop-up window will inform you that the installation completed successfully. Click "OK".
12. Finally the **Vivado License Manager** will launch. Select "Get Free ISE WebPACK, ISE/Vivado IP or PetaLinux License" and click "Connect Now" to complete the license registration process.
### Environment Setup
The last step is to update your `~/.bashrc` with the following line:
```bash
# Xilinx Vivado 2017.1 environment
source <install_path>/Vivado/2017.1/settings64.sh
```
This will include all of the Xilinx binary paths so you can launch compilation scripts from the command line.
Note that this will overwrite the paths to GCC required to build TVM, or NNVM. Therefore, when attempting to build TVM and NNVM, please comment this line from your `~/.bashrc` before re-sourcing it.
# Bitstream compilation
High-level parameters are listed under `<vta root>/make/config.mk` and can be customized by the user.
Bitstream generation is driven by a makefile. All it takes is to enter the following command:
```bash
make
```
The local `Makefile` containts several variables that can be tweaked by the user:
* `VTA_HW_COMP_THREADS`: determines the number of threads used for the Vivado compilation job (default 8 threads).
* `VTA_HW_COMP_CLOCK_FREQ`: determines the target frequency of the VTA design (default 100MHz). It can only be set to 100, 142, 167 or 200MHz.
* `VTA_HW_COMP_TIMING_COMP`: determines how much additional slack must be provided to close timing (default 0ns). Generally when utilization is high for an FPGA design, setting this paramter to 1, 2 or 3 can help close timing.
Once the compilation completes, the generated bitstream can be found under `<vta root>/build/hardware/xilinx/vivado/<design name>/export/vta.bit`.
\ No newline at end of file
......@@ -40,6 +40,8 @@ int main(void) {
status |= alu_test(VTA_ALU_OPCODE_ADD, true, 16, 128, false);
status |= alu_test(VTA_ALU_OPCODE_SHR, true, 16, 128, true);
status |= alu_test(VTA_ALU_OPCODE_SHR, true, 16, 128, false);
status |= alu_test(VTA_ALU_OPCODE_SHL, true, 16, 128, true);
status |= alu_test(VTA_ALU_OPCODE_SHL, true, 16, 128, false);
// Run ALU test (vector-vector operators)
status |= alu_test(VTA_ALU_OPCODE_MIN, false, 16, 128, true);
......
......@@ -107,9 +107,9 @@ typedef ap_uint<VTA_LOG_ACC_WIDTH> aluop_sh_imm_T;
void fetch(
uint32_t insn_count,
volatile insn_T *insns,
hls::stream<insn_T> *load_queue,
hls::stream<insn_T> *gemm_queue,
hls::stream<insn_T> *store_queue);
hls::stream<insn_T> &load_queue,
hls::stream<insn_T> &gemm_queue,
hls::stream<insn_T> &store_queue);
/*!
* \brief Load module.
......@@ -129,9 +129,9 @@ void fetch(
void load(
volatile inp_vec_T *inputs,
volatile wgt_vec_T *weights,
hls::stream<insn_T> *load_queue,
hls::stream<bool> *g2l_dep_queue,
hls::stream<bool> *l2g_dep_queue,
hls::stream<insn_T> &load_queue,
hls::stream<bool> &g2l_dep_queue,
hls::stream<bool> &l2g_dep_queue,
inp_vec_T inp_mem[VTA_INP_BUFF_DEPTH][VTA_BATCH],
wgt_vec_T wgt_mem[VTA_WGT_BUFF_DEPTH][VTA_BLOCK_OUT]);
......@@ -159,14 +159,14 @@ void load(
* \param out_mem Local output SRAM buffer. Write only single port BRAM.
*/
void compute(
volatile uint32_t *done,
volatile uint32_t &done,
volatile uop_T *uops,
volatile acc_vec_T *biases,
hls::stream<insn_T> *gemm_queue,
hls::stream<bool> *l2g_dep_queue,
hls::stream<bool> *s2g_dep_queue,
hls::stream<bool> *g2l_dep_queue,
hls::stream<bool> *g2s_dep_queue,
hls::stream<insn_T> &gemm_queue,
hls::stream<bool> &l2g_dep_queue,
hls::stream<bool> &s2g_dep_queue,
hls::stream<bool> &g2l_dep_queue,
hls::stream<bool> &g2s_dep_queue,
out_vec_T inp_mem[VTA_INP_BUFF_DEPTH][VTA_BATCH],
wgt_vec_T wgt_mem[VTA_WGT_BUFF_DEPTH][VTA_BLOCK_OUT],
out_vec_T out_mem[VTA_ACC_BUFF_DEPTH][VTA_BATCH]);
......@@ -186,9 +186,9 @@ void compute(
*/
void store(
volatile out_vec_T *outputs,
hls::stream<insn_T> *store_queue,
hls::stream<bool> *g2s_dep_queue,
hls::stream<bool> *s2g_dep_queue,
hls::stream<insn_T> &store_queue,
hls::stream<bool> &g2s_dep_queue,
hls::stream<bool> &s2g_dep_queue,
out_vec_T out_mem[VTA_ACC_BUFF_DEPTH][VTA_BATCH]);
/*!
......
......@@ -84,7 +84,7 @@ VTA_ACC_BUFF_SIZE = $(shell echo "$$(( 1 << $(VTA_LOG_ACC_BUFF_SIZE) ))" )
VTA_LOG_OUT_BUFF_SIZE = \
$(shell echo "$$(( $(VTA_LOG_ACC_BUFF_SIZE) + $(VTA_LOG_OUT_WIDTH) - $(VTA_LOG_ACC_WIDTH) ))" )
# Out buffer size in Bytes
VTA_OUT_BUFF_SIZE = $(shell echo "$$(( 1 << $(LOG_OUT_BUFF_SIZE) ))" )
VTA_OUT_BUFF_SIZE = $(shell echo "$$(( 1 << $(VTA_LOG_OUT_BUFF_SIZE) ))" )
# Update ADD_CFLAGS
ADD_CFLAGS += \
......
"""VTA Python package backed by TVM"""
"""TVM VTA runtime"""
from __future__ import absolute_import as _abs
from .hw_spec import *
# version of this package
__version__ = "0.1.0"
from .runtime import SCOPE_INP, SCOPE_OUT, SCOPE_WGT, DMA_COPY, ALU
from .intrin import GEVM, GEMM
from .build import debug_mode
from . import mock, ir_pass
from . import arm_conv2d, vta_conv2d
from . import graph
"""Runtime function related hooks"""
from __future__ import absolute_import as _abs
import tvm
from tvm import build_module
from . runtime import CB_HANDLE
from . import ir_pass
def lift_coproc_scope(x):
x = ir_pass.lift_alloc_to_scope_begin(x)
x = tvm.ir_pass.LiftAttrScope(x, "coproc_scope", False)
return x
def early_rewrite(stmt):
try:
return tvm.ir_pass.StorageRewrite(stmt)
except tvm.TVMError:
return stmt
def debug_mode(debug_flag):
"""Pass to enable vta debug mode.
Parameters
----------
debug_flag : int
The dbeug flag to be passed.
Returns
-------
pass_list: list of function
The pass to set to build_config(add_lower_pass=vta.debug_mode(mode))
"""
def add_debug(stmt):
debug = tvm.call_extern(
"int32", "VTASetDebugMode", CB_HANDLE, debug_flag)
return tvm.make.stmt_seq(debug, stmt)
pass_list = [(1, ir_pass.inject_dma_intrin),
(1, ir_pass.inject_skip_copy),
(1, ir_pass.annotate_alu_coproc_scope),
(1, lambda x: tvm.ir_pass.LiftAttrScope(x, "coproc_uop_scope", True)),
(1, lift_coproc_scope),
(1, ir_pass.inject_coproc_sync),
(1, early_rewrite)]
if debug_flag:
pass_list.append((1, add_debug))
pass_list.append((2, ir_pass.inject_alu_intrin))
pass_list.append((3, ir_pass.fold_uop_loop))
pass_list.append((3, ir_pass.cpu_access_rewrite))
return pass_list
# Add a lower pass to sync uop
build_module.BuildConfig.current.add_lower_pass = debug_mode(0)
"""VTA configuration constants (should match hw_spec.h"""
from __future__ import absolute_import as _abs
# The Constants
VTA_WGT_WIDTH = 8
VTA_INP_WIDTH = VTA_WGT_WIDTH
VTA_OUT_WIDTH = 32
# Dimensions of the GEMM unit
# (BATCH,BLOCK_IN) x (BLOCK_IN,BLOCK_OUT)
VTA_BATCH = 1
VTA_BLOCK_IN = 16
VTA_BLOCK_OUT = 16
# log-2 On-chip wgt buffer size in Bytes
VTA_LOG_WGT_BUFF_SIZE = 15
# log-2 On-chip input buffer size in Bytes
VTA_LOG_INP_BUFF_SIZE = 15
# log-2 On-chip output buffer size in Bytes
VTA_LOG_OUT_BUFF_SIZE = 17
# On-chip wgt buffer size in Bytes
VTA_WGT_BUFF_SIZE = 1 << VTA_LOG_WGT_BUFF_SIZE
# Input buffer size
VTA_INP_BUFF_SIZE = 1 << VTA_LOG_INP_BUFF_SIZE
# Output buffer size.
VTA_OUT_BUFF_SIZE = 1 << VTA_LOG_OUT_BUFF_SIZE
# Number of bytes per buffer
VTA_INP_ELEM_BYTES = (VTA_BATCH*VTA_BLOCK_IN*VTA_INP_WIDTH//8)
VTA_WGT_ELEM_BYTES = (VTA_BLOCK_OUT*VTA_BLOCK_IN*VTA_WGT_WIDTH//8)
VTA_OUT_ELEM_BYTES = (VTA_BATCH*VTA_BLOCK_OUT*VTA_OUT_WIDTH//8)
# Maximum external buffer size in bytes
VTA_MAX_XFER = 1 << 22
# Number of elements
VTA_INP_BUFF_DEPTH = VTA_INP_BUFF_SIZE//VTA_INP_ELEM_BYTES
VTA_WGT_BUFF_DEPTH = VTA_WGT_BUFF_SIZE//VTA_WGT_ELEM_BYTES
VTA_OUT_BUFF_DEPTH = VTA_OUT_BUFF_SIZE//VTA_OUT_ELEM_BYTES
# Memory id for DMA
VTA_MEM_ID_UOP = 0
VTA_MEM_ID_WGT = 1
VTA_MEM_ID_INP = 2
VTA_MEM_ID_ACC = 3
VTA_MEM_ID_OUT = 4
# VTA ALU Opcodes
VTA_ALU_OPCODE_MIN = 0
VTA_ALU_OPCODE_MAX = 1
VTA_ALU_OPCODE_ADD = 2
VTA_ALU_OPCODE_SUB = 3
VTA_ALU_OPCODE_MUL = 4
VTA_ALU_OPCODE_SHL = 5
VTA_ALU_OPCODE_SHR = 6
VTA_ALU_OPCODE_UNSET = 7
# Task queue id (pipeline stage)
VTA_QID_LOAD_INP = 1
VTA_QID_LOAD_WGT = 1
VTA_QID_LOAD_OUT = 2
VTA_QID_STORE_OUT = 3
VTA_QID_COMPUTE = 2
VTA_QID_STORE_INP = 3
# Debug flags
DEBUG_DUMP_INSN = (1 << 1)
DEBUG_DUMP_UOP = (1 << 2)
DEBUG_SKIP_READ_BARRIER = (1 << 3)
DEBUG_SKIP_WRITE_BARRIER = (1 << 4)
\ No newline at end of file
"""VTA related intrinsics"""
from __future__ import absolute_import as _abs
import tvm
from . import hw_spec as spec
from .runtime import VTA_AXIS, VTA_PUSH_UOP, get_task_qid
from .runtime import SCOPE_OUT, SCOPE_INP, SCOPE_WGT
# The memory information for the compiler
@tvm.register_func("tvm.info.mem.%s" % SCOPE_INP)
def mem_info_inp_buffer():
return tvm.make.node("MemoryInfo",
unit_bits=spec.VTA_INP_ELEM_BYTES * 8,
max_simd_bits=spec.VTA_INP_ELEM_BYTES * 8,
max_num_bits=spec.VTA_INP_BUFF_SIZE * 8,
head_address=None)
@tvm.register_func("tvm.info.mem.%s" % SCOPE_WGT)
def mem_info_wgt_buffer():
return tvm.make.node("MemoryInfo",
unit_bits=spec.VTA_WGT_ELEM_BYTES * 8,
max_simd_bits=spec.VTA_WGT_ELEM_BYTES * 8,
max_num_bits=spec.VTA_WGT_BUFF_SIZE * 8,
head_address=None)
@tvm.register_func("tvm.info.mem.%s" % SCOPE_OUT)
def mem_info_out_buffer():
return tvm.make.node("MemoryInfo",
unit_bits=spec.VTA_OUT_ELEM_BYTES * 8,
max_simd_bits=spec.VTA_OUT_ELEM_BYTES * 8,
max_num_bits=spec.VTA_OUT_BUFF_SIZE * 8,
head_address=None)
def intrin_gevm(mock=False):
"""Vector-matrix multiply intrinsic"""
wgt_lanes = spec.VTA_WGT_ELEM_BYTES * 8 // spec.VTA_WGT_WIDTH
assert wgt_lanes == spec.VTA_BLOCK_OUT * spec.VTA_BLOCK_IN
wgt_shape = (spec.VTA_BLOCK_OUT, spec.VTA_BLOCK_IN)
assert wgt_shape[0] * wgt_shape[1] == wgt_lanes
inp_lanes = spec.VTA_INP_ELEM_BYTES * 8 // spec.VTA_INP_WIDTH
out_lanes = spec.VTA_OUT_ELEM_BYTES * 8 // spec.VTA_OUT_WIDTH
wgt = tvm.placeholder((wgt_shape[0], wgt_shape[1]),
dtype="int%d" % spec.VTA_WGT_WIDTH,
name=SCOPE_WGT)
inp = tvm.placeholder((wgt_shape[1], ),
dtype="int%d" % spec.VTA_INP_WIDTH,
name=SCOPE_INP)
k = tvm.reduce_axis((0, wgt_shape[1]), name="k")
out_dtype = "int%d" % spec.VTA_OUT_WIDTH
out = tvm.compute((wgt_shape[0],),
lambda i: tvm.sum(inp[k].astype(out_dtype) *
wgt[i, k].astype(out_dtype),
axis=[k]),
name="out")
wgt_layout = tvm.decl_buffer(
wgt.shape, wgt.dtype, SCOPE_WGT,
scope=SCOPE_WGT, offset_factor=wgt_lanes, data_alignment=wgt_lanes)
inp_layout = tvm.decl_buffer(
inp.shape, inp.dtype, SCOPE_INP,
scope=SCOPE_INP, offset_factor=inp_lanes, data_alignment=inp_lanes)
out_layout = tvm.decl_buffer(
out.shape, out.dtype, SCOPE_OUT,
scope=SCOPE_OUT, offset_factor=out_lanes, data_alignment=out_lanes)
def intrin_func(ins, outs):
"""Vector-matrix multiply intrinsic function"""
dinp, dwgt = ins
dout = outs[0]
def instr(index):
"""Generate vector-matrix multiply VTA instruction"""
irb = tvm.ir_builder.create()
irb.scope_attr(VTA_AXIS, "coproc_scope", get_task_qid(spec.VTA_QID_COMPUTE))
irb.scope_attr(VTA_AXIS, "coproc_uop_scope", VTA_PUSH_UOP)
if index == 0 or index == 2:
irb.emit(tvm.call_extern(
"int32", "VTAUopPush",
0, 0,
dout.access_ptr("rw", "int32"),
dinp.access_ptr("r", "int32"),
dwgt.access_ptr("r", "int32"),
0, 0, 0))
else:
irb.emit(tvm.call_extern(
"int32", "VTAUopPush",
0, 1,
dout.access_ptr("rw", "int32"),
0,
0,
0, 0, 0))
return irb.get()
# return a triple of normal-set, reset, update
nop = tvm.make.Evaluate(0)
if mock:
return (nop, nop, nop)
return (instr(0), instr(1), instr(2))
return tvm.decl_tensor_intrin(out.op, intrin_func,
name="GEVM",
binds={inp: inp_layout,
wgt: wgt_layout,
out: out_layout})
def intrin_gemm(mock=False):
"""Matrix-matrix multiply intrinsic"""
wgt_lanes = spec.VTA_WGT_ELEM_BYTES * 8 // spec.VTA_WGT_WIDTH
assert wgt_lanes == spec.VTA_BLOCK_OUT * spec.VTA_BLOCK_IN
wgt_shape = (spec.VTA_BLOCK_OUT, spec.VTA_BLOCK_IN)
assert wgt_shape[0] * wgt_shape[1] == wgt_lanes
inp_lanes = spec.VTA_INP_ELEM_BYTES * 8 // spec.VTA_INP_WIDTH
assert inp_lanes == spec.VTA_BATCH * spec.VTA_BLOCK_IN
inp_shape = (spec.VTA_BATCH, spec.VTA_BLOCK_IN)
assert inp_shape[0] * inp_shape[1] == inp_lanes
out_lanes = spec.VTA_OUT_ELEM_BYTES * 8 // spec.VTA_OUT_WIDTH
assert out_lanes == spec.VTA_BATCH * spec.VTA_BLOCK_OUT
out_shape = (spec.VTA_BATCH, spec.VTA_BLOCK_OUT)
assert out_shape[0] * out_shape[1] == out_lanes
wgt = tvm.placeholder((wgt_shape[0], wgt_shape[1]),
dtype="int%d" % spec.VTA_WGT_WIDTH,
name=SCOPE_WGT)
inp = tvm.placeholder((inp_shape[0], inp_shape[1]),
dtype="int%d" % spec.VTA_INP_WIDTH,
name=SCOPE_INP)
k = tvm.reduce_axis((0, wgt_shape[1]), name="k")
out_dtype = "int%d" % spec.VTA_OUT_WIDTH
out = tvm.compute((out_shape[0], out_shape[1]),
lambda i, j: tvm.sum(inp[i, k].astype(out_dtype) *
wgt[j, k].astype(out_dtype),
axis=[k]),
name="out")
wgt_layout = tvm.decl_buffer(
wgt.shape, wgt.dtype, SCOPE_WGT,
scope=SCOPE_WGT, offset_factor=wgt_lanes, data_alignment=wgt_lanes)
inp_layout = tvm.decl_buffer(
inp.shape, inp.dtype, SCOPE_INP,
scope=SCOPE_INP, offset_factor=inp_lanes, data_alignment=inp_lanes)
out_layout = tvm.decl_buffer(
out.shape, out.dtype, SCOPE_OUT,
scope=SCOPE_OUT, offset_factor=out_lanes, data_alignment=out_lanes)
def intrin_func(ins, outs):
"""Matrix-matrix multiply intrinsic function"""
dinp, dwgt = ins
dout = outs[0]
def instr(index):
"""Generate matrix-matrix multiply VTA instruction"""
irb = tvm.ir_builder.create()
irb.scope_attr(VTA_AXIS, "coproc_scope", get_task_qid(spec.VTA_QID_COMPUTE))
irb.scope_attr(VTA_AXIS, "coproc_uop_scope", VTA_PUSH_UOP)
if index == 0 or index == 2:
irb.emit(tvm.call_extern(
"int32", "VTAUopPush",
0, 0,
dout.access_ptr("rw", "int32"),
dinp.access_ptr("r", "int32"),
dwgt.access_ptr("r", "int32"),
0, 0, 0))
else:
irb.emit(tvm.call_extern(
"int32", "VTAUopPush",
0, 1,
dout.access_ptr("rw", "int32"),
0,
0,
0, 0, 0))
return irb.get()
# return a triple of normal-set, reset, update
nop = tvm.make.Evaluate(0)
if mock:
return (nop, nop, nop)
return (instr(0), instr(1), instr(2))
return tvm.decl_tensor_intrin(out.op, intrin_func,
name="GEMM",
binds={inp: inp_layout,
wgt: wgt_layout,
out: out_layout})
GEMM = intrin_gemm()
GEVM = intrin_gevm()
"""Mock interface for skip part of compute """
from .intrin import intrin_gevm, intrin_gemm
GEMM = intrin_gemm(True)
GEVM = intrin_gevm(True)
DMA_COPY = "skip_dma_copy"
ALU = "skip_alu"
"""Runtime function related hooks"""
from __future__ import absolute_import as _abs
import tvm
def thread_local_command_buffer():
"""Get thread local command buffer"""
ctx = tvm.call_extern("handle", "VTATLSCommandHandle")
return tvm.make.Call(
"handle", "tvm_thread_context", [ctx], tvm.expr.Call.Intrinsic, None, 0)
CB_HANDLE = thread_local_command_buffer()
VTA_AXIS = tvm.thread_axis("vta")
VTA_PUSH_UOP = tvm.make.StringImm("VTAPushGEMMOp")
SCOPE_INP = "local.inp_buffer"
SCOPE_OUT = "local.out_buffer"
SCOPE_WGT = "local.wgt_buffer"
DMA_COPY = "dma_copy"
ALU = "alu"
DEBUG_NO_SYNC = False
def get_task_qid(qid):
"""Get transformed queue index."""
return 1 if DEBUG_NO_SYNC else qid
@tvm.register_func("tvm.intrin.rule.default.vta.coproc_sync")
def coproc_sync(op):
return tvm.call_extern(
"int32", "VTASynchronize", CB_HANDLE, 1<<31)
@tvm.register_func("tvm.intrin.rule.default.vta.coproc_dep_push")
def coproc_dep_push(op):
return tvm.call_extern(
"int32", "VTADepPush", CB_HANDLE, op.args[0], op.args[1])
@tvm.register_func("tvm.intrin.rule.default.vta.coproc_dep_pop")
def coproc_dep_pop(op):
return tvm.call_extern(
"int32", "VTADepPop", CB_HANDLE, op.args[0], op.args[1])
......@@ -80,4 +80,4 @@ void xlnkInvalidateCache(void* buf, int size);
#ifdef __cplusplus
}
#endif
#endif // VTA_PYNQ_PYNQ_DRIVER_H_
\ No newline at end of file
#endif // VTA_PYNQ_PYNQ_DRIVER_H_
......@@ -1043,9 +1043,9 @@ class CommandQueue {
VTAWriteMappedReg(vta_load_handle_, 0x10, 0);
// LOAD @ 0x18 : Data signal of weight_V
VTAWriteMappedReg(vta_load_handle_, 0x18, 0);
// COMPUTE @ 0x10 : Data signal of uops_V
// COMPUTE @ 0x20 : Data signal of uops_V
VTAWriteMappedReg(vta_compute_handle_, 0x20, 0);
// COMPUTE @ 0x18 : Data signal of biases_V
// COMPUTE @ 0x28 : Data signal of biases_V
VTAWriteMappedReg(vta_compute_handle_, 0x28, 0);
// STORE @ 0x10 : Data signal of outputs_V
VTAWriteMappedReg(vta_store_handle_, 0x10, 0);
......
......@@ -39,7 +39,7 @@ uint64_t vta(
#else // NO_SIM
#include "../../../hardware/vivado/src/vta.h"
#include "../../../hardware/xilinx/src/vta.h"
#endif // NO_SIM
......
CC ?= g++
CFLAGS = -Wall -O3 -std=c++11 -I/usr/include
LDFLAGS = -L/usr/lib -L/home/xilinx/pynq/drivers
LDFLAGS = -L/usr/lib -L/opt/python3.6/lib/python3.6/site-packages/pynq/lib/
LIBS = -l:libsds_lib.so -l:libdma.so
INCLUDE_DIR = ../../../include
DRIVER_DIR = ../../../src/pynq
......
"""Testing if we can generate code in topi style"""
import topi
import tvm
from tvm.contrib import util, rpc
import vta
from vta import vta_conv2d
import numpy as np
import mxnet as mx
Workload = vta_conv2d.Workload
@tvm.tag_scope(tag=topi.tag.ELEMWISE)
def my_clip(x, a_min, a_max):
"""Unlike topi's current clip, put min and max into two stages."""
const_min = tvm.const(a_min, x.dtype)
const_max = tvm.const(a_max, x.dtype)
x = tvm.compute(x.shape, lambda *i: tvm.min(x(*i), const_max), name="clipA")
x = tvm.compute(x.shape, lambda *i: tvm.max(x(*i), const_min), name="clipB")
return x
host = "pynq"
port = 9091
out_dtype = "int%d" % vta.VTA_OUT_WIDTH
wgt_dtype = "int%d" % vta.VTA_WGT_WIDTH
inp_dtype = "int%d" % vta.VTA_INP_WIDTH
target = "llvm -target=armv7-none-linux-gnueabihf -mattr=+neon"
print_ir = False
def test_vta_conv2d(key, batch_size, wl, profile=True):
data_shape = (batch_size, wl.in_filter//vta.VTA_BLOCK_IN,
wl.height, wl.width, vta.VTA_BLOCK_IN)
kernel_shape = (wl.out_filter//vta.VTA_BLOCK_OUT, wl.in_filter//vta.VTA_BLOCK_IN,
wl.hkernel, wl.wkernel, vta.VTA_BLOCK_OUT, vta.VTA_BLOCK_IN)
bias_shape = (wl.out_filter//vta.VTA_BLOCK_OUT, 1, 1, vta.VTA_BLOCK_OUT)
fout_height = (wl.height + 2 * wl.hpad - wl.hkernel) // wl.hstride + 1
fout_width = (wl.width + 2 * wl.wpad - wl.wkernel) // wl.wstride + 1
data = tvm.placeholder(data_shape, name="data", dtype=inp_dtype)
kernel = tvm.placeholder(kernel_shape, name="kernel", dtype=wgt_dtype)
bias = tvm.placeholder(bias_shape, name="kernel", dtype=out_dtype)
res_conv = vta_conv2d.packed_conv2d(
data, kernel, padding=(wl.hpad, wl.wpad), strides=(wl.hstride, wl.wstride))
res = topi.right_shift(res_conv, 8)
res = topi.broadcast_add(res, bias)
res = my_clip(res, 0, 127)
res = topi.cast(res, "int8")
num_ops = fout_height * fout_width * wl.hkernel * wl.wkernel * wl.out_filter * wl.in_filter
def verify(s, check_correctness):
mod = tvm.build(s, [data, kernel, bias, res], "ext_dev", target, name="conv2d")
temp = util.tempdir()
remote = rpc.connect(host, port)
mod.save(temp.relpath("conv2d.o"))
remote.upload(temp.relpath("conv2d.o"))
f = remote.load_module("conv2d.o")
# verify
ctx = remote.ext_dev(0)
# Data in original format
data_orig = (np.random.uniform(
size=(batch_size, wl.in_filter, wl.height, wl.width)) * 4).astype(data.dtype)
kernel_orig = (np.random.uniform(
size=(wl.out_filter, wl.in_filter, wl.hkernel, wl.wkernel)) * 4).astype(kernel.dtype)
bias_orig = (np.random.uniform(size=(wl.out_filter,)) * 4).astype("int32")
data_orig = np.abs(data_orig)
kernel_orig = np.abs(kernel_orig)
bias_orig = np.abs(bias_orig)
data_packed = data_orig.reshape(
batch_size, wl.in_filter//vta.VTA_BLOCK_IN, vta.VTA_BLOCK_IN,
wl.height, wl.width).transpose((0, 1, 3, 4, 2))
kernel_packed = kernel_orig.reshape(
wl.out_filter//vta.VTA_BLOCK_OUT, vta.VTA_BLOCK_OUT,
wl.in_filter//vta.VTA_BLOCK_IN, vta.VTA_BLOCK_IN,
wl.hkernel, wl.wkernel).transpose((0, 2, 4, 5, 1, 3))
bias_packed = bias_orig.reshape(
wl.out_filter//vta.VTA_BLOCK_OUT, 1, 1, vta.VTA_BLOCK_OUT)
res_shape = topi.util.get_const_tuple(res.shape)
res_np = np.zeros(res_shape).astype(res.dtype)
data_arr = tvm.nd.array(data_packed, ctx)
kernel_arr = tvm.nd.array(kernel_packed, ctx)
bias_arr = tvm.nd.array(bias_packed, ctx)
res_arr = tvm.nd.array(res_np, ctx)
time_f = f.time_evaluator("conv2d", ctx, number=10)
cost = time_f(data_arr, kernel_arr, bias_arr, res_arr)
res_unpack = res_arr.asnumpy().transpose(
(0, 1, 4, 2, 3)).reshape(batch_size, wl.out_filter, fout_height, fout_width)
if check_correctness:
res_ref = mx.nd.Convolution(
mx.nd.array(data_orig.astype(out_dtype), mx.cpu(0)),
mx.nd.array(kernel_orig.astype(out_dtype), mx.cpu(0)),
stride=(wl.hstride, wl.wstride),
kernel=(wl.hkernel, wl.wkernel),
num_filter=wl.out_filter,
no_bias=True,
pad=(wl.hpad, wl.wpad)).asnumpy().astype(out_dtype)
res_ref = res_ref >> 8
res_ref += bias_orig.reshape(wl.out_filter, 1, 1)
res_ref = np.clip(res_ref, 0, 127).astype("int8")
np.testing.assert_allclose(res_unpack, res_ref)
print("Correctness check pass...")
return cost
def conv_normal(print_ir):
print("----- CONV2D End-to-End Test-------")
with tvm.build_config(add_lower_pass=vta.debug_mode(0)):
s = vta_conv2d.schedule_packed_conv2d([res])
if print_ir:
print(tvm.lower(s, [data, kernel, bias, res], simple_mode=True))
cost = verify(s, True)
gops = (num_ops / cost.mean) / float(10 ** 9)
print("\tTime cost = %g sec/op, %g GFLOPS" % (cost.mean, gops))
conv_normal(print_ir)
# ResNet18 workloads
resnet = {
# Workloads of resnet18 on imagenet
0: Workload(224, 224, 16, 64, 7, 7, 3, 3, 2, 2),
1: Workload(56, 56, 64, 64, 3, 3, 1, 1, 1, 1),
2: Workload(56, 56, 64, 64, 1, 1, 0, 0, 1, 1),
3: Workload(56, 56, 64, 128, 3, 3, 1, 1, 2, 2),
4: Workload(56, 56, 64, 128, 1, 1, 0, 0, 2, 2),
5: Workload(28, 28, 128, 128, 3, 3, 1, 1, 1, 1),
6: Workload(28, 28, 128, 256, 3, 3, 1, 1, 2, 2),
7: Workload(28, 28, 128, 256, 1, 1, 0, 0, 2, 2),
8: Workload(14, 14, 256, 256, 3, 3, 1, 1, 1, 1),
9: Workload(14, 14, 256, 512, 3, 3, 1, 1, 2, 2),
10: Workload(14, 14, 256, 512, 1, 1, 0, 0, 2, 2),
11: Workload(7, 7, 512, 512, 3, 3, 1, 1, 1, 1),
}
batch_size = 1
for i in range(0, len(resnet)):
wl = resnet[i]
key = "resnet-cfg[%d]" % i
print "key=%s" % key
print wl
test_vta_conv2d(key, batch_size, wl)
import tvm
import vta
import os
from tvm.contrib import rpc, util
host = "pynq"
port = 9091
target = "llvm -target=armv7-none-linux-gnueabihf"
bit = "vta.bit"
curr_path = os.path.dirname(os.path.abspath(os.path.expanduser(__file__)))
bitstream = os.path.join(curr_path, "./", bit)
def test_program_rpc():
assert tvm.module.enabled("rpc")
remote = rpc.connect(host, port)
remote.upload(bitstream, bit)
fprogram = remote.get_function("tvm.contrib.vta.init")
fprogram(bit)
test_program_rpc()
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment