[DOC, HARDWARE] Hardware developer guide, migrating to use Vivado 2018.2 (#1473)

e806cd15 · Thierry Moreau · Tianqi Chen · efe2f6a2 · e806cd15 · e806cd15
Commit e806cd15 authored Jul 22, 2018 by Thierry Moreau Committed by Tianqi Chen Jul 22, 2018
11 changed files
--- a/docs/vta/dev/config.rst
+++ b/docs/vta/dev/config.rst
+VTA Configuration
+=================
+The VTA stack incorporates both a hardware accelerator stack and
+a TVM based software stack.
+VTA incorporates flexibility out of the box: by modifying the
+``vta/config/vta_config.json`` high-level configuration file,
+the user can change the shape of the tensor intrinsic,
+clock frequency, pipelining, data type width, and on-chip buffer sizes.
+Parameters Overview
+-------------------
+We explain the parameters listed in the ``vta_config.json`` file in the table
+below.
+-----------------------+------------+--------------------------------------------------------+
+| Attribute             | Format     | Description                                            |
+=======================+============+========================================================+
+| ``TARGET``            | String     | The TVM device target.                                 |
+-----------------------+------------+--------------------------------------------------------+
+| ``HW_TARGET``         | Int        | FPGA frequency in MHz.                                 |
+-----------------------+------------+--------------------------------------------------------+
+| ``HW_CLK_TARGET``     | Int        | FPGA clock period in ns target for HLS tool.           |
+-----------------------+------------+--------------------------------------------------------+
+| ``HW_VER``            | String     | VTA hardware version number.                           |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_INP_WIDTH``     | Int (log2) | Input data type signed integer width.                  |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_WGT_WIDTH``     | Int (log2) | Weight data type signed integer width.                 |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_ACC_WIDTH``     | Int (log2) | Accumulator data type signed integer width.            |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_OUT_WIDTH``     | Int (log2) | Output data type signed integer width.                 |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_BATCH``         | Int (log2) | VTA matrix multiply intrinsic output dimension 0.      |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_BLOCK_IN``      | Int (log2) | VTA matrix multiply reduction dimension.               |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_BLOCK_OUT``     | Int (log2) | VTA matrix multiply intrinsic output dimension 1.      |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_UOP_BUFF_SIZE`` | Int (log2) | Micro-op on-chip buffer in Bytes.                      |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_INP_BUFF_SIZE`` | Int (log2) | Input on-chip buffer in Bytes.                         |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_WGT_BUFF_SIZE`` | Int (log2) | Weight on-chip buffer in Bytes.                        |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_ACC_BUFF_SIZE`` | Int (log2) | Accumulator on-chip buffer in Bytes.                   |
+-----------------------+------------+--------------------------------------------------------+
+ .. note::
+    When a parameter name is preceded with ``LOG``, it means that it describes a value that can only be expressed a power of two.
+    For that reason we describe these parameters by their log2 value.
+    For instance, to describe an integer width of 8-bits for the input data types, we set the ``LOG_INP_WIDTH`` to be 3, which is the log2 of 8.
+    Similarly, to descibe a 64kB micro-op buffer, we would set ``LOG_UOP_BUFF_SIZE`` to be 16.
+We provide additional detail below regarding each parameter:
+ - ``TARGET``: Can be set to ``"pynq"`` or ``"sim"``.
+ - ``HW_TARGET``: In pynq mode, can be set to ``100``, ``142``, ``167``, or ``200`` MHz.
+ - ``HW_CLK_TARGET``: The lower the target, the more pipeline stages HLS will insert to achieve timing closure during place and route (this can also slightly decrease performance).
+ - ``HW_VER``: Hardware version which increments everytime the VTA hardware design changes. This parameter is used to uniquely idenfity hardware bitstreams.
+ - ``LOG_OUT_WIDTH``: We recommend matching ``LOG_OUT_WIDTH`` to ``LOG_INP_WIDTH``.
+ - ``LOG_BATCH``: Equivalent to A in multiplication of shape (A, B) x (B, C), or typically, the batch dimension.
+ - ``LOG_BATCH``: Equivalent to A in multiplication of shape (A, B) x (B, C), or typically, the batch dimension.
+ - ``LOG_BLOCK_IN``: Equivalent to B in multiplication of shape (A, B) x (B, C), or typically, the input channel dimension.
+ - ``LOG_BLOCK_OUT``: Equivalent to C in multiplication of shape (A, B) x (B, C), or typically, the output channel dimension.
--- a/docs/vta/dev/hardware.rst
+++ b/docs/vta/dev/hardware.rst
--- a/docs/vta/dev/index.rst
+++ b/docs/vta/dev/index.rst
+VTA Design and Developer Guide
+==============================
+This developer guide details the complete VTA-TVM hardware-software stack.
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_stack.png
+   :align: center
+   :width: 60%
+.. toctree::
+   :maxdepth: 2
+   config
+   hardware
\ No newline at end of file
--- a/docs/vta/index.rst
+++ b/docs/vta/index.rst
 VTA: Deep Learning Accelerator Stack
 ====================================
-Specialized accelerators are key enablers of future deep learning workloads. TVM stack targets specialized accelerators.
-VTA(versatile tensor accelerator) is a generic, modular open-source deep learning accelerator.
+The Versatile Tensor Accelerator (VTA) is an open, generic, and customizable deep learning accelerator with a complete TVM-based compiler stack. We designed VTA to expose the most salient and common characteristics of mainstream deep learning accelerators. Together TVM and VTA form an end-to-end hardware-software deep learning system stack that includes hardware design, drivers, a JIT runtime, and an optimizing compiler stack based on TVM.
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_overview.png
+   :align: center
+   :width: 60%
+VTA has the following key features:
+- Generic, modular, open-source hardware.
+- Streamlined workflow to deploy to FPGAs.
+- Simulator support to prototype compilation passes on regular workstations.
+- Pynq-based driver and JIT runtime for both simulated and FPGA hardware back-end.
+- End to end TVM stack integration.
 This page contains links to all the resources related to VTA:
 .. toctree::
   :maxdepth: 1
   install
+   dev/index
   tutorials/index
-Features
+Literature
--------
+----------
-VTA have the following key features:
- Generic, modular open-source hardware
+- Read the VTA `release blog post`_.
- Streamlined workflow to deploy to FPGAs.
+- Read the VTA tech report: `An Open Hardware Software Stack for Deep Learning`_.
- Simulator support to protoype compilation passes on regular workstations.
- Driver and JIT runtime for both simulated and FPGA hardware backend.
+.. _release blog post: https://tvm.ai/2018/07/12/vta-release-announcement.html
- End to end TVM stack integration
+.. _An Open Hardware Software Stack for Deep Learning: https://arxiv.org/abs/1807.04188
\ No newline at end of file
--- a/docs/vta/install.md
+++ b/docs/vta/install.md
--- a/vta/hardware/xilinx/scripts/vivado.tcl
+++ b/vta/hardware/xilinx/scripts/vivado.tcl
@@ -6,7 +6,7 @@
 #
 # Check if script is running in correct Vivado version.
-set scripts_vivado_version 2017.1
+set scripts_vivado_version 2018.2
 set current_vivado_version [version -short]
 if { [string first $scripts_vivado_version $current_vivado_version] == -1 } {
@@ -53,7 +53,8 @@ if { [llength $argv] eq 12 } {
  }
 } else {
  puts "Arg list incomplete: <path to ip dir> <num threads> <clock freq> \
-    <inp width> <wgt_width> <out_width> <batch> <in_block / 1024> <out_block>"
+    <inp width> <wgt_width> <out_width> <batch> <batch> <out_block> <in_block
+    <inp_mem_size> <wgt_mem_size> <out_mem_size>"
  return 1
 }
@@ -66,6 +67,7 @@ if {[expr $inp_part == 0]} {
  set inp_bus_width $inp_mem_width
 }
 set inp_mem_depth [expr $inp_mem_size * 8 / ($inp_mem_width * $inp_part)]
 # Derive weight mem parameters
 set wgt_mem_width [expr $wgt_width * $out_block * $in_block]
 set wgt_bus_width 1024
@@ -75,6 +77,7 @@ if {[expr $wgt_part == 0]} {
  set wgt_bus_width $wgt_mem_width
 }
 set wgt_mem_depth [expr $wgt_mem_size * 8 / ($wgt_mem_width * $wgt_part)]
 # Derive output mem parameters
 set out_mem_width [expr $out_width * $batch * $out_block]
 set out_bus_width 1024
@@ -252,7 +255,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $fetch_0
  # Create instance: g2l_queue, and set properties
-  set g2l_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 g2l_queue ]
+  set g2l_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 g2l_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {1022} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -273,7 +276,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $g2l_queue
  # Create instance: g2s_queue, and set properties
-  set g2s_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 g2s_queue ]
+  set g2s_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 g2s_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {1022} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -294,7 +297,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $g2s_queue
  # Create instance: gemm_queue, and set properties
-  set gemm_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 gemm_queue ]
+  set gemm_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 gemm_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {510} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -318,7 +321,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $gemm_queue
  # Create instance: l2g_queue, and set properties
-  set l2g_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 l2g_queue ]
+  set l2g_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 l2g_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {1022} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -345,7 +348,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $load_0
  # Create instance: load_queue, and set properties
-  set load_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 load_queue ]
+  set load_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 load_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {510} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -406,7 +409,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $processing_system7_1
  # Create instance: s2g_queue, and set properties
-  set s2g_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 s2g_queue ]
+  set s2g_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 s2g_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {1022} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -433,7 +436,7 @@ CONFIG.C_M_AXI_DATA_PORT_CACHE_VALUE {"1111"} \
  ] $store_0
  # Create instance: store_queue, and set properties
-  set store_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 store_queue ]
+  set store_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 store_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {510} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -466,7 +469,7 @@ CONFIG.NUM_PORTS {5} \
  if {${inp_part} > 1} {
    for {set i 0} {$i < ${inp_part}} {incr i} {
      # Create instance: inp_mem, and set properties
-      set inp_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 inp_mem_${i} ]
+      set inp_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 inp_mem_${i} ]
      set_property -dict [ list \
        CONFIG.Byte_Size {8} \
        CONFIG.Enable_32bit_Address {true} \
@@ -494,7 +497,7 @@ CONFIG.NUM_PORTS {5} \
    }
  } else {
      # Create instance: inp_mem, and set properties
-      set inp_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 inp_mem ]
+      set inp_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 inp_mem ]
      set_property -dict [ list \
        CONFIG.Byte_Size {8} \
        CONFIG.Enable_32bit_Address {true} \
@@ -525,7 +528,7 @@ CONFIG.NUM_PORTS {5} \
  if {${wgt_part} > 1} {
    for {set i 0} {$i < ${wgt_part}} {incr i} {
      # Create instance: wgt_mem, and set properties
-      set wgt_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 wgt_mem_${i} ]
+      set wgt_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 wgt_mem_${i} ]
      set_property -dict [ list \
        CONFIG.Assume_Synchronous_Clk {true} \
        CONFIG.Byte_Size {8} \
@@ -553,7 +556,7 @@ CONFIG.NUM_PORTS {5} \
    }
  } else {
      # Create instance: wgt_mem, and set properties
-      set wgt_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 wgt_mem ]
+      set wgt_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 wgt_mem ]
      set_property -dict [ list \
        CONFIG.Assume_Synchronous_Clk {true} \
        CONFIG.Byte_Size {8} \
@@ -584,7 +587,7 @@ CONFIG.NUM_PORTS {5} \
  if {${out_part} > 1} {
    for {set i 0} {$i < ${out_part}} {incr i} {
      # Create instance: out_mem, and set properties
-      set out_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 out_mem_${i} ]
+      set out_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 out_mem_${i} ]
      set_property -dict [ list \
        CONFIG.Byte_Size {8} \
        CONFIG.Enable_32bit_Address {true} \
@@ -612,7 +615,7 @@ CONFIG.NUM_PORTS {5} \
    }
  } else {
      # Create instance: out_mem, and set properties
-      set out_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 out_mem ]
+      set out_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 out_mem ]
      set_property -dict [ list \
        CONFIG.Byte_Size {8} \
        CONFIG.Enable_32bit_Address {true} \

--- a/vta/tutorials/convolution_opt.py
+++ b/vta/tutorials/convolution_opt.py
@@ -30,7 +30,7 @@ from tvm import rpc
 from tvm.contrib import util
 from vta.testing import simulator
-# Load VTA parameters from the config.json file
+# Load VTA parameters from the vta/config/vta_config.json file
 env = vta.get_env()
 # We read the Pynq RPC host IP address and port number from the OS environment
@@ -38,7 +38,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))
 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":
    # Make sure that TVM was compiled with RPC=1

--- a/vta/tutorials/matrix_multiply.py
+++ b/vta/tutorials/matrix_multiply.py
@@ -26,7 +26,7 @@ from tvm import rpc
 from tvm.contrib import util
 from vta.testing import simulator
-# Load VTA parameters from the config.json file
+# Load VTA parameters from the vta/config/vta_config.json file
 env = vta.get_env()
 # We read the Pynq RPC host IP address and port number from the OS environment
@@ -34,7 +34,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))
 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":
    # Make sure that TVM was compiled with RPC=1
@@ -95,7 +95,7 @@ elif env.TARGET == "sim":
 #        :width: 480px
 #
 #   The dimensions of that matrix-matrix multiplication are specified in
-#   the :code:`config.json` configuration file.
+#   the :code:`vta_config.json` configuration file.
 #   The activation matrix has a :code:`(BATCH, BLOCK_IN)` shape
 #   and the transposed weight matrix has a :code:`(BLOCK_OUT, BLOCK_IN)` shape,
 #   thus inferring that the resulting output matrix has a
@@ -131,7 +131,7 @@ elif env.TARGET == "sim":
 #   dimension of VTA's tensor core, but also to match the specific data types
 #   expected by VTA.
 #   VTA for now only supports fixed point data types, which integer width is
-#   specified in the :code:`config.json` file by :code:`INP_WIDTH` and
+#   specified in the :code:`vta_config.json` file by :code:`INP_WIDTH` and
 #   :code:`WGT_WIDTH` for the activations and weights data types respectively.
 #   In addition, the accumulator data type integer width is specified by
 #   :code:`ACC_WIDTH`.
@@ -284,7 +284,7 @@ print(tvm.lower(s, [A, B, C], simple_mode=True))
 #      that stores input matrices of shape :code:`(env.BATCH, env.BLOCK_IN)`
 #      of type :code:`env.inp_dtype`. The input buffer contains
 #      `2 ^ LOG_INP_BUFF_SIZE` matrix elements (as specified in the
-#      :code:`config.json` file).
+#      :code:`vta_config.json` file).
 #    - :code:`env.wgt_scope`: Weight buffer, which is a read-only SRAM buffer
 #      that stores weight matrices of shape :code:`(env.BLOCK_OUT, env.BLOCK_IN)`
 #      of type :code:`env.wgt_dtype`. The weight buffer contains

--- a/vta/tutorials/matrix_multiply_opt.py
+++ b/vta/tutorials/matrix_multiply_opt.py
@@ -29,7 +29,7 @@ from tvm import rpc
 from tvm.contrib import util
 from vta.testing import simulator
-# Load VTA parameters from the config.json file
+# Load VTA parameters from the vta/config/vta_config.json file
 env = vta.get_env()
 # We read the Pynq RPC host IP address and port number from the OS environment
@@ -37,7 +37,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))
 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":
    # Make sure that TVM was compiled with RPC=1

--- a/vta/tutorials/resnet.py
+++ b/vta/tutorials/resnet.py
@@ -38,7 +38,7 @@ from io import BytesIO
 from matplotlib import pyplot as plt
 from PIL import Image
-# Load VTA parameters from the config.json file
+# Load VTA parameters from the vta/config/vta_config.json file
 env = vta.get_env()
 # Helper to crop an image to a square (224, 224)
@@ -180,7 +180,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))
 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":
    # Make sure that TVM was compiled with RPC=1

--- a/vta/tutorials/vta_get_started.py
+++ b/vta/tutorials/vta_get_started.py
@@ -29,12 +29,12 @@ import numpy as np
 # VTA is a modular and customizable design. Consequently, the user
 # is free to modify high-level hardware parameters that affect
 # the hardware design layout.
-# These parameters are specified in the :code:`config.json` file by their
+# These parameters are specified in the :code:`vta_config.json` file by their
 # :code:`log2` values.
 # These VTA parameters can be loaded with the :code:`vta.get_env`
 # function.
 #
-# Finally, the TVM target is specified in the :code:`config.json` file.
+# Finally, the TVM target is also specified in the :code:`vta_config.json` file.
 # When set to *sim*, execution will take place inside of a behavioral
 # VTA simulator.
 # If you want to run this tutorial on the Pynq FPGA development platform,
@@ -58,7 +58,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))
 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":
    # Make sure that TVM was compiled with RPC=1
@@ -110,11 +110,11 @@ elif env.TARGET == "sim":
 # For VTA's general purpose operations such as vector adds, the tile size is
 # :code:`(env.BATCH, env.BLOCK_OUT)`.
 # The dimensions are specified in
-# the :code:`config.json` configuration file and are set by default to
+# the :code:`vta_config.json` configuration file and are set by default to
 # a (1, 16) vector.
 #
 # In addition, A and B's data types also needs to match the :code:`env.acc_dtype`
-# which is set by the :code:`config.json` file to be a 32-bit integer.
+# which is set by the :code:`vta_config.json` file to be a 32-bit integer.
 # Output channel factor m - total 64 x 16 = 1024 output channels
 m = 64