[DOC, HARDWARE] Hardware developer guide, migrating to use Vivado 2018.2 (#1473)

e806cd15 · Thierry Moreau · Tianqi Chen · efe2f6a2 · e806cd15 · e806cd15
Commit e806cd15 authored Jul 22, 2018 by Thierry Moreau Committed by Tianqi Chen Jul 22, 2018
11 changed files
--- a/docs/vta/dev/config.rst
+++ b/docs/vta/dev/config.rst
+VTA Configuration
+=================
+
+The VTA stack incorporates both a hardware accelerator stack and
+a TVM based software stack.
+VTA incorporates flexibility out of the box: by modifying the
+``vta/config/vta_config.json`` high-level configuration file,
+the user can change the shape of the tensor intrinsic,
+clock frequency, pipelining, data type width, and on-chip buffer sizes.
+
+Parameters Overview
+-------------------
+
+We explain the parameters listed in the ``vta_config.json`` file in the table
+below.
+
+-----------------------+------------+--------------------------------------------------------+
+| Attribute             | Format     | Description                                            |
+=======================+============+========================================================+
+| ``TARGET``            | String     | The TVM device target.                                 |
+-----------------------+------------+--------------------------------------------------------+
+| ``HW_TARGET``         | Int        | FPGA frequency in MHz.                                 |
+-----------------------+------------+--------------------------------------------------------+
+| ``HW_CLK_TARGET``     | Int        | FPGA clock period in ns target for HLS tool.           |
+-----------------------+------------+--------------------------------------------------------+
+| ``HW_VER``            | String     | VTA hardware version number.                           |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_INP_WIDTH``     | Int (log2) | Input data type signed integer width.                  |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_WGT_WIDTH``     | Int (log2) | Weight data type signed integer width.                 |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_ACC_WIDTH``     | Int (log2) | Accumulator data type signed integer width.            |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_OUT_WIDTH``     | Int (log2) | Output data type signed integer width.                 |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_BATCH``         | Int (log2) | VTA matrix multiply intrinsic output dimension 0.      |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_BLOCK_IN``      | Int (log2) | VTA matrix multiply reduction dimension.               |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_BLOCK_OUT``     | Int (log2) | VTA matrix multiply intrinsic output dimension 1.      |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_UOP_BUFF_SIZE`` | Int (log2) | Micro-op on-chip buffer in Bytes.                      |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_INP_BUFF_SIZE`` | Int (log2) | Input on-chip buffer in Bytes.                         |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_WGT_BUFF_SIZE`` | Int (log2) | Weight on-chip buffer in Bytes.                        |
+-----------------------+------------+--------------------------------------------------------+
+| ``LOG_ACC_BUFF_SIZE`` | Int (log2) | Accumulator on-chip buffer in Bytes.                   |
+-----------------------+------------+--------------------------------------------------------+
+
+
+ .. note::
+
+    When a parameter name is preceded with ``LOG``, it means that it describes a value that can only be expressed a power of two.
+    For that reason we describe these parameters by their log2 value.
+    For instance, to describe an integer width of 8-bits for the input data types, we set the ``LOG_INP_WIDTH`` to be 3, which is the log2 of 8.
+    Similarly, to descibe a 64kB micro-op buffer, we would set ``LOG_UOP_BUFF_SIZE`` to be 16.
+
+We provide additional detail below regarding each parameter:
+
+ - ``TARGET``: Can be set to ``"pynq"`` or ``"sim"``.
+ - ``HW_TARGET``: In pynq mode, can be set to ``100``, ``142``, ``167``, or ``200`` MHz.
+ - ``HW_CLK_TARGET``: The lower the target, the more pipeline stages HLS will insert to achieve timing closure during place and route (this can also slightly decrease performance).
+ - ``HW_VER``: Hardware version which increments everytime the VTA hardware design changes. This parameter is used to uniquely idenfity hardware bitstreams.
+ - ``LOG_OUT_WIDTH``: We recommend matching ``LOG_OUT_WIDTH`` to ``LOG_INP_WIDTH``.
+ - ``LOG_BATCH``: Equivalent to A in multiplication of shape (A, B) x (B, C), or typically, the batch dimension.
+ - ``LOG_BATCH``: Equivalent to A in multiplication of shape (A, B) x (B, C), or typically, the batch dimension.
+ - ``LOG_BLOCK_IN``: Equivalent to B in multiplication of shape (A, B) x (B, C), or typically, the input channel dimension.
+ - ``LOG_BLOCK_OUT``: Equivalent to C in multiplication of shape (A, B) x (B, C), or typically, the output channel dimension.
+
--- a/docs/vta/dev/hardware.rst
+++ b/docs/vta/dev/hardware.rst
+VTA Hardware Guide
+==================
+
+We present a top-down overview of the VTA hardware design.
+This hardware design guide covers VTA hardware at two levels:
+
+ - An architectural overview of the VTA design and its ISA hardware-software
+   interface.
+ - A micro-architectural overview of the VTA hardware modules, and the
+   micro-code specification for the compute core.
+
+VTA Overview
+------------
+
+VTA is a generic deep learning accelerator built for fast and efficient dense linear algebra.
+VTA incorporates a simple RISC-like processor that can perform dense linear algebra operations on rank 1 or 2 tensor registers.
+In addition the design adopts decoupled access-execute to hide memory access latency.
+
+
+To a broader extent, VTA can serve as a template deep learning accelerator design for full stack optimization, exposing a generic tensor computation interface to the compiler stack.
+
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_overview.png
+   :align: center
+   :width: 80%
+
+The figure above gives a high-level overview of the VTA hardware organization.
+VTA is composed of four modules that communicate among each other via FIFO queues and local memory blocks (SRAM), to enable task-level pipeline parallelism:
+
+- The fetch module takes care of loading an instruction stream from DRAM. It also decodes those instructions to route them into one of three command queues.
+- The load module takes care of loading input and weight tensors from DRAM into data-specialized on-chip memories.
+- The compute module performs both dense linear algebra computation with its GEMM core, and general computation with its tensor ALU. It also takes care of loading data from DRAM into the register file, and loading micro-op kernels into the micro-op cache.
+- The store module stores results produced by the compute core back to DRAM.
+
+HLS Hardware Source Organization
+--------------------------------
+
+The VTA design is currently specified in Vivado HLS C++, which is only supported
+by Xilinx toolchains.
+The VTA hardware sources are contained under ``vta/hardware/xilinx/sources``:
+
+ - ``vta.cc`` contains the definitions for each VTA module, as well as a top
+   level behavioral model for the top-level VTA design.
+ - ``vta.h`` contains type definitions using Xilinx ``ap_int`` types, and
+   function prototypes declarations.
+
+In addition preprocessor macros are defined under ``vta/include/vta/hw_spec.h``.
+Much of these macro definitions are derived from the parameters listed in the
+``vta/config/vta_config.json`` file.
+The json file is processed by ``vta/config/vta_config.py`` to produce a string of
+compile flags that define the preprocessor macros.
+That string is used by the makefile in order to set those high-level
+parameters in both the HLS hardware synthesis compiler, and the C++
+compiler that builds the VTA runtime.
+
+HLS Module Example
+~~~~~~~~~~~~~~~~~~
+
+We show a definition of one of the VTA modules defined in C++:
+
+.. code-block:: c
+
+  void fetch(
+    uint32_t insn_count,
+    volatile insn_T *insns,
+    hls::stream<insn_T> &load_queue,
+    hls::stream<insn_T> &gemm_queue,
+    hls::stream<insn_T> &store_queue) {
+  #pragma HLS INTERFACE s_axilite port = insn_count bundle = CONTROL_BUS
+  #pragma HLS INTERFACE m_axi port = insns offset = slave bundle = ins_port
+  #pragma HLS INTERFACE axis port = load_queue
+  #pragma HLS INTERFACE axis port = gemm_queue
+  #pragma HLS INTERFACE axis port = store_queue
+  #pragma HLS INTERFACE s_axilite port = return bundle = CONTROL_BUS
+
+    INSN_DECODE: for (int pc = 0; pc < insn_count; pc++) {
+  #pragma HLS PIPELINE II = 1
+      // Read instruction fields
+      insn_T insn = insns[pc];
+      // Do some partial decoding
+      opcode_T opcode = insn.range(VTA_INSN_MEM_0_1, VTA_INSN_MEM_0_0);
+      memop_id_T memory_type = insn.range(VTA_INSN_MEM_5_1, VTA_INSN_MEM_5_0);
+      // Push to appropriate instruction queue
+      if (opcode == VTA_OPCODE_STORE) {
+        store_queue.write(insn);
+      } else if (opcode == VTA_OPCODE_LOAD &&
+          (memory_type == VTA_MEM_ID_INP || memory_type == VTA_MEM_ID_WGT)) {
+        load_queue.write(insn);
+      } else {
+        gemm_queue.write(insn);
+      }
+    }
+  }
+
+A few observations on HLS coding:
+ - *Parameters:* The parameter list of each function, combined with the
+   interface pragmas define the hardware interface exposed by the
+   generated hardware module.
+
+    - Parameters passed by value indicate a read-only hardware memory-mapped
+      register that the host can write to.
+      This fetch function for instance has an ``insn_count`` parameter
+      which will be synthesized as a memory mapped register for the host
+      to write to, in order to set the length of a given VTA instruction
+      sequence.
+    - Pointer parameters can mean one of two things depending on the interface
+      pragma being used.
+
+       - When used with a ``m_axi`` interface pragma, an AXI master interface
+         gets generated to provide DMA access to DRAM.
+       - When used with a ``bram`` interface pragma, a BRAM interface gets
+         generated to expose read and/or write ports to an FPGA block-RAM.
+    - HLS streams being passed by reference combined with the ``axis`` interface
+      pragma produce FIFO interfaces to the module. Hardware FIFOs provide a
+      useful synchronization mechanism between modules.
+ - *Pragmas*: Compiler pragmas are essential to define hardware implementation
+   of each module. We list several pragmas used in the VTA design to communicate
+   implementation requirements to the compiler.
+
+    - ``HLS INTERFACE``: specifies the interface of the synthesized
+      hardware module.
+    - ``HLS PIPELINE``: defines hardware pipeline performance target by setting
+      an initiation interval goal. When the ``II == 1`` target is set, it tells
+      the compiler that the synthesized hardware pipeline should be able to
+      execute one loop iteration per cycle.
+    - ``HLS DEPENDENCE``: instructs the compiler to ignore certain types
+      of dependence checks in a given loop. Consider a loop body that writes
+      and reads to the same BRAM structure, and needs to achieve an II of 1.
+      The HLS compiler has to assume worst-case scenario, whereby a read is
+      issued to an address that a past write updated the cycle prior: this
+      cannot be achieved given BRAM timing characteristics (it takes at least
+      2 cycles to see the updated value). Therefore in order to achieve an II of 1,
+      the dependence checks have to be relaxed.
+      Note that when turning this optimization on, it falls onto
+      the software stack to prevent writes followed by reads to the same address.
+
+ .. note::
+    This `reference guide <https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_2/ug902-vivado-high-level-synthesis.pdf>`_
+    provides a much more in-depth, and complete specification of HLS for the Xilinx 2018.2 toolchains.
+
+Architectural Overview
+----------------------
+
+Instruction Set Architecture
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+VTA's instruction set architecture (ISA) is composed of 4 CISC instructions that have a variable execution latency, two of which execute a micro-coded instruction sequence to perform computation.
+
+The VTA instructions are listed below:
+
+- ``LOAD`` instruction: loads a 2D tensor from DRAM into the input buffer, weight buffer, or register file. It can also load a micro-kernel into the micro-op cache. Supports dynamic padding when loading input and weight tiles.
+- ``GEMM`` instruction: performs a micro-op sequence of matrix-matrix multiplications over an input tensor and a weight tensors, and adds the result to a register-file tensor.
+- ``ALU`` instruction: performs a micro-op sequence of matrix-matrix ALU operations over register-file tensor data.
+- ``STORE`` instruction: stores a 2D tensor from the output buffer to DRAM.
+
+The ``LOAD`` instructions are executed by the load and compute modules depending on the store memory buffer location target.
+The ``GEMM`` and ``ALU`` instructions are executed by the compute module's GEMM core and tensor ALU.
+Finally, the ``STORE`` instructions are executed by the store module exclusively.
+The fields of each instruction is described in the figure below.
+The meaning of each field will be further explained in the :ref:`vta-uarch` section.
+
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/developer/vta_instructions.png
+   :align: center
+   :width: 100%
+
+.. note::
+   Note that the VTA ISA changes as VTA's architectural parameters are modified (i.e. GEMM core shape, data type, memory size etc.), and as a result the ISA does not guarantee compatibility across all variants of VTA.
+   This is acceptable however, since the VTA runtime adapts to parameter changes, and produces binary code tailored for the version of the accelerator that gets generated.
+   This exemplifies the co-design philosophy adopted by the VTA stack which embraces fluidity of the hardware-software interface.
+
+Dataflow Execution
+~~~~~~~~~~~~~~~~~~
+
+VTA relies on dependence FIFO queues between hardware modules to synchronize the execution of concurrent tasks.
+The figure below shows how a given hardware module can execute concurrently from its producer and consumer modules in a dataflow fashion through the use of dependence FIFO queues, and single-reader/single-writer SRAM buffers.
+Each module is connected to its consumer and producer via read-after-write (RAW) and write-after-read (WAR) dependence queues.
+
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/developer/dataflow.png
+   :align: center
+   :width: 100%
+
+The pseudo-code above describes how a module executes a given instruction predicated on dependences with other instructions.
+First, the dependence flags within each instruction are decoded in hardware.
+If the instruction has an incoming RAW dependences, execution is predicated upon receiving a RAW dependence token from the producer module.
+Similarly, if the task has an incoming WAR dependence, execution is predicated upon receiving a WAR dependence token from the consumer module.
+Finally when the task is done, we check for outgoing RAW and WAR dependences, and notify the consumer and producer modules respectively.
+
+.. note::
+   Note that the dependence tokens in this scenario are information-less.
+   This is because the instructions executed by each module cannot be reordered by design, as they arrive in FIFO order.
+
+Pipeline Expandability
+~~~~~~~~~~~~~~~~~~~~~~
+
+The default VTA design is composed of four modules that describe a 3-stage ``load-compute-store`` task pipeline.
+Following the dataflow hardware organization principle, we can extend VTA the pipeline to include more stages.
+For example, we can envision separating the tensor ALU from the GEMM core in order to maximize the utilization of the GEMM core.
+This would result in a ``load-gemm-activate-store`` task pipeline which closely reflects the TPU design.
+Adding more stages has a cost however: it can add storage and extra logic overhead, which is why we opted for a default 3-stage pipeline.
+
+.. _vta-uarch:
+Microarchitectural Overview
+----------------------
+
+We describe the modules that compose the VTA design.
+The module definitions are contained in ``vta/hardware/xilinx/sources/vta.cc``.
+
+Fetch Module
+~~~~~~~~~~~~
+
+VTA is programmed by a linear instruction stream.
+The fetch module is the entry point of VTA to the CPU and is programmed via three memory mapped registers:
+
+- The read-write ``control`` register starts the fetch module, and is read to check for its completion.
+- The write-only ``insn_count`` register sets the number of instructions to execute.
+- The write-only ``insns`` register sets the start address of the instruction stream in DRAM.
+
+The CPU prepares the instruction stream in DRAM in a physically-contiguous buffer prepared by the VTA runtime.
+When the instruction stream is ready, the CPU writes the start physical address into the ``insns`` register, the length of the instruction stream into the ``insn_count`` register, and asserts the start signal in the ``control`` register.
+This procedure starts VTA, which reads in the instruction stream from DRAM via DMA. 
+
+Upon accessing the instruction stream, the fetch module partially decodes instructions, and pushes those instructions into command queues that feed into the load, compute, and store modules:
+
+- ``STORE`` instructions are pushed to the store command queue to be processed by the store module.
+- ``GEMM`` and ``ALU`` instructions are pushed to the compute command queue to be processed by the compute module.
+- ``LOAD`` instructions that describe a load operation of micro-op kernels or register file data are pushed to the compute command queue to be processed by the compute module.
+- ``LOAD`` instructions that describe a load operation of input or weight data are pushed to the load command queue to be processed by the load module.
+
+When one of the command queues becomes full, the fetch module stalls until the queue is not full.
+Consequently, the command queues are sized to be deep enough to allow for a wide execution window, and allow multiple tasks to be in flight concurrently across the ``load-compute-store`` pipeline.
+
+
+Compute Module
+~~~~~~~~~~~~~~
+
+VTA's compute module acts as a RISC processor that performs computation on tensor registers rather than scalar registers.
+Two functional units mutate the register file: the tensor ALU, and the GEMM core.
+
+The compute module executes RISC micro-ops from the micro-op cache.
+There are two types of compute micro-ops: ALU and GEMM operations.
+To minimize the footprint of micro-op kernels, while avoiding the need for control-flow instructions such as conditional jumps, the compute module executes micro-op sequences inside a two-level nested loop that computes the location of each tensor register location via an affine function.
+This compression approach helps reduce the micro-kernel instruction footprint, and applies to both matrix multiplication and 2D convolution, commonly found in neural network operators.
+
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/developer/gemm_core.png
+   :align: center
+   :width: 100%
+
+The **GEMM core** evaluates GEMM instructions, by executing a micro-code sequence in a 2-level nested loop described in the Figure above.
+The GEMM core can perform one input-weight matrix multiplication per cycle.
+The dimensions of the single-cycle matrix multiplication defines a hardware *tensorization intrinsic* which the TVM compiler has to lower a computation schedule onto.
+This tensorization intrinsic is defined by the dimensions of the input, weight and accumulator tensors.
+Each data type can have a different integer precision: typically both weight and input types are low-precision (8-bits or less), while the accumulator tensor has a wider type to prevent overflows (32-bits).
+In order to keep the GEMM core busy, each of the input buffer, weight buffer, and register file have to expose sufficient read/write bandwidth.
+
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/developer/alu_core.png
+   :align: center
+   :width: 100%
+
+The **Tensor ALU** supports a set of standard operations to implement common activation, normalization, and pooling operators.
+VTA being a modular design, the range of operators that the Tensor ALU supports can be extended for higher operator coverage, at the expense of higher resource utilization.
+The Tensor ALU can perform tensor-tensor operations, as well as tensor-scalar operations on an immediate value.
+The opcode of the tensor ALU, and the immediate value are specified by the high-level CISC instruction.
+The micro-code in the context of tensor ALU computation only takes care of specifying data access patterns.
+
+.. note::
+   In terms of computational throughput, the Tensor ALU does not execute at a rate of one operation per cycle.
+   The limitation comes from the lack of read-ports: since one register file tensor can be read per cycle, the tensor ALU has an initiation interval of at least 2 (i.e. performs at most 1 operation every 2 cycles).
+   In addition, performing a single tensor-tensor operation at once can be expensive especially given that register file types are wide, typically 32-bit integers.
+   As a result, in order to balance the resource utilization footprint of the Tensor ALU with the GEMM core, a tensor-tensor operation is by default performed via vector-vector operations over multiple cycles.
+
+
+Load and Store Modules
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/developer/2d_dma.png
+   :align: center
+   :width: 100%
+
+The load and store modules perform 2D DMA loads with a strided access pattern from DRAM to SRAM.
+In addition, the load module can insert 2D padding on the fly, which is useful when blocking 2D convolution.
+This means that VTA can tile 2D convolution inputs without paying the overhead of re-laying data out in DRAM to insert spatial padding around input and weight tiles.
+
+
--- a/docs/vta/dev/index.rst
+++ b/docs/vta/dev/index.rst
+VTA Design and Developer Guide
+==============================
+
+This developer guide details the complete VTA-TVM hardware-software stack.
+
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_stack.png
+   :align: center
+   :width: 60%
+
+.. toctree::
+   :maxdepth: 2
+
+   config
+   hardware
\ No newline at end of file
--- a/docs/vta/index.rst
+++ b/docs/vta/index.rst
 VTA: Deep Learning Accelerator Stack
 ====================================
-Specialized accelerators are key enablers of future deep learning workloads. TVM stack targets specialized accelerators.
-VTA(versatile tensor accelerator) is a generic, modular open-source deep learning accelerator.
+
+The Versatile Tensor Accelerator (VTA) is an open, generic, and customizable deep learning accelerator with a complete TVM-based compiler stack. We designed VTA to expose the most salient and common characteristics of mainstream deep learning accelerators. Together TVM and VTA form an end-to-end hardware-software deep learning system stack that includes hardware design, drivers, a JIT runtime, and an optimizing compiler stack based on TVM.
+
+.. image:: http://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_overview.png
+   :align: center
+   :width: 60%
+
+VTA has the following key features:
+
+- Generic, modular, open-source hardware.
+- Streamlined workflow to deploy to FPGAs.
+- Simulator support to prototype compilation passes on regular workstations.
+- Pynq-based driver and JIT runtime for both simulated and FPGA hardware back-end.
+- End to end TVM stack integration.
+
 This page contains links to all the resources related to VTA:

+
 .. toctree::
   :maxdepth: 1

   install
+   dev/index
   tutorials/index


-Features
--------
-VTA have the following key features:
+Literature
+----------

- Generic, modular open-source hardware
- Streamlined workflow to deploy to FPGAs.
- Simulator support to protoype compilation passes on regular workstations.
- Driver and JIT runtime for both simulated and FPGA hardware backend.
- End to end TVM stack integration
+- Read the VTA `release blog post`_.
+- Read the VTA tech report: `An Open Hardware Software Stack for Deep Learning`_.
+
+.. _release blog post: https://tvm.ai/2018/07/12/vta-release-announcement.html
+.. _An Open Hardware Software Stack for Deep Learning: https://arxiv.org/abs/1807.04188
\ No newline at end of file
--- a/docs/vta/install.md
+++ b/docs/vta/install.md
@@ -2,15 +2,17 @@ VTA Installation Guide
 ======================

 We present three installation guides, each extending on the previous one:
-1. VTA simulation-only installation
-2. VTA hardware testing setup with the [Pynq](http://www.pynq.io/) FPGA development board
-3. VTA hardware compilation tool chain installation
+1. [Simulator installation](#vta-simulator-installation)
+2. [Hardware test setup](#vta-pynq-based-test-setup)
+3. [FPGA toolchain installation](#vta-fpga-toolchain-installation)

-## VTA Simulation-Only Installation
+## VTA Simulator Installation

-You need [TVM installed](https://docs.tvm.ai/install/index.html) on your machine. For a quick and easy start, use the pre-built Docker image.
-VTA simulator is library will be built by default along with TVM.
-All you need to run the simulator is to add the vta library to your python path.
+You need [TVM installed](https://docs.tvm.ai/install/index.html) on your machine.
+For a quick and easy start, use the pre-built [TVM Docker image](https://docs.tvm.ai/install/docker.html).
+
+The VTA simulator library is built by default with TVM.
+Add the VTA library to your python path to run the VTA examples.

 ```bash
 export PYTHONPATH=/path/to/vta/python:${PYTHONPATH}
@@ -18,25 +20,27 @@ export PYTHONPATH=/path/to/vta/python:${PYTHONPATH}

 ### Testing your VTA Simulation Setup

-Finally to ensure that you've properly installed the VTA package, we can run simple unit tests and the ResNet-18 inference example.
-
-Let's first run the 2D convolution test bench that will only run the ResNet-18 convolution layers.
+To ensure that you've properly installed the VTA python package, run the following 2D convolution testbench.

 ```bash
 python <tvm root>/vta/tests/python/integration/test_benchmark_topi_conv2d.py
 ```

-> Note: You'll notice that for every convolution layer, the throughput gets reported in GOPS. These numbers are actually the computational throughput that the simulator achieves, by evaluating the convolution in software.
+> Note: You'll notice that for every convolution layer, the throughput gets reported in GOPS. These numbers are actually the computational throughput that the simulator achieves, by evaluating the convolutions in software.
+
+You are invited to try out our [VTA programming tutorials](https://docs.tvm.ai/vta/tutorials/index.html).

-You can also try out our [VTA programming tutorials](https://docs.tvm.ai/vta/tutorials/index.html) on the VTA simulator.

+### Advanced Configuration (optional)

-### Advanced Configuration
+VTA is a generic configurable deep learning accelerator.
+The configuration is specified by `vta_config.json` under the TVM root folder.
+This file provides an architectural specification of the VTA accelerator to parameterize the TVM compiler stack and the VTA hardware stack.

-VTA is a generic configurable hardware. The configuration is specified by a `vta_config.json` under root of the TVM folder.
-This file provides an architectural specification of the VTA accelerator that can be understood by both the TVM compiler stack and the VTA hardware stack.
-It also specifies the TVM compiler target. When `TARGET` is set to `sim`, it tells the TVM compiler to execute the TVM workloads on the VTA simulator.
-You can modify the content to reconfigure VTA to a different mode. To do so,
+The VTA configuration file also specifies the TVM compiler target.
+When `TARGET` is set to `sim`, all TVM workloads execute on the VTA simulator.
+You can modify the content of the configuration file to rebuild VTA to a different parameterization.
+To do so,

 ```bash
 cd <tvm root>
@@ -45,28 +49,28 @@ cp vta/config/vta_config.json vta_config.json
 make vta
 ```

-## VTA Pynq-Based Testing Setup
+## VTA Pynq-Based Test Setup

-This second guide extends the *VTA Simulation-Only Installation* guide above to allow FPGA-based hardware tests of the full TVM and VTA software-hardware stack.
+This second guide extends the *VTA Simulator Installation* guide above to run FPGA hardware tests of the complete TVM and VTA software-hardware stack.
 In terms of hardware components you'll need:
 * The [Pynq](http://www.pynq.io/) FPGA development board which can be acquired for $200, or $150 for academics from [Digilent](https://store.digilentinc.com/pynq-z1-python-productivity-for-zynq/).
-* An Ethernet-to-USB adapter to connect the Pynq board to your development computer.
-* An 8+GB micro SD card the (can be ordered with the Pynq dev kit).
-* An AC to DC 12V 3A power adapter (can be ordered with the Pynq dev kit).
+* An Ethernet-to-USB adapter to connect the Pynq board to your development machine.
+* An 8+GB micro SD card.
+* An AC to DC 12V 3A power adapter.

-This guide includes:
-1. Pynq board setup instructions
-2. Pynq-side RPC server build and deployment
-3. Revisiting the test examples from the *VTA Simulation-Only Installation* guide, this time executing on the Pynq board
+This guide covers the following themes:
+1. Pynq board setup instructions.
+2. Pynq-side RPC server build and deployment.
+3. Revisiting the test examples from the *VTA Simulator Installation* guide, this time executing on the Pynq board.

 ### Pynq Board Setup

-Setup your Pynq board based on the *Getting Started* tutorial for the [Pynq board](http://pynq.readthedocs.io/en/latest/getting_started.html). You should follow the instructions up to and including the *Turning On the PYNQ-Z1* steps (no need to pursue *Getting Started* tutorial beyond this point).
-* Make sure that you've downloaded the latest Pynq image, PYNQ-Z1 v2.1 (released 21 Feb 2018), and have imaged your SD card with it.
-* For this particular setup, follow the ["Connect to a Computer"](http://pynq.readthedocs.io/en/latest/getting_started.html#connect-to-a-computer) Ethernet setup instructions.
-  * To be able to talk to the board, make sure to [assign your computer a static IP address](http://pynq.readthedocs.io/en/latest/appendix.html#assign-your-computer-a-static-ip)
+Setup your Pynq board based on the [Pynq board getting started tutorial](http://pynq.readthedocs.io/en/latest/getting_started.html).
+You should follow the instructions up to and including the *Turning On the PYNQ-Z1* step (no need to pursue the tutorial beyond this point).
+* Make sure that you've downloaded the latest Pynq image, [PYNQ-Z1 v2.1](http://pynq-testing.readthedocs.io/en/image_v2.2/getting_started/pynq_image.html) (released 21 Feb 2018), and have imaged your SD card with it (we recommend the free [Etcher](https://etcher.io/) program).
+* For this test setup, follow the ["Connect to a Computer"](http://pynq.readthedocs.io/en/latest/getting_started.html#connect-to-a-computer) Ethernet setup instructions. To be able to talk to the board, make sure to [assign your computer a static IP address](http://pynq.readthedocs.io/en/latest/appendix.html#assign-your-computer-a-static-ip)

-Once the board is powered on and connected to your development host machine, try connecting to it to make sure you've properly set up your Pynq board:
+Once the board is powered on and connected to your development machine, try connecting to it to make sure you've properly set up your Pynq board:
 ```bash
 # To connect to the Pynq board use the [username, password] combo: [xilinx, xilinx]
 ssh xilinx@192.168.2.99
@@ -74,9 +78,10 @@ ssh xilinx@192.168.2.99

 ### Pynq-Side RPC Server Build & Deployment

-Because the direct board-to-computer connection prevents the board from directly accessing the internet, we'll need to mount the Pynq's file system to your development machine's file system with [sshfs](https://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-ssh). Next we directly clone the VTA repository into the mountpoint from your development machine.
+Because the direct board-to-computer connection prevents the board from directly accessing the internet, we'll need to mount the Pynq's file system to your development machine's file system with [sshfs](https://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-ssh). Next we directly clone the TVM repository into the sshfs mountpoint on your development machine.

 ```bash
+# On the Host-side
 mkdir <mountpoint>
 sshfs xilinx@192.168.2.99:/home/xilinx <mountpoint>
 cd <mountpoint>
@@ -95,7 +100,7 @@ ssh xilinx@192.168.2.99
 cd /home/xilinx/tvm
 mkdir build
 cp cmake/config.cmake build/.
-# copy pynq specific configuration
+# Copy pynq specific configuration
 cp vta/config/pynq_sample.json build/vta_config.json
 cd build
 cmake ..
@@ -115,10 +120,11 @@ Tips regarding the Pynq RPC Server:
 * To kill the RPC server, just send the `Ctrl + c` command. You can re-run it with `sudo ./apps/pynq_rpc/start_rpc_server.sh`.
 * If unresponsive, the board can be rebooted by power-cycling it with the physical power switch.

-### Testing your VTA Pynq-based Hardware Setup
+### Testing your Pynq-based Hardware Setup

-Before running the examples you'll need to configure your host environment as follows:
+Before running the examples on your development machine, you'll need to configure your host environment as follows:
 ```bash
+# On the Host-side
 export VTA_PYNQ_RPC_HOST=192.168.2.99
 export VTA_PYNQ_RPC_PORT=9091
 ```
@@ -128,23 +134,28 @@ Alternatively, you can copy the default `vta/config/pynq_sample.json` into the T
 > Note: in contrast to our simulation setup, there are no libraries to compile on the host side since the host offloads all of the computation to the Pynq board.

 ```bash
+# On the Host-side
 cd <tvm root>
 cp vta/config/pynq_sample.json vta_config.json
 ```

-This time again, we will run the 2D convolution testbench. But beforehand, we'll need to program the Pynq's own FPGA with a VTA bitstream, and build the VTA runtime on the Pynq via RPC. The following `test_program_rpc.py` script will perform two operations:
+This time again, we will run the 2D convolution testbench.
+Beforehand, we need to program the Pynq board FPGA with a VTA bitstream, and build the VTA runtime via RPC.
+The following `test_program_rpc.py` script will perform two operations:
 * FPGA programming, by downloading a pre-compiled bitstream from a [VTA bitstream repository](https://github.com/uwsaml/vta-distro) that matches the default `vta_config.json` configuration set by the host, and sending it over to the Pynq via RPC to program the Pynq's FPGA.
-* Runtime building on the Pynq, which needs to be run everytime the `vta_config.json` configuration is modified. This ensures that the VTA software runtime that generates the accelerator's executable via just-in-time (JIT) compilation matches the specifications of the VTA design that is programmed on the FPGA. The build process takes about 30 seconds to complete.
+* Runtime building on the Pynq, which needs to be run every time the `vta_config.json` configuration is modified. This ensures that the VTA software runtime that generates the accelerator's executable via just-in-time (JIT) compilation matches the specifications of the VTA design that is programmed on the FPGA. The build process takes about 30 seconds to complete so be patient!

 ```bash
+# On the Host-side
 python <tvm root>/vta/tests/python/pynq/test_program_rpc.py
 ```

 > Tip: You can track progress of the FPGA programming and the runtime rebuilding steps by looking at the RPC server's logging messages in your Pynq `ssh` session.

-We are now ready to run the 2D convolution testbench for the ResNet-18 workload in hardware.
+We are now ready to run the 2D convolution testbench in hardware.

 ```bash
+# On the Host-side
 python <tvm root>/vta/tests/python/integration/test_benchmark_topi_conv2d.py
 ```

@@ -153,28 +164,29 @@ The performance metrics measured on the Pynq board will be reported for each con
 You can also try out our [VTA programming tutorials](https://docs.tvm.ai/vta/tutorials/index.html).


-## VTA Hardware Toolchain Installation
+## VTA FPGA Toolchain Installation

 This third and last guide allows users to generate custom VTA bitstreams using free-to-use Xilinx compilation toolchains.

 ### Xilinx Toolchain Installation

-We recommend using `Vivado 2017.1` since our scripts have been tested to work on this version of the Xilinx toolchains. Our guide is written for Linux installation.
+We recommend using `Vivado 2018.2` since our scripts have been tested to work on this version of the Xilinx toolchains.
+Our guide is written for Linux (Ubuntu) installation.

-You’ll need to install Xilinx’ FPGA compilation toolchain, [Vivado HL WebPACK 2017.1](https://www.xilinx.com/products/design-tools/vivado.html), which a license-free version of the Vivado HLx toolchain.
+You’ll need to install Xilinx’ FPGA compilation toolchain, [Vivado HL WebPACK 2018.2](https://www.xilinx.com/products/design-tools/vivado.html), which a license-free version of the Vivado HLx toolchain.

 #### Obtaining and Launching the Vivado GUI Installer

-1. Go to the [download webpage](https://www.xilinx.com/support/download.html), and download the Linux Self Extracting Web Installer for Vivado HL 2017.1 WebPACK and Editions.
+1. Go to the [download webpage](https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vivado-design-tools/2018-2.html), and download the Linux Self Extracting Web Installer for Vivado HLx 2018.2: WebPACK and Editions.
 2. You’ll have to sign in with a Xilinx account. This requires a Xilinx account creation that will take 2 minutes.
-3. Complete the Name and Address Verification by clicking “Next”, and you will get the opportunity to download a binary file, called `Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin`.
+3. Complete the Name and Address Verification by clicking “Next”, and you will get the opportunity to download a binary file, called `Xilinx_Vivado_SDK_Web_2018.2_0614_1954_Lin64.bin`.
 4. Now that the file is downloaded, go to your `Downloads` directory, and change the file permissions so it can be executed:
 ```bash
-chmod u+x Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
+chmod u+x Xilinx_Vivado_SDK_Web_2018.2_0614_1954_Lin64.bin
 ```
 5. Now you can execute the binary:
 ```bash
-./Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
+./Xilinx_Vivado_SDK_Web_2018.2_0614_1954_Lin64.bin
 ```

 #### Xilinx Vivado GUI Installer Steps
@@ -182,17 +194,17 @@ chmod u+x Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
 At this point you've launched the Vivado 2017.1 Installer GUI program.

 1. Click “Next” on the *Welcome* screen.
-2. Enter your Xilinx User Credentials under “User Authentication” and select the “Download and Install Now” before clicking “Next” on the *Select Install Type* screen.
-3. Accept all terms before clicking on “Next” on the *Accept License Agreements* screen.
-4. Select the “Vivado HL WebPACK” before clicking on “Next” on the *Select Edition to Install* screen.
+2. On the *Select Install Type* screen, enter your Xilinx user credentials under the “User Authentication” box and select the “Download and Install Now” option before clicking “Next” .
+3. On the *Accept License Agreements* screen, accept all terms before clicking “Next”.
+4. On the *Select Edition to Install* screen, select the “Vivado HL WebPACK” before clicking “Next” .
 5. Under the *Vivado HL WebPACK* screen, before hitting “Next", check the following options (the rest should be unchecked):
   * Design Tools -> Vivado Design Suite -> Vivado
-   * Design Tools -> Vivado Design Suite -> Vivado High Level Synthesis
-   * Devices -> Production Services -> SoCs -> Zynq-7000 Series
-6. Your total download size should be about 3GB and the amount of Disk Space Required 13GB.
-7. Set the installation directory before clicking “Next” on the *Select Destination Directory* screen. It might highlight some paths as red - that’s because the installer doesn’t have the permission to write to that directory. In that case select a path that doesn’t require special write permissions (e.g. in your home directory).
-8. Hit “Install” under the *Installation Summary* screen.
-9. An *Installation Progress Window* will pop-up to track progress of the download and the installation.
+   * Devices -> Production Devices -> SoCs -> Zynq-7000 (if you are targeting the Pynq board)
+   * Devices -> Production Devices -> SoCs -> UltraScale+ MPSoC (if you are targeting the Ultra-96 board)
+6. Your total download size should be about 5GB and the amount of Disk Space Required 23GB.
+7. On the *Select Destination Directory* screen, set the installation directory before clicking “Next”. It might highlight some paths as red - that’s because the installer doesn’t have the permission to write to the directory. In that case select a path that doesn’t require special write permissions (e.g. your home directory).
+8. On the *Installation Summary* screen, hit “Install”.
+9. An *Installation Progress* window will pop-up to track progress of the download and the installation.
 10. This process will take about 20-30 minutes depending on your connection speed.
 11. A pop-up window will inform you that the installation completed successfully. Click "OK".
 12. Finally the *Vivado License Manager* will launch. Select "Get Free ISE WebPACK, ISE/Vivado IP or PetaLinux License" and click "Connect Now" to complete the license registration process.
@@ -201,20 +213,17 @@ At this point you've launched the Vivado 2017.1 Installer GUI program.

 The last step is to update your `~/.bashrc` with the following lines. This will include all of the Xilinx binary paths so you can launch compilation scripts from the command line.
 ```bash
-# Xilinx Vivado 2017.1 environmentexport XILINX_VIVADO=${HOME}/Xilinx/SDx/2017.1/Vivado
-export XILINX_VIVADO=${HOME}/Xilinx/SDx/2017.1/Vivado
-export XILINX_HLS=${HOME}/Xilinx/SDx/2017.1/Vivado_HLS
-export XILINX_SDK=${HOME}/Xilinx/SDx/2017.1/SDK
+# Xilinx Vivado 2018.2 environment
+export XILINX_VIVADO=${XILINX_PATH}/Vivado/2018.2
 export PATH=${XILINX_VIVADO}/bin:${PATH}
-export PATH=${XILINX_HLS}/bin:${PATH}
-export PATH=${XILINX_SDK}/bin:${PATH}
 ```

 ### Custom VTA Bitstream Compilation

-High-level parameters are listed under `<tvm root>/vta/config/vta_config.json` and can be customized by the user. For this custom VTA Bitstream Compilation exercise, we'll change the frequency of our design, so it can be clocked a little faster.
-* Set the `HW_FREQ` field to `142`. The Pynq board supports 100, 142, 167 and 200MHz clocks. Note that the higher the frequency, the harder it will be to close timing. Increasing the frequency can lead to timing violation and thus faulty hardware.
-* Set the `HW_CLK_TARGET` to `6`. This parameters refers to the target clock period in ns passed to HLS - a lower clock period leads to more aggressive pipelining to achieve timing closure at higher frequencies. Technically a 142MHz clock would require a 7ns target, but we intentionally lower the clock target to 6ns to more aggressively pipeline our design.
+High-level hardware parameters are listed in the VTA configuration file and can be customized by the user.
+For this custom VTA bitstream compilation exercise, we'll change the frequency of our design, so it can be clocked a little faster.
+* Set the `HW_FREQ` field to `142`. The Pynq board supports 100, 142, 167 and 200MHz clocks. Note that the higher the frequency, the harder it will be to close timing. Increasing the frequency can lead to timing violation and thus faulty hardware execution.
+* Set the `HW_CLK_TARGET` to `6`. This parameters refers to the target clock period in nano seconds for HLS - a lower clock period leads to more aggressive pipelining to achieve timing closure at higher frequencies. Technically a 142MHz clock would require a 7ns target, but we intentionally lower the clock target to 6ns to more aggressively pipeline our design.

 Bitstream generation is driven by a top-level `Makefile` under `<tvm root>/vta/hardware/xilinx/`.

@@ -229,25 +238,26 @@ If you just want to generate the HLS-based VTA IP cores without launching the en
 make ip
 ```
 You'll be able to view the HLS synthesis reports under `<tvm root>/vta/build/hardware/xilinx/hls/` `<configuration>/<block>/solution0/syn/report/<block>_csynth.rpt`
-> Note: The `<configuration>` name is a string that summarizes the VTA configuration parameters specified in the `vta_config.json`. The `<block>` name refers to the specific module in the VTA pipeline.
+> Note: The `<configuration>` name is a string that summarizes the VTA configuration parameters listed in the `vta_config.json`. The `<block>` name refers to the specific module (or HLS function) that compose the high-level VTA pipeline.

-Finally to run the full hardware compilation and generate the bitstream, run:
+Finally to run the full hardware compilation and generate the VTA bitstream, run:

 ```bash
 make
 ```

-This process is lenghty, and can take around up to an hour to complete depending on your machine's specs. We recommend setting the `VTA_HW_COMP_THREADS` variable in the Makefile to take full advantage of all the cores on your development machine.
+This process is lengthy, and can take around up to an hour to complete depending on your machine's specs.
+We recommend setting the `VTA_HW_COMP_THREADS` variable in the Makefile to take full advantage of all the cores on your development machine.

 Once the compilation completes, the generated bitstream can be found under `<tvm root>/vta/build/hardware/xilinx/vivado/<configuration>/export/vta.bit`.

 ### Use the Custom Bitstream

-We can change the FPGA bitstream by simply change the bistream path to the configuring API.
+We can program the new VTA FPGA bitstream by setting the bitstream path of the `vta.program_fpga()` function in the tutorial examples, or in the `test_program_rpc.py` script.

 ```python
 vta.program_fpga(remote, bitstream="<tvm root>/vta/build/hardware/xilinx/vivado/<configuration>/export/vta.bit")
 ```

-Instead of downloading the bitstream from the bitstream repository, the programmer will instead use the custom bitstream you just generated, which is a VTA design clocked at a higher frequency.
-Do you observe a noticable performance increase on the ImageNet inference workload?
+Instead of downloading a pre-built bitstream from the VTA bitstream repository, TVM will instead use the new bitstream you just generated, which is a VTA design clocked at a higher frequency.
+Do you observe a noticeable performance increase on the ImageNet classification example?
--- a/vta/hardware/xilinx/scripts/vivado.tcl
+++ b/vta/hardware/xilinx/scripts/vivado.tcl
@@ -6,7 +6,7 @@
 #

 # Check if script is running in correct Vivado version.
-set scripts_vivado_version 2017.1
+set scripts_vivado_version 2018.2
 set current_vivado_version [version -short]

 if { [string first $scripts_vivado_version $current_vivado_version] == -1 } {
@@ -53,7 +53,8 @@ if { [llength $argv] eq 12 } {
  }
 } else {
  puts "Arg list incomplete: <path to ip dir> <num threads> <clock freq> \
-    <inp width> <wgt_width> <out_width> <batch> <in_block / 1024> <out_block>"
+    <inp width> <wgt_width> <out_width> <batch> <batch> <out_block> <in_block
+    <inp_mem_size> <wgt_mem_size> <out_mem_size>"
  return 1
 }

@@ -66,6 +67,7 @@ if {[expr $inp_part == 0]} {
  set inp_bus_width $inp_mem_width
 }
 set inp_mem_depth [expr $inp_mem_size * 8 / ($inp_mem_width * $inp_part)]
+
 # Derive weight mem parameters
 set wgt_mem_width [expr $wgt_width * $out_block * $in_block]
 set wgt_bus_width 1024
@@ -75,6 +77,7 @@ if {[expr $wgt_part == 0]} {
  set wgt_bus_width $wgt_mem_width
 }
 set wgt_mem_depth [expr $wgt_mem_size * 8 / ($wgt_mem_width * $wgt_part)]
+
 # Derive output mem parameters
 set out_mem_width [expr $out_width * $batch * $out_block]
 set out_bus_width 1024
@@ -252,7 +255,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $fetch_0

  # Create instance: g2l_queue, and set properties
-  set g2l_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 g2l_queue ]
+  set g2l_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 g2l_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {1022} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -273,7 +276,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $g2l_queue

  # Create instance: g2s_queue, and set properties
-  set g2s_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 g2s_queue ]
+  set g2s_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 g2s_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {1022} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -294,7 +297,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $g2s_queue

  # Create instance: gemm_queue, and set properties
-  set gemm_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 gemm_queue ]
+  set gemm_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 gemm_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {510} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -318,7 +321,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $gemm_queue

  # Create instance: l2g_queue, and set properties
-  set l2g_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 l2g_queue ]
+  set l2g_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 l2g_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {1022} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -345,7 +348,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $load_0

  # Create instance: load_queue, and set properties
-  set load_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 load_queue ]
+  set load_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 load_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {510} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -406,7 +409,7 @@ proc create_root_design { parentCell clk inp_part wgt_part out_part inp_bus_widt
  ] $processing_system7_1

  # Create instance: s2g_queue, and set properties
-  set s2g_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 s2g_queue ]
+  set s2g_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 s2g_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {1022} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -433,7 +436,7 @@ CONFIG.C_M_AXI_DATA_PORT_CACHE_VALUE {"1111"} \
  ] $store_0

  # Create instance: store_queue, and set properties
-  set store_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.1 store_queue ]
+  set store_queue [ create_bd_cell -type ip -vlnv xilinx.com:ip:fifo_generator:13.2 store_queue ]
  set_property -dict [ list \
    CONFIG.Empty_Threshold_Assert_Value_axis {510} \
    CONFIG.Empty_Threshold_Assert_Value_rach {14} \
@@ -466,7 +469,7 @@ CONFIG.NUM_PORTS {5} \
  if {${inp_part} > 1} {
    for {set i 0} {$i < ${inp_part}} {incr i} {
      # Create instance: inp_mem, and set properties
-      set inp_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 inp_mem_${i} ]
+      set inp_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 inp_mem_${i} ]
      set_property -dict [ list \
        CONFIG.Byte_Size {8} \
        CONFIG.Enable_32bit_Address {true} \
@@ -494,7 +497,7 @@ CONFIG.NUM_PORTS {5} \
    }
  } else {
      # Create instance: inp_mem, and set properties
-      set inp_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 inp_mem ]
+      set inp_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 inp_mem ]
      set_property -dict [ list \
        CONFIG.Byte_Size {8} \
        CONFIG.Enable_32bit_Address {true} \
@@ -525,7 +528,7 @@ CONFIG.NUM_PORTS {5} \
  if {${wgt_part} > 1} {
    for {set i 0} {$i < ${wgt_part}} {incr i} {
      # Create instance: wgt_mem, and set properties
-      set wgt_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 wgt_mem_${i} ]
+      set wgt_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 wgt_mem_${i} ]
      set_property -dict [ list \
        CONFIG.Assume_Synchronous_Clk {true} \
        CONFIG.Byte_Size {8} \
@@ -553,7 +556,7 @@ CONFIG.NUM_PORTS {5} \
    }
  } else {
      # Create instance: wgt_mem, and set properties
-      set wgt_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 wgt_mem ]
+      set wgt_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 wgt_mem ]
      set_property -dict [ list \
        CONFIG.Assume_Synchronous_Clk {true} \
        CONFIG.Byte_Size {8} \
@@ -584,7 +587,7 @@ CONFIG.NUM_PORTS {5} \
  if {${out_part} > 1} {
    for {set i 0} {$i < ${out_part}} {incr i} {
      # Create instance: out_mem, and set properties
-      set out_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 out_mem_${i} ]
+      set out_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 out_mem_${i} ]
      set_property -dict [ list \
        CONFIG.Byte_Size {8} \
        CONFIG.Enable_32bit_Address {true} \
@@ -612,7 +615,7 @@ CONFIG.NUM_PORTS {5} \
    }
  } else {
      # Create instance: out_mem, and set properties
-      set out_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.3 out_mem ]
+      set out_mem [ create_bd_cell -type ip -vlnv xilinx.com:ip:blk_mem_gen:8.4 out_mem ]
      set_property -dict [ list \
        CONFIG.Byte_Size {8} \
        CONFIG.Enable_32bit_Address {true} \

--- a/vta/tutorials/convolution_opt.py
+++ b/vta/tutorials/convolution_opt.py
@@ -30,7 +30,7 @@ from tvm import rpc
 from tvm.contrib import util
 from vta.testing import simulator

-# Load VTA parameters from the config.json file
+# Load VTA parameters from the vta/config/vta_config.json file
 env = vta.get_env()

 # We read the Pynq RPC host IP address and port number from the OS environment
@@ -38,7 +38,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))

 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":

    # Make sure that TVM was compiled with RPC=1

--- a/vta/tutorials/matrix_multiply.py
+++ b/vta/tutorials/matrix_multiply.py
@@ -26,7 +26,7 @@ from tvm import rpc
 from tvm.contrib import util
 from vta.testing import simulator

-# Load VTA parameters from the config.json file
+# Load VTA parameters from the vta/config/vta_config.json file
 env = vta.get_env()

 # We read the Pynq RPC host IP address and port number from the OS environment
@@ -34,7 +34,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))

 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":

    # Make sure that TVM was compiled with RPC=1
@@ -95,7 +95,7 @@ elif env.TARGET == "sim":
 #        :width: 480px
 #
 #   The dimensions of that matrix-matrix multiplication are specified in
-#   the :code:`config.json` configuration file.
+#   the :code:`vta_config.json` configuration file.
 #   The activation matrix has a :code:`(BATCH, BLOCK_IN)` shape
 #   and the transposed weight matrix has a :code:`(BLOCK_OUT, BLOCK_IN)` shape,
 #   thus inferring that the resulting output matrix has a
@@ -131,7 +131,7 @@ elif env.TARGET == "sim":
 #   dimension of VTA's tensor core, but also to match the specific data types
 #   expected by VTA.
 #   VTA for now only supports fixed point data types, which integer width is
-#   specified in the :code:`config.json` file by :code:`INP_WIDTH` and
+#   specified in the :code:`vta_config.json` file by :code:`INP_WIDTH` and
 #   :code:`WGT_WIDTH` for the activations and weights data types respectively.
 #   In addition, the accumulator data type integer width is specified by
 #   :code:`ACC_WIDTH`.
@@ -284,7 +284,7 @@ print(tvm.lower(s, [A, B, C], simple_mode=True))
 #      that stores input matrices of shape :code:`(env.BATCH, env.BLOCK_IN)`
 #      of type :code:`env.inp_dtype`. The input buffer contains
 #      `2 ^ LOG_INP_BUFF_SIZE` matrix elements (as specified in the
-#      :code:`config.json` file).
+#      :code:`vta_config.json` file).
 #    - :code:`env.wgt_scope`: Weight buffer, which is a read-only SRAM buffer
 #      that stores weight matrices of shape :code:`(env.BLOCK_OUT, env.BLOCK_IN)`
 #      of type :code:`env.wgt_dtype`. The weight buffer contains

--- a/vta/tutorials/matrix_multiply_opt.py
+++ b/vta/tutorials/matrix_multiply_opt.py
@@ -29,7 +29,7 @@ from tvm import rpc
 from tvm.contrib import util
 from vta.testing import simulator

-# Load VTA parameters from the config.json file
+# Load VTA parameters from the vta/config/vta_config.json file
 env = vta.get_env()

 # We read the Pynq RPC host IP address and port number from the OS environment
@@ -37,7 +37,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))

 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":

    # Make sure that TVM was compiled with RPC=1

--- a/vta/tutorials/resnet.py
+++ b/vta/tutorials/resnet.py
@@ -38,7 +38,7 @@ from io import BytesIO
 from matplotlib import pyplot as plt
 from PIL import Image

-# Load VTA parameters from the config.json file
+# Load VTA parameters from the vta/config/vta_config.json file
 env = vta.get_env()

 # Helper to crop an image to a square (224, 224)
@@ -180,7 +180,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))

 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":

    # Make sure that TVM was compiled with RPC=1

--- a/vta/tutorials/vta_get_started.py
+++ b/vta/tutorials/vta_get_started.py
@@ -29,12 +29,12 @@ import numpy as np
 # VTA is a modular and customizable design. Consequently, the user
 # is free to modify high-level hardware parameters that affect
 # the hardware design layout.
-# These parameters are specified in the :code:`config.json` file by their
+# These parameters are specified in the :code:`vta_config.json` file by their
 # :code:`log2` values.
 # These VTA parameters can be loaded with the :code:`vta.get_env`
 # function.
 #
-# Finally, the TVM target is specified in the :code:`config.json` file.
+# Finally, the TVM target is also specified in the :code:`vta_config.json` file.
 # When set to *sim*, execution will take place inside of a behavioral
 # VTA simulator.
 # If you want to run this tutorial on the Pynq FPGA development platform,
@@ -58,7 +58,7 @@ host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.2.99")
 port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))

 # We configure both the bitstream and the runtime system on the Pynq
-# to match the VTA configuration specified by the config.json file.
+# to match the VTA configuration specified by the vta_config.json file.
 if env.TARGET == "pynq":

    # Make sure that TVM was compiled with RPC=1
@@ -110,11 +110,11 @@ elif env.TARGET == "sim":
 # For VTA's general purpose operations such as vector adds, the tile size is
 # :code:`(env.BATCH, env.BLOCK_OUT)`.
 # The dimensions are specified in
-# the :code:`config.json` configuration file and are set by default to
+# the :code:`vta_config.json` configuration file and are set by default to
 # a (1, 16) vector.
 #
 # In addition, A and B's data types also needs to match the :code:`env.acc_dtype`
-# which is set by the :code:`config.json` file to be a 32-bit integer.
+# which is set by the :code:`vta_config.json` file to be a 32-bit integer.

 # Output channel factor m - total 64 x 16 = 1024 output channels
 m = 64