[DOCS] Fix Sphinx Warnings (RST indent, cross-ref, and image scale) (#4920)

* fix indents * Fix image scale and cross-ref

[DOCS] Fix Sphinx Warnings (RST indent, cross-ref, and image scale) (#4920)
* fix indents * Fix image scale and cross-ref
98e7709f · Cody Yu · GitHub · efd35e86 · 98e7709f · 98e7709f
Unverified Commit 98e7709f authored Feb 20, 2020 by Cody Yu Committed by GitHub Feb 20, 2020
22 changed files
--- a/docs/api/python/index.rst
+++ b/docs/api/python/index.rst
@@ -21,6 +21,7 @@ Python API
 .. toctree::
   :maxdepth: 2
+   tvm
   runtime
   ndarray
   error

--- a/docs/api/python/relay/op.rst
+++ b/docs/api/python/relay/op.rst
@@ -53,8 +53,3 @@ tvm.relay.op
 .. automodule:: tvm.relay.op.nn
    :members:
-.. automodule:: tvm.relay.op.vision.multibox
-    :members:
-.. autofunction:: tvm.relay.vision.nms
--- a/docs/api/python/runtime.rst
+++ b/docs/api/python/runtime.rst
@@ -27,7 +27,7 @@ tvm.runtime
 .. autoclass:: tvm.runtime.PackedFunc
   :members:
-   :inheritated-members:
+   :inherited-members:
 .. autofunction:: tvm.register_func

--- a/docs/contribute/pull_request.rst
+++ b/docs/contribute/pull_request.rst
@@ -29,12 +29,11 @@ This is a quick guide to submit a pull request, please also refer to the detaile
    git rebase upstream/master
 - Make sure code style check pass by typing the following command, and all the existing test-cases pass.
-  - ``docker/bash.sh tvmai/ci-lint ./tests/scripts/task_lint.sh``
+- ``docker/bash.sh tvmai/ci-lint ./tests/scripts/task_lint.sh``. (Note: You must install docker beforehand so you can run a docker image.)
-     (Note: You must install docker beforehand so you can run a docker image.)
 - Add test-cases to cover the new features or bugfix the patch introduces.
 - Document the code you wrote, see more at :ref:`doc_guide`
- Send the pull request,  fix the problems reported by automatic checks.
+- Send the pull request and fix the problems reported by automatic checks.
-  Request code reviews from other contributors and improves your patch according to feedbacks.
+- Request code reviews from other contributors and improves your patch according to feedbacks.
  - To get your code reviewed quickly, we encourage you to help review others' code so they can do the favor in return.
  - Code review is a shepherding process that helps to improve contributor's code quality.

--- a/docs/deploy/index.rst
+++ b/docs/deploy/index.rst
@@ -56,7 +56,6 @@ embedded devices is through TVM's RPC API.
 Here are the links to the related tutorials.
 - :ref:`tutorial-cross-compilation-and-rpc`
- :ref:`tutorial-deploy-model-on-mali-gpu`
 - :ref:`tutorial-deploy-model-on-rasp`
 After you finished tuning and benchmarking, you might need to deploy the model on the
@@ -68,3 +67,5 @@ target device without relying on RPC. see the following resources on how to do s
   cpp_deploy
   android
   integrate
+   aocl_fpga
+   aws_fpga
--- a/docs/dev/inferbound.rst
+++ b/docs/dev/inferbound.rst
@@ -118,13 +118,11 @@ In the Operation class declaration above, we can see that each operation also ha
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/stage_graph.png
    :align: center
-    :scale: 70%
 InferBound makes one pass through the graph, visiting each stage exactly once. InferBound starts from the output stages (i.e., the solid blue nodes in the graph above), and moves upwards (in the opposite direction of the edges). This is achieved by performing a reverse topological sort on the nodes of the graph. Therefore, when InferBound visits a stage, each of its consumer stages has already been visited.
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/inferbound_traversal.png
    :align: center
-    :scale: 70%
 The InferBound pass is shown in the following pseudo-code:
@@ -162,7 +160,6 @@ Recall that all IterVars of the stage are related by IterVarRelations. The IterV
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/relations.png
    :align: center
-    :scale: 70%
 The above diagram shows the IterVar hyper-graph for one stage. The stage has one root_iter_var, ``i``. It has been split, and the resulting inner axis ``i.inner``, has been split again. The leaf_iter_vars of the stage are shown in green: ``i.outer``, ``i.inner.outer``, and ``i.inner.inner``.
@@ -208,7 +205,6 @@ As mentioned above, a consumer may only require a small number of elements from 
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/inferbound_phases.png
    :align: center
-    :scale: 70%
 IntSets
 ~~~~~~~
@@ -323,14 +319,12 @@ A ComputeOp has only a single output Tensor, whose axes correspond to the axis v
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/gatherbound.png
    :align: center
-    :scale: 70%
 The union of IntSets is computed by converting each IntSet to an Interval, and then taking the minimum of all minimums, and the maximum of all of these interval's maximums.
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/union.png
    :align: center
-    :scale: 70%
 This clearly results in some unnecessary computation, i.e., tensor elements will be computed that are never used.
@@ -340,7 +334,6 @@ Unfortunately, even if we're lucky and the IntervalSet unions do not produce unn
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/gatherbound_problem.png
    :align: center
-    :scale: 70%
 .. _InferBoundCA:
@@ -696,7 +689,6 @@ When InferRootBound is working on stage B, it visits B's consumer stage C to fin
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/passupdomain_problem.png
    :align: center
-    :scale: 70%
@@ -756,17 +748,14 @@ If the split factor is 4, or 8, in the above example, the region of B needed in 
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/passupdomain_div.png
    :align: center
-    :scale: 70%
 However, if the split factor is changed from 4 to 3 in the example above, it is easy to see that the region of B that C needs can no longer be described by an independent Range for each of its axes.
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/passupdomain_nodiv.png
    :align: center
-    :scale: 70%
 The best that can be done with rectangular regions is shown in the following diagram. The orange regions are the minimum rectangular regions covering the region of B that needs to be computed, at each iteration of the outer loop.
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/docs/inferbound/passupdomain_min.png
    :align: center
-    :scale: 70%
--- a/docs/dev/relay_bring_your_own_codegen.rst
+++ b/docs/dev/relay_bring_your_own_codegen.rst
@@ -535,7 +535,7 @@ To simplify, we define a graph representation named "ExampleJSON" in this guide.
 Then the ExampleJON of this subgraph looks like:
-.. code-block:: json
+.. code-block:: none
  subgraph_0
    input 0 10 10
@@ -544,7 +544,7 @@ Then the ExampleJON of this subgraph looks like:
    input 3 10 10
    add 4 inputs: 0 1 shape: 10 10
    sub 5 inputs: 4 2 shape: 10 10
-    add 6 inputs: 5 3 shape: 10 10
+    mul 6 inputs: 5 3 shape: 10 10
 The ``input`` keyword declares an input tensor with its ID and shape; while the other statements describes computations in ``<op> <output ID> inputs: [input ID] shape: [shape]`` syntax.

--- a/docs/dev/relay_intro.rst
+++ b/docs/dev/relay_intro.rst
@@ -39,7 +39,6 @@ compile for heterogeneous execution environments (e.g., executing parts of the g
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/relay/dataflow.png
    :align: center
-    :scale: 70%
 You can use Relay to build a computational (dataflow) graph. Specifically, the above code shows how to
@@ -130,7 +129,6 @@ The code example below shows one program with two forms side by side.
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/relay/dataflow_vs_func.png
    :align: center
-    :scale: 70%
 The nested let binding is called A-normal form, and it is commonly used as IRs in functional programming languages.
@@ -155,7 +153,6 @@ which does not use let bindings.
 .. image:: https://raw.githubusercontent.com/tvmai/tvmai.github.io/master/images/relay/let_scope.png
    :align: center
-    :scale: 70%
 The problem comes when we try to decide where we should evaluate node ``%1``. In particular, while the text format seems
 to suggest that we should evaluate node ``%1`` outside the if scope, the AST(as shown in the picture) does not suggest so.

--- a/docs/dev/runtime.rst
+++ b/docs/dev/runtime.rst
@@ -258,6 +258,7 @@ It also allows us to get members of an object easily in front-end language.
 For example, in the following code, we accessed the op field of the TensorNode.
 .. code:: python
    import tvm
    x = tvm.placeholder((3,4), name="x")

--- a/docs/dev/virtual_machine.rst
+++ b/docs/dev/virtual_machine.rst
@@ -91,6 +91,7 @@ Ret
 ^^^
 **Arguments**:
 ::
  RegName dst
  RegName result
@@ -100,6 +101,7 @@ InvokePacked
 ^^^^^^^^^^^^
 **Arguments**:
 ::
  Index packed_index
  Index arity
  Index output_size
@@ -114,6 +116,7 @@ AllocTensor
 ^^^^^^^^^^^
 **Arguments**:
 ::
  RegName dst
  RegName storage
  uint32_t ndim
@@ -127,6 +130,7 @@ AllocTensorReg
 ^^^^^^^^^^^^^^
 **Arguments**:
 ::
  RegName dst
  RegName storage
  RegName shape_register
@@ -139,6 +143,7 @@ AllocStorage
 ^^^^^^^^^^^^
 **Arguments**:
 ::
  RegName dst
  RegName size
  RegName alignment
@@ -151,6 +156,7 @@ AllocADT
 ^^^^^^^^
 **Arguments**:
 ::
  RegName dst
  Index tag
  Index num_fields
@@ -163,6 +169,7 @@ AllocClosure
 ^^^^^^^^^^^^
 **Arguments**:
 ::
  RegName dst
  Index clo_index
  Index num_freevar
@@ -176,6 +183,7 @@ GetField
 ^^^^^^^^
 **Arguments**:
 ::
  RegName dst
  RegName object
  Index field_index
@@ -186,6 +194,7 @@ If
 ^^
 **Arguments**:
 ::
  RegName test
  RegName target
  Index true_offset
@@ -199,6 +208,7 @@ GetTag
 ^^^^^^
 **Arguments**:
 ::
  RegName object
  RegName dst
@@ -212,6 +222,7 @@ Goto
 ^^^^
 **Arguments**:
 ::
  Index pc_offset
 Relative unconditional jump by ``pc_offset``.
@@ -220,6 +231,7 @@ Invoke
 ^^^^^^
 **Arguments**:
 ::
  Index func_index
 Invoke function at ``func_index``, consumes the number of arguments contained in the VMFunction's
@@ -229,6 +241,7 @@ InvokeClosure
 ^^^^^^^^^^^^^
 **Arguments**:
 ::
    RegName closure
    Index num_closure_args
    RegName* closure_args
@@ -239,6 +252,7 @@ LoadConst
 ^^^^^^^^^
 **Arguments**:
 ::
  RegName dst
  Index const_index
@@ -248,6 +262,7 @@ LoadConsti
 ^^^^^^^^^^
 **Arguments**:
 ::
  Index val
  RegName dst
@@ -277,7 +292,7 @@ previous call. Registers are allocated in a continuous space (virtual register f
 We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter).
-::
+.. code-block:: c
    struct VirtualMachine {
      ...
@@ -331,6 +346,7 @@ Optimizations marked with `TODO` are not implemented yet.
 Serialization
 ~~~~~~~~~~~~~
 Serializing and deserializing the executable generated by the Relay VM compiler is a must as
 we may want to save the model to the disk and perform inference later. Previously, Relay has produced
 a serialized form in a json file for the graph runtime. However, the same format is not directly
@@ -372,14 +388,17 @@ Unresolved Questions
 How do we handle dynamic shapes?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 TODO
 How can we modify the VM to support JIT compilation of certain code paths?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 In the code generation space there are still many tradeoffs to be analyzed and the VM is designed
 to be very flexible so we can modify it for future experiments.
 How do we support heterogenous execution?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies.
 In order to do this properly we need to run the device annotation and copying passes. 
--- a/docs/vta/dev/hardware.rst
+++ b/docs/vta/dev/hardware.rst
@@ -215,6 +215,7 @@ This would result in a ``load-gemm-activate-store`` task pipeline which closely 
 Adding more stages has a cost however: it can add storage and extra logic overhead, which is why we opted for a default 3-stage pipeline.
 .. _vta-uarch:
 Microarchitectural Overview
 ---------------------------

--- a/docs/vta/hardware.rst
+++ b/docs/vta/hardware.rst
-..  Licensed to the Apache Software Foundation (ASF) under one
-    or more contributor license agreements.  See the NOTICE file
-    distributed with this work for additional information
-    regarding copyright ownership.  The ASF licenses this file
-    to you under the Apache License, Version 2.0 (the
-    "License"); you may not use this file except in compliance
-    with the License.  You may obtain a copy of the License at
-..    http://www.apache.org/licenses/LICENSE-2.0
-..  Unless required by applicable law or agreed to in writing,
-    software distributed under the License is distributed on an
-    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-    KIND, either express or implied.  See the License for the
-    specific language governing permissions and limitations
-    under the License.
-VTA Hardware Design Overview
-============================
--- a/tutorials/frontend/build_gcn.py
+++ b/tutorials/frontend/build_gcn.py
@@ -16,32 +16,29 @@
 # under the License.
 """
 Building a Graph Convolutional Network
-=====================
+======================================
 **Author**: `Yulun Yao <https://yulunyao.io/>`_, \
            `Chien-Yu Lin <https://homes.cs.washington.edu/~cyulin/>`_
 This article is an introductory tutorial to build a Graph Convolutional Network (GCN) with Relay.
 In this tutorial, we will run our GCN on Cora dataset to demonstrate.
 Cora dataset is a common benchmark for Graph Neural Networks (GNN) and frameworks that support GNN training and inference.
 We directly load the dataset from DGL library to do the apples to apples comparison against DGL.
 Please refer to DGL doc for DGL installation at
-https://docs.dgl.ai/install/index.html
+https://docs.dgl.ai/install/index.html.
-and refer to PyTorch guide for PyTorch installation at
+Please refer to PyTorch guide for PyTorch installation at
-https://pytorch.org/get-started/locally/
+https://pytorch.org/get-started/locally/.
 """
 ######################################################################
 # Define GCN in DGL with PyTorch backend
-# ------------------
+# --------------------------------------
 #
 # DGL example: https://github.com/dmlc/dgl/tree/master/examples/pytorch/gcn
-# This part reuses the code from the above example
+# This part reuses the code from the above example.
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
@@ -78,7 +75,7 @@ class GCN(nn.Module):
 ######################################################################
 # Define the functions to load dataset and evaluate accuracy
-# ------------------
+# ----------------------------------------------------------
 # You may substitute this part with your own dataset, here we load data from DGL
 from dgl.data import load_data
 from collections import namedtuple
@@ -106,7 +103,7 @@ def evaluate(data, logits):
 ######################################################################
 # Load the data and set up model parameters
-# ------------------
+# -----------------------------------------
 """
 Parameters
 ----------
@@ -136,7 +133,7 @@ num_classes = data.num_labels
 ######################################################################
 # Set up the DGL-PyTorch model and get the golden results
-# ------------------
+# -------------------------------------------------------
 #
 # The weights are trained with https://github.com/dmlc/dgl/blob/master/examples/pytorch/gcn/train.py
 from tvm.contrib.download import download_testdata
@@ -162,7 +159,7 @@ torch_model.load_state_dict(torch.load(model_path))
 ######################################################################
 # Run the DGL model and test for accuracy
-# ------------------
+# ---------------------------------------
 torch_model.eval()
 with torch.no_grad():
    logits_torch = torch_model(features)
@@ -174,9 +171,8 @@ print("Test accuracy of DGL results: {:.2%}".format(acc))
 ######################################################################
 # Define Graph Convolution Layer in Relay
-# ----------------------------
+# ---------------------------------------
 # To run GCN on TVM, we first need to implement Graph Convolution Layer.
-#
 # You may refer to https://github.com/dmlc/dgl/blob/master/python/dgl/nn/mxnet/conv.py for a GraphConv Layer implemented in DGL with MXNet Backend
 #
 # The layer is defined with below operations, note that we apply two transposes to keep adjacency matrix on right hand side of sparse_dense operator,
@@ -251,7 +247,7 @@ def GraphConv(layer_name,
 ######################################################################
 # Prepare the parameters needed in the GraphConv layers
-# ------------------
+# -----------------------------------------------------
 #
 import numpy as np
 import networkx as nx
@@ -282,7 +278,7 @@ assert params['infeats'].shape[0] == params['indptr'].shape[0] - 1
 ######################################################################
 # Put layers together
-# ------------------
+# -------------------
 # Define input features, norms, adjacency matrix in Relay
 infeats = relay.var("infeats", shape=data.features.shape)
@@ -321,7 +317,8 @@ func = relay.Function(relay.analysis.free_vars(output), output)
 ######################################################################
 # Compile and run with TVM
-# ------------------
+# ------------------------
+#
 # Export the weigths from PyTorch model to Python Dict
 model_params = {}
 for param_tensor in torch_model.state_dict():
@@ -345,7 +342,7 @@ m.set_input(**params)
 ######################################################################
 # Run the TVM model, test for accuracy and verify with DGL
-# ------------------
+# --------------------------------------------------------
 m.run()
 logits_tvm = m.get_output(0).asnumpy()
 print("Print the first five outputs from TVM execution\n", logits_tvm[:5])

--- a/tutorials/frontend/from_tflite.py
+++ b/tutorials/frontend/from_tflite.py
@@ -16,13 +16,12 @@
 # under the License.
 """
 Compile TFLite Models
-===================
+=====================
 **Author**: `Zhao Wu <https://github.com/FrozenGene>`_
 This article is an introductory tutorial to deploy TFLite models with Relay.
 To get started, Flatbuffers and TFLite package needs to be installed as prerequisites.
 A quick solution is to install Flatbuffers via pip
 .. code-block:: bash
@@ -68,7 +67,7 @@ Below you can find an example on how to compile TFLite model using TVM.
 """
 ######################################################################
 # Utils for downloading and extracting zip files
-# ---------------------------------------------
+# ----------------------------------------------
 import os
 def extract(path):
@@ -84,28 +83,28 @@ def extract(path):
 ######################################################################
 # Load pretrained TFLite model
-# ---------------------------------------------
+# ----------------------------
-# we load mobilenet V1 TFLite model provided by Google
+# Load mobilenet V1 TFLite model provided by Google
 from tvm.contrib.download import download_testdata
 model_url = "http://download.tensorflow.org/models/mobilenet_v1_2018_08_02/mobilenet_v1_1.0_224.tgz"
-# we download model tar file and extract, finally get mobilenet_v1_1.0_224.tflite
+# Download model tar file and extract it to get mobilenet_v1_1.0_224.tflite
 model_path = download_testdata(model_url, "mobilenet_v1_1.0_224.tgz", module=['tf', 'official'])
 model_dir = os.path.dirname(model_path)
 extract(model_path)
-# now we have mobilenet_v1_1.0_224.tflite on disk and open it
+# Now we can open mobilenet_v1_1.0_224.tflite
 tflite_model_file = os.path.join(model_dir, "mobilenet_v1_1.0_224.tflite")
 tflite_model_buf = open(tflite_model_file, "rb").read()
-# get TFLite model from buffer
+# Get TFLite model from buffer
 import tflite.Model
 tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)
 ######################################################################
 # Load a test image
-# ---------------------------------------------
+# -----------------
 # A single cat dominates the examples!
 from PIL import Image
 from matplotlib import pyplot as plt
@@ -118,10 +117,10 @@ plt.imshow(resized_image)
 plt.show()
 image_data = np.asarray(resized_image).astype("float32")
-# after expand_dims, we have format NHWC
+# Add a dimension to the image so that we have NHWC format layout
 image_data = np.expand_dims(image_data, axis=0)
-# preprocess image as described here:
+# Preprocess image as described here:
 # https://github.com/tensorflow/models/blob/edb6ed22a801665946c63d650ab9a0b23d98e1b1/research/slim/preprocessing/inception_preprocessing.py#L243
 image_data[:, :, :, 0] = 2.0 / 255.0 * image_data[:, :, :, 0] - 1
 image_data[:, :, :, 1] = 2.0 / 255.0 * image_data[:, :, :, 1] - 1
@@ -130,50 +129,50 @@ print('input', image_data.shape)
 ######################################################################
 # Compile the model with relay
-# ---------------------------------------------
+# ----------------------------
 # TFLite input tensor name, shape and type
 input_tensor = "input"
 input_shape = (1, 224, 224, 3)
 input_dtype = "float32"
-# parse TFLite model and convert into Relay computation graph
+# Parse TFLite model and convert it to a Relay module
 from tvm import relay
 mod, params = relay.frontend.from_tflite(tflite_model,
                                         shape_dict={input_tensor: input_shape},
                                         dtype_dict={input_tensor: input_dtype})
-# target x86 CPU
+# Build the module against to x86 CPU
 target = "llvm"
 with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(mod, target, params=params)
 ######################################################################
 # Execute on TVM
-# ---------------------------------------------
+# --------------
 import tvm
 from tvm.contrib import graph_runtime as runtime
-# create a runtime executor module
+# Create a runtime executor module
 module = runtime.create(graph, lib, tvm.cpu())
-# feed input data
+# Feed input data
 module.set_input(input_tensor, tvm.nd.array(image_data))
-# feed related params
+# Feed related params
 module.set_input(**params)
-# run
+# Run
 module.run()
-# get output
+# Get output
 tvm_output = module.get_output(0).asnumpy()
 ######################################################################
 # Display results
-# ---------------------------------------------
+# ---------------
-# load label file
+# Load label file
 label_file_url = ''.join(['https://raw.githubusercontent.com/',
                          'tensorflow/tensorflow/master/tensorflow/lite/java/demo/',
                          'app/src/main/assets/',
@@ -181,15 +180,15 @@ label_file_url = ''.join(['https://raw.githubusercontent.com/',
 label_file = "labels_mobilenet_quant_v1_224.txt"
 label_path = download_testdata(label_file_url, label_file, module='data')
-# list of 1001 classes
+# List of 1001 classes
 with open(label_path) as f:
    labels = f.readlines()
-# convert result to 1D data
+# Convert result to 1D data
 predictions = np.squeeze(tvm_output)
-# get top 1 prediction
+# Get top 1 prediction
 prediction = np.argmax(predictions)
-# convert id to class name and show the result
+# Convert id to class name and show the result
 print("The image prediction result is: id " + str(prediction) + " name: " + labels[prediction])
--- a/tutorials/frontend/using_external_lib.py
+++ b/tutorials/frontend/using_external_lib.py
@@ -16,7 +16,7 @@
 # under the License.
 """
 Using External Libraries in Relay
-================================
+=================================
 **Author**: `Masahiro Masuda <https://github.com/masahi>`_, `Truman Tian <https://github.com/SiNZeRo>`_
 This is a short tutorial on how to use external libraries such as cuDNN, or cuBLAS with Relay.

--- a/tutorials/language/tuple_inputs.py
+++ b/tutorials/language/tuple_inputs.py
@@ -56,7 +56,7 @@ print(tvm.lower(s, [A0, A1, B0, B1], simple_mode=True))
 # operators, and the inputs will collaborate together, e.g. :code:`argmax`.
 # In the reduction procedure, :code:`argmax` need to compare the value of
 # operands, also need to keep the index of operand. It can be expressed
-# with :any:`comm_reducer` as below:
+# with :py:func:`tvm.comm_reducer` as below:
 # x and y are the operands of reduction, both of them is a tuple of index
 # and value.

--- a/tutorials/optimize/opt_conv_tensorcore.py
+++ b/tutorials/optimize/opt_conv_tensorcore.py
@@ -18,7 +18,7 @@
 .. _opt-conv-tensorcore:
 How to optimize convolution using TensorCores
-==================================
+=============================================
 **Author**: `Siyuan Feng <https://github.com/Hzfengsy>`_
 In this tutorial, we will demonstrate how to write a high performance convolution
@@ -29,7 +29,7 @@ convolution has a large batch. We strongly recommend covering the :ref:`opt-conv
 ################################################################
 # TensorCore Introduction
-# -------------------------
+# -----------------------
 # Each Tensor Core provides a 4x4x4 matrix processing array that operates
 # :code:`D = A * B + C`, where A, B, C and D are 4x4 matrices as Figure shows.
 # The matrix multiplication inputs A and B are FP16 matrices, while the accumulation
@@ -45,7 +45,7 @@ convolution has a large batch. We strongly recommend covering the :ref:`opt-conv
 ################################################################
 # Preparation and Algorithm
-# --------------------------
+# -------------------------
 # We use the fixed size for input tensors with 256 channels and 14 x 14 dimensions.
 # The batch size is 256. Convolution filters contain 512 filters of size 3 x 3.
 # We use stride size 1 and padding size 1 for the convolution. In the example, we use
@@ -126,8 +126,7 @@ s[Apad].compute_inline()
 ###############################################################################
 # Memory Scope
-# ----------------
+# ------------
-#
 # In traditional GPU schedule, we have global, shared and local memory scope.
 # To support TensorCores, we add another three special memory scope: :code:`wmma.matrix_a`,
 # :code:`wmma.matrix_b` and :code:`wmma.accumulator`. On hardware, all fragments scope
@@ -142,6 +141,7 @@ ConvF = s.cache_write(Conv, 'wmma.accumulator')
 ###############################################################################
 # Define Tensor Intrinsic
+# -----------------------
 # In fact, TensorCore is a special hardware operation. So, we can just use tensorize
 # to replace a unit of computation with the TensorCore instruction. The first thing is
 # that we need to define tensor intrinsic.
@@ -246,7 +246,6 @@ def intrin_wmma_store_matrix():
 #   easiest way to solve this. Then We can bind threadIdx.x to any loops except those contain
 #   TensorCore intrinsics directly or indirectly. Also note that it is not the unique solution.
 #   The only thing we should do is to make sure all threads in a warp can call TensorCore at the same time.
-#
 # Define tiling sizes
 block_row_warps = 4
@@ -312,10 +311,9 @@ print(tvm.lower(s, [A, W, Conv], simple_mode=True))
 ###############################################################################
 # Lowering Computation to Intrinsics
-# --------------------------
+# ----------------------------------
 # The last phase is to lower the computation loops down to TensorCore hardware intrinsics
 # by mapping the 2D convolution to tensor intrinsics
-#
 s[AF].tensorize(AF.op.axis[-2], intrin_wmma_load_matrix('wmma.matrix_a'))
 s[WF].tensorize(WF.op.axis[-2], intrin_wmma_load_matrix('wmma.matrix_b'))
@@ -344,5 +342,6 @@ if nvcc.have_tensorcore(ctx.compute_version):
 ###############################################################################
 # Summary
+# -------
 # This tutorial demonstrates how TVM scheduling primitives can be used to
 # call TensorCores on specific GPUs.
--- a/tutorials/optimize/opt_gemm.py
+++ b/tutorials/optimize/opt_gemm.py
@@ -232,7 +232,6 @@ print(tvm.lower(s, [A, B, C], simple_mode=True))
 #
 # .. image:: https://github.com/dmlc/web-data/raw/master/tvm/tutorial/array-packing.png
 #      :align: center
-#      :scale: 100%
 #

--- a/tutorials/optimize/opt_matmul_auto_tensorcore.py
+++ b/tutorials/optimize/opt_matmul_auto_tensorcore.py
@@ -18,7 +18,7 @@
 .. _opt-matmul-auto-tensorcore:
 How to optimize matmul with Auto TensorCore CodeGen
-==================================
+===================================================
 **Author**: `Minmin Sun <https://github.com/minminsun>`_, \
            `Lanbo Li <https://github.com/Orion34C>`_, \
            `Chenfan Jia <https://github.com/jcf94>`_, \
@@ -31,12 +31,11 @@ with most transformations done in ir passes.
 Users can also write schedule with tensorization to generate TensorCore code.
 Both solutions use the same tensorcore intrinsics.
 Please refer to :ref:`opt-conv-tensorcore` tutorial for more details.
 """
 ################################################################
 # Preparation and Algorithm
-# --------------------------
+# -------------------------
 # 2 kinds of input data types are supported: float16 and int8.
 # For float16, the accumulator is float32.
 # For int8, the accumulator is int32.
@@ -215,7 +214,7 @@ def test_gemm(N, L, M, dtype, layout):
 ###############################################################################
 # AutoTune and Test
-# --------------------
+# -----------------
 # Finally we use a tuner to tune the schedule, generate code with best config
 # and run the kernel to compare with numpy to check whether the results are correct.
@@ -460,6 +459,6 @@ def tune_and_evaluate(M, N, L, dtype, layout):
 ###############################################################################
 # Summary
-# --------------------------
+# -------
 # This tutorial demonstrates how to use the AutoTensorCoreCodeGen of TVM
 # to generate tensorcore kernels.
--- a/tutorials/relay_quick_start.py
+++ b/tutorials/relay_quick_start.py
@@ -18,7 +18,7 @@
 .. _tutorial-relay-quick-start:
 Quick Start Tutorial for Compiling Deep Learning Models
-======================================================
+=======================================================
 **Author**: `Yao Wang <https://github.com/kevinthesun>`_, `Truman Tian <https://github.com/SiNZeRo>`_
 This example shows how to build a neural network with Relay python frontend and
@@ -33,7 +33,6 @@ Notice that you need to build TVM with cuda and llvm enabled.
 #
 # .. image:: https://github.com/dmlc/web-data/raw/master/tvm/tutorial/tvm_support_list.png
 #      :align: center
-#      :scale: 100%
 #
 # In this tutorial, we'll choose cuda and llvm as target backends.
 # To begin with, let's import Relay and TVM.
@@ -47,7 +46,7 @@ from tvm.contrib import graph_runtime
 ######################################################################
 # Define Neural Network in Relay
-# -----------------------------
+# ------------------------------
 # First, let's define a neural network with relay python frontend.
 # For simplicity, we'll use pre-defined resnet-18 network in Relay.
 # Parameters are initialized with Xavier initializer.

--- a/vta/tutorials/frontend/README.txt
+++ b/vta/tutorials/frontend/README.txt
-.. _tutorial-frontend:
+.. _vta-tutorial-frontend:
 Compile Deep Learning Models
 ----------------------------
--- a/vta/tutorials/frontend/deploy_vision_on_vta.py
+++ b/vta/tutorials/frontend/deploy_vision_on_vta.py
@@ -94,7 +94,7 @@ assert model in pack_dict
 ######################################################################
 # Obtain an execution remote
-# ---------------------------------
+# --------------------------
 # When target is 'pynq', reconfigure FPGA and runtime.
 # Otherwise, if target is 'sim', execute locally.
@@ -136,15 +136,16 @@ ctx = remote.ext_dev(0) if device == "vta" else remote.cpu(0)
 # ---------------------------------
 # Grab vision model from Gluon model zoo and compile with Relay.
 # The compilation steps are:
-#    1) Front end translation from MxNet into Relay module.
+#
-#    2) Apply 8-bit quantization: here we skip the first conv layer,
+# 1. Front end translation from MxNet into Relay module.
+# 2. Apply 8-bit quantization: here we skip the first conv layer,
 #    and dense layer which will both be executed in fp32 on the CPU.
-#    3) Perform graph packing to alter the data layout for tensorization.
+# 3. Perform graph packing to alter the data layout for tensorization.
-#    4) Perform constant folding to reduce number of operators (e.g. eliminate
+# 4. Perform constant folding to reduce number of operators (e.g. eliminate batch norm multiply).
-#       batch norm multiply).
+# 5. Perform relay build to object file.
-#    5) Perform relay build to object file.
+# 6. Load the object file onto remote (FPGA device).
-#    6) Load the object file onto remote (FPGA device).
+# 7. Generate graph runtime, `m`.
-#    7) Generate graph runtime, `m`.
+#
 # Load pre-configured AutoTVM schedules
 with autotvm.tophub.context(target):