Commit fe51c498 by Mercy Committed by Tianqi Chen

[DOC] Fix typos in tutorials (#287)

parent cf2f5197
...@@ -47,7 +47,7 @@ This specifies an out of source build using the MSVC 12 64 bit generator. Open t ...@@ -47,7 +47,7 @@ This specifies an out of source build using the MSVC 12 64 bit generator. Open t
### Customized Building ### Customized Building
The configuration of tvm can be modified by ```config.mk``` The configuration of tvm can be modified by ```config.mk```
- First copy make/config.mk to the project root, on which - First copy ```make/config.mk``` to the project root, on which
any local modification will be ignored by git, then modify the according flags. any local modification will be ignored by git, then modify the according flags.
- TVM optionally depends on LLVM. LLVM is required for CPU codegen that needs LLVM. - TVM optionally depends on LLVM. LLVM is required for CPU codegen that needs LLVM.
- LLVM 4.0 is needed for build with LLVM - LLVM 4.0 is needed for build with LLVM
......
...@@ -3,12 +3,12 @@ External Tensor Functions ...@@ -3,12 +3,12 @@ External Tensor Functions
========================= =========================
**Author**: `Tianqi Chen <https://tqchen.github.io>`_ **Author**: `Tianqi Chen <https://tqchen.github.io>`_
While tvm support transparent code generation, sometimes While TVM supports transparent code generation, sometimes
it is also helpful to incorporate manual written code into it is also helpful to incorporate manual written code into
the pipeline. For example, we might want to use cuDNN for the pipeline. For example, we might want to use cuDNN for
some of the convolution kernels and define the rest of the stages. some of the convolution kernels and define the rest of the stages.
TVM support these black box function calls natively. TVM supports these black box function calls natively.
Specfically, tvm support all the tensor functions that are DLPack compatible. Specfically, tvm support all the tensor functions that are DLPack compatible.
Which means we can call any function with POD types(pointer, int, float) Which means we can call any function with POD types(pointer, int, float)
or pointer to DLTensor as argument. or pointer to DLTensor as argument.
...@@ -27,12 +27,12 @@ from tvm.contrib import cblas ...@@ -27,12 +27,12 @@ from tvm.contrib import cblas
# of output tensors. In the second argument we provide the list of inputs. # of output tensors. In the second argument we provide the list of inputs.
# #
# User will need to provide a function describing how to compute the result. # User will need to provide a function describing how to compute the result.
# The compute function takes list of symbolic are placeholder for the inputs, # The compute function takes list of symbolic placeholder for the inputs,
# list of symbolic placeholder for the outputs and returns the executing statement. # list of symbolic placeholder for the outputs and returns the executing statement.
# #
# In this case we simply call a registered tvm function, which invokes a CBLAS call. # In this case we simply call a registered tvm function, which invokes a CBLAS call.
# TVM do not control internal of the extern array function and treats it as blackbox. # TVM does not control internal of the extern array function and treats it as blackbox.
# We can further mix schedulable TVM calls that add a bias to term to the result. # We can further mix schedulable TVM calls that add a bias term to the result.
# #
n = 1024 n = 1024
l = 128 l = 128
...@@ -103,7 +103,7 @@ np.testing.assert_allclose(b.asnumpy(), a.asnumpy() + 1, rtol=1e-5) ...@@ -103,7 +103,7 @@ np.testing.assert_allclose(b.asnumpy(), a.asnumpy() + 1, rtol=1e-5)
###################################################################### ######################################################################
# Summary # Summary
# ------- # -------
# - TVM call extern tensor function via :any:`tvm.extern` # - TVM calls extern tensor function via :any:`tvm.extern`
# - Use contrib wrappers for short sugars of extern tensor calls. # - Use contrib wrappers for short sugars of extern tensor calls.
# - We can hook front-end function as extern tensor callbacks. # - We can hook front-end function as extern tensor callbacks.
# #
...@@ -84,7 +84,7 @@ s = tvm.create_schedule(C.op) ...@@ -84,7 +84,7 @@ s = tvm.create_schedule(C.op)
bx, tx = s[C].split(C.op.axis[0], factor=64) bx, tx = s[C].split(C.op.axis[0], factor=64)
###################################################################### ######################################################################
# Finally we bind the iteratio axis bx and tx to threads in the GPU # Finally we bind the iteration axis bx and tx to threads in the GPU
# compute grid. These are GPU specific constructs that allows us # compute grid. These are GPU specific constructs that allows us
# to generate code that runs on GPU. # to generate code that runs on GPU.
# #
...@@ -120,7 +120,7 @@ fadd_cuda = tvm.build(s, [A, B, C], "cuda", target_host="llvm", name="myadd") ...@@ -120,7 +120,7 @@ fadd_cuda = tvm.build(s, [A, B, C], "cuda", target_host="llvm", name="myadd")
# The array API is based on `DLPack <https://github.com/dmlc/dlpack>`_ standard. # The array API is based on `DLPack <https://github.com/dmlc/dlpack>`_ standard.
# #
# - We first create a gpu context. # - We first create a gpu context.
# - Then tvm.nd.array copies the data to cpu. # - Then tvm.nd.array copies the data to gpu.
# - fadd runs the actual computation. # - fadd runs the actual computation.
# - asnumpy() copies the gpu array back to cpu and we can use this to verify correctness # - asnumpy() copies the gpu array back to cpu and we can use this to verify correctness
# #
...@@ -153,9 +153,9 @@ print(dev_module.get_source()) ...@@ -153,9 +153,9 @@ print(dev_module.get_source())
# to pass only single shape argument to the kernel, as you will find in # to pass only single shape argument to the kernel, as you will find in
# the printed device code. This is one form of specialization. # the printed device code. This is one form of specialization.
# #
# On the host side, TVM will automatically generate check codes # On the host side, TVM will automatically generate check code
# that checks the constraints in the parameters. So if you pass # that checks the constraints in the parameters. So if you pass
# arrays with different shape into the fadd, an error will be raised. # arrays with different shapes into the fadd, an error will be raised.
# #
# We can do more specializations. For example, we can write # We can do more specializations. For example, we can write
# :code:`n = tvm.convert(1024)` instead of :code:`n = tvm.var("n")`, # :code:`n = tvm.convert(1024)` instead of :code:`n = tvm.var("n")`,
...@@ -166,7 +166,7 @@ print(dev_module.get_source()) ...@@ -166,7 +166,7 @@ print(dev_module.get_source())
###################################################################### ######################################################################
# Save Compiled Module # Save Compiled Module
# -------------------- # --------------------
# Besides runtime compilation, we can save the compiled module into # Besides runtime compilation, we can save the compiled modules into
# file and load them back later. This is called ahead of time compilation. # file and load them back later. This is called ahead of time compilation.
# #
# The following code first does the following step: # The following code first does the following step:
...@@ -210,7 +210,7 @@ np.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy()) ...@@ -210,7 +210,7 @@ np.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
# Pack Everything into One Library # Pack Everything into One Library
# -------------------------------- # --------------------------------
# In the above example, we store the device and host code seperatedly. # In the above example, we store the device and host code seperatedly.
# TVM also support export everything as one shared library. # TVM also supports export everything as one shared library.
# Under the hood, we pack the device modules into binary blobs and link # Under the hood, we pack the device modules into binary blobs and link
# them together with the host code. # them together with the host code.
# Currently we support packing of Metal, OpenCL and CUDA modules. # Currently we support packing of Metal, OpenCL and CUDA modules.
...@@ -225,7 +225,7 @@ np.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy()) ...@@ -225,7 +225,7 @@ np.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
# #
# The compiled modules of TVM do not depend on the TVM compiler. # The compiled modules of TVM do not depend on the TVM compiler.
# Instead, it only depends on a minimum runtime library. # Instead, it only depends on a minimum runtime library.
# TVM runtime library wraps the device drivers and provide # TVM runtime library wraps the device drivers and provides
# thread-safe and device agnostic call into the compiled functions. # thread-safe and device agnostic call into the compiled functions.
# #
# This means you can call the compiled TVM function from any thread, # This means you can call the compiled TVM function from any thread,
......
...@@ -3,8 +3,8 @@ Intrinsics and Math Functions ...@@ -3,8 +3,8 @@ Intrinsics and Math Functions
============================= =============================
**Author**: `Tianqi Chen <https://tqchen.github.io>`_ **Author**: `Tianqi Chen <https://tqchen.github.io>`_
While tvm support basic arithmetic operations. In many cases While TVM supports basic arithmetic operations. In many cases
usually we will need more complicated buildin functions. usually we will need more complicated builtin functions.
For example :code:`exp` to take the exponetial of the function. For example :code:`exp` to take the exponetial of the function.
These functions are target system dependent and may have different These functions are target system dependent and may have different
...@@ -135,7 +135,7 @@ print(fcuda.imported_modules[0].get_source()) ...@@ -135,7 +135,7 @@ print(fcuda.imported_modules[0].get_source())
###################################################################### ######################################################################
# Summary # Summary
# ------- # -------
# - TVM call call extern target dependent math function. # - TVM can call extern target dependent math function.
# - Use intrinsic to defined a unified interface for the functions. # - Use intrinsic to defined a unified interface for the functions.
# - For more intrinsics available in tvm, take a look at :any:`tvm.intrin` # - For more intrinsics available in tvm, take a look at :any:`tvm.intrin`
# - You can customize the intrinsic behavior by defining your own rules. # - You can customize the intrinsic behavior by defining your own rules.
......
...@@ -9,7 +9,7 @@ algorithm in high-performance schedule breaks the algorithm's readability and mo ...@@ -9,7 +9,7 @@ algorithm in high-performance schedule breaks the algorithm's readability and mo
trying various seemingly promising schedules is time-consuming. With the help of TVM, we can trying various seemingly promising schedules is time-consuming. With the help of TVM, we can
try these schedules efficiently to enhance the performance. try these schedules efficiently to enhance the performance.
In this tutorial, we will demonstrate how squre matrix multiplication is optimized step by step by In this tutorial, we will demonstrate how square matrix multiplication is optimized step by step by
writing TVM. writing TVM.
There are two important optmizations on intense computation applications executed on CPU: There are two important optmizations on intense computation applications executed on CPU:
...@@ -25,14 +25,14 @@ Actually, all the methodologies used in this tutorial is a subset of tricks ment ...@@ -25,14 +25,14 @@ Actually, all the methodologies used in this tutorial is a subset of tricks ment
`repo <https://github.com/flame/how-to-optimize-gemm>`_. Some of them have been applied by TVM `repo <https://github.com/flame/how-to-optimize-gemm>`_. Some of them have been applied by TVM
abstraction automatically, but some of them cannot be simply applied due to TVM constraints. abstraction automatically, but some of them cannot be simply applied due to TVM constraints.
All the experiment results mentioned below, are executed on 2013's 15' MacBook equiped All the experiment results mentioned below, are executed on 2013's 15' MacBook equiped with
Intel i7-2760QM CPU. The cache line size should be 64 bytes for all the x86 CPU. Intel i7-2760QM CPU. The cache line size should be 64 bytes for all the x86 CPU.
""" """
############################################################################### ###############################################################################
# Preparation and Baseline # Preparation and Baseline
# ------------------------ # ------------------------
# In this tutorial we assume all the matrix tensors are squre and fix-bounded. # In this tutorial we assume all the matrix tensors are square and fix-bounded.
# We use 1024x1024 float32 matrix in demonstration. Before actually demonstrating, # We use 1024x1024 float32 matrix in demonstration. Before actually demonstrating,
# we first define these variables. Then we write a baseline implementation, # we first define these variables. Then we write a baseline implementation,
# the simplest way to write a matrix mulplication in TVM. # the simplest way to write a matrix mulplication in TVM.
...@@ -42,7 +42,7 @@ import tvm ...@@ -42,7 +42,7 @@ import tvm
import numpy import numpy
import time import time
# The size of the squre matrix # The size of the square matrix
N = 1024 N = 1024
# The default tensor type in tvm # The default tensor type in tvm
dtype = "float32" dtype = "float32"
...@@ -152,8 +152,8 @@ print('Opt3: %f' % evaluator(a, b, c).mean) ...@@ -152,8 +152,8 @@ print('Opt3: %f' % evaluator(a, b, c).mean)
################################################################################################## ##################################################################################################
# Summary # Summary
# ------- # -------
# After applying three main tricks, we can almost 90% performance of numpy. Further observation is # After applying three main tricks, we can achieve almost 90% performance of numpy.
# required to catch up with the performance of numpy. # Further observation is required to catch up with the performance of numpy.
# #
# TODO(Jian Weng): Catch up with the performance of numpy. # TODO(Jian Weng): Catch up with the performance of numpy.
......
...@@ -20,7 +20,7 @@ import numpy as np ...@@ -20,7 +20,7 @@ import numpy as np
# Assume we want to compute sum of rows as our example. # Assume we want to compute sum of rows as our example.
# In numpy semantics this can be written as :code:`B = numpy.sum(A, axis=1)` # In numpy semantics this can be written as :code:`B = numpy.sum(A, axis=1)`
# #
# The following lines describes the row sum operation. # The following lines describe the row sum operation.
# To create a reduction formula, we declare a reduction axis using # To create a reduction formula, we declare a reduction axis using
# :any:`tvm.reduce_axis`. :any:`tvm.reduce_axis` takes in the range of reductions. # :any:`tvm.reduce_axis`. :any:`tvm.reduce_axis` takes in the range of reductions.
# :any:`tvm.sum` takes in the expression to be reduced as well as the reduction # :any:`tvm.sum` takes in the expression to be reduced as well as the reduction
...@@ -65,8 +65,8 @@ print(tvm.lower(s, [A, B], simple_mode=True)) ...@@ -65,8 +65,8 @@ print(tvm.lower(s, [A, B], simple_mode=True))
###################################################################### ######################################################################
# If we are building a GPU kernel, we can bind the rows of B to GPU threads. # If we are building a GPU kernel, we can bind the rows of B to GPU threads.
s[B.op].bind(xo, tvm.thread_axis("blockIdx.x")) s[B].bind(xo, tvm.thread_axis("blockIdx.x"))
s[B.op].bind(xi, tvm.thread_axis("threadIdx.x")) s[B].bind(xi, tvm.thread_axis("threadIdx.x"))
print(tvm.lower(s, [A, B], simple_mode=True)) print(tvm.lower(s, [A, B], simple_mode=True))
###################################################################### ######################################################################
...@@ -96,18 +96,18 @@ print(s[B].op.body) ...@@ -96,18 +96,18 @@ print(s[B].op.body)
# Cross Thread Reduction # Cross Thread Reduction
# ---------------------- # ----------------------
# We can now parallelize over the factored axis. # We can now parallelize over the factored axis.
# Here mark the reduction axis of B is marked to be a thread. # Here the reduction axis of B is marked to be a thread.
# tvm allow reduction axis to be marked as thread if it is the only # TVM allows reduction axis to be marked as thread if it is the only
# axis in reduction and cross thread reduction is possible in the device. # axis in reduction and cross thread reduction is possible in the device.
# #
# This is indeed the case after the factoring. # This is indeed the case after the factoring.
# We can directly compute BF at the reduction axis as well. # We can directly compute BF at the reduction axis as well.
# The final generated kernel will divides the rows by blockIdx.x and threadIdx.y # The final generated kernel will divide the rows by blockIdx.x and threadIdx.y
# columns by threadIdx.x and finally do a cross thread reduction over threadIdx.x # columns by threadIdx.x and finally do a cross thread reduction over threadIdx.x
# #
xo, xi = s[B].split(s[B].op.axis[0], factor=32) xo, xi = s[B].split(s[B].op.axis[0], factor=32)
s[B.op].bind(xo, tvm.thread_axis("blockIdx.x")) s[B].bind(xo, tvm.thread_axis("blockIdx.x"))
s[B.op].bind(xi, tvm.thread_axis("threadIdx.y")) s[B].bind(xi, tvm.thread_axis("threadIdx.y"))
s[B].bind(s[B].op.reduce_axis[0], tvm.thread_axis("threadIdx.x")) s[B].bind(s[B].op.reduce_axis[0], tvm.thread_axis("threadIdx.x"))
s[BF].compute_at(s[B], s[B].op.reduce_axis[0]) s[BF].compute_at(s[B], s[B].op.reduce_axis[0])
fcuda = tvm.build(s, [A, B], "cuda") fcuda = tvm.build(s, [A, B], "cuda")
......
...@@ -81,7 +81,7 @@ np.testing.assert_allclose(b.asnumpy(), np.cumsum(a_np, axis=0)) ...@@ -81,7 +81,7 @@ np.testing.assert_allclose(b.asnumpy(), np.cumsum(a_np, axis=0))
# computation stage in s_update. It is possible to use multiple # computation stage in s_update. It is possible to use multiple
# Tensor stages in the scan cell. # Tensor stages in the scan cell.
# #
# The following lines demonstrates a scan with two stage operations # The following lines demonstrate a scan with two stage operations
# in the scan cell. # in the scan cell.
# #
m = tvm.var("m") m = tvm.var("m")
...@@ -108,7 +108,7 @@ print(tvm.lower(s, [X, s_scan], simple_mode=True)) ...@@ -108,7 +108,7 @@ print(tvm.lower(s, [X, s_scan], simple_mode=True))
# --------------- # ---------------
# For complicated applications like RNN, we might need more than one # For complicated applications like RNN, we might need more than one
# recurrent state. Scan support multiple recurrent states. # recurrent state. Scan support multiple recurrent states.
# The following example demonstrate how we can build recurrence with two states. # The following example demonstrates how we can build recurrence with two states.
# #
m = tvm.var("m") m = tvm.var("m")
n = tvm.var("n") n = tvm.var("n")
......
...@@ -30,7 +30,7 @@ m = tvm.var('m') ...@@ -30,7 +30,7 @@ m = tvm.var('m')
###################################################################### ######################################################################
# A schedule can be created from a list of ops, by default the # A schedule can be created from a list of ops, by default the
# schedule compute tensor in a serial manner in a row-major order. # schedule computes tensor in a serial manner in a row-major order.
# declare a matrix element-wise multiply # declare a matrix element-wise multiply
A = tvm.placeholder((m, n), name='A') A = tvm.placeholder((m, n), name='A')
...@@ -182,7 +182,7 @@ print(tvm.lower(s, [A, B, C], simple_mode=True)) ...@@ -182,7 +182,7 @@ print(tvm.lower(s, [A, B, C], simple_mode=True))
# tvm, which permits users schedule the computation easily and # tvm, which permits users schedule the computation easily and
# flexibly. # flexibly.
# #
# In order to get an good performance kernel implementation, the # In order to get a good performance kernel implementation, the
# general workflow often is: # general workflow often is:
# #
# - Describe your computation via series of operations. # - Describe your computation via series of operations.
......
...@@ -36,7 +36,7 @@ print(tvm.lower(s, [A0, A1, B0, B1], simple_mode=True)) ...@@ -36,7 +36,7 @@ print(tvm.lower(s, [A0, A1, B0, B1], simple_mode=True))
# #
# Describe Reduction with Collaborative Inputs # Describe Reduction with Collaborative Inputs
# -------------------------------------------- # --------------------------------------------
# Sometimes, we requires multiple inputs to express some reduction # Sometimes, we require multiple inputs to express some reduction
# operators, and the inputs will collaborate together, e.g. :code:`argmax`. # operators, and the inputs will collaborate together, e.g. :code:`argmax`.
# In the reduction procedure, :code:`argmax` need to compare the value of # In the reduction procedure, :code:`argmax` need to compare the value of
# operands, also need to keep the index of operand. It can be expressed # operands, also need to keep the index of operand. It can be expressed
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment