Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
T
tic
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
wenyuanbo
tic
Commits
fe51c498
Commit
fe51c498
authored
Aug 01, 2017
by
Mercy
Committed by
Tianqi Chen
Jul 31, 2017
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
[DOC] Fix typos in tutorials (#287)
parent
cf2f5197
Show whitespace changes
Inline
Side-by-side
Showing
9 changed files
with
36 additions
and
36 deletions
+36
-36
docs/how_to/install.md
+1
-1
tutorials/python/extern_op.py
+6
-6
tutorials/python/get_started.py
+7
-7
tutorials/python/intrin_math.py
+3
-3
tutorials/python/opt_gemm.py
+6
-6
tutorials/python/reduction.py
+8
-8
tutorials/python/scan.py
+2
-2
tutorials/python/schedule_primitives.py
+2
-2
tutorials/python/tuple_inputs.py
+1
-1
No files found.
docs/how_to/install.md
View file @
fe51c498
...
@@ -47,7 +47,7 @@ This specifies an out of source build using the MSVC 12 64 bit generator. Open t
...
@@ -47,7 +47,7 @@ This specifies an out of source build using the MSVC 12 64 bit generator. Open t
### Customized Building
### Customized Building
The configuration of tvm can be modified by
```config.mk```
The configuration of tvm can be modified by
```config.mk```
-
First copy
make/config.mk
to the project root, on which
-
First copy
```make/config.mk```
to the project root, on which
any local modification will be ignored by git, then modify the according flags.
any local modification will be ignored by git, then modify the according flags.
-
TVM optionally depends on LLVM. LLVM is required for CPU codegen that needs LLVM.
-
TVM optionally depends on LLVM. LLVM is required for CPU codegen that needs LLVM.
-
LLVM 4.0 is needed for build with LLVM
-
LLVM 4.0 is needed for build with LLVM
...
...
tutorials/python/extern_op.py
View file @
fe51c498
...
@@ -3,12 +3,12 @@ External Tensor Functions
...
@@ -3,12 +3,12 @@ External Tensor Functions
=========================
=========================
**Author**: `Tianqi Chen <https://tqchen.github.io>`_
**Author**: `Tianqi Chen <https://tqchen.github.io>`_
While
tvm support
transparent code generation, sometimes
While
TVM supports
transparent code generation, sometimes
it is also helpful to incorporate manual written code into
it is also helpful to incorporate manual written code into
the pipeline. For example, we might want to use cuDNN for
the pipeline. For example, we might want to use cuDNN for
some of the convolution kernels and define the rest of the stages.
some of the convolution kernels and define the rest of the stages.
TVM support these black box function calls natively.
TVM support
s
these black box function calls natively.
Specfically, tvm support all the tensor functions that are DLPack compatible.
Specfically, tvm support all the tensor functions that are DLPack compatible.
Which means we can call any function with POD types(pointer, int, float)
Which means we can call any function with POD types(pointer, int, float)
or pointer to DLTensor as argument.
or pointer to DLTensor as argument.
...
@@ -27,12 +27,12 @@ from tvm.contrib import cblas
...
@@ -27,12 +27,12 @@ from tvm.contrib import cblas
# of output tensors. In the second argument we provide the list of inputs.
# of output tensors. In the second argument we provide the list of inputs.
#
#
# User will need to provide a function describing how to compute the result.
# User will need to provide a function describing how to compute the result.
# The compute function takes list of symbolic
are
placeholder for the inputs,
# The compute function takes list of symbolic placeholder for the inputs,
# list of symbolic placeholder for the outputs and returns the executing statement.
# list of symbolic placeholder for the outputs and returns the executing statement.
#
#
# In this case we simply call a registered tvm function, which invokes a CBLAS call.
# In this case we simply call a registered tvm function, which invokes a CBLAS call.
# TVM do not control internal of the extern array function and treats it as blackbox.
# TVM do
es
not control internal of the extern array function and treats it as blackbox.
# We can further mix schedulable TVM calls that add a bias t
o t
erm to the result.
# We can further mix schedulable TVM calls that add a bias term to the result.
#
#
n
=
1024
n
=
1024
l
=
128
l
=
128
...
@@ -103,7 +103,7 @@ np.testing.assert_allclose(b.asnumpy(), a.asnumpy() + 1, rtol=1e-5)
...
@@ -103,7 +103,7 @@ np.testing.assert_allclose(b.asnumpy(), a.asnumpy() + 1, rtol=1e-5)
######################################################################
######################################################################
# Summary
# Summary
# -------
# -------
# - TVM call extern tensor function via :any:`tvm.extern`
# - TVM call
s
extern tensor function via :any:`tvm.extern`
# - Use contrib wrappers for short sugars of extern tensor calls.
# - Use contrib wrappers for short sugars of extern tensor calls.
# - We can hook front-end function as extern tensor callbacks.
# - We can hook front-end function as extern tensor callbacks.
#
#
tutorials/python/get_started.py
View file @
fe51c498
...
@@ -84,7 +84,7 @@ s = tvm.create_schedule(C.op)
...
@@ -84,7 +84,7 @@ s = tvm.create_schedule(C.op)
bx
,
tx
=
s
[
C
]
.
split
(
C
.
op
.
axis
[
0
],
factor
=
64
)
bx
,
tx
=
s
[
C
]
.
split
(
C
.
op
.
axis
[
0
],
factor
=
64
)
######################################################################
######################################################################
# Finally we bind the iteratio axis bx and tx to threads in the GPU
# Finally we bind the iteratio
n
axis bx and tx to threads in the GPU
# compute grid. These are GPU specific constructs that allows us
# compute grid. These are GPU specific constructs that allows us
# to generate code that runs on GPU.
# to generate code that runs on GPU.
#
#
...
@@ -120,7 +120,7 @@ fadd_cuda = tvm.build(s, [A, B, C], "cuda", target_host="llvm", name="myadd")
...
@@ -120,7 +120,7 @@ fadd_cuda = tvm.build(s, [A, B, C], "cuda", target_host="llvm", name="myadd")
# The array API is based on `DLPack <https://github.com/dmlc/dlpack>`_ standard.
# The array API is based on `DLPack <https://github.com/dmlc/dlpack>`_ standard.
#
#
# - We first create a gpu context.
# - We first create a gpu context.
# - Then tvm.nd.array copies the data to
c
pu.
# - Then tvm.nd.array copies the data to
g
pu.
# - fadd runs the actual computation.
# - fadd runs the actual computation.
# - asnumpy() copies the gpu array back to cpu and we can use this to verify correctness
# - asnumpy() copies the gpu array back to cpu and we can use this to verify correctness
#
#
...
@@ -153,9 +153,9 @@ print(dev_module.get_source())
...
@@ -153,9 +153,9 @@ print(dev_module.get_source())
# to pass only single shape argument to the kernel, as you will find in
# to pass only single shape argument to the kernel, as you will find in
# the printed device code. This is one form of specialization.
# the printed device code. This is one form of specialization.
#
#
# On the host side, TVM will automatically generate check code
s
# On the host side, TVM will automatically generate check code
# that checks the constraints in the parameters. So if you pass
# that checks the constraints in the parameters. So if you pass
# arrays with different shape into the fadd, an error will be raised.
# arrays with different shape
s
into the fadd, an error will be raised.
#
#
# We can do more specializations. For example, we can write
# We can do more specializations. For example, we can write
# :code:`n = tvm.convert(1024)` instead of :code:`n = tvm.var("n")`,
# :code:`n = tvm.convert(1024)` instead of :code:`n = tvm.var("n")`,
...
@@ -166,7 +166,7 @@ print(dev_module.get_source())
...
@@ -166,7 +166,7 @@ print(dev_module.get_source())
######################################################################
######################################################################
# Save Compiled Module
# Save Compiled Module
# --------------------
# --------------------
# Besides runtime compilation, we can save the compiled module into
# Besides runtime compilation, we can save the compiled module
s
into
# file and load them back later. This is called ahead of time compilation.
# file and load them back later. This is called ahead of time compilation.
#
#
# The following code first does the following step:
# The following code first does the following step:
...
@@ -210,7 +210,7 @@ np.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
...
@@ -210,7 +210,7 @@ np.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
# Pack Everything into One Library
# Pack Everything into One Library
# --------------------------------
# --------------------------------
# In the above example, we store the device and host code seperatedly.
# In the above example, we store the device and host code seperatedly.
# TVM also support export everything as one shared library.
# TVM also support
s
export everything as one shared library.
# Under the hood, we pack the device modules into binary blobs and link
# Under the hood, we pack the device modules into binary blobs and link
# them together with the host code.
# them together with the host code.
# Currently we support packing of Metal, OpenCL and CUDA modules.
# Currently we support packing of Metal, OpenCL and CUDA modules.
...
@@ -225,7 +225,7 @@ np.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
...
@@ -225,7 +225,7 @@ np.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
#
#
# The compiled modules of TVM do not depend on the TVM compiler.
# The compiled modules of TVM do not depend on the TVM compiler.
# Instead, it only depends on a minimum runtime library.
# Instead, it only depends on a minimum runtime library.
# TVM runtime library wraps the device drivers and provide
# TVM runtime library wraps the device drivers and provide
s
# thread-safe and device agnostic call into the compiled functions.
# thread-safe and device agnostic call into the compiled functions.
#
#
# This means you can call the compiled TVM function from any thread,
# This means you can call the compiled TVM function from any thread,
...
...
tutorials/python/intrin_math.py
View file @
fe51c498
...
@@ -3,8 +3,8 @@ Intrinsics and Math Functions
...
@@ -3,8 +3,8 @@ Intrinsics and Math Functions
=============================
=============================
**Author**: `Tianqi Chen <https://tqchen.github.io>`_
**Author**: `Tianqi Chen <https://tqchen.github.io>`_
While
tvm support
basic arithmetic operations. In many cases
While
TVM supports
basic arithmetic operations. In many cases
usually we will need more complicated buil
d
in functions.
usually we will need more complicated buil
t
in functions.
For example :code:`exp` to take the exponetial of the function.
For example :code:`exp` to take the exponetial of the function.
These functions are target system dependent and may have different
These functions are target system dependent and may have different
...
@@ -135,7 +135,7 @@ print(fcuda.imported_modules[0].get_source())
...
@@ -135,7 +135,7 @@ print(fcuda.imported_modules[0].get_source())
######################################################################
######################################################################
# Summary
# Summary
# -------
# -------
# - TVM ca
ll
call extern target dependent math function.
# - TVM ca
n
call extern target dependent math function.
# - Use intrinsic to defined a unified interface for the functions.
# - Use intrinsic to defined a unified interface for the functions.
# - For more intrinsics available in tvm, take a look at :any:`tvm.intrin`
# - For more intrinsics available in tvm, take a look at :any:`tvm.intrin`
# - You can customize the intrinsic behavior by defining your own rules.
# - You can customize the intrinsic behavior by defining your own rules.
...
...
tutorials/python/opt_gemm.py
View file @
fe51c498
...
@@ -9,7 +9,7 @@ algorithm in high-performance schedule breaks the algorithm's readability and mo
...
@@ -9,7 +9,7 @@ algorithm in high-performance schedule breaks the algorithm's readability and mo
trying various seemingly promising schedules is time-consuming. With the help of TVM, we can
trying various seemingly promising schedules is time-consuming. With the help of TVM, we can
try these schedules efficiently to enhance the performance.
try these schedules efficiently to enhance the performance.
In this tutorial, we will demonstrate how squre matrix multiplication is optimized step by step by
In this tutorial, we will demonstrate how squ
a
re matrix multiplication is optimized step by step by
writing TVM.
writing TVM.
There are two important optmizations on intense computation applications executed on CPU:
There are two important optmizations on intense computation applications executed on CPU:
...
@@ -25,14 +25,14 @@ Actually, all the methodologies used in this tutorial is a subset of tricks ment
...
@@ -25,14 +25,14 @@ Actually, all the methodologies used in this tutorial is a subset of tricks ment
`repo <https://github.com/flame/how-to-optimize-gemm>`_. Some of them have been applied by TVM
`repo <https://github.com/flame/how-to-optimize-gemm>`_. Some of them have been applied by TVM
abstraction automatically, but some of them cannot be simply applied due to TVM constraints.
abstraction automatically, but some of them cannot be simply applied due to TVM constraints.
All the experiment results mentioned below, are executed on 2013's 15' MacBook equiped
All the experiment results mentioned below, are executed on 2013's 15' MacBook equiped
with
Intel i7-2760QM CPU. The cache line size should be 64 bytes for all the x86 CPU.
Intel i7-2760QM CPU. The cache line size should be 64 bytes for all the x86 CPU.
"""
"""
###############################################################################
###############################################################################
# Preparation and Baseline
# Preparation and Baseline
# ------------------------
# ------------------------
# In this tutorial we assume all the matrix tensors are squre and fix-bounded.
# In this tutorial we assume all the matrix tensors are squ
a
re and fix-bounded.
# We use 1024x1024 float32 matrix in demonstration. Before actually demonstrating,
# We use 1024x1024 float32 matrix in demonstration. Before actually demonstrating,
# we first define these variables. Then we write a baseline implementation,
# we first define these variables. Then we write a baseline implementation,
# the simplest way to write a matrix mulplication in TVM.
# the simplest way to write a matrix mulplication in TVM.
...
@@ -42,7 +42,7 @@ import tvm
...
@@ -42,7 +42,7 @@ import tvm
import
numpy
import
numpy
import
time
import
time
# The size of the squre matrix
# The size of the squ
a
re matrix
N
=
1024
N
=
1024
# The default tensor type in tvm
# The default tensor type in tvm
dtype
=
"float32"
dtype
=
"float32"
...
@@ -152,8 +152,8 @@ print('Opt3: %f' % evaluator(a, b, c).mean)
...
@@ -152,8 +152,8 @@ print('Opt3: %f' % evaluator(a, b, c).mean)
##################################################################################################
##################################################################################################
# Summary
# Summary
# -------
# -------
# After applying three main tricks, we can a
lmost 90% performance of numpy. Further observation is
# After applying three main tricks, we can a
chieve almost 90% performance of numpy.
# required to catch up with the performance of numpy.
#
Further observation is
required to catch up with the performance of numpy.
#
#
# TODO(Jian Weng): Catch up with the performance of numpy.
# TODO(Jian Weng): Catch up with the performance of numpy.
...
...
tutorials/python/reduction.py
View file @
fe51c498
...
@@ -20,7 +20,7 @@ import numpy as np
...
@@ -20,7 +20,7 @@ import numpy as np
# Assume we want to compute sum of rows as our example.
# Assume we want to compute sum of rows as our example.
# In numpy semantics this can be written as :code:`B = numpy.sum(A, axis=1)`
# In numpy semantics this can be written as :code:`B = numpy.sum(A, axis=1)`
#
#
# The following lines describe
s
the row sum operation.
# The following lines describe the row sum operation.
# To create a reduction formula, we declare a reduction axis using
# To create a reduction formula, we declare a reduction axis using
# :any:`tvm.reduce_axis`. :any:`tvm.reduce_axis` takes in the range of reductions.
# :any:`tvm.reduce_axis`. :any:`tvm.reduce_axis` takes in the range of reductions.
# :any:`tvm.sum` takes in the expression to be reduced as well as the reduction
# :any:`tvm.sum` takes in the expression to be reduced as well as the reduction
...
@@ -65,8 +65,8 @@ print(tvm.lower(s, [A, B], simple_mode=True))
...
@@ -65,8 +65,8 @@ print(tvm.lower(s, [A, B], simple_mode=True))
######################################################################
######################################################################
# If we are building a GPU kernel, we can bind the rows of B to GPU threads.
# If we are building a GPU kernel, we can bind the rows of B to GPU threads.
s
[
B
.
op
]
.
bind
(
xo
,
tvm
.
thread_axis
(
"blockIdx.x"
))
s
[
B
]
.
bind
(
xo
,
tvm
.
thread_axis
(
"blockIdx.x"
))
s
[
B
.
op
]
.
bind
(
xi
,
tvm
.
thread_axis
(
"threadIdx.x"
))
s
[
B
]
.
bind
(
xi
,
tvm
.
thread_axis
(
"threadIdx.x"
))
print
(
tvm
.
lower
(
s
,
[
A
,
B
],
simple_mode
=
True
))
print
(
tvm
.
lower
(
s
,
[
A
,
B
],
simple_mode
=
True
))
######################################################################
######################################################################
...
@@ -96,18 +96,18 @@ print(s[B].op.body)
...
@@ -96,18 +96,18 @@ print(s[B].op.body)
# Cross Thread Reduction
# Cross Thread Reduction
# ----------------------
# ----------------------
# We can now parallelize over the factored axis.
# We can now parallelize over the factored axis.
# Here
mark
the reduction axis of B is marked to be a thread.
# Here the reduction axis of B is marked to be a thread.
#
tvm allow
reduction axis to be marked as thread if it is the only
#
TVM allows
reduction axis to be marked as thread if it is the only
# axis in reduction and cross thread reduction is possible in the device.
# axis in reduction and cross thread reduction is possible in the device.
#
#
# This is indeed the case after the factoring.
# This is indeed the case after the factoring.
# We can directly compute BF at the reduction axis as well.
# We can directly compute BF at the reduction axis as well.
# The final generated kernel will divide
s
the rows by blockIdx.x and threadIdx.y
# The final generated kernel will divide the rows by blockIdx.x and threadIdx.y
# columns by threadIdx.x and finally do a cross thread reduction over threadIdx.x
# columns by threadIdx.x and finally do a cross thread reduction over threadIdx.x
#
#
xo
,
xi
=
s
[
B
]
.
split
(
s
[
B
]
.
op
.
axis
[
0
],
factor
=
32
)
xo
,
xi
=
s
[
B
]
.
split
(
s
[
B
]
.
op
.
axis
[
0
],
factor
=
32
)
s
[
B
.
op
]
.
bind
(
xo
,
tvm
.
thread_axis
(
"blockIdx.x"
))
s
[
B
]
.
bind
(
xo
,
tvm
.
thread_axis
(
"blockIdx.x"
))
s
[
B
.
op
]
.
bind
(
xi
,
tvm
.
thread_axis
(
"threadIdx.y"
))
s
[
B
]
.
bind
(
xi
,
tvm
.
thread_axis
(
"threadIdx.y"
))
s
[
B
]
.
bind
(
s
[
B
]
.
op
.
reduce_axis
[
0
],
tvm
.
thread_axis
(
"threadIdx.x"
))
s
[
B
]
.
bind
(
s
[
B
]
.
op
.
reduce_axis
[
0
],
tvm
.
thread_axis
(
"threadIdx.x"
))
s
[
BF
]
.
compute_at
(
s
[
B
],
s
[
B
]
.
op
.
reduce_axis
[
0
])
s
[
BF
]
.
compute_at
(
s
[
B
],
s
[
B
]
.
op
.
reduce_axis
[
0
])
fcuda
=
tvm
.
build
(
s
,
[
A
,
B
],
"cuda"
)
fcuda
=
tvm
.
build
(
s
,
[
A
,
B
],
"cuda"
)
...
...
tutorials/python/scan.py
View file @
fe51c498
...
@@ -81,7 +81,7 @@ np.testing.assert_allclose(b.asnumpy(), np.cumsum(a_np, axis=0))
...
@@ -81,7 +81,7 @@ np.testing.assert_allclose(b.asnumpy(), np.cumsum(a_np, axis=0))
# computation stage in s_update. It is possible to use multiple
# computation stage in s_update. It is possible to use multiple
# Tensor stages in the scan cell.
# Tensor stages in the scan cell.
#
#
# The following lines demonstrate
s
a scan with two stage operations
# The following lines demonstrate a scan with two stage operations
# in the scan cell.
# in the scan cell.
#
#
m
=
tvm
.
var
(
"m"
)
m
=
tvm
.
var
(
"m"
)
...
@@ -108,7 +108,7 @@ print(tvm.lower(s, [X, s_scan], simple_mode=True))
...
@@ -108,7 +108,7 @@ print(tvm.lower(s, [X, s_scan], simple_mode=True))
# ---------------
# ---------------
# For complicated applications like RNN, we might need more than one
# For complicated applications like RNN, we might need more than one
# recurrent state. Scan support multiple recurrent states.
# recurrent state. Scan support multiple recurrent states.
# The following example demonstrate how we can build recurrence with two states.
# The following example demonstrate
s
how we can build recurrence with two states.
#
#
m
=
tvm
.
var
(
"m"
)
m
=
tvm
.
var
(
"m"
)
n
=
tvm
.
var
(
"n"
)
n
=
tvm
.
var
(
"n"
)
...
...
tutorials/python/schedule_primitives.py
View file @
fe51c498
...
@@ -30,7 +30,7 @@ m = tvm.var('m')
...
@@ -30,7 +30,7 @@ m = tvm.var('m')
######################################################################
######################################################################
# A schedule can be created from a list of ops, by default the
# A schedule can be created from a list of ops, by default the
# schedule compute tensor in a serial manner in a row-major order.
# schedule compute
s
tensor in a serial manner in a row-major order.
# declare a matrix element-wise multiply
# declare a matrix element-wise multiply
A
=
tvm
.
placeholder
((
m
,
n
),
name
=
'A'
)
A
=
tvm
.
placeholder
((
m
,
n
),
name
=
'A'
)
...
@@ -182,7 +182,7 @@ print(tvm.lower(s, [A, B, C], simple_mode=True))
...
@@ -182,7 +182,7 @@ print(tvm.lower(s, [A, B, C], simple_mode=True))
# tvm, which permits users schedule the computation easily and
# tvm, which permits users schedule the computation easily and
# flexibly.
# flexibly.
#
#
# In order to get a
n
good performance kernel implementation, the
# In order to get a good performance kernel implementation, the
# general workflow often is:
# general workflow often is:
#
#
# - Describe your computation via series of operations.
# - Describe your computation via series of operations.
...
...
tutorials/python/tuple_inputs.py
View file @
fe51c498
...
@@ -36,7 +36,7 @@ print(tvm.lower(s, [A0, A1, B0, B1], simple_mode=True))
...
@@ -36,7 +36,7 @@ print(tvm.lower(s, [A0, A1, B0, B1], simple_mode=True))
#
#
# Describe Reduction with Collaborative Inputs
# Describe Reduction with Collaborative Inputs
# --------------------------------------------
# --------------------------------------------
# Sometimes, we require
s
multiple inputs to express some reduction
# Sometimes, we require multiple inputs to express some reduction
# operators, and the inputs will collaborate together, e.g. :code:`argmax`.
# operators, and the inputs will collaborate together, e.g. :code:`argmax`.
# In the reduction procedure, :code:`argmax` need to compare the value of
# In the reduction procedure, :code:`argmax` need to compare the value of
# operands, also need to keep the index of operand. It can be expressed
# operands, also need to keep the index of operand. It can be expressed
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment