Commit 29d253d0 by Jian Weng Committed by Tianqi Chen

[TUTORIAL] Optimize gemm on CPU add! (#270)

* [TUTORIAL] Optimize gemm add!

* temp commitment

* [TUTORIAL] python3 compatiblity made; doc generation updated!

* [DOCS] gen_modules clean and ignore add!

* [TUTORIAL] title modified!

* [TUTORIAL] some rolled-back modification re-write

* [TUTORIAL] title underscore extended!

* CONTRIBUTORS add me!
parent 3935329b
...@@ -62,6 +62,7 @@ instance/ ...@@ -62,6 +62,7 @@ instance/
# Sphinx documentation # Sphinx documentation
docs/_build/ docs/_build/
docs/gen_modules
# PyBuilder # PyBuilder
target/ target/
......
...@@ -24,3 +24,4 @@ List of Contributors ...@@ -24,3 +24,4 @@ List of Contributors
- To contributors: please add your name to the list. - To contributors: please add your name to the list.
- [Qiao Zhang](https://github.com/zhangqiaorjc) - [Qiao Zhang](https://github.com/zhangqiaorjc)
- [Yizhi Liu](https://github.com/javelinjs) - [Yizhi Liu](https://github.com/javelinjs)
- [Jian Weng](https://github.com/were)
...@@ -50,6 +50,7 @@ help: ...@@ -50,6 +50,7 @@ help:
clean: clean:
rm -rf $(BUILDDIR)/* rm -rf $(BUILDDIR)/*
rm -rf gen_modules
html: html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
......
"""
How to optimize GEMM on CPU
===========================
**Author**: `Jian Weng <https://github.com/were>`_
(TL;DR) TVM provides abstract interfaces which allows users to depict an algorithm and the
algorithm's implementing organization (the so-called schedule) separately. Typically, writing
algorithm in high-performance schedule breaks the algorithm's readability and modularity. Also,
trying various seemingly promising schedules is time-consuming. With the help of TVM, we can
try these schedules efficiently to enhance the performance.
In this tutorial, we will demonstrate how squre matrix multiplication is optimized step by step by
writing TVM.
There are two important optmizations on intense computation applications executed on CPU:
1. Increase the cache hit rate of memory access. Both complex numerical computation and hot-spot
memory access can be acclerated from high cache hit rate. This requires us to transform the
origin memory access pattern to the pattern fits the cache policy.
2. SIMD (Single instruction multi-data), or we call it vector processing unit. Every time, a
small batch of data, rather than a single grid, will be processed. This requires us to
transform the data access pattern in the loop body in uniform pattern so that the LLVM
backend can lower it to SIMD.
Actually, all the methodologies used in this tutorial is a subset of tricks mentioned in this
`repo <https://github.com/flame/how-to-optimize-gemm>`_. Some of them have been applied by TVM
abstraction automatically, but some of them cannot be simply applied due to TVM constraints.
All the experiment results mentioned below, are executed on 2013's 15' MacBook equiped
Intel i7-2760QM CPU. The cache line size should be 64 bytes for all the x86 CPU.
"""
###############################################################################
# Preparation and Baseline
# ------------------------
# In this tutorial we assume all the matrix tensors are squre and fix-bounded.
# We use 1024x1024 float32 matrix in demonstration. Before actually demonstrating,
# we first define these variables. Then we write a baseline implementation,
# the simplest way to write a matrix mulplication in TVM.
#
import tvm
import numpy
import time
# The size of the squre matrix
N = 1024
# The default tensor type in tvm
dtype = "float32"
# Random generated tensor for testing
a = tvm.nd.array(numpy.random.rand(N, N).astype(dtype), tvm.cpu(0))
b = tvm.nd.array(numpy.random.rand(N, N).astype(dtype), tvm.cpu(0))
# The expected answer
answer = numpy.dot(a.asnumpy(), b.asnumpy())
# Algorithm
k = tvm.reduce_axis((0, N), 'k')
A = tvm.placeholder((N, N), name = 'A')
B = tvm.placeholder((N, N), name = 'B')
C = tvm.compute(
A.shape,
lambda x, y: tvm.sum(A[x, k] * B[k, y], axis = k),
name = 'C')
# Default schedule
s = tvm.create_schedule(C.op)
func = tvm.build(s, [A, B, C], name = 'mmult')
assert func
evaluator = func.time_evaluator(func.entry_name, tvm.cpu(0), number = 1)
c = tvm.nd.array(numpy.zeros((N, N), dtype = dtype), tvm.cpu(0))
print('Baseline: %f' % evaluator(a, b, c).mean)
################################################################################################
# Blocking
# --------
# A important trick to enhance the cache hit rate is blocking --- data chunck will be computed
# block by block. The memory access inside the block is a small neighbourhood which is with high
# meomry locality. In this tutorial, I pick up 8, a relatively small value (8 ints < 64 bytes),
# as the blocking size.
#
bn = 8
# Blocking by loop tiling
yo, xo, yi, xi = s[C].tile(C.op.axis[1], C.op.axis[0], bn, bn)
# Hoist reduction domain outside the blocking loop
s[C].reorder(yo, xo, k, yi, xi)
func = tvm.build(s, [A, B, C], name = 'mmult')
assert func
# By simply tiling the loop 8x8, and hoisting k outside the blocking loops, we can get nearly 4x
# speedup compared with the baseline.
evaluator = func.time_evaluator(func.entry_name, tvm.cpu(0), number = 5)
c = tvm.nd.array(numpy.zeros((N, N), dtype = dtype), tvm.cpu(0))
print('Opt1: %f' % evaluator(a, b, c).mean)
###################################################################################################
# Vectorization
# -------------
# Another important trick is vectorization. When the memory access pattern is uniform, the compiler
# can dectect this pattern and pass the continuous memory to vector processor. In TVM, we can use
# `vectorize` interface to hint the compiler this pattern, so that we can accelerate it vastly.
#
# After trying different schedule, we finally found that we can benefit from vectorizing
# the row loop most, i.e. yi.
s[C].vectorize(yi)
func = tvm.build(s, [A, B, C], name = 'mmult')
assert func
# We can get almost another 4x speedup compared with the previous schedule.
evaluator = func.time_evaluator(func.entry_name, tvm.cpu(0), number = 5)
c = tvm.nd.array(numpy.zeros((N, N), dtype = dtype), tvm.cpu(0))
print('Opt2: %f' % evaluator(a, b, c).mean)
###################################################################################################
# Array Packing
# -------------
# Another important trick is array packing. This trick is to reorder the storage dimension of the
# array to convert the continuous access pattern on certain dimension to a sequential pattern after
# flattening. For the convienience of drawing a figure, we use 4x4 blocking as an example to
# demonstrate array packing:
#
# First we observe memory access pattern of AB=C:
# A: B: C:
# ---- ---- ---- ---- |||| **** **** **** **** ++++ **** **** **** ****
# ---- ---- ---- ---- |||| **** **** **** **** ++++ **** **** **** ****
# ---- ---- ---- ---- |||| **** **** **** **** ++++ **** **** **** ****
# ---- ---- ---- ---- |||| **** **** **** **** ++++ **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# **** **** **** **** |||| **** **** **** **** **** **** **** **** ****
# We access A sequentially, but for B, we access it continuous on dimension of rows. Thus, what we
# want to do is to put this dimension to the inner most dimension. For 1x1 blocking, it is simply
# to transpose the matrix B. However, here is 4x4 case, array B is packed in this fashion:
# B:
# 0123 4567 89AB CDEF 0: 1234 1: 1234 2: 1234 3: 1234
# 0 |||| **** **** **** 0 |||| **** **** ****
# 1 |||| **** **** **** 1 |||| **** **** ****
# 2 |||| **** **** **** 2 |||| **** **** ****
# 3 |||| **** **** **** 3 |||| **** **** ****
# 4 |||| **** **** **** 4 |||| **** **** ****
# 5 |||| **** **** **** 5 |||| **** **** ****
# 6 |||| **** **** **** 6 |||| **** **** ****
# 7 |||| **** **** **** -> 7 |||| **** **** ****
# 8 |||| **** **** **** 8 |||| **** **** ****
# 9 |||| **** **** **** 9 |||| **** **** ****
# A |||| **** **** **** A |||| **** **** ****
# B |||| **** **** **** B |||| **** **** ****
# C |||| **** **** **** C |||| **** **** ****
# D |||| **** **** **** D |||| **** **** ****
# E |||| **** **** **** E |||| **** **** ****
# F |||| **** **** **** F |||| **** **** ****
###################################################################################################
# We reorder a 16x16 array to a [16/4][16][4] array so that the access pattern of B will be
# sequential when grabing the corresponding value from the packed array.
#
# We have to re-write the algorithm slightly.
packedB = tvm.compute((N / bn, N, bn), lambda x, y, z: B[y, x * bn + z], name = 'packedB')
C = tvm.compute(A.shape,
lambda x, y: tvm.sum(A[x, k] * packedB[y / bn, k, y % bn], axis = k),
name = 'C')
# Same schedule
s = tvm.create_schedule(C.op)
yo, xo, yi, xi = s[C].tile(C.op.axis[1], C.op.axis[0], bn, bn)
s[C].reorder(yo, xo, k, yi, xi)
s[C].vectorize(yi)
func = tvm.build(s, [A, B, C], name = 'mmult')
assert func
# We can accelerate it almost 3x compared with the previous schedule.
evaluator = func.time_evaluator(func.entry_name, tvm.cpu(0), number = 5)
c = tvm.nd.array(numpy.zeros((N, N), dtype = dtype), tvm.cpu(0))
print('Opt3: %f' % evaluator(a, b, c).mean)
##################################################################################################
# Summary
# -------
# After applying three main tricks, we can getnerly 90% performance of numpy. Further observation is
# required to catch up with the performance of numpy.
#
# TODO(Jian Weng): Catch up with the performance of numpy.
now = time.clock()
answer = numpy.dot(a.asnumpy(), b.asnumpy())
print("Numpy: %f" % (time.clock() - now))
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment