opt_gemm.py 7.09 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11
"""
How to optimize GEMM on CPU
===========================
**Author**: `Jian Weng <https://github.com/were>`_

(TL;DR) TVM provides abstract interfaces which allows users to depict an algorithm and the
algorithm's implementing organization (the so-called schedule) separately. Typically, writing
algorithm in high-performance schedule breaks the algorithm's readability and modularity. Also,
trying various seemingly promising schedules is time-consuming. With the help of TVM, we can
try these schedules efficiently to enhance the performance.

12
In this tutorial, we will demonstrate how square matrix multiplication is optimized step by step by
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
writing TVM.

There are two important optmizations on intense computation applications executed on CPU:
    1. Increase the cache hit rate of memory access. Both complex numerical computation and hot-spot
       memory access can be acclerated from high cache hit rate. This requires us to transform the
       origin memory access pattern to the pattern fits the cache policy.
    2. SIMD (Single instruction multi-data), or we call it vector processing unit. Every time, a
       small batch of data, rather than a single grid, will be processed. This requires us to
       transform the data access pattern in the loop body in uniform pattern so that the LLVM
       backend can lower it to SIMD.

Actually, all the methodologies used in this tutorial is a subset of tricks mentioned in this
`repo <https://github.com/flame/how-to-optimize-gemm>`_. Some of them have been applied by TVM
abstraction automatically, but some of them cannot be simply applied due to TVM constraints.

28
All the experiment results mentioned below, are executed on 2013's 15' MacBook equiped with
29 30 31 32 33 34
Intel i7-2760QM CPU. The cache line size should be 64 bytes for all the x86 CPU.
"""

###############################################################################
# Preparation and Baseline
# ------------------------
35
# In this tutorial we assume all the matrix tensors are square and fix-bounded.
36 37 38 39 40 41 42 43 44
# We use 1024x1024 float32 matrix in demonstration. Before actually demonstrating,
# we first define these variables. Then we write a baseline implementation,
# the simplest way to write a matrix mulplication in TVM.
#

import tvm
import numpy
import time

45
# The size of the square matrix
46 47 48 49
N = 1024
# The default tensor type in tvm
dtype = "float32"
# Random generated tensor for testing
50
a = tvm.nd.array(numpy.random.rand(N, N).astype(dtype), tvm.cpu(0))
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
b = tvm.nd.array(numpy.random.rand(N, N).astype(dtype), tvm.cpu(0))
# The expected answer
answer = numpy.dot(a.asnumpy(), b.asnumpy())

# Algorithm
k = tvm.reduce_axis((0, N), 'k')
A = tvm.placeholder((N, N), name = 'A')
B = tvm.placeholder((N, N), name = 'B')
C = tvm.compute(
           A.shape,
           lambda x, y: tvm.sum(A[x, k] * B[k, y], axis = k),
           name = 'C')

# Default schedule
s = tvm.create_schedule(C.op)
func = tvm.build(s, [A, B, C], name = 'mmult')
assert func
evaluator = func.time_evaluator(func.entry_name, tvm.cpu(0), number = 1)
c = tvm.nd.array(numpy.zeros((N, N), dtype = dtype), tvm.cpu(0))
print('Baseline: %f' % evaluator(a, b, c).mean)

################################################################################################
# Blocking
# --------
# A important trick to enhance the cache hit rate is blocking --- data chunck will be computed
# block by block. The memory access inside the block is a small neighbourhood which is with high
# meomry locality. In this tutorial, I pick up 8, a relatively small value (8 ints < 64 bytes),
# as the blocking size.
#

bn = 8
# Blocking by loop tiling
yo, xo, yi, xi = s[C].tile(C.op.axis[1], C.op.axis[0], bn, bn)
# Hoist reduction domain outside the blocking loop
s[C].reorder(yo, xo, k, yi, xi)
func = tvm.build(s, [A, B, C], name = 'mmult')
assert func
# By simply tiling the loop 8x8, and hoisting k outside the blocking loops, we can get nearly 4x
# speedup compared with the baseline.
evaluator = func.time_evaluator(func.entry_name, tvm.cpu(0), number = 5)
c = tvm.nd.array(numpy.zeros((N, N), dtype = dtype), tvm.cpu(0))
print('Opt1: %f' % evaluator(a, b, c).mean)

###################################################################################################
# Vectorization
# -------------
# Another important trick is vectorization. When the memory access pattern is uniform, the compiler
98
# can detect this pattern and pass the continuous memory to vector processor. In TVM, we can use
99 100 101
# `vectorize` interface to hint the compiler this pattern, so that we can accelerate it vastly.
#

102
# After trying different schedule, we finally found that we can benefit from vectorizing
103 104 105 106 107 108 109 110 111 112 113 114 115 116
# the row loop most, i.e. yi.
s[C].vectorize(yi)
func = tvm.build(s, [A, B, C], name = 'mmult')
assert func
# We can get almost another 4x speedup compared with the previous schedule.
evaluator = func.time_evaluator(func.entry_name, tvm.cpu(0), number = 5)
c = tvm.nd.array(numpy.zeros((N, N), dtype = dtype), tvm.cpu(0))
print('Opt2: %f' % evaluator(a, b, c).mean)

###################################################################################################
# Array Packing
# -------------
# Another important trick is array packing. This trick is to reorder the storage dimension of the
# array to convert the continuous access pattern on certain dimension to a sequential pattern after
117 118 119 120 121
# flattening.
#
# .. image:: https://github.com/dmlc/web-data/raw/master/tvm/tutorial/array-packing.png
#      :align: center
#      :scale: 100%
122 123 124 125
#


###################################################################################################
126 127 128 129 130
# Just as it is shown in the figure above, after blocking the computations, we can observe the array
# access pattern of B (after flattening), which is regular but discontinuous. We expect that after
# some transformation we can get continuous access pattern. We can reorder a [16][16] array to 
# a [16/4][16][4] array, so that the access pattern of B will be sequential when grabing 
# the corresponding value from the packed array.
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
#

# We have to re-write the algorithm slightly.
packedB = tvm.compute((N / bn, N, bn), lambda x, y, z: B[y, x * bn + z], name = 'packedB')
C = tvm.compute(A.shape,
                lambda x, y: tvm.sum(A[x, k] * packedB[y / bn, k, y % bn], axis = k),
                name = 'C')

# Same schedule
s = tvm.create_schedule(C.op)
yo, xo, yi, xi = s[C].tile(C.op.axis[1], C.op.axis[0], bn, bn)
s[C].reorder(yo, xo, k, yi, xi)
s[C].vectorize(yi)

func = tvm.build(s, [A, B, C], name = 'mmult')
assert func
# We can accelerate it almost 3x compared with the previous schedule.
evaluator = func.time_evaluator(func.entry_name, tvm.cpu(0), number = 5)
c = tvm.nd.array(numpy.zeros((N, N), dtype = dtype), tvm.cpu(0))
print('Opt3: %f' % evaluator(a, b, c).mean)

##################################################################################################
# Summary
# -------
155 156
# After applying three main tricks, we can achieve almost 90% performance of numpy.
# Further observation is required to catch up with the performance of numpy.
157 158 159
#

# TODO(Jian Weng): Catch up with the performance of numpy.
160 161
_a = a.asnumpy()
_b = b.asnumpy()
162
now = time.clock()
163
answer = numpy.dot(_a, _b)
164 165
print("Numpy: %f" % (time.clock() - now))