opt_gemm.py 14.2 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
17
"""
18 19
.. _opt-gemm:

20 21
How to optimize GEMM on CPU
===========================
22 23
**Author**: `Jian Weng <https://github.com/were>`_, \
            `Ruofei Yu <https://github.com/yuruofeifei>`_
24 25 26 27 28 29 30

(TL;DR) TVM provides abstract interfaces which allows users to depict an algorithm and the
algorithm's implementing organization (the so-called schedule) separately. Typically, writing
algorithm in high-performance schedule breaks the algorithm's readability and modularity. Also,
trying various seemingly promising schedules is time-consuming. With the help of TVM, we can
try these schedules efficiently to enhance the performance.

31
In this tutorial, we will demonstrate how to use TVM to optimize square matrix multiplication
32
and achieve 200 times faster than baseline by simply adding 18 extra lines of code.
33

34
There are two important optimizations on intense computation applications executed on CPU:
35
    1. Increase the cache hit rate of memory access. Both complex numerical computation and hot-spot
36
       memory access can be accelerated from high cache hit rate. This requires us to transform the
37 38 39 40 41 42 43 44 45 46
       origin memory access pattern to the pattern fits the cache policy.
    2. SIMD (Single instruction multi-data), or we call it vector processing unit. Every time, a
       small batch of data, rather than a single grid, will be processed. This requires us to
       transform the data access pattern in the loop body in uniform pattern so that the LLVM
       backend can lower it to SIMD.

Actually, all the methodologies used in this tutorial is a subset of tricks mentioned in this
`repo <https://github.com/flame/how-to-optimize-gemm>`_. Some of them have been applied by TVM
abstraction automatically, but some of them cannot be simply applied due to TVM constraints.

47
All the experiment results mentioned below, are executed on 2015's 15' MacBook equipped with
48
Intel i7-4770HQ CPU. The cache line size should be 64 bytes for all the x86 CPUs.
49 50
"""

51
################################################################################################
52 53
# Preparation and Baseline
# ------------------------
54 55 56
# In this tutorial, we will demo how to use TVM to optimize matrix multiplication.
# Before actually demonstrating, we first define these variables.
# Then we write a baseline implementation, the simplest way to write a matrix multiplication in TVM.
57 58 59

import tvm
import numpy
60
import timeit
61

62 63 64 65 66
# The size of the matrix
# (M, K) x (K, N)
# You are free to try out different shapes, sometimes TVM optimization outperforms numpy with MKL.
M = 1024
K = 1024
67
N = 1024
68

69 70
# The default tensor type in tvm
dtype = "float32"
71 72

# using Intel AVX2(Advanced Vector Extensions) ISA for SIMD
73 74 75
# To get the best performance, please change the following line
# to llvm -mcpu=core-avx2, or specific type of CPU you use
target = 'llvm'
76 77
ctx = tvm.context(target, 0)

78
# Random generated tensor for testing
79 80
a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), ctx)
b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), ctx)
81 82 83

np_repeat = 100
np_runing_time = timeit.timeit(setup='import numpy\n'
84 85 86
                                     'M = ' + str(M) + '\n'
                                     'K = ' + str(K) + '\n'
                                     'N = ' + str(N) + '\n'
87
                                     'dtype = "float32"\n'
88 89
                                     'a = numpy.random.rand(M, K).astype(dtype)\n'
                                     'b = numpy.random.rand(K, N).astype(dtype)\n',
90 91 92 93
                               stmt='answer = numpy.dot(a, b)',
                               number=np_repeat)
print("Numpy running time: %f" % (np_runing_time / np_repeat))

94 95 96
answer = numpy.dot(a.asnumpy(), b.asnumpy())

# Algorithm
97 98 99
k = tvm.reduce_axis((0, K), 'k')
A = tvm.placeholder((M, K), name='A')
B = tvm.placeholder((K, N), name='B')
100
C = tvm.compute(
101 102 103
           (M, N),
           lambda x, y: tvm.sum(A[x, k] * B[k, y], axis=k),
           name='C')
104 105 106

# Default schedule
s = tvm.create_schedule(C.op)
107
func = tvm.build(s, [A, B, C], target=target, name='mmult')
108
assert func
109

110
c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), ctx)
111
func(a, b, c)
112
tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)
113

114
evaluator = func.time_evaluator(func.entry_name, ctx, number=1)
115 116 117
print('Baseline: %f' % evaluator(a, b, c).mean)

################################################################################################
118 119 120 121 122 123
# In TVM, we can always inspect lower level IR to debug or optimize our schedule.
# Here is the generated IR using our baseline schedule.

print(tvm.lower(s, [A, B, C], simple_mode=True))

################################################################################################
124 125
# Blocking
# --------
126
# A important trick to enhance the cache hit rate is blocking --- data chunk will be computed
127
# block by block. The memory access inside the block is a small neighbourhood which is with high
128 129
# memory locality. In this tutorial, I picked up 32 as the blocking factor. So the block will
# fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB (L1 data cache)
130

131 132
bn = 32
s = tvm.create_schedule(C.op)
133

134
# Blocking by loop tiling
135
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
136 137 138
k, = s[C].op.reduce_axis
ko, ki = s[C].split(k, factor=4)

139
# Hoist reduction domain outside the blocking loop
140 141 142
s[C].reorder(xo, yo, ko, ki, xi, yi)

func = tvm.build(s, [A, B, C], target=target, name='mmult')
143
assert func
144

145
c = tvm.nd.array(numpy.zeros((M, N), dtype = dtype), ctx)
146
func(a, b, c)
147
tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)
148

149 150 151
# By simply tiling the loop 32x32, and hoisting ko, ki outside the blocking loops,
# we can see big speedup compared with the baseline.
evaluator = func.time_evaluator(func.entry_name, ctx, number=10)
152 153
print('Opt1: %f' % evaluator(a, b, c).mean)

154 155
################################################################################################
# Here is the generated IR after blocking.
156

157
print(tvm.lower(s, [A, B, C], simple_mode=True))
158 159

###################################################################################################
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
# Vectorization
# -------------
# Another important trick is vectorization. When the memory access pattern is uniform,
# the compiler can detect this pattern and pass the continuous memory to vector processor. In TVM,
# we can use `vectorize` interface to hint the compiler this pattern, so that we can accelerate it vastly.
#
# In this tutorial, we chose to vectorize the inner loop row data since it is cache friendly.

s = tvm.create_schedule(C.op)
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
k, = s[C].op.reduce_axis
ko, ki = s[C].split(k, factor=4)

s[C].reorder(xo, yo, ko, ki, xi, yi)

# Vectorization
s[C].vectorize(yi)

func = tvm.build(s, [A, B, C], target=target, name='mmult')
assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype = dtype), ctx)
func(a, b, c)
183
tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)
184 185 186 187 188 189 190 191 192 193 194

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)
print('Opt2: %f' % evaluator(a, b, c).mean)

################################################################################################
# Here is the generated IR after vectorization.

print(tvm.lower(s, [A, B, C], simple_mode=True))

###################################################################################################
# Loop Permutation
195
# ----------------
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
# If we look at the above IR, we can see the inner loop row data is vectorized and
# B is transformed into PackedB. The traversal of PackedB is sequential now.
# So we will look at the access pattern of A. In current schedule, A is accessed column by column
# which is not cache friendly. If we change the nested loop order of ki and inner axes xi,
# the access pattern for A matrix is more cache friendly.

s = tvm.create_schedule(C.op)
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
k, = s[C].op.reduce_axis
ko, ki = s[C].split(k, factor=4)

# re-ordering
s[C].reorder(xo, yo, ko, xi, ki, yi)
s[C].vectorize(yi)

func = tvm.build(s, [A, B, C], target=target, name='mmult')
assert func

c = tvm.nd.array(numpy.zeros((M, N), dtype = dtype), ctx)
func(a, b, c)
216
tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)
217 218 219 220 221 222 223 224 225 226

evaluator = func.time_evaluator(func.entry_name, ctx, number=10)
print('Opt3: %f' % evaluator(a, b, c).mean)

################################################################################################
# Here is the generated IR after loop permutation.

print(tvm.lower(s, [A, B, C], simple_mode=True))

###################################################################################################
227 228 229 230
# Array Packing
# -------------
# Another important trick is array packing. This trick is to reorder the storage dimension of the
# array to convert the continuous access pattern on certain dimension to a sequential pattern after
231 232 233 234 235
# flattening.
#
# .. image:: https://github.com/dmlc/web-data/raw/master/tvm/tutorial/array-packing.png
#      :align: center
#      :scale: 100%
236 237 238 239
#


###################################################################################################
240 241
# Just as it is shown in the figure above, after blocking the computations, we can observe the array
# access pattern of B (after flattening), which is regular but discontinuous. We expect that after
242 243
# some transformation we can get continuous access pattern. We can reorder a [16][16] array to
# a [16/4][16][4] array, so that the access pattern of B will be sequential when grabing
244
# the corresponding value from the packed array.
245 246 247
#

# We have to re-write the algorithm slightly.
248 249
packedB = tvm.compute((N / bn, K, bn), lambda x, y, z: B[y, x * bn + z], name='packedB')
C = tvm.compute((M, N),
250
                lambda x, y: tvm.sum(A[x, k] * packedB[y // bn, k, tvm.indexmod(y, bn)], axis=k),
251 252 253
                name = 'C')

s = tvm.create_schedule(C.op)
254

255
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
256 257
k, = s[C].op.reduce_axis
ko, ki = s[C].split(k, factor=4)
258

259 260 261 262 263 264 265 266
s[C].reorder(xo, yo, ko, xi, ki, yi)
s[C].vectorize(yi)

x, y, z = s[packedB].op.axis
s[packedB].vectorize(z)
s[packedB].parallel(x)

func = tvm.build(s, [A, B, C], target=target, name='mmult')
267
assert func
268

269
c = tvm.nd.array(numpy.zeros((M, N), dtype = dtype), ctx)
270
func(a, b, c)
271
tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)
272

273 274
evaluator = func.time_evaluator(func.entry_name, ctx, number=10)
print('Opt4: %f' % evaluator(a, b, c).mean)
275 276 277 278 279 280

################################################################################################
# Here is the generated IR after array packing.

print(tvm.lower(s, [A, B, C], simple_mode=True))

281 282
################################################################################################
# Write cache for blocks
283
# ----------------------
284 285 286
# After blocking, the program will write result to C block by block, the access pattern
# is not sequential. So we can use a sequential cache array to hold the block results and
# write to C when all the block results are ready.
287 288 289 290
#

s = tvm.create_schedule(C.op)

291 292
# Allocate write cache
CC = s.cache_write(C, 'global')
293

294
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
295

296 297
# Write cache is computed at yo
s[CC].compute_at(s[C], yo)
298

299 300
# New inner axes
xc, yc = s[CC].op.axis
301

302 303 304 305 306
k, = s[CC].op.reduce_axis
ko, ki = s[CC].split(k, factor=4)
s[CC].reorder(ko, xc, ki, yc)
s[CC].unroll(ki)
s[CC].vectorize(yc)
307

308 309 310
x, y, z = s[packedB].op.axis
s[packedB].vectorize(z)
s[packedB].parallel(x)
311

312
func = tvm.build(s, [A, B, C], target=target, name='mmult')
313 314
assert func

315
c = tvm.nd.array(numpy.zeros((M, N), dtype = dtype), ctx)
316
func(a, b, c)
317
tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)
318

319 320
evaluator = func.time_evaluator(func.entry_name, ctx, number=10)
print('Opt5: %f' % evaluator(a, b, c).mean)
321 322

################################################################################################
323
# Here is the generated IR after blocking.
324 325 326 327 328

print(tvm.lower(s, [A, B, C], simple_mode=True))

###################################################################################################
# Parallel
329
# --------
Yida Wang committed
330
# Futhermore, we can also utilize multi-core processors to do the thread-level parallelization.
331 332

s = tvm.create_schedule(C.op)
333 334 335

CC = s.cache_write(C, 'global')

336
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
337 338 339 340 341 342 343 344 345 346

s[CC].compute_at(s[C], yo)

xc, yc = s[CC].op.axis

k, = s[CC].op.reduce_axis
ko, ki = s[CC].split(k, factor=4)
s[CC].reorder(ko, xc, ki, yc)
s[CC].unroll(ki)
s[CC].vectorize(yc)
347 348 349 350

# parallel
s[C].parallel(xo)

351 352 353 354 355
x, y, z = s[packedB].op.axis
s[packedB].vectorize(z)
s[packedB].parallel(x)

func = tvm.build(s, [A, B, C], target=target, name = 'mmult')
356 357
assert func

358
c = tvm.nd.array(numpy.zeros((M, N), dtype = dtype), ctx)
359
func(a, b, c)
360
tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)
361

362 363 364
evaluator = func.time_evaluator(func.entry_name, ctx, number=50)
opt6_time = evaluator(a, b, c).mean
print('Opt6: %f' % opt6_time)
365

Yida Wang committed
366 367 368 369 370 371 372
################################################################################################
# Here is the generated IR after parallelization.

print(tvm.lower(s, [A, B, C], simple_mode=True))

###################################################################################################

373 374 375
##################################################################################################
# Summary
# -------
376 377 378
# After applying the above simple optimizations with only 18 lines of code,
# our generated code can achieve 60% of the `numpy` performance with MKL.
# Note that the outputs on the web page reflect the running times on a non-exclusive
Yida Wang committed
379 380
# Docker container, thereby they are *unreliable*. It is highly encouraged to run the
# tutorial by yourself to observe the performance gain acheived by TVM.